Volume 5, No 1, 2008

Generating best features for web page classification


M. Indra Devi, R. Rajaram and K. Selvakuberan

Abstract

As the Internet provides millions of web pages for each and every search term, getting interesting and required results quickly from the Web becomes very difficult. Automatic classification of web pages into relevant categories is the current research topic which helps the search engine to get relevant results. As the web pages contain many irrelevant, infrequent and stop words that reduce the performance of the classifier, extracting or selecting representative features from the web page is an essential pre-processing step. The goal of this paper is to find minimum number of highly qualitative features by integrating feature selection techniques. We conducted experiments with various numbers of features selected by different feature selection algorithms on a well defined initial set of features and show that cfssubset evaluator combined with term frequency method gives minimal qualitative features enough to attain considerable classification accuracy.


Pages: 1-12

Keywords: Feature selection; Subset generation; Subset evaluation; Machine learning; Classification

Full Text