Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (10): 43-52    DOI: 10.11925/infotech.2096-3467.2017.0702
Automatic Classification of Documents from Wikipedia
Xiangdong Li1,2(),Tao Ruan1,Kang Liu1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
[Objective] This paper aims to improve the performance of text classification systems with the help of Wikipedia’s feature expansion function. [Methods] First, we established the CDFmax-IDF method based on the modified TF-IDF, which helped retrieve the candidate word list. Then, we used the Wikipedia to extend the document features and calculated the relationship among direct links, categories and indirect links, which decided the semantic relevance of the words. Finally, we proposed an improved LDA model, the wLDA, for the extended feature and text modeling. [Results] The proposed method improved the value of marco-F1 and micro-F1 on Naive Bayes, KNN and SVM classifiers by 1.6%-2.8% and 1.4%-2.7%. [Limitations] We did not include the properties of the words and relationship among them. [Conclusions] The feature expansion method based on the Wikipedia improves the effectiveness of automatic document classification methods.

Key wordsVarious Types of Documents      Text Classification      Feature Selection      Feature Expansion      Wikipedia     
Received: 17 July 2017      Published: 08 November 2017

Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia. Data Analysis and Knowledge Discovery, 2017, 1(10): 43-52.

