Automatic Classification of Documents from Wikipedia
Li Xiangdong1,2(), Ruan Tao1, Liu Kang1
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
[Objective] This paper aims to improve the performance of text classification systems with the help of Wikipedia’s feature expansion function. [Methods] First, we established the CDFmax-IDF method based on the modified TF-IDF, which helped retrieve the candidate word list. Then, we used the Wikipedia to extend the document features and calculated the relationship among direct links, categories and indirect links, which decided the semantic relevance of the words. Finally, we proposed an improved LDA model, the wLDA, for the extended feature and text modeling. [Results] The proposed method improved the value of marco-F1 and micro-F1 on Naive Bayes, KNN and SVM classifiers by 1.6%-2.8% and 1.4%-2.7%. [Limitations] We did not include the properties of the words and relationship among them. [Conclusions] The feature expansion method based on the Wikipedia improves the effectiveness of automatic document classification methods.
(He Yanhui, Li Hejuan, Guan Qiong, et al.Discussion on Concepts of Network Library, Digital Library and Virtual Library[J]. Journal of Library and Information Sciences in Agriculture, 2006, 18(9): 120-123.)
doi: 10.3969/j.issn.1002-1248.2006.09.039
(Li Xiangdong, Hu Yiquan, Ba Zhichao, et al.The Study of Mixed Automatic Categorization on Digital Library Collections[J]. Library Journal, 2014, 33(11): 42-48.)
[3]
Pong J Y-H, Kwok R C-W, Lau R Y-K, et al. A Comparative Study of Two Automatic Document Classification Methods in a Library Setting[J]. Journal of Information Science, 2008, 34(2): 213-230.
doi: 10.1177/0165551507082592
(Xue Chunxiang, Xia Zuqi, Hou Hanqing.A Comparison of Automatic Classification Between Corpus-based Model and Experiences-based Model[J]. Journal of Nanjing Agricultural University: Social Sciences Edition, 2005, 5(4): 85-91.)
doi: 10.3969/j.issn.1671-7465.2005.04.016
[5]
Joorabchi A, Mahdi A E.An Unsupervised Approach to Automatic Classification of Scientific Literature Utilizing Bibliographic Metadata[J]. Journal of Information Science, 2011, 37(5): 499-514.
doi: 10.1177/0165551511417785
(Fan Yunjie, Liu Huailiang.Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[7]
Guo N, He Y, Yan C G, et al.Multi-level Topical Text Categorization with Wikipedia[C]// Proceedings of International Conference on Utility and Cloud Computing. ACM, 2016: 343-352.
[8]
Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[9]
Peter, Maxwell.Co-Clustering Based Classification Algorithm with Latent Semantic Relationship for Cross- Domain Text Classification Through Wikipedia[J]. Bonfring International Journal of Data Mining, 2017, 7(2): 1-5.
doi: 10.9756/BIJDM.8330
(Li Xiangdong, Liu Kang, Gao Fan.Application of Wikipedia to Automatic Categorization with Multiple Types of Digital Text Resources[J]. Information Science, 2017, 35(2): 75-79.)
(Xu Fengya, Luo Zhensheng.An Improved Approach to Term Weighting in Automated Text Classification[J]. Computer Engineering and Applications, 2005, 41(1): 181-184.)
[12]
蒋健. 文本分类中特征提取和特征加权方法研究[D]. 重庆: 重庆大学, 2010.
[12]
(Jiang Jian.Research on Feature Extraction and Feature Weighting in Text Categorization[D]. Chongqing: Chongqing University, 2010.)
(Li Xiangdong, Ding Cong, Gao Fan.The Research of Bibliographic Information Classification Method Based on the Composite Weighted LDA Model[J]. Journal of the China Society for Scientific andTechnical Information, 2017, 36(4): 352-360.)
(Li Fenggang, Liang Yu, GAO Xiaozhi, et al.Research on Text Categorization Based on LDA-wSVM Model[J]. Application Research of Computers, 2015, 32(1): 21-25.)
doi: 10.3969/j.issn.1001-3695.2015.01.005
[15]
Li X, Ouyang J, Zhou X, et al.Supervised Labeled Latent Dirichlet Allocation for Document Categorization[J]. Applied Intelligence, 2015, 42(3): 581-593.
doi: 10.1007/s10489-014-0595-0
(Shi Qingwei, Cong Shiyuan.Research on Text Categorization Based on mRMR and LDA[J].Computer Engineering and Applications, 2016, 52(5): 127-133.)
doi: 10.3778/j.issn.1002-8331.1506-0266
[17]
Lin W, Pang X, Wan B, et al.MR-LDA: An Efficient Topic Model for Classification of Short Text in Big Social Data[J]. International Journal of Grid & High Performance Computing, 2016, 8(4): 100-113.
doi: 10.4018/IJGHPC.2016100106
(Wang Lancheng, Liu Xiaoliang.Construction Research and Application Progress of Wikipedia Knowledge Network[J]. Information and Documentation Services, 2012(5): 56-60.)
doi: 10.3969/j.issn.1002-0314.2012.05.010
(Lu Shengqi, Guan Lian, Jin Min, et al.The Application of LDA in Online Video Recommendation[J]. Microcomputer and Its Applications, 2016, 35(11): 74-79.)
doi: 10.19358/j.issn.1674-7720.2016.11.023
[21]
周琨峰. 基于中文维基百科的概念相关词群研究[D]. 武汉: 华中师范大学, 2012.
[21]
(Zhou Kunfeng.Research on the Concept-related Phrases Based on Chinese Wikipedia [D]. Wuhan: Huazhong Normal University, 2012.)
[22]
Wei X, Croft W B.LDA-based Document Models for Ad-Hoc Retrieval[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006: 178-185.
(Wang Zhenzhen, He Ming, Du Yongping.Text Similarity Computing Based on Topic Model LDA[J]. Computer Science, 2013, 40(12): 229-232.)
doi: 10.3969/j.issn.1002-137X.2013.12.049
[24]
Cao J, Xia T, Li J, et al.A Density-based Method for Adaptive LDA Model Selection[J]. Neuro Computing, 2009, 72(7): 1775-1781.
doi: 10.1016/j.neucom.2008.06.011