Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (10): 43-52    DOI: 10.11925/infotech.2096-3467.2017.0702
Orginal Article Current Issue | Archive | Adv Search |
Automatic Classification of Documents from Wikipedia
Xiangdong Li1,2(),Tao Ruan1,Kang Liu1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
Download: PDF(772 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to improve the performance of text classification systems with the help of Wikipedia’s feature expansion function. [Methods] First, we established the CDFmax-IDF method based on the modified TF-IDF, which helped retrieve the candidate word list. Then, we used the Wikipedia to extend the document features and calculated the relationship among direct links, categories and indirect links, which decided the semantic relevance of the words. Finally, we proposed an improved LDA model, the wLDA, for the extended feature and text modeling. [Results] The proposed method improved the value of marco-F1 and micro-F1 on Naive Bayes, KNN and SVM classifiers by 1.6%-2.8% and 1.4%-2.7%. [Limitations] We did not include the properties of the words and relationship among them. [Conclusions] The feature expansion method based on the Wikipedia improves the effectiveness of automatic document classification methods.

Key wordsVarious Types of Documents      Text Classification      Feature Selection      Feature Expansion      Wikipedia     
Received: 17 July 2017      Published: 08 November 2017

Cite this article:

Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia. Data Analysis and Knowledge Discovery, 2017, 1(10): 43-52.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0702     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I10/43

[1] 和艳会, 李和娟, 关琼, 等. 浅谈网络图书馆、数字图书馆、虚拟图书馆的概念[J]. 农业图书情报学刊, 2006, 18(9): 120-123.
[1] (He Yanhui, Li Hejuan, Guan Qiong, et al.Discussion on Concepts of Network Library, Digital Library and Virtual Library[J]. Journal of Library and Information Sciences in Agriculture, 2006, 18(9): 120-123.)
[2] 李湘东, 胡逸泉, 巴志超, 等. 数字图书馆多种类型文献混合自动分类研究[J]. 图书馆杂志, 2014, 33(11): 42-48.
[2] (Li Xiangdong, Hu Yiquan, Ba Zhichao, et al.The Study of Mixed Automatic Categorization on Digital Library Collections[J]. Library Journal, 2014, 33(11): 42-48.)
[3] Pong J Y-H, Kwok R C-W, Lau R Y-K, et al. A Comparative Study of Two Automatic Document Classification Methods in a Library Setting[J]. Journal of Information Science, 2008, 34(2): 213-230.
[4] 薛春香, 夏祖奇, 侯汉清. 基于语料和基于标引经验的自动分类模式比较[J]. 南京农业大学学报: 社会科学版, 2005, 5(4): 85-91.
[4] (Xue Chunxiang, Xia Zuqi, Hou Hanqing.A Comparison of Automatic Classification Between Corpus-based Model and Experiences-based Model[J]. Journal of Nanjing Agricultural University: Social Sciences Edition, 2005, 5(4): 85-91.)
[5] Joorabchi A, Mahdi A E.An Unsupervised Approach to Automatic Classification of Scientific Literature Utilizing Bibliographic Metadata[J]. Journal of Information Science, 2011, 37(5): 499-514.
[6] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3): 47-52.
[6] (Fan Yunjie, Liu Huailiang.Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3): 47-52.)
[7] Guo N, He Y, Yan C G, et al.Multi-level Topical Text Categorization with Wikipedia[C]// Proceedings of International Conference on Utility and Cloud Computing. ACM, 2016: 343-352.
[8] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[9] Peter, Maxwell.Co-Clustering Based Classification Algorithm with Latent Semantic Relationship for Cross- Domain Text Classification Through Wikipedia[J]. Bonfring International Journal of Data Mining, 2017, 7(2): 1-5.
[10] 李湘东, 刘康, 高凡. 维基百科在多种类型数字文本资源自动分类中的应用[J]. 情报科学, 2017, 35(2): 75-79.
[10] (Li Xiangdong, Liu Kang, Gao Fan.Application of Wikipedia to Automatic Categorization with Multiple Types of Digital Text Resources[J]. Information Science, 2017, 35(2): 75-79.)
[11] 徐凤亚, 罗振声. 文本自动分类中特征权重算法的改进研究[J]. 计算机工程与应用, 2005, 41(1): 181-184.
[11] (Xu Fengya, Luo Zhensheng.An Improved Approach to Term Weighting in Automated Text Classification[J]. Computer Engineering and Applications, 2005, 41(1): 181-184.)
[12] 蒋健. 文本分类中特征提取和特征加权方法研究[D]. 重庆: 重庆大学, 2010.
[12] (Jiang Jian.Research on Feature Extraction and Feature Weighting in Text Categorization[D]. Chongqing: Chongqing University, 2010.)
[13] 李湘东, 丁丛, 高凡. 基于复合加权LDA模型的书目信息分类方法研究[J]. 情报学报, 2017, 36(4): 352-360.
[13] (Li Xiangdong, Ding Cong, Gao Fan.The Research of Bibliographic Information Classification Method Based on the Composite Weighted LDA Model[J]. Journal of the China Society for Scientific andTechnical Information, 2017, 36(4): 352-360.)
[14] 李锋刚, 梁钰, GAO Xiaozhi, 等. 基于LDA-wSVM模型的文本分类研究[J]. 计算机应用研究, 2015, 32(1): 21-25.
[14] (Li Fenggang, Liang Yu, GAO Xiaozhi, et al.Research on Text Categorization Based on LDA-wSVM Model[J]. Application Research of Computers, 2015, 32(1): 21-25.)
[15] Li X, Ouyang J, Zhou X, et al.Supervised Labeled Latent Dirichlet Allocation for Document Categorization[J]. Applied Intelligence, 2015, 42(3): 581-593.
[16] 史庆伟, 从世源. 基于mRMR和LDA主题模型的文本分类研究[J]. 计算机工程与应用, 2016, 52(5): 127-133.
[16] (Shi Qingwei, Cong Shiyuan.Research on Text Categorization Based on mRMR and LDA[J].Computer Engineering and Applications, 2016, 52(5): 127-133.)
[17] Lin W, Pang X, Wan B, et al.MR-LDA: An Efficient Topic Model for Classification of Short Text in Big Social Data[J]. International Journal of Grid & High Performance Computing, 2016, 8(4): 100-113.
[18] 孙建军. 信息检索技术[M]. 北京: 科学出版社, 2004: 169-170.
[18] (Sun Jianjun.Information Retrieval Technology [M]. Beijing: Science Press, 2004: 169-170.)
[19] 王兰成, 刘晓亮. 维基百科知网的构建研究与应用进展[J]. 情报资料工作, 2012(5): 56-60.
[19] (Wang Lancheng, Liu Xiaoliang.Construction Research and Application Progress of Wikipedia Knowledge Network[J]. Information and Documentation Services, 2012(5): 56-60.)
[20] 卢盛祺, 管连, 金敏, 等. LDA模型在网络视频推荐中的应用[J]. 微型机与应用, 2016, 35(11): 74-79.
[20] (Lu Shengqi, Guan Lian, Jin Min, et al.The Application of LDA in Online Video Recommendation[J]. Microcomputer and Its Applications, 2016, 35(11): 74-79.)
[21] 周琨峰. 基于中文维基百科的概念相关词群研究[D]. 武汉: 华中师范大学, 2012.
[21] (Zhou Kunfeng.Research on the Concept-related Phrases Based on Chinese Wikipedia [D]. Wuhan: Huazhong Normal University, 2012.)
[22] Wei X, Croft W B.LDA-based Document Models for Ad-Hoc Retrieval[C]//Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006: 178-185.
[23] 王振振, 何明, 杜永萍. 基于LDA主题模型的文本相似度计算[J]. 计算机科学, 2013, 40(12): 229-232.
[23] (Wang Zhenzhen, He Ming, Du Yongping.Text Similarity Computing Based on Topic Model LDA[J]. Computer Science, 2013, 40(12): 229-232.)
[24] Cao J, Xia T, Li J, et al.A Density-based Method for Adaptive LDA Model Selection[J]. Neuro Computing, 2009, 72(7): 1775-1781.
[25] 复旦大学中文语料库[DB/OL]. [2017-03-01].
[25] (Fudan-Classification-Corpus [DB/ OL]. [2017-03-01]. udan-Classification-Corpus [DB/ OL]. [2017-03-01].
[26] 搜狗互联网语料库[DB/OL]. [2017-03-01].
[26] (SogouT [DB/OL]. [2017-03-01].
[1] Cheng Zhou,Hongqin Wei. Evaluating and Classifying Patent Values Based on Self-Organizing Maps and Support Vector Machine[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[2] Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong. Detecting Collusive Fraudulent Online Transaction with Implicit User Behaviors[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[3] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[4] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[5] Zhanglu Tan,Zhaogang Wang,Han Hu. Study on a Method of Feature Classification Selection Based on χ2 Statistics[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[6] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[7] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[8] Tingxin Wen,Yangzi Li,Jingshuang Sun. Extracting Text Features with Improved Fruit Fly Optimization Algorithm[J]. 数据分析与知识发现, 2018, 2(5): 59-69.
[9] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[10] Zhipeng Li,Weizhong Li. Feature Selection Based on Modified QPSO Algorithm[J]. 数据分析与知识发现, 2017, 1(7): 82-89.
[11] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[12] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[13] Li Xiangdong,Ba Zhichao,Gao Fan. Review of Digital Documents Automatic Classification Research[J]. 现代图书情报技术, 2016, 32(9): 17-26.
[14] Zhou Pengcheng,Wu Chuan,Lu Wei. Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase[J]. 现代图书情报技术, 2016, 32(6): 1-11.
[15] Liu Hongguang,Ma Shuanggang,Liu Guifeng. Classifying Chinese News Texts with Denoising Auto Encoder[J]. 现代图书情报技术, 2016, 32(6): 12-19.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn