Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (9): 17-26    DOI: 10.11925/infotech.1003-3513.2016.09.02
Orginal Article Current Issue | Archive | Adv Search |
Review of Digital Documents Automatic Classification Research
Li Xiangdong1,2(),Ba Zhichao1,3,Gao Fan1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
3Information Research Institute of Shandong Academy of Sciences Ji’nan 250014, China
Download: PDF(455 KB)   HTML ( 26
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper discusses the existing issues and possible solutions to the automatic classification of digital documents (i.e. library bibliographies, news pages and social media posts). [Coverage] We reviewed literature on the feature semantics conversion, feature expansion and weighting strategy from the field of Automatic Classification based on machine learning. [Methods] We analyzed the leading studies, key technologies, current achievements, and future directions from the published articles. [Results] Our research found the limits of previous studies on semantic representation of texts and utilization of knowledge bases. [Limitations] We did not discuss the classification algorithms. [Conclusions] To improve the effectiveness of automatic classification of digital documents, future research could try to combine Vector Space Model with Probabilistic Topic Model, use the knowledge base to improve the concept similarity computing, as well as construct composite weighted strategy.

Key wordsAutomatic classification      Feature semantic association      Feature semantic conversion      Feature expansion      Weighting strategy     
Received: 22 January 2016      Published: 19 October 2016

Cite this article:

Li Xiangdong,Ba Zhichao,Gao Fan. Review of Digital Documents Automatic Classification Research. New Technology of Library and Information Service, 2016, 32(9): 17-26.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.09.02     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I9/17

[1] 王细薇, 樊兴华, 赵军. 一种基于特征扩展的中文短文本分类方法[J]. 计算机应用, 2009, 29(3): 843-845.
[1] (Wang Xiwei, Fan Xinghua, Zhao Jun.Method for Chinese Short Text Classification Based on Feature Extension[J]. Journal of Computer Applications, 2009, 29(3): 843-845.)
[2] 王细薇, 张凯. 一种改进的基于共现关系的短文本特征扩展算法研究[J]. 河南城建学院学报, 2012, 21(4): 48-50.
[2] (Wang Xiwei, Zhang Kai.Improved Expansion Algorithm Based on Co-occurrence Relationship Between Short Text Feature[J]. Journal of Henan University of Urban Construction, 2012, 21(4): 48-50.)
[3] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48.
[3] (Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short-Text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[4] Vo D T, Ock C Y.Learning to Classify Short Text from Scientific Documents Using Topic Models with Various Types of Knowledge[J]. Expert Systems with Applications, 2015, 42(3): 1684-1698.
[5] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145.
[5] (Ning Yahui, Fan Xinghua, Wu Yu.Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[6] 赵辉, 刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11): 120-124.
[6] (Zhao Hui, Liu Huailiang.Classification Algorithm of Chinese Short Texts Based on Wikipedia[J]. Library and Information Service, 2013, 57(11): 120-124.)
[7] Phan X H, Nguyen L M, Horiguchi S.Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web. ACM, 2008.
[8] 王盛, 樊兴华, 陈现麟.利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-607.
[8] (Wang Sheng, Fan Xinghua, Chen Xianlin.Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-607.)
[9] Fan X, Wei D.A Method of Agent and Patient Relation Acquisition for Short-Text Classification [A].// Advanced Research on Computer Science and Information Engineering[M]. Springer Berlin Heidelberg, 2011.
[10] Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment [C]. In: Proceedings of the 19th International Conference on World Wide Web. ACM, 2010.
[11] Soundarya R.Sentiment Classification for Domain Adaptation Using Cross Domains [C]. In: Proceedings of the 2015 International Conference on Futuristic Trends in Computing and Communication, Chennai, Tamilnadu, India. 2015.
[12] Wang P, Domeniconi C, Hu J.Cross-Domain Text Classification Using Wikipedia[J]. IEEE Intelligent Informatics Bulletin, 2008, 9(1): 5-17.
[13] Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]. In: Proceedings of the 18th International Conference on World Wide Web, NewYork, NY, USA. ACM, 2009.
[14] 马范玲, 胡泽文. 基于SUMO本体的图书自动分类模型研究[J]. 情报杂志, 2011, 30(1): 168-173.
[14] (Ma Fanling, Hu Zewen.The Research on Automatic Library Classification Model Based on SUMO Ontology[J]. Journal of Intelligence, 2011, 30(1): 168-173.)
[15] 朱平, 范少辉, 岳永德. 一种集成本体和SVM的文本分类方法[J]. 江西理工大学学报, 2012, 33(1): 68-72.
[15] (Zhu Ping, Fan Shaohui, Yue Yongde.A Text Classification Method Integrating Ontology and SVM[J]. Journal of Jiangxi University of Science and Technology, 2012, 33(1): 68-72.)
[16] Xiang E W, Cao B, Hu D H, et al.Bridging Domains Using World Wide Knowledge for Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(6): 770-783.
[17] 马芳. 基于SUMO本体的中文文本自动分类研究[J]. 情报科学, 2015, 33(6): 43-47.
[17] (Ma Fang.Research on Chinese Text Automatic Classification Based on SUMO Ontology[J]. Information Science, 2015, 33(6): 43-47.)
[18] 胡泽文, 王效岳, 白如江. 基于SUMO和WordNet本体集成的文本分类模型研究[J]. 现代图书情报技术, 2011(1): 31-38.
[18] (Hu Zewen, Wang Xiaoyue, Bai Rujiang.Study on Text Classification Model Based on SUMO and WordNet Ontology Integration[J]. New Technology of Library and Information Service, 2011(1): 31-38.)
[19] Xue G, Dai W, Yang Q, et al.Topic-bridged PLSA for Cross-domain Text Classification[C]. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 627-634.
[20] 李湘东, 胡逸泉, 黄莉. 采用LDA主题模型的多种类型文献混合自动分类研究[J]. 图书馆论坛, 2015, 35(1): 74-80.
[20] (Li Xiangdong, Hu Yiquan, Huang Li.A Study of Mixed Automatic Categorization of Multi-type Document Adopting LDA Model[J]. Library Tribune, 2015, 35(1): 74-80.)
[21] 刘怀亮, 杜坤, 秦春秀. 基于知网语义相似度的中文文本分类研究[J]. 现代图书情报技术, 2015(2): 39-45.
[21] (Liu Huailiang, Du Kun, Qin Chunxiu.Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. New Technology of Library and Information Service, 2015(2): 39-45.)
[22] Bollegala D, Maehara T, Kawarabayashi K.Unsupervised Cross-Domain Word Representation Learning [C]. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015.
[23] Bouma G.Normalized (Pointwsie) Mutual Information in Collocation Extraction [C]. In: Proceedings of the Bi-Annual Conference of the German Society for Computational Linguistics and Language Technology.2009.
[24] Ng A Y, Jordan M, Weiss Y.On Spectral Clustering: Analysis and an Algorithm [C]. In: Proceedings of the 2011 Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada. 2001.
[25] Dai W, Xue G R, Yang Q, et al.Co-clustering Based Classification for Out-of-Domain Documents [C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007.
[26] Milne D, Witten I H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]. In: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 2008.
[27] 赵辉, 刘怀亮. 面向社区问答的中文短文本分类算法研究[J]. 现代情报, 2013, 33(10): 70-74.
[27] (Zhao Hui, Liu Huailiang.Research on Chinese Short Text Classification Algorithm for Community-Based Q&A[J]. Journal of Modern Information, 2013, 33(10): 70-74.)
[28] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005, 28(4): 595-602.
[28] (Wu Jian, Wu Zhaohui, Li Ying, et al.Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
[29] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 计算语言学及中文语言处理, 2002, 7(2): 59-76.
[29] (Liu Qun, Li Sujian.Word Similarity Computating Based on How-Net[J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76.)
[30] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文本分类研究[J]. 计算机工程与设计, 2013, 34(10): 3613-3618.
[30] (Sun Jianwang, Lv Xueqiang, Zhang Leihan.Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[31] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[32] 张志飞, 苗夺谦, 高灿.基于LDA主题模型的短文本分类方法[J]. 计算机应用, 2013, 33(6): 1587-1590.
[32] (Zhang Zhifei, Miao Duoqian, Gao Can.Short Text Classification Using Latent Dirichlet Allocation[J]. Journal of Computer Applications, 2013, 33(6): 1587-1590.)
[33] 孙世杰, 濮建忠.基于LDA模型的Twitter中文微博热点主题词组发现[J]. 洛阳师范学院学报, 2012, 31(11): 60-64.
[33] (Sun Shijie, Pu Jianzhong.A Hot Topic Phrase Selection Based on LDA for Chinese Tweets[J]. Journal of Luoyang Normal University, 2012, 31(11): 60-64.)
[34] Bischof J M, Airoldi E M.Summarizing Topical Content with Word Frequency and Exclusivity[C]. In: Proceedings of the 29th International Conference on Machine Learning. 2012.
[35] 王昊, 叶鹏, 邓三鸿.机器学习在中文期刊论文自动分类研究中的应用[J].现代图书情报技术, 2014(3): 80-87.
[35] (Wang Hao, Ye Peng, Deng Sanhong.Machine Learning Application in Chinese Journal Articles Automatic Classification[J]. New Technology of Library and Information Service, 2014(3): 80-87.)
[36] 李湘东, 巴志超, 黄莉.一种基于加权LDA模型和多粒度的文本特征选择方法[J].现代图书情报技术, 2015(5): 42-48.
[36] (Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-48.)
[37] Lertantree V, Theeramunkong T.Effect of Term Distributions on Centroid-based Text Categorization[J]. Information Sciences, 2004, 158(1): 89-115.
[38] 蒋健. 文本分类中特征提取和特征加权方法研究[D]. 重庆: 重庆大学, 2010.
[38] (Jiang Jian.Text Classification in the Feature Extraction and Feature Weighting Method Research [D]. Chongqing: Chongqing University, 2010.)
[39] Salton G, Yang C S.On the Specification of Term Values in Automatic Indexing[J]. Journal of Documentation, 1973, 29(4): 351-372.
[40] Peter A C, Brett W B, Stephen H, et al.An Information- Theoretic, Vector-Space-Model Approach to Cross-Language Information Retrieval[J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.
[1] Sanhong Deng,Yuyangzi Fu,Hao Wang. Multi-Label Classification of Chinese Books with LSTM Model[J]. 数据分析与知识发现, 2017, 1(7): 52-60.
[2] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[3] He Lin, Wan Jian, He Juan, Guo Shiyun. Research on Automatic Classification of Chinese Books Based on Social Tagging[J]. 现代图书情报技术, 2014, 30(9): 1-7.
[4] Hu Bing, Zhang Jianli. Research on Chinese Patent Automatic Classification Method Based on Statistical Distribution[J]. 现代图书情报技术, 2013, 29(7/8): 101-106.
[5] Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extraction for Chinese Short-text Classification[J]. 现代图书情报技术, 2013, (6): 42-48.
[6] Xu Jian, Wen Haosheng. Study on Talents Description Web Page Automatic Recognition System[J]. 现代图书情报技术, 2011, 27(6): 20-26.
[7] Ma Fang. Research of Patent Automatic Classification Based on RBFNN[J]. 现代图书情报技术, 2011, 27(12): 58-63.
[8] Wang Meiwen. Design and Implementation of Automatic Classification Meta-search Engine Based on Ontology[J]. 现代图书情报技术, 2008, 24(9): 58-63.
[9] Guo Shaoyou. Research on Automatic Classification Based on Term Context Relations[J]. 现代图书情报技术, 2008, 24(5): 44-49.
[10] Qian Aibing,Jiang Lan . Automatic Classification Based on News Titles for Chinese News Web Pages[J]. 现代图书情报技术, 2008, 24(10): 59-68.
[11] Yue Qingling. Automated Folksonomy Research of Tag Resource Based on Synergetic Mechanism[J]. 现代图书情报技术, 2007, 2(9): 58-61.
[12] Luan Fangfang. Automatic Classification Approach and Implement of Multi-media Information Resources[J]. 现代图书情报技术, 2007, 2(7): 83-87.
[13] Fu Liang. A Design of Automatic Classification Based on the Military Information Resources Classification’s Indexing-experience[J]. 现代图书情报技术, 2007, 2(11): 76-79.
[14] Zang Guoquan. On Automatic Classification of Web Page in Virtual Library[J]. 现代图书情报技术, 2002, 18(3): 28-31.
[15] Xiao Ming,Shen Ying. Development of Research on Automatic Classification[J]. 现代图书情报技术, 2000, 16(5): 25-28.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn