Review of Digital Documents Automatic Classification Research
Li Xiangdong1,2(),Ba Zhichao1,3,Gao Fan1
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China 3Information Research Institute of Shandong Academy of Sciences Ji’nan 250014, China
[Objective] This paper discusses the existing issues and possible solutions to the automatic classification of digital documents (i.e. library bibliographies, news pages and social media posts). [Coverage] We reviewed literature on the feature semantics conversion, feature expansion and weighting strategy from the field of Automatic Classification based on machine learning. [Methods] We analyzed the leading studies, key technologies, current achievements, and future directions from the published articles. [Results] Our research found the limits of previous studies on semantic representation of texts and utilization of knowledge bases. [Limitations] We did not discuss the classification algorithms. [Conclusions] To improve the effectiveness of automatic classification of digital documents, future research could try to combine Vector Space Model with Probabilistic Topic Model, use the knowledge base to improve the concept similarity computing, as well as construct composite weighted strategy.
李湘东,巴志超,高凡. 数字文本自动分类中特征语义关联及加权策略研究综述与展望*[J]. 现代图书情报技术, 2016, 32(9): 17-26.
Li Xiangdong,Ba Zhichao,Gao Fan. Review of Digital Documents Automatic Classification Research. New Technology of Library and Information Service, 2016, 32(9): 17-26.
(Wang Xiwei, Fan Xinghua, Zhao Jun.Method for Chinese Short Text Classification Based on Feature Extension[J]. Journal of Computer Applications, 2009, 29(3): 843-845.)
(Wang Xiwei, Zhang Kai.Improved Expansion Algorithm Based on Co-occurrence Relationship Between Short Text Feature[J]. Journal of Henan University of Urban Construction, 2012, 21(4): 48-50.)
(Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short-Text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[4]
Vo D T, Ock C Y.Learning to Classify Short Text from Scientific Documents Using Topic Models with Various Types of Knowledge[J]. Expert Systems with Applications, 2015, 42(3): 1684-1698.
(Zhao Hui, Liu Huailiang.Classification Algorithm of Chinese Short Texts Based on Wikipedia[J]. Library and Information Service, 2013, 57(11): 120-124.)
[7]
Phan X H, Nguyen L M, Horiguchi S.Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web. ACM, 2008.
(Wang Sheng, Fan Xinghua, Chen Xianlin.Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-607.)
[9]
Fan X, Wei D.A Method of Agent and Patient Relation Acquisition for Short-Text Classification [A].// Advanced Research on Computer Science and Information Engineering[M]. Springer Berlin Heidelberg, 2011.
[10]
Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment [C]. In: Proceedings of the 19th International Conference on World Wide Web. ACM, 2010.
[11]
Soundarya R.Sentiment Classification for Domain Adaptation Using Cross Domains [C]. In: Proceedings of the 2015 International Conference on Futuristic Trends in Computing and Communication, Chennai, Tamilnadu, India. 2015.
[12]
Wang P, Domeniconi C, Hu J.Cross-Domain Text Classification Using Wikipedia[J]. IEEE Intelligent Informatics Bulletin, 2008, 9(1): 5-17.
[13]
Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]. In: Proceedings of the 18th International Conference on World Wide Web, NewYork, NY, USA. ACM, 2009.
(Ma Fanling, Hu Zewen.The Research on Automatic Library Classification Model Based on SUMO Ontology[J]. Journal of Intelligence, 2011, 30(1): 168-173.)
(Zhu Ping, Fan Shaohui, Yue Yongde.A Text Classification Method Integrating Ontology and SVM[J]. Journal of Jiangxi University of Science and Technology, 2012, 33(1): 68-72.)
[16]
Xiang E W, Cao B, Hu D H, et al.Bridging Domains Using World Wide Knowledge for Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(6): 770-783.
(Hu Zewen, Wang Xiaoyue, Bai Rujiang.Study on Text Classification Model Based on SUMO and WordNet Ontology Integration[J]. New Technology of Library and Information Service, 2011(1): 31-38.)
[19]
Xue G, Dai W, Yang Q, et al.Topic-bridged PLSA for Cross-domain Text Classification[C]. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 627-634.
(Li Xiangdong, Hu Yiquan, Huang Li.A Study of Mixed Automatic Categorization of Multi-type Document Adopting LDA Model[J]. Library Tribune, 2015, 35(1): 74-80.)
(Liu Huailiang, Du Kun, Qin Chunxiu.Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. New Technology of Library and Information Service, 2015(2): 39-45.)
[22]
Bollegala D, Maehara T, Kawarabayashi K.Unsupervised Cross-Domain Word Representation Learning [C]. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015.
[23]
Bouma G.Normalized (Pointwsie) Mutual Information in Collocation Extraction [C]. In: Proceedings of the Bi-Annual Conference of the German Society for Computational Linguistics and Language Technology.2009.
[24]
Ng A Y, Jordan M, Weiss Y.On Spectral Clustering: Analysis and an Algorithm [C]. In: Proceedings of the 2011 Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada. 2001.
[25]
Dai W, Xue G R, Yang Q, et al.Co-clustering Based Classification for Out-of-Domain Documents [C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007.
[26]
Milne D, Witten I H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]. In: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 2008.
(Zhao Hui, Liu Huailiang.Research on Chinese Short Text Classification Algorithm for Community-Based Q&A[J]. Journal of Modern Information, 2013, 33(10): 70-74.)
(Wu Jian, Wu Zhaohui, Li Ying, et al.Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
(Sun Jianwang, Lv Xueqiang, Zhang Leihan.Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[31]
Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
(Sun Shijie, Pu Jianzhong.A Hot Topic Phrase Selection Based on LDA for Chinese Tweets[J]. Journal of Luoyang Normal University, 2012, 31(11): 60-64.)
[34]
Bischof J M, Airoldi E M.Summarizing Topical Content with Word Frequency and Exclusivity[C]. In: Proceedings of the 29th International Conference on Machine Learning. 2012.
(Wang Hao, Ye Peng, Deng Sanhong.Machine Learning Application in Chinese Journal Articles Automatic Classification[J]. New Technology of Library and Information Service, 2014(3): 80-87.)
(Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-48.)
[37]
Lertantree V, Theeramunkong T.Effect of Term Distributions on Centroid-based Text Categorization[J]. Information Sciences, 2004, 158(1): 89-115.
[38]
蒋健. 文本分类中特征提取和特征加权方法研究[D]. 重庆: 重庆大学, 2010.
[38]
(Jiang Jian.Text Classification in the Feature Extraction and Feature Weighting Method Research [D]. Chongqing: Chongqing University, 2010.)
[39]
Salton G, Yang C S.On the Specification of Term Values in Automatic Indexing[J]. Journal of Documentation, 1973, 29(4): 351-372.
[40]
Peter A C, Brett W B, Stephen H, et al.An Information- Theoretic, Vector-Space-Model Approach to Cross-Language Information Retrieval[J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.