Categorizing Documents Automatically within Common Semantic Space
Li Xiangdong1,2(), Gao Fan1, Li Youhai3
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China 3Wuhan Foreign Languages School, Wuhan 430072, China
[Objective] This paper aims to solve the semantic differences among documents due to file types and writing styles. [Methods] First, we chose domain-independent features appearing in two document sets and domain-dependent features appearing only in one set. Then, we used the domain-independent features to construct the bidirectional graph and the spectral clustering of the domain-dependent features. Finally, we correlated the domain-dependent features, and generated the common semantic space defined by clustering features. [Results] We found that the proposed model improved the classification results by 3.0% to 6.9% compared with the traditional methods. [Limitations] The proposed model requires large number of documents belonging to the same field to build the common semantic space. [Conclusions] The common semantic space could help us effectively organize the digital resources of different file types.
(Zhuang Fuzhen, Luo Ping, He Qing, et al.Survey on Transfer Learning Research[J]. Journal of Software, 2015, 26(1): 26-39.)
Weiss K, Khoshgoftaar T M, Wang D D.A Survey of Transfer Learning[J]. Journal of Big Data, 2016, 3: 9.
Pan S L, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
Bao Y, Collier N, Datta A.A Partially Supervised Cross-Collection Topic Model for Cross-Domain Text Classification[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013.
李良豪. 跨领域文本分类算法研究[D]. 北京: 清华大学, 2012.
(Li Lianghao.Research on Cross Domain Text Classification Algorithm[D]. Beijing: Tsinghua University, 2012.)
Blitzer J, McDonald R T, Pereira F. Domain Adaptation with Structural Correspondence Learning[C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia. ACL, 2006: 120-128.
Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]// Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain. ACM, 2009: 91-100.
Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment[C]// Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA. ACM, 2010: 751-760.
Luxburg U.A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
Joorabchi A, Mahdi A E.Classification of Scientific Publications According to Library Controlled Vocabularies: A New Concept Matching-based Approach[J]. Library Hi Tech, 2013, 31(4): 725-747.
(Zhang Zhiwu.Sentiment Analysis of Product Reviews by Means of Cross-domain Transfer Learning[J]. New Technology of Library and Information Service, 2013(6): 49-54.)
Rogati M, Yang Y.High-performing Feature Selection for Text Classification[C]//Proceedings of the 11th International Conference on Information and Knowledge Management, McLean, Virginia, USA. ACM, 2002: 659-661.