Categorizing Documents Automatically within Common Semantic Space
Li Xiangdong1,2(), Gao Fan1, Li Youhai3
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China 3Wuhan Foreign Languages School, Wuhan 430072, China
[Objective] This paper aims to solve the semantic differences among documents due to file types and writing styles. [Methods] First, we chose domain-independent features appearing in two document sets and domain-dependent features appearing only in one set. Then, we used the domain-independent features to construct the bidirectional graph and the spectral clustering of the domain-dependent features. Finally, we correlated the domain-dependent features, and generated the common semantic space defined by clustering features. [Results] We found that the proposed model improved the classification results by 3.0% to 6.9% compared with the traditional methods. [Limitations] The proposed model requires large number of documents belonging to the same field to build the common semantic space. [Conclusions] The common semantic space could help us effectively organize the digital resources of different file types.
(Huang Li, Li Xiangdong.The Study of Document Type of the Digital Library Collections[J]. Academic Library and Information Tribune, 2015, 14(4): 19-22.)
(Xue Chunxiang, Zhang Yufang.Research Review on Chinese Test Classification in the News Field[J]. Library and Information Service, 2013, 57(14): 134-139.)
[3]
王冰. 一种基于机器学习的主题Web分类算法研究[D]. 长沙: 湖南大学, 2015.
[3]
(Wang Bing.A Study of Subject Web Classification Algorithm Based on Machine Learning[D]. Changsha: Hu’nan University, 2015.)
(Zhuang Fuzhen, Luo Ping, He Qing, et al.Survey on Transfer Learning Research[J]. Journal of Software, 2015, 26(1): 26-39.)
[5]
Weiss K, Khoshgoftaar T M, Wang D D.A Survey of Transfer Learning[J]. Journal of Big Data, 2016, 3: 9.
doi: 10.1186/s40537-016-0043-6
[6]
Pan S L, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
doi: 10.1109/TKDE.2009.191
[7]
Bao Y, Collier N, Datta A.A Partially Supervised Cross-Collection Topic Model for Cross-Domain Text Classification[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013.
[8]
李良豪. 跨领域文本分类算法研究[D]. 北京: 清华大学, 2012.
[8]
(Li Lianghao.Research on Cross Domain Text Classification Algorithm[D]. Beijing: Tsinghua University, 2012.)
[9]
Blitzer J, McDonald R T, Pereira F. Domain Adaptation with Structural Correspondence Learning[C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia. ACL, 2006: 120-128.
[10]
Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]// Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain. ACM, 2009: 91-100.
[11]
Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment[C]// Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA. ACM, 2010: 751-760.
[12]
Luxburg U.A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
doi: 10.1007/s11222-007-9033-z
[13]
Joorabchi A, Mahdi A E.Classification of Scientific Publications According to Library Controlled Vocabularies: A New Concept Matching-based Approach[J]. Library Hi Tech, 2013, 31(4): 725-747.
doi: 10.1108/LHT-03-2013-0030
(Zhang Zhiwu.Sentiment Analysis of Product Reviews by Means of Cross-domain Transfer Learning[J]. New Technology of Library and Information Service, 2013(6): 49-54.)
[15]
Rogati M, Yang Y.High-performing Feature Selection for Text Classification[C]//Proceedings of the 11th International Conference on Information and Knowledge Management, McLean, Virginia, USA. ACM, 2002: 659-661.
(Yao Haiying.Research on Chi-square Statistic Feature Selection Method and TF-IDF Feature Weighting Method for Chinese Text Classification[D]. Changchun: Jilin University, 2016.)
[17]
孙丽娟. 谱聚类算法研究及其在文本聚类中的应用[D]. 南京: 南京理工大学, 2013.
[17]
(Sun Lijuan.Spectral Clustering Algorithm and Its Application in Text Clustering[D]. Nanjing: Nanjing University of Science and Technology, 2013.)
[18]
Jeribi A.Spectral Graph Theory[A]// Spectral Theory and Applications of Linear Operators and Block Operator Matrices[M]. Springer, 2015.
[19]
Xu X H, He P, Chen L.Learning Spectral Graph Mapping for Classification[C]//Proceedings of the 2010 International Conference on Machine Learning and Cybernetics, Qingdao, China. IEEE, 2010:758-762.
[20]
何清, 史忠植. 机器学习与概念语义空间生成[J]. 信息技术快报, 2004. .
[20]
(He Qing, Shi Zhongzhi.Machine Learning and Concept Semantic Space Generation[J]. Information Technology Letter, 2004.
[21]
复旦大学中文语料库 [DB/OL]. [2017-03-01]. .
[21]
(Fudan University Chinese Corpus[DB/OL]. [2017-03-01].
[22]
搜狗互联网语料库-全网新闻数据[EB/OL]. [2017-03-01]. .
[22]
(Sogou Internet Corpus- SogouCA[EB/OL].[2017-03-01].
[23]
奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70.
[23]
(Feng Guohe.Review of Performance Evaluation of Text Classification[J]. Journal of Intelligence, 2011, 30(8): 66-70.)
(Li Xiangdong, Liu Kang, Ding Cong, et al.A New Automatic Categorization Method with Documents Based on HowNet[J]. New Technology of Library and Information Service, 2016(2): 59-66.)
(Li Xiangdong, Liu Kang, Gao Fan.Application of Wikipedia to Automatic Categorization with Multiple Types of Digital Text Resources[J]. Information Science, 2017, 35(2): 75-79, 111.)