Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (9): 66-73    DOI: 10.11925/infotech.2096-3467.2018.0314
Current Issue | Archive | Adv Search |
Categorizing Documents Automatically within Common Semantic Space
Xiangdong Li1,2(),Fan Gao1,Youhai Li3
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China
3Wuhan Foreign Languages School, Wuhan 430072, China
Download: PDF(525 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to solve the semantic differences among documents due to file types and writing styles. [Methods] First, we chose domain-independent features appearing in two document sets and domain-dependent features appearing only in one set. Then, we used the domain-independent features to construct the bidirectional graph and the spectral clustering of the domain-dependent features. Finally, we correlated the domain-dependent features, and generated the common semantic space defined by clustering features. [Results] We found that the proposed model improved the classification results by 3.0% to 6.9% compared with the traditional methods. [Limitations] The proposed model requires large number of documents belonging to the same field to build the common semantic space. [Conclusions] The common semantic space could help us effectively organize the digital resources of different file types.

Key wordsCommon Semantic Space      Automatic Text Categorization      Spectral Clustering      Cross-domain Categorization     
Received: 21 March 2018      Published: 25 October 2018

Cite this article:

Xiangdong Li,Fan Gao,Youhai Li. Categorizing Documents Automatically within Common Semantic Space. Data Analysis and Knowledge Discovery, 2018, 2(9): 66-73.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0314     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I9/66

[1] 黄莉, 李湘东. 数字图书馆馆藏资源的文献类型研究[J]. 高校图书情报论坛, 2015, 14(4): 19-22.
[1] (Huang Li, Li Xiangdong.The Study of Document Type of the Digital Library Collections[J]. Academic Library and Information Tribune, 2015, 14(4): 19-22.)
[2] 薛春香, 张玉芳. 面向新闻领域的中文文本分类研究综述[J]. 图书情报工作, 2013, 57(14): 134-139.
[2] (Xue Chunxiang, Zhang Yufang.Research Review on Chinese Test Classification in the News Field[J]. Library and Information Service, 2013, 57(14): 134-139.)
[3] 王冰. 一种基于机器学习的主题Web分类算法研究[D]. 长沙: 湖南大学, 2015.
[3] (Wang Bing.A Study of Subject Web Classification Algorithm Based on Machine Learning[D]. Changsha: Hu’nan University, 2015.)
[4] 庄福振, 罗平, 何清, 等. 迁移学习研究进展[J]. 软件学报, 2015, 26(1):26-39.
[4] (Zhuang Fuzhen, Luo Ping, He Qing, et al.Survey on Transfer Learning Research[J]. Journal of Software, 2015, 26(1): 26-39.)
[5] Weiss K, Khoshgoftaar T M, Wang D D.A Survey of Transfer Learning[J]. Journal of Big Data, 2016, 3: 9.
[6] Pan S L, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
[7] Bao Y, Collier N, Datta A.A Partially Supervised Cross-Collection Topic Model for Cross-Domain Text Classification[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013.
[8] 李良豪. 跨领域文本分类算法研究[D]. 北京: 清华大学, 2012.
[8] (Li Lianghao.Research on Cross Domain Text Classification Algorithm[D]. Beijing: Tsinghua University, 2012.)
[9] Blitzer J, McDonald R T, Pereira F. Domain Adaptation with Structural Correspondence Learning[C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia. ACL, 2006: 120-128.
[10] Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]// Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain. ACM, 2009: 91-100.
[11] Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment[C]// Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA. ACM, 2010: 751-760.
[12] Luxburg U.A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
[13] Joorabchi A, Mahdi A E.Classification of Scientific Publications According to Library Controlled Vocabularies: A New Concept Matching-based Approach[J]. Library Hi Tech, 2013, 31(4): 725-747.
[14] 张志武. 跨领域迁移学习产品评论情感分析[J]. 现代图书情报技术, 2013(6): 49-54.
[14] (Zhang Zhiwu.Sentiment Analysis of Product Reviews by Means of Cross-domain Transfer Learning[J]. New Technology of Library and Information Service, 2013(6): 49-54.)
[15] Rogati M, Yang Y.High-performing Feature Selection for Text Classification[C]//Proceedings of the 11th International Conference on Information and Knowledge Management, McLean, Virginia, USA. ACM, 2002: 659-661.
[16] 姚海英. 中文文本分类中卡方统计特征选择方法和TF-IDF权重计算方法的研究[D]. 长春: 吉林大学, 2016.
[16] (Yao Haiying.Research on Chi-square Statistic Feature Selection Method and TF-IDF Feature Weighting Method for Chinese Text Classification[D]. Changchun: Jilin University, 2016.)
[17] 孙丽娟. 谱聚类算法研究及其在文本聚类中的应用[D]. 南京: 南京理工大学, 2013.
[17] (Sun Lijuan.Spectral Clustering Algorithm and Its Application in Text Clustering[D]. Nanjing: Nanjing University of Science and Technology, 2013.)
[18] Jeribi A.Spectral Graph Theory[A]// Spectral Theory and Applications of Linear Operators and Block Operator Matrices[M]. Springer, 2015.
[19] Xu X H, He P, Chen L.Learning Spectral Graph Mapping for Classification[C]//Proceedings of the 2010 International Conference on Machine Learning and Cybernetics, Qingdao, China. IEEE, 2010:758-762.
[20] 何清, 史忠植. 机器学习与概念语义空间生成[J]. 信息技术快报, 2004. .
[20] (He Qing, Shi Zhongzhi.Machine Learning and Concept Semantic Space Generation[J]. Information Technology Letter, 2004.
[21] 复旦大学中文语料库 [DB/OL]. [2017-03-01]. .
[21] (Fudan University Chinese Corpus[DB/OL]. [2017-03-01].
[22] 搜狗互联网语料库-全网新闻数据[EB/OL]. [2017-03-01]. .
[22] (Sogou Internet Corpus- SogouCA[EB/OL].[2017-03-01].
[23] 奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70.
[23] (Feng Guohe.Review of Performance Evaluation of Text Classification[J]. Journal of Intelligence, 2011, 30(8): 66-70.)
[24] 李湘东, 刘康, 丁丛, 等. 基于《知网》的多种类型文献混合自动分类研究[J]. 现代图书情报技术, 2016(2): 59-66.
[24] (Li Xiangdong, Liu Kang, Ding Cong, et al.A New Automatic Categorization Method with Documents Based on HowNet[J]. New Technology of Library and Information Service, 2016(2): 59-66.)
[25] 李湘东, 刘康, 高凡. 维基百科在多种类型数字文本资源自动分类中的应用[J]. 情报科学, 2017, 35(2): 75-79, 111.
[25] (Li Xiangdong, Liu Kang, Gao Fan.Application of Wikipedia to Automatic Categorization with Multiple Types of Digital Text Resources[J]. Information Science, 2017, 35(2): 75-79, 111.)
[1] Meimei Chen, Kangjie Xue. Personalized Recommendation Algorithm Based on Modified Tensor Decomposition Model[J]. 数据分析与知识发现, 2017, 1(3): 38-45.
[2] Wang Hao, Ye Peng, Deng Sanhong. The Application of Machine-Learning in the Research on Automatic Categorization of Chinese Periodical Articles[J]. 现代图书情报技术, 2014, 30(3): 80-87.
[3] Zhang Zhiwu. Sentiment Analysis of Product Reviews by means of Cross-domain Transfer Learning[J]. 现代图书情报技术, 2013, (6): 49-54.
[4] Shi Jiebin. Study on Automatic Text Categorization with Support Vector Machine[J]. 现代图书情报技术, 2004, 20(7): 27-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn