|
|
Categorizing Documents Automatically within Common Semantic Space |
Li Xiangdong1,2(), Gao Fan1, Li Youhai3 |
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for Electronic Commerce Research and Development, Wuhan University, Wuhan 430072, China 3Wuhan Foreign Languages School, Wuhan 430072, China |
|
|
Abstract [Objective] This paper aims to solve the semantic differences among documents due to file types and writing styles. [Methods] First, we chose domain-independent features appearing in two document sets and domain-dependent features appearing only in one set. Then, we used the domain-independent features to construct the bidirectional graph and the spectral clustering of the domain-dependent features. Finally, we correlated the domain-dependent features, and generated the common semantic space defined by clustering features. [Results] We found that the proposed model improved the classification results by 3.0% to 6.9% compared with the traditional methods. [Limitations] The proposed model requires large number of documents belonging to the same field to build the common semantic space. [Conclusions] The common semantic space could help us effectively organize the digital resources of different file types.
|
Received: 21 March 2018
Published: 25 October 2018
|
|
[1] |
黄莉, 李湘东. 数字图书馆馆藏资源的文献类型研究[J]. 高校图书情报论坛, 2015, 14(4): 19-22.
|
[1] |
(Huang Li, Li Xiangdong.The Study of Document Type of the Digital Library Collections[J]. Academic Library and Information Tribune, 2015, 14(4): 19-22.)
|
[2] |
薛春香, 张玉芳. 面向新闻领域的中文文本分类研究综述[J]. 图书情报工作, 2013, 57(14): 134-139.
|
[2] |
(Xue Chunxiang, Zhang Yufang.Research Review on Chinese Test Classification in the News Field[J]. Library and Information Service, 2013, 57(14): 134-139.)
|
[3] |
王冰. 一种基于机器学习的主题Web分类算法研究[D]. 长沙: 湖南大学, 2015.
|
[3] |
(Wang Bing.A Study of Subject Web Classification Algorithm Based on Machine Learning[D]. Changsha: Hu’nan University, 2015.)
|
[4] |
庄福振, 罗平, 何清, 等. 迁移学习研究进展[J]. 软件学报, 2015, 26(1):26-39.
|
[4] |
(Zhuang Fuzhen, Luo Ping, He Qing, et al.Survey on Transfer Learning Research[J]. Journal of Software, 2015, 26(1): 26-39.)
|
[5] |
Weiss K, Khoshgoftaar T M, Wang D D.A Survey of Transfer Learning[J]. Journal of Big Data, 2016, 3: 9.
doi: 10.1186/s40537-016-0043-6
|
[6] |
Pan S L, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
doi: 10.1109/TKDE.2009.191
|
[7] |
Bao Y, Collier N, Datta A.A Partially Supervised Cross-Collection Topic Model for Cross-Domain Text Classification[C]//Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013.
|
[8] |
李良豪. 跨领域文本分类算法研究[D]. 北京: 清华大学, 2012.
|
[8] |
(Li Lianghao.Research on Cross Domain Text Classification Algorithm[D]. Beijing: Tsinghua University, 2012.)
|
[9] |
Blitzer J, McDonald R T, Pereira F. Domain Adaptation with Structural Correspondence Learning[C]//Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, Sydney, Australia. ACL, 2006: 120-128.
|
[10] |
Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]// Proceedings of the 18th International Conference on World Wide Web, Madrid, Spain. ACM, 2009: 91-100.
|
[11] |
Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment[C]// Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA. ACM, 2010: 751-760.
|
[12] |
Luxburg U.A Tutorial on Spectral Clustering[J]. Statistics and Computing, 2007, 17(4): 395-416.
doi: 10.1007/s11222-007-9033-z
|
[13] |
Joorabchi A, Mahdi A E.Classification of Scientific Publications According to Library Controlled Vocabularies: A New Concept Matching-based Approach[J]. Library Hi Tech, 2013, 31(4): 725-747.
doi: 10.1108/LHT-03-2013-0030
|
[14] |
张志武. 跨领域迁移学习产品评论情感分析[J]. 现代图书情报技术, 2013(6): 49-54.
|
[14] |
(Zhang Zhiwu.Sentiment Analysis of Product Reviews by Means of Cross-domain Transfer Learning[J]. New Technology of Library and Information Service, 2013(6): 49-54.)
|
[15] |
Rogati M, Yang Y.High-performing Feature Selection for Text Classification[C]//Proceedings of the 11th International Conference on Information and Knowledge Management, McLean, Virginia, USA. ACM, 2002: 659-661.
|
[16] |
姚海英. 中文文本分类中卡方统计特征选择方法和TF-IDF权重计算方法的研究[D]. 长春: 吉林大学, 2016.
|
[16] |
(Yao Haiying.Research on Chi-square Statistic Feature Selection Method and TF-IDF Feature Weighting Method for Chinese Text Classification[D]. Changchun: Jilin University, 2016.)
|
[17] |
孙丽娟. 谱聚类算法研究及其在文本聚类中的应用[D]. 南京: 南京理工大学, 2013.
|
[17] |
(Sun Lijuan.Spectral Clustering Algorithm and Its Application in Text Clustering[D]. Nanjing: Nanjing University of Science and Technology, 2013.)
|
[18] |
Jeribi A.Spectral Graph Theory[A]// Spectral Theory and Applications of Linear Operators and Block Operator Matrices[M]. Springer, 2015.
|
[19] |
Xu X H, He P, Chen L.Learning Spectral Graph Mapping for Classification[C]//Proceedings of the 2010 International Conference on Machine Learning and Cybernetics, Qingdao, China. IEEE, 2010:758-762.
|
[20] |
何清, 史忠植. 机器学习与概念语义空间生成[J]. 信息技术快报, 2004. .
|
[20] |
(He Qing, Shi Zhongzhi.Machine Learning and Concept Semantic Space Generation[J]. Information Technology Letter, 2004.
|
[21] |
复旦大学中文语料库 [DB/OL]. [2017-03-01]. .
|
[21] |
(Fudan University Chinese Corpus[DB/OL]. [2017-03-01].
|
[22] |
搜狗互联网语料库-全网新闻数据[EB/OL]. [2017-03-01]. .
|
[22] |
(Sogou Internet Corpus- SogouCA[EB/OL].[2017-03-01].
|
[23] |
奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70.
|
[23] |
(Feng Guohe.Review of Performance Evaluation of Text Classification[J]. Journal of Intelligence, 2011, 30(8): 66-70.)
|
[24] |
李湘东, 刘康, 丁丛, 等. 基于《知网》的多种类型文献混合自动分类研究[J]. 现代图书情报技术, 2016(2): 59-66.
|
[24] |
(Li Xiangdong, Liu Kang, Ding Cong, et al.A New Automatic Categorization Method with Documents Based on HowNet[J]. New Technology of Library and Information Service, 2016(2): 59-66.)
|
[25] |
李湘东, 刘康, 高凡. 维基百科在多种类型数字文本资源自动分类中的应用[J]. 情报科学, 2017, 35(2): 75-79, 111.
|
[25] |
(Li Xiangdong, Liu Kang, Gao Fan.Application of Wikipedia to Automatic Categorization with Multiple Types of Digital Text Resources[J]. Information Science, 2017, 35(2): 75-79, 111.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|