[Objective] This paper aims to solve the feature mismatch problem caused by different document types and improve the performance of automatic classification technology. [Methods] We proposed a new method to extend the semantic features using documents of various types as the corpus, which were introduced the third-party resource HowNet and were different with the other un-categorized ones. [Results] Compared with the non-feature-extension classification method, the proposed method increased the F-measure by 1.2% to 11.0% in our classification experiment. Four document types, used in our study included webpages, books, non-academic periodicals and academic journals. [Limitations] Not every type of document was tested with the publicly accessible corpus, thus, more tests were needed to examine the generalization and objectiveness of the new method. [Conclusions] Our study showed that the proposed method was feasible. It could effectively eliminate the semantic differences among various types of collections and improve the performance of automatic text classification through corpus construction and feature extension.
李湘东,刘康,丁丛,高凡. 基于《知网》的多种类型文献混合自动分类研究*[J]. 现代图书情报技术, 2016, 32(2): 59-66.
Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet. New Technology of Library and Information Service, DOI：10.11925/infotech.1003-3513.2016.02.08.
(Xue Chunxiang, Xia Zuqi, Hou Hanqing.A Comparison of Automatic Classification Between Corpus-based Model and Experiences-based Model[J]. Journal of Nanjing Agricultural University: Social Sciences Edition, 2005, 5(4): 85-91.)
Pong J Y H, Kwok R C W, Lau R Y K, et al. A Comparative Study of Two Automatic Document Classification Methods in a Library Setting[J]. Journal of Information Science, 2008, 34(2): 213-230.
(Li Xiangdong, Hu Yiquan, Ba Zhichao, et al.The Study of Mixed Automatic Categorization on Digital Library Collections[J]. Library Journal, 2014, 33(11): 42-48.)
知网[DB/OL]. [2015-06-15]. .
(HowNet Knowledge Database [DB/OL]. [2015-06-15].
Pan S J, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
Wang P, Domeniconi C, Hu J.Using Wikipedia for Co-Clustering Based Cross-Domain Text Classification [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 2008.
Lu Z, Zhu Y, Pan S J, et al.Source Free Transfer Learning for Text Classification [C]. In: Proceedings of the 28th Association for the Advancement of Artificial Intelligence Conference on Artificia Intelligence. 2014.
(Li Shengqi, Tian Qiaoyan, Tang Cheng.Disambiguating Method for Computing Relevancy Based on HowNet Semantic Knowledge[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(5): 706-711.)