A New Automatic Categorization Method with Documents Based on HowNet
Li Xiangdong1,2(),Liu Kang1,Ding Cong1,Gao Fan1
1School of Information Management, Wuhan University, Wuhan 430072, China 2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
[Objective] This paper aims to solve the feature mismatch problem caused by different document types and improve the performance of automatic classification technology. [Methods] We proposed a new method to extend the semantic features using documents of various types as the corpus, which were introduced the third-party resource HowNet and were different with the other un-categorized ones. [Results] Compared with the non-feature-extension classification method, the proposed method increased the F-measure by 1.2% to 11.0% in our classification experiment. Four document types, used in our study included webpages, books, non-academic periodicals and academic journals. [Limitations] Not every type of document was tested with the publicly accessible corpus, thus, more tests were needed to examine the generalization and objectiveness of the new method. [Conclusions] Our study showed that the proposed method was feasible. It could effectively eliminate the semantic differences among various types of collections and improve the performance of automatic text classification through corpus construction and feature extension.
李湘东,刘康,丁丛,高凡. 基于《知网》的多种类型文献混合自动分类研究*[J]. 现代图书情报技术, 2016, 32(2): 59-66.
Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet. New Technology of Library and Information Service, 2016, 32(2): 59-66.
(Xue Chunxiang, Xia Zuqi, Hou Hanqing.A Comparison of Automatic Classification Between Corpus-based Model and Experiences-based Model[J]. Journal of Nanjing Agricultural University: Social Sciences Edition, 2005, 5(4): 85-91.)
[2]
Pong J Y H, Kwok R C W, Lau R Y K, et al. A Comparative Study of Two Automatic Document Classification Methods in a Library Setting[J]. Journal of Information Science, 2008, 34(2): 213-230.
(Li Xiangdong, Hu Yiquan, Ba Zhichao, et al.The Study of Mixed Automatic Categorization on Digital Library Collections[J]. Library Journal, 2014, 33(11): 42-48.)
[4]
知网[DB/OL]. [2015-06-15]. .
[4]
(HowNet Knowledge Database [DB/OL]. [2015-06-15].
[5]
Pan S J, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
[6]
Wang P, Domeniconi C, Hu J.Using Wikipedia for Co-Clustering Based Cross-Domain Text Classification [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 2008.
[7]
Lu Z, Zhu Y, Pan S J, et al.Source Free Transfer Learning for Text Classification [C]. In: Proceedings of the 28th Association for the Advancement of Artificial Intelligence Conference on Artificia Intelligence. 2014.
(Zhao Hui, Liu Huailiang.Classification Algorithm of Chinese Short Texts Based on Wikipedia[J]. Library and Information Service, 2013, 57(11): 120-124.)
(Li Xiangdong, Cao Huan, Ding Cong, et al.Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2): 31-38.)
(Wu Jian, Wu Zhaohui, Li Ying, et al.Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
(Li Shengqi, Tian Qiaoyan, Tang Cheng.Disambiguating Method for Computing Relevancy Based on HowNet Semantic Knowledge[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(5): 706-711.)
(Sun Jianwang, Lv Xueqiang, Zhang Leihan.Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[16]
搜狗互联网语料库[DB/OL]. [2015-06-03]. .
[16]
(SogouT [DB/OL]. [2015-06-03].
[17]
Tan S.An Effective Refinement Strategy for KNN Text Classifier[J]. Expert Systems with Applications, 2006, 30(2): 290-298.
[18]
奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70.
[18]
(Feng Guohe.Review of Performance Evaluation of Text Classification[J]. Journal of Intelligence, 2011, 30(8): 66-70.)