Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (2): 59-66    DOI: 10.11925/infotech.1003-3513.2016.02.08
Orginal Article Current Issue | Archive | Adv Search |
A New Automatic Categorization Method with Documents Based on HowNet
Li Xiangdong1,2(),Liu Kang1,Ding Cong1,Gao Fan1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to solve the feature mismatch problem caused by different document types and improve the performance of automatic classification technology. [Methods] We proposed a new method to extend the semantic features using documents of various types as the corpus, which were introduced the third-party resource HowNet and were different with the other un-categorized ones. [Results] Compared with the non-feature-extension classification method, the proposed method increased the F-measure by 1.2% to 11.0% in our classification experiment. Four document types, used in our study included webpages, books, non-academic periodicals and academic journals. [Limitations] Not every type of document was tested with the publicly accessible corpus, thus, more tests were needed to examine the generalization and objectiveness of the new method. [Conclusions] Our study showed that the proposed method was feasible. It could effectively eliminate the semantic differences among various types of collections and improve the performance of automatic text classification through corpus construction and feature extension.

Key wordsThird-party resource      HowNet      Feature extension      Semantic difference     
Received: 12 August 2015      Published: 08 March 2016

Cite this article:

Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet. New Technology of Library and Information Service, 2016, 32(2): 59-66.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.02.08     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I2/59

[1] 薛春香, 夏祖奇, 侯汉清. 基于语料和基于标引经验的自动分类模式比较[J]. 南京农业大学学报: 社会科学版, 2005, 5(4): 85-91.
[1] (Xue Chunxiang, Xia Zuqi, Hou Hanqing.A Comparison of Automatic Classification Between Corpus-based Model and Experiences-based Model[J]. Journal of Nanjing Agricultural University: Social Sciences Edition, 2005, 5(4): 85-91.)
[2] Pong J Y H, Kwok R C W, Lau R Y K, et al. A Comparative Study of Two Automatic Document Classification Methods in a Library Setting[J]. Journal of Information Science, 2008, 34(2): 213-230.
[3] 李湘东, 胡逸泉, 巴志超, 等. 数字图书馆多种类型文献混合自动分类研究[J]. 图书馆杂志, 2014, 33(11): 42-48.
[3] (Li Xiangdong, Hu Yiquan, Ba Zhichao, et al.The Study of Mixed Automatic Categorization on Digital Library Collections[J]. Library Journal, 2014, 33(11): 42-48.)
[4] 知网[DB/OL]. [2015-06-15]. .
[4] (HowNet Knowledge Database [DB/OL]. [2015-06-15].
[5] Pan S J, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
[6] Wang P, Domeniconi C, Hu J.Using Wikipedia for Co-Clustering Based Cross-Domain Text Classification [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 2008.
[7] Lu Z, Zhu Y, Pan S J, et al.Source Free Transfer Learning for Text Classification [C]. In: Proceedings of the 28th Association for the Advancement of Artificial Intelligence Conference on Artificia Intelligence. 2014.
[8] 赵辉, 刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11): 120-124.
[8] (Zhao Hui, Liu Huailiang.Classification Algorithm of Chinese Short Texts Based on Wikipedia[J]. Library and Information Service, 2013, 57(11): 120-124.)
[9] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145.
[9] (Ning Yahui, Fan Xinghua, Wu Yu.Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[10] 李湘东, 曹环, 丁丛, 等. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015(2): 31-38.
[10] (Li Xiangdong, Cao Huan, Ding Cong, et al.Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2): 31-38.)
[11] 施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述[J]. 计算机应用, 2009, 29(S1): 167-170.
[11] (Shi Congying, Xu Chaojun, Yang Xiaojiang.Study of TFIDF Algorithm[J]. Journal of Computer Applications, 2009, 29(S1): 167-170.)
[12] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 计算语言学及中文语言处理, 2002,7(2): 59-76.
[12] (Liu Qun, Li Sujian.Word Similarity Computating Based on How-net[J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76.)
[13] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005, 28(4): 595-602.
[13] (Wu Jian, Wu Zhaohui, Li Ying, et al.Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
[14] 李生琦, 田巧燕, 汤承. 基于《<知网>》词汇语义相关度计算的消歧方法[J]. 情报学报, 2009, 28(5): 706-711.
[14] (Li Shengqi, Tian Qiaoyan, Tang Cheng.Disambiguating Method for Computing Relevancy Based on HowNet Semantic Knowledge[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(5): 706-711.)
[15] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文本分类研究[J]. 计算机工程与设计, 2013, 34(10): 3613-3618.
[15] (Sun Jianwang, Lv Xueqiang, Zhang Leihan.Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[16] 搜狗互联网语料库[DB/OL]. [2015-06-03]. .
[16] (SogouT [DB/OL]. [2015-06-03].
[17] Tan S.An Effective Refinement Strategy for KNN Text Classifier[J]. Expert Systems with Applications, 2006, 30(2): 290-298.
[18] 奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70.
[18] (Feng Guohe.Review of Performance Evaluation of Text Classification[J]. Journal of Intelligence, 2011, 30(8): 66-70.)
[1] Zhai Dongsheng,Cai Wenhao,Zhang Jie,Li Zhenfei. An Improved Method of Semantic Similarity Calculation of Chinese Trademarks[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[2] Gao Feng, Xiong Jing, Liu Yongge. Research on the Extenics of Oracle Bone Inscriptions Interpretation Based on HowNet[J]. 现代图书情报技术, 2015, 31(7-8): 58-64.
[3] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[4] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[5] Zhao Hui, Liu Huailiang. Research on Short Text Clustering Algorithm for User Generated Content[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[6] Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia[J]. 现代图书情报技术, 2012, 28(3): 47-52.
[7] Bai Rujiang, Yu Xiaofan, Wang Xiaoyue. The Comparative Analysis of Major Domestic and Foreign Ontology Library[J]. 现代图书情报技术, 2011, 27(1): 3-13.
[8] Fu Jibin,Liu Jie,Jia Keliang,Mao Jintao. Ontoloy Relationship Extraction Research Based on HowNet and Term Relevancy Degree[J]. 现代图书情报技术, 2008, 24(9): 36-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn