Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (2): 31-38    DOI: 10.11925/infotech.1003-3513.2015.02.05
Current Issue | Archive | Adv Search |
Short-text Classification Based on HowNet and Domain Keyword Set Extension
Li Xiangdong1,2, Cao Huan1, Ding Cong1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
Download: PDF(736 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to implement characteristic extension of short-text and improve short-text classification performance. [Methods] Extract the high frequency words and topic core words of each class of the training set as domain keyword set based on two different feature granularity, which is word and potential topic, and derive the topic probability distribution of the testing text using LDA model, while some topic probability is greater than a certain threshold, extend the keywords of the topic into the testing text. Calculate the sematic similarity of the testing text and the domain keyword set of each class by using HowNet. [Results] Compared with the short-text classification method based on LDA model, the proposed classification algorithm in Fudan corpora, Sogou corpus and the Micro-blog corpus average increase by 4.9%, 5.9% and 4.2% on Macro F1, on the Micro F1 average increased by 4.6%, 6.2% and 4.6%. Compared with the short-text classification method based on VSM model, the method can increase F-measure more than 13% in the all three corpus. And experimental proof in combination with characteristics of high frequency words and subject core words in the field of extension method classification performance is better than the extension method that only using high frequency words or subject core words. [Limitations] There are many words not included by HowNet, and these words cannot use HowNet to calculate similarity. It will affect classification results. [Conclusions] The method of this paper can effectively improve the short-text classification performance.

Key wordsShort-text classification      Keyword set      LDA      Feature extension      HowNet     
Received: 25 July 2014      Published: 17 March 2015
:  TP391  

Cite this article:

Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension. New Technology of Library and Information Service, 2015, 31(2): 31-38.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.02.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I2/31

[1] Zelikovitz S, Hirsh H. Improving Short-Text Classification Using Unlabeled Background Knowledge to Assess Document Similarity [C]. In: Proceedings of the 17th International Conference on Machine Learning. San Francisco, CA, USA: Morgan Kaufmann Publishers, 2000: 1183-1190.
[2] Pu Q, Yang G W. Short-Text Classification Based on ICA and LSA[C]. In: Proceedings of the 3rd International Symposium on Neural Networks, Chengdu, China. 2006: 265-270.
[3] 王细薇, 樊兴华, 赵军. 一种基于特征扩展的中文短文本 分类方法[J]. 计算机应用, 2009, 29(3): 843-845. (Wang Xiwei, Fan Xinghua, Zhao Jun. Method for Chinese Short Text Classification Based on Feature Extension [J]. Journal of Computer Applications, 2009, 29(3): 843-845.)
[4] 赵辉, 刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11): 120-124. (Zhao Hui, Liu Huailiang. Classification Algorithm of Chinese Short Texts Based on Wikipedia [J]. Library and Information Service, 2013, 57(11): 120-124.)
[5] 张素智, 刘婧姣. 基于语义的KNN 短文本分类算法研究[J]. 郑州轻工业学院学报: 自然科学版, 2012, 27(6): 1-4. (Zhang Suzhi, Liu Jingjiao. A Short Text KNN Classification Algorithm Based on Semantic [J]. Journal of Zhengzhou University of Light Industry: Natural Science, 2012, 27(6): 1-4.)
[6] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类 [J]. 计算机科学, 2009, 36(3): 142-145. (Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology [J]. Computer Science, 2009, 36(3): 142-145.)
[7] 湛燕, 陈昊. 基于主题本体扩展特征的短文本分类[J].河北 大学学报: 自然科学版, 2014, 34(3): 307-311. (Zhan Yan, Chen Hao. Short Text Classification Based on Theme Ontology Features Extended [J]. Journal of Hebei University: Natural Science Edition, 2014, 34(3): 307-311.)
[8] 胡勇军, 江嘉欣, 常会友. 基于LDA 高频词扩展的中文短 文本分类[J]. 现代图书情报技术, 2013(6): 42-48. (Hu Yongjun, Jiang Jiaxin, Chang Huiyou. A New Method of Keywords Extension for Chinese Short-Text Classification [J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[9] Sriram B, Fuhry D, Demir E, et al. Short Text Classification in Twitter to Improve Information Filtering [C]. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2010: 841-842.
[10] Blei D M, Ng A Y, Jordan M I, et al. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[11] 司宪策. 基于内容的社会标签推荐与分析研究[D]. 北京: 清华大学, 2010. (Si Xiance. Content-based Recommendation and Analysis of Social Tags [D]. Beijing: Tsinghua University, 2010.)
[12] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 计 算语言学及中文语言处理, 2002, 7(2): 59-76. (Liu Qun, Li Sujian. Word Similarity Computating Based on How-net [J]. Computational Linguistics and Chinese Language Processing, 2002,7(2): 59-76.)
[13] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度 的Web 服务发现[J]. 计算机学报, 2005, 28(4): 595-602. (Wu Jian, Wu Zhaohui, Li Ying, et al. Web Service Discovery Based on Ontology and Similarity of Words [J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
[14] 李生琦, 田巧燕, 汤承. 基于《<知网>》词汇语义相关度计 算的消歧方法[J]. 情报学报, 2009, 28(5): 706-711. (Li Shengqi, Tian Qiaoyan, Tang Cheng. Disambiguating Method for Computing Relevancy Based on HowNet Semantic Knowledge [J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(5): 706-711.)
[15] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文 本分类研究[J]. 计算机工程与设计, 2013, 34(10): 3613-3618. (Sun Jianwang, Lv Xueqiang, Zhang Leihan. Short Text Classification Based on Semantics and Maximum Matching Degree [J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[16] 周云, 朱定局, 柏佳宁, 等. 基于HowNet 句子相似度的计 算[J]. 先进技术研究通报, 2010, 4(8): 32-37. (Zhou Yun, Zhu Dingju, Bo Jia'ning. Sentence Similarity Calculation Based on Hownet [J]. Bulletin of Advanced Technology Research, 2010, 4(8): 32-37.)
[17] 复旦大学中文语料库[DB/OL]. [2014-06-20]. http://www.datatang.com/data/43318.(Fudan University Chinese Corpus [DB/OL]. [2014-06-20]. http://www.datatang.com/data/43318.)
[18] 搜狗文本分类语料库 [DB/OL]. [2014-06-20]. http://www.Sogou.com/labs/dl/c.html. (Sogou Classification Corpus [DB/OL]. [2014-06-20]. http://www.Sogou.com/labs/dl/c.html.)
[19] NLPIR 微博内容语料库[DB/OL]. [2014-06-20]. http://www.nlpir.org/?action-viewnews-itemid-231. (NLPIR Corpus [DB/OL]. [2014-06-20]. http://www.nlpir.org/?action-viewnewsitemid-231.)
[20] 奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70. (Feng Guohe. Review of Performance Evaluation of Text Classification [J]. Journal of Intelligence, 2011, 30(8): 66-70.)

[1] Lixin Xia,Jieyan Zeng,Chongwu Bi,Guanghui Ye. Identifying Hierarchy Evolution of User Interests with LDA Topic Model[J]. 数据分析与知识发现, 2019, 3(7): 1-13.
[2] Peng Guan,Yuefen Wang,Zhu Fu. Analyzing Topic Semantic Evolution with LDA: Case Study of Lithium Ion Batteries[J]. 数据分析与知识发现, 2019, 3(7): 61-72.
[3] Linna Xi,Yongxiang Dou. Examining Reposts of Micro-bloggers with Planned Behavior Theory[J]. 数据分析与知识发现, 2019, 3(2): 13-20.
[4] Jie Zhang,Junbo Zhao,Dongsheng Zhai,Ningning Sun. Patent Technology Analysis of Microalgae Biofuel Industrial Chain Based on Topic Model[J]. 数据分析与知识发现, 2019, 3(2): 52-64.
[5] Junwan Liu,Zhixin Long,Feifei Wang. Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction[J]. 数据分析与知识发现, 2019, 3(1): 104-117.
[6] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[7] Yue He,Yue Feng,Shupeng Zhao,Yufeng Ma. Recommending Contents Based on Zhihu Q&A Community: Case Study of Logistics Topics[J]. 数据分析与知识发现, 2018, 2(9): 42-49.
[8] Tao Zhang,Haiqun Ma. Clustering Policy Texts Based on LDA Topic Model[J]. 数据分析与知识发现, 2018, 2(9): 59-65.
[9] Yanhua Xu,Yujie Miao,Lin Miao,Xueqiang Lv. Generating HSK Writing Essays with LDA Model[J]. 数据分析与知识发现, 2018, 2(9): 80-87.
[10] Ziming Zeng,Qianwen Yang. Sentiment Analysis for Micro-blogs with LDA and AdaBoost[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[11] Beibei Pang,Juanqiong Gou,Wenxin Mu. Extracting Topics and Their Relationship from College Student Mentoring[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[12] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[13] Jingqi Wang,Rui Li,Huayi Wu. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[14] He Li,Linlin Zhu,Min Yan,Jincheng Liu,Chuang Hong. Identifying Useful Information from Open Innovation Community[J]. 数据分析与知识发现, 2018, 2(12): 12-22.
[15] Jiabin Qu,Shiyan Ou. Analyzing Topic Evolution with Topic Filtering and Relevance[J]. 数据分析与知识发现, 2018, 2(1): 64-75.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn