New Technology of Library and Information Service  2015, Vol. 31 Issue (2): 31-38    DOI: 10.11925/infotech.1003-3513.2015.02.05
Short-text Classification Based on HowNet and Domain Keyword Set Extension
Li Xiangdong1,2, Cao Huan1, Ding Cong1, Huang Li3
1. School of Information Management, Wuhan University, Wuhan 430072, China;
2. Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China;
3. Wuhan University Library, Wuhan 430072, China
[Objective] This paper aims to implement characteristic extension of short-text and improve short-text classification performance. [Methods] Extract the high frequency words and topic core words of each class of the training set as domain keyword set based on two different feature granularity, which is word and potential topic, and derive the topic probability distribution of the testing text using LDA model, while some topic probability is greater than a certain threshold, extend the keywords of the topic into the testing text. Calculate the sematic similarity of the testing text and the domain keyword set of each class by using HowNet. [Results] Compared with the short-text classification method based on LDA model, the proposed classification algorithm in Fudan corpora, Sogou corpus and the Micro-blog corpus average increase by 4.9%, 5.9% and 4.2% on Macro F1, on the Micro F1 average increased by 4.6%, 6.2% and 4.6%. Compared with the short-text classification method based on VSM model, the method can increase F-measure more than 13% in the all three corpus. And experimental proof in combination with characteristics of high frequency words and subject core words in the field of extension method classification performance is better than the extension method that only using high frequency words or subject core words. [Limitations] There are many words not included by HowNet, and these words cannot use HowNet to calculate similarity. It will affect classification results. [Conclusions] The method of this paper can effectively improve the short-text classification performance.

Key wordsShort-text classification      Keyword set      LDA      Feature extension      HowNet     
Received: 25 July 2014      Published: 17 March 2015
:  TP391  

Cite this article:

Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension. New Technology of Library and Information Service, 2015, 31(2): 31-38.

URL:     OR

