Please wait a minute...
New Technology of Library and Information Service  2012, Vol. 28 Issue (3): 47-52    DOI: 10.11925/infotech.1003-3513.2012.03.08
Current Issue | Archive | Adv Search |
Research on Chinese Short Text Classification Based on Wikipedia
Fan Yunjie, Liu Huailiang
School of Economics and Management, Xidian University, Xi’an 710071, China
Download: PDF(654 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  According to the characteristics of Chinese short texts, a method of feature extension is introduced to help text classification. Firstly, related concepts are extracted from Wikipedia and concept associativity is calculated based on the combination of statistical laws and categories. Then the semantic related concept sets are built to extend the eigenvector of short text in order to supply its semantic features. The contrast experiment shows that the algorithm of short text classification based on Wikipedia can get a better classified effect.
Key wordsShort text      Wikipedia      Text classification      Feature extension     
Received: 01 February 2012      Published: 19 April 2012
: 

TP391.1

 

Cite this article:

Fan Yunjie, Liu Huailiang. Research on Chinese Short Text Classification Based on Wikipedia. New Technology of Library and Information Service, 2012, 28(3): 47-52.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2012.03.08     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2012/V28/I3/47

[1] 王细薇, 沈云琴. 中文短文本分类方法研究[J]. 现代计算机, 2010(7): 28-31.(Wang Xiwei, Shen Yunqin. Research on Chinese Short Text Classification Method[J]. Modern Computer, 2010(7):28-31.)

[2] Metaler D, Dumais S C, Meek C. Similarity Measures for Short Segments of Text[C]. In: Proceedings of the 29th European Conference on Information Retrieval. Berlin: Springer-Verlag, 2007.

[3] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets[C]. In: Proceedings of the 15th International World Wide Web Conference Committee (IW3C2), Edinburgh, Scotland. New York: ACM Press, 2006: 377-386.

[4] Hynek J, Jezek K, Rohlik O. Short Document Categorization-Itemsets Method[C]. In : Proceedings of the 4th European Conference on Principles and Practice of Knowledge Discovery in Databases, Workshop Machine Learning and Textual Information Access, Lyon, France. 2000:14-19.

[5] Zelikovitz S, Transductive M F. Learning for Short-Text Classification Problem Using Latent Semantic Indexing International[J]. Journal of Pattern Recognition and Artificial Intelligence, 2005, 19(2):143-163.

[6] 王鹏, 樊兴华. 中文文本分类中利用依存关系的实验研究[J]. 计算机工程与应用, 2010, 46(3): 131-133.(Wang Peng, Fan Xinghua. Study on Chinese Text Classification Based on Dependency Relation[J]. Computer Engineering and Applications, 2010, 46(3): 131-133.)

[7] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009,36(3): 142-145.(Ning Yahui, Fan Xinghua, Wu Yu. Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009,36(3): 142-145.)

[8] 王盛, 樊兴华, 陈现麟. 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010,30(3): 603-611.(Wang Sheng, Fan Xinghua, Chen Xianlin. Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Application, 2010,30(3): 603-611.)

[9] 张海粟, 马大明, 邓智龙. 基于维基百科的语义知识库及其构建方法研究[J]. 计算机应用研究, 2011,28(8): 2807-2811. (Zhang Haisu, Ma Daming, Deng Zhilong. Semantic Knowledge Bases Construction Based on Wikipedia[J]. Application Research of Computers, 2011,28(8): 2807-2811.)

[10] Wang P, Domeniconi C. Building Semantic Kernels for Text Classification Using Wikipedia[C]. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada,USA. ACM:New York,2008:713-721.

[11] 裘江南, 秦璇, 仲秋雁. 异质知识网络相关度算法研究[J]. 情报学报, 2011,30(5): 495-502.(Qiu Jiangnan, Qin Xuan, Zhong Qiuyan. Research on Relatedness Algorithms in Heterogeneous Knowledge Network[J]. Journal of the China Society for Scientific and Technical Information, 2011,30(5): 495-502.)

[12] Wikipedia[EB/OL].[2011-12-08]. http://zh.wikipedia.org.

[13] 盛志超, 陶晓鹏. 基于维基百科的语义相似度计算方法[J]. 计算机工程, 2011,37(7): 193-195.(Sheng Zhichao, Tao Xiaopeng. Semantic Similarity Computing Method Based on Wikipedia[J]. Computer Engineering, 2011,37(7): 193-195.)

[14] 苏小康. 基于维基百科构建语义知识库及其在文本分类领域的应用研究[D]. 武汉:华中师范大学, 2010.(Su Xiaokang. Research on Building Wikipedia Semantic Knowledge Base and Its Application in Text Classification[D]. Wuhan: Central China Normal University, 2010)

[15] 王元珍, 钱铁云, 冯小年. 基于关联规则挖掘的中文文本自动分类[J]. 小型微型计算机系统, 2005, 26(8): 1380-1383.(Wang Yuanzhen, Qian Tieyun, Feng Xiaonian. Association Rules Based Automatic Chinese Text Categorization[J]. Mini-micro Systems, 2005, 26(8):1380-1383)

[16] Salton G, McGillM J. Introduction to Modern Information Retrieval[M]. New York, NY, USA:McGraw Hill, 1983.
[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[5] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[6] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[7] Zhou Pengcheng,Wu Chuan,Lu Wei. Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase[J]. 现代图书情报技术, 2016, 32(6): 1-11.
[8] Xia Tian. Generating Hierarchical Paths of Chinese Text from Wikipedia[J]. 现代图书情报技术, 2016, 32(3): 25-32.
[9] Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet[J]. 现代图书情报技术, 2016, 32(2): 59-66.
[10] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[11] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[12] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[13] Lu Yonghe, Wang Hongbin. Feature Weighting Method Affected by Part of Speech in Text Classification[J]. 现代图书情报技术, 2015, 31(4): 18-25.
[14] Li Hui, Xiang Huating, Tang Qiang. A Trust Model for Wikipedia Based on Structure Information and Edit History[J]. 现代图书情报技术, 2015, 31(3): 33-38.
[15] Li Xiangdong, Cao Huan, Ding Cong, Huang Li. Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. 现代图书情报技术, 2015, 31(2): 31-38.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn