Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (2): 39-45    DOI: 10.11925/infotech.1003-3513.2015.02.06
Current Issue | Archive | Adv Search |
Research on Chinese Text Categorization Based on Semantic Similarity of HowNet
Liu Huailiang, Du Kun, Qin Chunxiu
School of Economics & Management, Xidian University, Xi'an 710126, China
Download: PDF(500 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This is an algorithm for improving the classification precision of Chinese text classification, which can calculate the similarity between Chinese texts more accurately. [Methods] With the TF-IDF algorithm calculating item weight and HowNet analyzing the semantic relationships between lexical items, this paper proposes a text similarity weighting algorithm based on HowNet semantics similarity, and makes an experiment on its Chinese text classification. [Results] The experiment resualts show that the proposed method can improve the text categorization performance comparing with the traditional ones. [Limitations] This algorithm is quite high in its time complexity, and its speed of text classification needs to be improved. [Conclusions] It is proved to be an effective algorithm for enhancing the classification accuracy of Chinese text by analyzing the semantic relationships between feature items.

Key wordsText classification      Semantic similarity      HowNet     
Received: 22 September 2014      Published: 17 March 2015
:  G353.1  

Cite this article:

Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet. New Technology of Library and Information Service, 2015, 31(2): 39-45.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.02.06     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I2/39

[1] 中国互联网络信息中心. 第34 次中国互联网络发展状况统 计报告[EB/OL]. [2014-07-21]. http://www.cnnic.net.cn. (China Internet Network Information Center. The 34th Statistical Report on Internet Development in China [EB/OL]. [2014-07-21]. http://www.cnnic.net.cn.)
[2] 刘青磊, 顾小丰. 基于《知网》的词语相似度算法研究[J]. 中文信息学报, 2011, 24(6): 31-36. (Liu Qinglei, Gu Xiaofeng. Study on HowNet-based Word Similarity Algorithm [J]. Journal of Chinese Information Processing, 2011, 24(6): 31-36.)
[3] 唐歆瑜, 乐文忠, 李志成, 等. 基于知网语义相似度计算 的特征降维方法研究[J]. 科学技术与工程, 2006, 6(21): 3442-3446. (Tang Xinyu, Le Wenzhong, Li Zhicheng, et al. The Research on Reduced Feature Dimension Based on Hownet Similarity Computing [J]. Science Technology and Engineering, 2006, 6(21): 3442-3446.)
[4] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词 语语义相似度计算[J]. 2008, 22(5): 84-89. (Jiang Min, Xiao Shibin, Wang Hongwei, et al. An Improved Word Similarity Computing Method Based on HowNet [J]. Journal of Chinese Information Processing, 2008, 22(5): 84-89.)
[5] 朱征宇, 孙俊华. 改进的基于《知网》的词汇语义相似度计 算[J]. 计算机应用, 2013, 33(8): 2276-2279, 2288. (Zhu Zhengyu, Sun Junhua. Improved Vocabulary Semantic Similarity Calculation Based on HowNet [J]. Journal of Computer Applications, 2013, 33(8): 2276-2279, 2288.)
[6] 肖志军, 冯广丽. 基于《知网》义原空间的文本相似度计算 [J]. 科学技术与工程, 2013, 13(29): 8651-8656. (Xiao Zhijun, Feng Guangli. Text Similarity Computing Based on HowNet Sememe Space [J]. Science Technology and Engineering, 2013, 13(29): 8651-8656.)
[7] 白秋产, 金春霞, 周海岩. 概念向量文本聚类算法[J]. 计 算机工程与应用, 2011, 47(35): 155-157, 209. (Bai Qiuchan, Jin Chunxia, Zhou Haiyan. Text Clustering Algorithm Based on Concept Vector [J]. Computer Engineering and Applications, 2011, 47(35): 155-157, 209.)
[8] Salton G, Yang C S. On the Specification of Term Value in Automatic Indexing [J]. Journal of Documentation, 1973, 29(4): 351-372.
[9] Satlon G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of ACM, 1975, 18(11): 613-620.
[10] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill Inc, 1986.
[11] 刘群, 李素建. 基于知网的词汇语义相似度计算[C]. 见: 第三届汉语词汇语义学研讨会, 2002: 59-76. (Liu Qun, Li Sujian. Vocabulary Semantic Similarity Calculation Based on HowNet [C]. In: Proceedings of Chinese Lexical Semantic Workshop 2002. 2002: 59-76.)
[12] 孙继明, 李舟军, 文健. 基于《知网》的汉语词语词义消歧 方法[J]. 计算机与信息技术, 2007(3): 18-20. (Sun Jiming, Li Zhoujun, Wen Jian. Method of Chinese Word Sense Disambiguation Based on Hownet [J]. Computer and Information Technology, 2007(3): 18-20.)
[13] Tan P, Steinbach M, Kumar V. 数据挖掘导论[M]. 北京: 人 民邮电出版社, 2011. (Tan P, Steinbach M, Kumar V. Introduction to Data Mining [M]. Beijing: Posts & Telecom Press, 2011.)
[14] 中国科学院计算技术研究所. ICTCLAS 汉语分词系统 [EB/OL]. [2014-07-06]. http://ictclas.org/ictclas_download. aspx. (Institute of Computing Technology, Chinese Academy of Sciences. ICTCLAS [EB/OL]. [2014-07-06]. http://ictclas.org/ictclas_download.aspx.)
[15] 哈工大社会计算与信息检索研究中心. 《同义词词林》扩展版[EB/OL]. [2014-07-10]. http://ir.hit.edu.cn/.(HIT-SCIR. Tongyicicilin [EB/OL]. [2014-07-10]. http://ir.hit.edu.cn/.)
[16] 刘怀亮, 张志国, 马志辉, 等.基于KNN 的中文文本分类反馈 学习研究[J]. 图书情报工作, 2008, 52(10): 101-104. (Liu Huailiang, Zhang Zhiguo, Ma Zhihui, et al. A Feedback Learning Study of Chinese Text Categorization Based on KNN [J]. Library and Information Service, 2008, 52(10): 101-104.)

[1] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[2] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Liu Liu,Dongbo Wang. Identifying Interdisciplinary Social Science Research Based on Article Classification[J]. 数据分析与知识发现, 2018, 2(3): 30-38.
[5] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[6] Zixuan Wang,Xiaoqiu Le,Yuanbiao He. Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. 数据分析与知识发现, 2017, 1(4): 1-8.
[7] Dongsheng Zhai,Wenhao Cai,Jie Zhang,Zhenfei Li. An Improved Method of Semantic Similarity Calculation of Chinese Trademarks[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[8] Xiangdong Li,Tao Ruan,Kang Liu. Automatic Classification of Documents from Wikipedia[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[9] Yonghe Lu,Jinghuang Chen. Optimizing Feature Selection Method for Text Classification with Shuffled Frog Leaping Algorithm[J]. 数据分析与知识发现, 2017, 1(1): 91-101.
[10] Liu Jian,Bi Qiang,Liu Qingxu,Wang Fu. New Content Recommendation Service of Digital Literature[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[11] Ba Zhichao,Li Gang,Zhu Shiwei. Similarity Measurement of Research Interests in Semantic Network[J]. 现代图书情报技术, 2016, 32(4): 81-90.
[12] Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet[J]. 现代图书情报技术, 2016, 32(2): 59-66.
[13] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[14] Qiang Bi, Jian Liu, Yulai Bao. A New Text Clustering Method Based on Semantic Similarity[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
[15] Gao Feng, Xiong Jing, Liu Yongge. Research on the Extenics of Oracle Bone Inscriptions Interpretation Based on HowNet[J]. 现代图书情报技术, 2015, 31(7-8): 58-64.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn