[Objective] This is an algorithm for improving the classification precision of Chinese text classification, which can calculate the similarity between Chinese texts more accurately. [Methods] With the TF-IDF algorithm calculating item weight and HowNet analyzing the semantic relationships between lexical items, this paper proposes a text similarity weighting algorithm based on HowNet semantics similarity, and makes an experiment on its Chinese text classification. [Results] The experiment resualts show that the proposed method can improve the text categorization performance comparing with the traditional ones. [Limitations] This algorithm is quite high in its time complexity, and its speed of text classification needs to be improved. [Conclusions] It is proved to be an effective algorithm for enhancing the classification accuracy of Chinese text by analyzing the semantic relationships between feature items.
刘怀亮, 杜坤, 秦春秀. 基于知网语义相似度的中文文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 39-45.
Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet. New Technology of Library and Information Service, 2015, 31(2): 39-45.
[1] 中国互联网络信息中心. 第34 次中国互联网络发展状况统 计报告[EB/OL]. [2014-07-21]. http://www.cnnic.net.cn. (China Internet Network Information Center. The 34th Statistical Report on Internet Development in China [EB/OL]. [2014-07-21]. http://www.cnnic.net.cn.)
[2] 刘青磊, 顾小丰. 基于《知网》的词语相似度算法研究[J]. 中文信息学报, 2011, 24(6): 31-36. (Liu Qinglei, Gu Xiaofeng. Study on HowNet-based Word Similarity Algorithm [J]. Journal of Chinese Information Processing, 2011, 24(6): 31-36.)
[3] 唐歆瑜, 乐文忠, 李志成, 等. 基于知网语义相似度计算 的特征降维方法研究[J]. 科学技术与工程, 2006, 6(21): 3442-3446. (Tang Xinyu, Le Wenzhong, Li Zhicheng, et al. The Research on Reduced Feature Dimension Based on Hownet Similarity Computing [J]. Science Technology and Engineering, 2006, 6(21): 3442-3446.)
[4] 江敏, 肖诗斌, 王弘蔚, 等. 一种改进的基于《知网》的词 语语义相似度计算[J]. 2008, 22(5): 84-89. (Jiang Min, Xiao Shibin, Wang Hongwei, et al. An Improved Word Similarity Computing Method Based on HowNet [J]. Journal of Chinese Information Processing, 2008, 22(5): 84-89.)
[5] 朱征宇, 孙俊华. 改进的基于《知网》的词汇语义相似度计 算[J]. 计算机应用, 2013, 33(8): 2276-2279, 2288. (Zhu Zhengyu, Sun Junhua. Improved Vocabulary Semantic Similarity Calculation Based on HowNet [J]. Journal of Computer Applications, 2013, 33(8): 2276-2279, 2288.)
[6] 肖志军, 冯广丽. 基于《知网》义原空间的文本相似度计算 [J]. 科学技术与工程, 2013, 13(29): 8651-8656. (Xiao Zhijun, Feng Guangli. Text Similarity Computing Based on HowNet Sememe Space [J]. Science Technology and Engineering, 2013, 13(29): 8651-8656.)
[7] 白秋产, 金春霞, 周海岩. 概念向量文本聚类算法[J]. 计 算机工程与应用, 2011, 47(35): 155-157, 209. (Bai Qiuchan, Jin Chunxia, Zhou Haiyan. Text Clustering Algorithm Based on Concept Vector [J]. Computer Engineering and Applications, 2011, 47(35): 155-157, 209.)
[8] Salton G, Yang C S. On the Specification of Term Value in Automatic Indexing [J]. Journal of Documentation, 1973, 29(4): 351-372.
[9] Satlon G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of ACM, 1975, 18(11): 613-620.
[10] Salton G, McGill M J. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill Inc, 1986.
[11] 刘群, 李素建. 基于知网的词汇语义相似度计算[C]. 见: 第三届汉语词汇语义学研讨会, 2002: 59-76. (Liu Qun, Li Sujian. Vocabulary Semantic Similarity Calculation Based on HowNet [C]. In: Proceedings of Chinese Lexical Semantic Workshop 2002. 2002: 59-76.)
[12] 孙继明, 李舟军, 文健. 基于《知网》的汉语词语词义消歧 方法[J]. 计算机与信息技术, 2007(3): 18-20. (Sun Jiming, Li Zhoujun, Wen Jian. Method of Chinese Word Sense Disambiguation Based on Hownet [J]. Computer and Information Technology, 2007(3): 18-20.)
[13] Tan P, Steinbach M, Kumar V. 数据挖掘导论[M]. 北京: 人 民邮电出版社, 2011. (Tan P, Steinbach M, Kumar V. Introduction to Data Mining [M]. Beijing: Posts & Telecom Press, 2011.)
[14] 中国科学院计算技术研究所. ICTCLAS 汉语分词系统 [EB/OL]. [2014-07-06]. http://ictclas.org/ictclas_download. aspx. (Institute of Computing Technology, Chinese Academy of Sciences. ICTCLAS [EB/OL]. [2014-07-06]. http://ictclas.org/ictclas_download.aspx.)
[15] 哈工大社会计算与信息检索研究中心. 《同义词词林》扩展版[EB/OL]. [2014-07-10]. http://ir.hit.edu.cn/.(HIT-SCIR. Tongyicicilin [EB/OL]. [2014-07-10]. http://ir.hit.edu.cn/.)
[16] 刘怀亮, 张志国, 马志辉, 等.基于KNN 的中文文本分类反馈 学习研究[J]. 图书情报工作, 2008, 52(10): 101-104. (Liu Huailiang, Zhang Zhiguo, Ma Zhihui, et al. A Feedback Learning Study of Chinese Text Categorization Based on KNN [J]. Library and Information Service, 2008, 52(10): 101-104.)