Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (6): 56-64    DOI: 10.11925/infotech.2096-3467.2017.06.06
Orginal Article Current Issue | Archive | Adv Search |
An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature
Rujiang Bai1(),Fuhai Leng2,Junhua Liao1
1Institute of Scientific and Technical Information, Shandong University of Technology, Zibo 255049, China
2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF(1452 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      

[Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.

Key wordsText Similarity      Semantic Chunks      Vector Space Model      Ontology     
Received: 27 April 2017      Published: 25 August 2017

Cite this article:

Rujiang Bai,Fuhai Leng,Junhua Liao. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature. Data Analysis and Knowledge Discovery, 2017, 1(6): 56-64.

URL:     OR

[1] Salton G, Wong A, Yang S, A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] 孙建军, 成颖. 信息检索技术[M]. 北京: 科学出版, 2004.
[2] (Sun Jianjun, Cheng Ying.Information Retrieval Technology [M]. Beijing: Science Press, 2004.)
[3] Jacob B, Benjamin C. Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia [EB/OL]. [2017-04-07]. .
[4] Metzler D, Bernstein Y, Croft W B, et al.Similarity Measures for Tracking Information Flow[C]// Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005:517-524.
[5] Banerjee S, Pedersen T.Extended Gloss Overlaps as a Measure of Semantic Relatedness[C]// Proceedings of the 17th International Joint Conference on Artificial Intelligence. New York: ACM Press, 2003: 805-810.
[6] Ponzetto P S, Strube M.Knowledge Derived from Wikipedia for Computing Semantic Relatedness[J]. Journal of Artificial Intelligence Research, 2007, 30(1): 181-212.
[7] Allan J, Bolivar A, Wade C.Retrieval and Novelty Detection at the Sentence Level[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003.
[8] Landauer T K, Foltz P W, Laham D.Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
[9] Lund K, Burgess C.Producing High-dimensional Semantic Spaces from Lexical Co-occurrence[J]. Behavior Research Methods Instruments & Computers, 1996, 28(2): 203-208.
[10] Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 10.
[11] Sébastien H, David S, Sylvie R, et al.A Framework for Unifying Ontology-based Semantic Similarity Measures: A Study in the Biomedical Domain[J]. Journal of Biomedical Informatics, 2014, 48(2): 38-53.
[12] Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics Society, 1989, 19(1): 17-30.
[13] Leacock C, Chodorow M.Combining Local Context and WordNet Similarity for Word Sense Identification[M]. MIT Press, 1998.
[14] Pekar V, Staab S.Taxonomy Learning: Factoring the Structure of a Taxonomy into a Semantic Classification Decision[C]// Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, China. New York: ACM Press, 2002: 1-7.
[15] Wu Z, Palmer M.Verb Semantics and Lexical Selection [C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. New York: ACM Press, 1994: 133-138.
[16] Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[17] Wang J Z, Du Z, Payattakool R, et al.A New Method to Measure the Semantic Similarity of GO Terms[J]. Bioinformatics, 2007, 23(10): 1274-1281.
[18] Couto F M, Silva M, Coutinho P M.Implementation of a Functional Semantic Similarity Measure Between geNe- products[D]. Lisbon: University of Lisbon, 2003.
[19] Othman R M, Deris S, Illias R M.A Genetic Similarity Algorithm for Searching the Gene Ontology Terms and Annotating Anonymous Protein Sequences[J]. Journal of Biomed Information, 2008, 41(1): 65-81.
[20] 李文清, 孙新, 张常有, 等. 一种本体概念的语义相似度计算方法[J]. 自动化学报, 2012, 38(2): 229-235.
[20] (Li Wenqing, Sun Xin, Zhang Changyou, et al.A Semantic Similarity Calculation Method of Ontology Concept[J]. Acta Automatica Sinica, 2012, 38(2): 229-235.)
[21] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
[21] (Liu Hongzhe, Xu De.Review of Semantic Similarity and Correlation Calculation Based on Ontology[J]. Computer Science, 2012, 39(2): 8-13.)
[22] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
[22] (Huang Chenghui, In Jian, Hou Fang.A Text Similarity Measure Based on Semantic Information and TF- IDF Method[J]. Journal of Computers, 2011, 34(5): 856-864.)
[23] 刘宏哲. 文本语义相似度计算方法研究[D]. 北京: 北京交通大学, 2012.
[23] (Liu Hongzhe.Text Semantic Similarity Calculation Method [D]. Beijing: Beijing Jiaotong University, 2012.)
[24] 白如江. 基于语义计算的科学研究前沿识别研究[D]. 北京: 中国科学院大学, 2015.
[24] (Bai Rujiang.Scientific Research Frontier Recognition Research Based on the Semantic Computing [D]. Beijing: Chinese Academy of Sciences, 2015.)
[1] Shiqi Deng,Liang Hong. Constructing Domain Ontology for Intelligent Applications: Case Study of Anti Tele-Fraud[J]. 数据分析与知识发现, 2019, 3(7): 73-84.
[2] Zhu Fu,Yuefen Wang,Xuhui Ding. Semantic Representation of Design Process Knowledge Reuse[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[3] Guangshang Gao. A Survey of User Profiles Methods[J]. 数据分析与知识发现, 2019, 3(3): 25-35.
[4] Ying Wang,Li Qian,Jing Xie,Zhijun Chang,Beibei Kong. Building Knowledge Graph with Sci-Tech Big Data[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
[5] Youshi He,Shufang He. Sentiment Mining of Online Product Reviews Based on Domain Ontology[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[6] Huihui Tang,Hao Wang,Zixuan Zhang,Xueying Wang. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[7] Beibei Pang,Juanqiong Gou,Wenxin Mu. Extracting Topics and Their Relationship from College Student Mentoring[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[8] Lin Li,Hui Li. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[9] Shengchun Ding,Menglu Liu,Zhu Fu. Unified Multidimensional Model Based on Knowledge Flow in Conceptual Design[J]. 数据分析与知识发现, 2018, 2(2): 11-19.
[10] Haili Tu,Xiaobo Tang. Building Product Recommendation Model Based on Tags[J]. 数据分析与知识发现, 2017, 1(9): 28-39.
[11] Erjing Chen,Enbo Jiang. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[12] Dan Wu,Chang Liu,Yi Li. Changing Sentiments of Pedestrian Navigation System Users[J]. 数据分析与知识发现, 2017, 1(5): 42-51.
[13] Liu Jian,Bi Qiang,Liu Qingxu,Wang Fu. New Content Recommendation Service of Digital Literature[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[14] Ding Heng,Lu Wei. Building Standard Literature Knowledge Service System[J]. 现代图书情报技术, 2016, 32(7-8): 120-128.
[15] Lu Jiaying,Yuan Qinjian,Huang Qi,Qian Yunjie. Building Product Domain Ontology with Concept Lattice Theory[J]. 现代图书情报技术, 2016, 32(5): 38-46.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938