Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (6): 56-64    DOI: 10.11925/infotech.2096-3467.2017.06.06
Orginal Article Current Issue | Archive | Adv Search |
An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature
Bai Rujiang1(), Leng Fuhai2, Liao Junhua1
1Institute of Scientific and Technical Information, Shandong University of Technology, Zibo 255049, China
2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China
Download: PDF (1452 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.

Key wordsText Similarity      Semantic Chunks      Vector Space Model      Ontology     
Received: 27 April 2017      Published: 25 August 2017
ZTFLH:  G250  

Cite this article:

Bai Rujiang,Leng Fuhai,Liao Junhua. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature. Data Analysis and Knowledge Discovery, 2017, 1(6): 56-64.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.06.06     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I6/56

AwardNumber Title Program(s)
0933141 Novel Catalyst Supports for Water Electrolysis: Experimental and Theoretical Studies ENERGY FOR
SUSTAINABILITY
0945004 SBIR Phase I: Low Density Carbon Fibers Based on Gel Spun Polyacrylonitrile/
Carbon Nanotube
SMALL BUSINESS PHASE I
1007793 Materials World Network: Novel Catalyst Systems for Carbon Nanotube (CNT)
Synthesis and their Underlying Mechanisms
SOLID STATE & MATERIALS CHEMIS, OFFICE OF SPECIAL PROGRAMS-DMR
1046519 SBIR Phase I: Manufacturing of Double-Walled Carbon Nanotube/Rigid Rod
Polymer Advanced Structural Fibers
SMALL BUSINESS PHASE I
1133117 Collaborative Research: Experimental and Theoretical Investigations of Catalysis on
Carbon Nanotube Surfaces For Selective Liquid Fuel Generation
CATALYSIS AND
BIOCATALYSIS
1434824 DMREF: Engineering Strong, Highly Conductive Nanotube Fibers Via Fusion DMREF
Doc_id Doc_id Raw_sim Sem_sim Increase
‘0933141’ ‘0945004’ 0.39 0.51 12%
‘0933141’ ‘1007793’ 0.63 0.71 8%
‘0933141’ ‘1046519’ 0.68 0.73 5%
‘0933141’ ‘1133117’ 0.42 0.69 27%
‘0933141’ ‘1434824’ 0.46 0.68 22%
‘0945004’ ‘1007793’ 0.4 0.51 11%
‘0945004’ ‘1046519’ 0.52 0.63 11%
‘0945004’ ‘1133117’ 0.26 0.51 25%
‘0945004’ ‘1434824’ 0.46 0.58 12%
‘1007793’ ‘1046519’ 0.63 0.74 11%
‘1007793’ ‘1133117’ 0.49 0.74 25%
‘1007793’ ‘1434824’ 0.52 0.68 16%
‘1046519’ ‘1133117’ 0.42 0.67 25%
‘1046519’ ‘1434824’ 0.57 0.73 16%
‘1133117’ ‘1434824’ 0.4 0.66 26%
项目编号 项目主要研究内容
0933141 开发一种新的纳米晶(Nano-Crystalline)混合金属氧化物催化剂(Oxide Catalyst), 能够获得理想的导电性和电化学特性。该项目还能够帮助理解在电化学过程中纳米材料结构对电化学稳定性和活性的影响。
0945004 利用凝胶纺丝技术(Gel Spun Technology)开发一种高强度-低密度的碳纳米管(Carbon NanoTube, CNT)基碳纤维(Carbon Fiber)。该纤维的拉伸强度大于7Gpa, 拉伸模量大于450Gpa, 密度小于1.2g/cm3。该纤维可以广泛应用于卫星、飞机机身、机翼以及高性能汽车中。
1007793 将探寻石墨烯(Graphene)和碳纳米管(Carbon NanoTube, CNT)对氧化物催化剂(Oxide Catalyst)影响机制, 并关注新的在氧化物催化剂(Oxide Catalyst)新的增长变量。
1046519 利用高度结晶(Crystaline)的双壁碳纳米管(Double Wall Carbon Nanotube, DWCNT)制备具备高强度和韧性的新一代结构纤维。该纤维可以为车辆防弹、商业航空航天等领域提供强度更高、重量更轻的结构纤维材料。
1133117 研究碳纳米管(Carbon NanoTube, CNT)本身做为非均相催化反应(Heterogeneous Catalysis)尤其是在FT催化反应中的催化剂(Catalysts)的作用。增强碳纤维(Carbon Fiber)/碳纳米管(Carbon NanoTube, CNT)分散颗粒的催化效率。
1434824 一种新型碳纳米结构(Carbon Nanostructure)工程过程, 称为纳米管融合(NanoTube Fusion)。该方法可以创建高性能的碳纤维(Carbon Fiber),可以应用于航空航天、高功率密度的能量存储和轻质布线等领域。
[1] Salton G, Wong A, Yang S, A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
[2] 孙建军, 成颖. 信息检索技术[M]. 北京: 科学出版, 2004.
[2] (Sun Jianjun, Cheng Ying.Information Retrieval Technology [M]. Beijing: Science Press, 2004.)
[3] Jacob B, Benjamin C. Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia [EB/OL]. [2017-04-07]. .
[4] Metzler D, Bernstein Y, Croft W B, et al.Similarity Measures for Tracking Information Flow[C]// Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005:517-524.
[5] Banerjee S, Pedersen T.Extended Gloss Overlaps as a Measure of Semantic Relatedness[C]// Proceedings of the 17th International Joint Conference on Artificial Intelligence. New York: ACM Press, 2003: 805-810.
[6] Ponzetto P S, Strube M.Knowledge Derived from Wikipedia for Computing Semantic Relatedness[J]. Journal of Artificial Intelligence Research, 2007, 30(1): 181-212.
doi: 10.1613/jair.2308
[7] Allan J, Bolivar A, Wade C.Retrieval and Novelty Detection at the Sentence Level[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003.
[8] Landauer T K, Foltz P W, Laham D.Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
[9] Lund K, Burgess C.Producing High-dimensional Semantic Spaces from Lexical Co-occurrence[J]. Behavior Research Methods Instruments & Computers, 1996, 28(2): 203-208.
doi: 10.3758/BF03204766
[10] Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 10.
doi: 10.1145/1376815.1376819
[11] Sébastien H, David S, Sylvie R, et al.A Framework for Unifying Ontology-based Semantic Similarity Measures: A Study in the Biomedical Domain[J]. Journal of Biomedical Informatics, 2014, 48(2): 38-53.
doi: 10.1016/j.jbi.2013.11.006 pmid: 24269894
[12] Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics Society, 1989, 19(1): 17-30.
[13] Leacock C, Chodorow M.Combining Local Context and WordNet Similarity for Word Sense Identification[M]. MIT Press, 1998.
[14] Pekar V, Staab S.Taxonomy Learning: Factoring the Structure of a Taxonomy into a Semantic Classification Decision[C]// Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, China. New York: ACM Press, 2002: 1-7.
[15] Wu Z, Palmer M.Verb Semantics and Lexical Selection [C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. New York: ACM Press, 1994: 133-138.
[16] Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[17] Wang J Z, Du Z, Payattakool R, et al.A New Method to Measure the Semantic Similarity of GO Terms[J]. Bioinformatics, 2007, 23(10): 1274-1281.
doi: 10.1093/bioinformatics/btm087 pmid: 17344234
[18] Couto F M, Silva M, Coutinho P M.Implementation of a Functional Semantic Similarity Measure Between geNe- products[D]. Lisbon: University of Lisbon, 2003.
[19] Othman R M, Deris S, Illias R M.A Genetic Similarity Algorithm for Searching the Gene Ontology Terms and Annotating Anonymous Protein Sequences[J]. Journal of Biomed Information, 2008, 41(1): 65-81.
doi: 10.1016/j.jbi.2007.05.010 pmid: 17681495
[20] 李文清, 孙新, 张常有, 等. 一种本体概念的语义相似度计算方法[J]. 自动化学报, 2012, 38(2): 229-235.
doi: 10.3724/SP.J.1004.2012.00229
[20] (Li Wenqing, Sun Xin, Zhang Changyou, et al.A Semantic Similarity Calculation Method of Ontology Concept[J]. Acta Automatica Sinica, 2012, 38(2): 229-235.)
doi: 10.3724/SP.J.1004.2012.00229
[21] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
doi: 10.3969/j.issn.1002-137X.2012.02.002
[21] (Liu Hongzhe, Xu De.Review of Semantic Similarity and Correlation Calculation Based on Ontology[J]. Computer Science, 2012, 39(2): 8-13.)
doi: 10.3969/j.issn.1002-137X.2012.02.002
[22] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
doi: 10.3724/SP.J.1016.2011.00856
[22] (Huang Chenghui, In Jian, Hou Fang.A Text Similarity Measure Based on Semantic Information and TF- IDF Method[J]. Journal of Computers, 2011, 34(5): 856-864.)
doi: 10.3724/SP.J.1016.2011.00856
[23] 刘宏哲. 文本语义相似度计算方法研究[D]. 北京: 北京交通大学, 2012.
[23] (Liu Hongzhe.Text Semantic Similarity Calculation Method [D]. Beijing: Beijing Jiaotong University, 2012.)
[24] 白如江. 基于语义计算的科学研究前沿识别研究[D]. 北京: 中国科学院大学, 2015.
[24] (Bai Rujiang.Scientific Research Frontier Recognition Research Based on the Semantic Computing [D]. Beijing: Chinese Academy of Sciences, 2015.)
[1] Sheng Shu, Huang Qi, Yang Yang, Xie Qiwen, Qin Xinguo. Exchanging Chinese Medical Information Based on HL7 FHIR[J]. 数据分析与知识发现, 2021, 5(11): 13-28.
[2] Zeng Zhen,Li Gang,Mao Jin,Chen Jinghao. Data Governance and Domain Ontology of Regional Public Security[J]. 数据分析与知识发现, 2020, 4(9): 41-55.
[3] Gao Yuan,Shi Yuanlei,Zhang Lei,Cao Tianyi,Feng Jun. Reconstructing Tour Routes Based on Travel Notes[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[4] Shaohua Qiang,Yunlu Luo,Yupeng Li,Peng Wu. Ontology Reasoning for Financial Affairs with RBR and CBR[J]. 数据分析与知识发现, 2019, 3(8): 94-104.
[5] Shiqi Deng,Liang Hong. Constructing Domain Ontology for Intelligent Applications: Case Study of Anti Tele-Fraud[J]. 数据分析与知识发现, 2019, 3(7): 73-84.
[6] Zhu Fu,Yuefen Wang,Xuhui Ding. Semantic Representation of Design Process Knowledge Reuse[J]. 数据分析与知识发现, 2019, 3(6): 21-29.
[7] Guangshang Gao. A Survey of User Profiles Methods[J]. 数据分析与知识发现, 2019, 3(3): 25-35.
[8] Ying Wang,Li Qian,Jing Xie,Zhijun Chang,Beibei Kong. Building Knowledge Graph with Sci-Tech Big Data[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
[9] He Youshi,He Shufang. Sentiment Mining of Online Product Reviews Based on Domain Ontology[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[10] Tang Huihui,Wang Hao,Zhang Zixuan,Wang Xueying. Extracting Names of Historical Events Based on Chinese Character Tags[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[11] Pang Beibei,Gou Juanqiong,Mu Wenxin. Extracting Topics and Their Relationship from College Student Mentoring[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[12] Li Lin,Li Hui. Computing Text Similarity Based on Concept Vector Space[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[13] Ding Shengchun,Liu Menglu,Fu Zhu. Unified Multidimensional Model Based on Knowledge Flow in Conceptual Design[J]. 数据分析与知识发现, 2018, 2(2): 11-19.
[14] Tu Haili,Tang Xiaobo. Building Product Recommendation Model Based on Tags[J]. 数据分析与知识发现, 2017, 1(9): 28-39.
[15] Chen Erjing,Jiang Enbo. Review of Studies on Text Similarity Measures[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn