Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (6): 56-64    DOI: 10.11925/infotech.2096-3467.2017.06.06
An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature
Bai Rujiang1(), Leng Fuhai2, Liao Junhua1
1Institute of Scientific and Technical Information, Shandong University of Technology, Zibo 255049, China
2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China
[Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.

Key wordsText Similarity      Semantic Chunks      Vector Space Model      Ontology     
Received: 27 April 2017      Published: 25 August 2017
ZTFLH:  G250  

Cite this article:

Bai Rujiang,Leng Fuhai,Liao Junhua. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature. Data Analysis and Knowledge Discovery, 2017, 1(6): 56-64.

AwardNumber Title Program(s)
0933141 Novel Catalyst Supports for Water Electrolysis: Experimental and Theoretical Studies ENERGY FOR
0945004 SBIR Phase I: Low Density Carbon Fibers Based on Gel Spun Polyacrylonitrile/
Carbon Nanotube
1007793 Materials World Network: Novel Catalyst Systems for Carbon Nanotube (CNT)
Synthesis and their Underlying Mechanisms
1046519 SBIR Phase I: Manufacturing of Double-Walled Carbon Nanotube/Rigid Rod
Polymer Advanced Structural Fibers
1133117 Collaborative Research: Experimental and Theoretical Investigations of Catalysis on
Carbon Nanotube Surfaces For Selective Liquid Fuel Generation
1434824 DMREF: Engineering Strong, Highly Conductive Nanotube Fibers Via Fusion DMREF
Doc_id Doc_id Raw_sim Sem_sim Increase
‘0933141’ ‘0945004’ 0.39 0.51 12%
‘0933141’ ‘1007793’ 0.63 0.71 8%
‘0933141’ ‘1046519’ 0.68 0.73 5%
‘0933141’ ‘1133117’ 0.42 0.69 27%
‘0933141’ ‘1434824’ 0.46 0.68 22%
‘0945004’ ‘1007793’ 0.4 0.51 11%
‘0945004’ ‘1046519’ 0.52 0.63 11%
‘0945004’ ‘1133117’ 0.26 0.51 25%
‘0945004’ ‘1434824’ 0.46 0.58 12%
‘1007793’ ‘1046519’ 0.63 0.74 11%
‘1007793’ ‘1133117’ 0.49 0.74 25%
‘1007793’ ‘1434824’ 0.52 0.68 16%
‘1046519’ ‘1133117’ 0.42 0.67 25%
‘1046519’ ‘1434824’ 0.57 0.73 16%
‘1133117’ ‘1434824’ 0.4 0.66 26%
项目编号 项目主要研究内容
0933141 开发一种新的纳米晶(Nano-Crystalline)混合金属氧化物催化剂(Oxide Catalyst), 能够获得理想的导电性和电化学特性。该项目还能够帮助理解在电化学过程中纳米材料结构对电化学稳定性和活性的影响。
0945004 利用凝胶纺丝技术(Gel Spun Technology)开发一种高强度-低密度的碳纳米管(Carbon NanoTube, CNT)基碳纤维(Carbon Fiber)。该纤维的拉伸强度大于7Gpa, 拉伸模量大于450Gpa, 密度小于1.2g/cm3。该纤维可以广泛应用于卫星、飞机机身、机翼以及高性能汽车中。
1007793 将探寻石墨烯(Graphene)和碳纳米管(Carbon NanoTube, CNT)对氧化物催化剂(Oxide Catalyst)影响机制, 并关注新的在氧化物催化剂(Oxide Catalyst)新的增长变量。
1046519 利用高度结晶(Crystaline)的双壁碳纳米管(Double Wall Carbon Nanotube, DWCNT)制备具备高强度和韧性的新一代结构纤维。该纤维可以为车辆防弹、商业航空航天等领域提供强度更高、重量更轻的结构纤维材料。
1133117 研究碳纳米管(Carbon NanoTube, CNT)本身做为非均相催化反应(Heterogeneous Catalysis)尤其是在FT催化反应中的催化剂(Catalysts)的作用。增强碳纤维(Carbon Fiber)/碳纳米管(Carbon NanoTube, CNT)分散颗粒的催化效率。
1434824 一种新型碳纳米结构(Carbon Nanostructure)工程过程, 称为纳米管融合(NanoTube Fusion)。该方法可以创建高性能的碳纤维(Carbon Fiber),可以应用于航空航天、高功率密度的能量存储和轻质布线等领域。
