Please wait a minute...
Advanced Search
数据分析与知识发现  2017, Vol. 1 Issue (6): 56-64     https://doi.org/10.11925/infotech.2096-3467.2017.06.06
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
一种基于语义组块特征的改进Cosine文本相似度计算方法*
白如江1(), 冷伏海2, 廖君华1
1山东理工大学科技信息研究所 淄博 255049
2中国科学院科技战略咨询研究院 北京 100190
An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature
Bai Rujiang1(), Leng Fuhai2, Liao Junhua1
1Institute of Scientific and Technical Information, Shandong University of Technology, Zibo 255049, China
2Institute of Policy and Management, Chinese Academy of Sciences, Beijing 100190, China
全文: PDF (1452 KB)   HTML ( 1
输出: BibTeX | EndNote (RIS)      
摘要 

目的】利用文本语义组块特征提升Cosine文本相似度计算性能。【方法】获取NSF资助的关于碳纳米管研究领域的项目数据, 进行词干还原、词性标注等预处理; 利用条件随机场模型实现文本内容的语义组块标注; 在此基础上实现基于语义组块特征的改进Cosine文本相似度计算, 并与未标注的数据进行相似度计算比较, 分析实验结果。【结果】实验证明基于语义组块特征的改进Cosine相似度计算结果比原始文本Cosine相似度计算结果相似度均有不同程度的提升, 在实验数据中最高的相似度提升了26%。【局限】依赖于语义组块标注性能。【结论】本文方法能有效提升文本间语义相似度, 降低向量空间模型维度, 提高计算效率, 并且具有良好的泛化能力和鲁棒性。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
白如江
冷伏海
廖君华
关键词 文本相似度语义组块向量空间模型本体    
Abstract

[Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.

Key wordsText Similarity    Semantic Chunks    Vector Space Model    Ontology
收稿日期: 2017-04-27      出版日期: 2017-08-25
ZTFLH:  G250  
基金资助:*本文系国家社会科学基金项目“未来新兴科学研究前沿识别研究”(项目编号: 16BTQ083)的研究成果之一
引用本文:   
白如江, 冷伏海, 廖君华. 一种基于语义组块特征的改进Cosine文本相似度计算方法*[J]. 数据分析与知识发现, 2017, 1(6): 56-64.
Bai Rujiang,Leng Fuhai,Liao Junhua. An Improved Cosine Text Similarity Computing Method Based on Semantic Chunk Feature. Data Analysis and Knowledge Discovery, 2017, 1(6): 56-64.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.06.06      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2017/V1/I6/56
  不同研究目的语义组块
  具有不同语义角色的语义组块
AwardNumber Title Program(s)
0933141 Novel Catalyst Supports for Water Electrolysis: Experimental and Theoretical Studies ENERGY FOR
SUSTAINABILITY
0945004 SBIR Phase I: Low Density Carbon Fibers Based on Gel Spun Polyacrylonitrile/
Carbon Nanotube
SMALL BUSINESS PHASE I
1007793 Materials World Network: Novel Catalyst Systems for Carbon Nanotube (CNT)
Synthesis and their Underlying Mechanisms
SOLID STATE & MATERIALS CHEMIS, OFFICE OF SPECIAL PROGRAMS-DMR
1046519 SBIR Phase I: Manufacturing of Double-Walled Carbon Nanotube/Rigid Rod
Polymer Advanced Structural Fibers
SMALL BUSINESS PHASE I
1133117 Collaborative Research: Experimental and Theoretical Investigations of Catalysis on
Carbon Nanotube Surfaces For Selective Liquid Fuel Generation
CATALYSIS AND
BIOCATALYSIS
1434824 DMREF: Engineering Strong, Highly Conductive Nanotube Fibers Via Fusion DMREF
  NSF项目基本信息
  数据预处理
  语义组块标注结果(部分)
Doc_id Doc_id Raw_sim Sem_sim Increase
‘0933141’ ‘0945004’ 0.39 0.51 12%
‘0933141’ ‘1007793’ 0.63 0.71 8%
‘0933141’ ‘1046519’ 0.68 0.73 5%
‘0933141’ ‘1133117’ 0.42 0.69 27%
‘0933141’ ‘1434824’ 0.46 0.68 22%
‘0945004’ ‘1007793’ 0.4 0.51 11%
‘0945004’ ‘1046519’ 0.52 0.63 11%
‘0945004’ ‘1133117’ 0.26 0.51 25%
‘0945004’ ‘1434824’ 0.46 0.58 12%
‘1007793’ ‘1046519’ 0.63 0.74 11%
‘1007793’ ‘1133117’ 0.49 0.74 25%
‘1007793’ ‘1434824’ 0.52 0.68 16%
‘1046519’ ‘1133117’ 0.42 0.67 25%
‘1046519’ ‘1434824’ 0.57 0.73 16%
‘1133117’ ‘1434824’ 0.4 0.66 26%
  相似度计算实验结果
项目编号 项目主要研究内容
0933141 开发一种新的纳米晶(Nano-Crystalline)混合金属氧化物催化剂(Oxide Catalyst), 能够获得理想的导电性和电化学特性。该项目还能够帮助理解在电化学过程中纳米材料结构对电化学稳定性和活性的影响。
0945004 利用凝胶纺丝技术(Gel Spun Technology)开发一种高强度-低密度的碳纳米管(Carbon NanoTube, CNT)基碳纤维(Carbon Fiber)。该纤维的拉伸强度大于7Gpa, 拉伸模量大于450Gpa, 密度小于1.2g/cm3。该纤维可以广泛应用于卫星、飞机机身、机翼以及高性能汽车中。
1007793 将探寻石墨烯(Graphene)和碳纳米管(Carbon NanoTube, CNT)对氧化物催化剂(Oxide Catalyst)影响机制, 并关注新的在氧化物催化剂(Oxide Catalyst)新的增长变量。
1046519 利用高度结晶(Crystaline)的双壁碳纳米管(Double Wall Carbon Nanotube, DWCNT)制备具备高强度和韧性的新一代结构纤维。该纤维可以为车辆防弹、商业航空航天等领域提供强度更高、重量更轻的结构纤维材料。
1133117 研究碳纳米管(Carbon NanoTube, CNT)本身做为非均相催化反应(Heterogeneous Catalysis)尤其是在FT催化反应中的催化剂(Catalysts)的作用。增强碳纤维(Carbon Fiber)/碳纳米管(Carbon NanoTube, CNT)分散颗粒的催化效率。
1434824 一种新型碳纳米结构(Carbon Nanostructure)工程过程, 称为纳米管融合(NanoTube Fusion)。该方法可以创建高性能的碳纤维(Carbon Fiber),可以应用于航空航天、高功率密度的能量存储和轻质布线等领域。
  人工判读结果
  语义相似度距离对比
  文本相似度两两对比结果
[1] Salton G, Wong A, Yang S, A Vector Space Model for Automatic Indexing[J]. Communications of the ACM, 1975, 18(11): 613-620.
doi: 10.1145/361219.361220
[2] 孙建军, 成颖. 信息检索技术[M]. 北京: 科学出版, 2004.
[2] (Sun Jianjun, Cheng Ying.Information Retrieval Technology [M]. Beijing: Science Press, 2004.)
[3] Jacob B, Benjamin C. Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia [EB/OL]. [2017-04-07]. .
[4] Metzler D, Bernstein Y, Croft W B, et al.Similarity Measures for Tracking Information Flow[C]// Proceedings of the 14th ACM International Conference on Information and Knowledge Management. 2005:517-524.
[5] Banerjee S, Pedersen T.Extended Gloss Overlaps as a Measure of Semantic Relatedness[C]// Proceedings of the 17th International Joint Conference on Artificial Intelligence. New York: ACM Press, 2003: 805-810.
[6] Ponzetto P S, Strube M.Knowledge Derived from Wikipedia for Computing Semantic Relatedness[J]. Journal of Artificial Intelligence Research, 2007, 30(1): 181-212.
doi: 10.1613/jair.2308
[7] Allan J, Bolivar A, Wade C.Retrieval and Novelty Detection at the Sentence Level[C]// Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. 2003.
[8] Landauer T K, Foltz P W, Laham D.Introduction to Latent Semantic Analysis[J]. Discourse Processes, 1998, 25(2-3): 259-284.
[9] Lund K, Burgess C.Producing High-dimensional Semantic Spaces from Lexical Co-occurrence[J]. Behavior Research Methods Instruments & Computers, 1996, 28(2): 203-208.
doi: 10.3758/BF03204766
[10] Islam A, Inkpen D.Semantic Text Similarity Using Corpus-based Word Similarity and String Similarity[J]. ACM Transactions on Knowledge Discovery from Data, 2008, 2(2): 10.
doi: 10.1145/1376815.1376819
[11] Sébastien H, David S, Sylvie R, et al.A Framework for Unifying Ontology-based Semantic Similarity Measures: A Study in the Biomedical Domain[J]. Journal of Biomedical Informatics, 2014, 48(2): 38-53.
doi: 10.1016/j.jbi.2013.11.006 pmid: 24269894
[12] Rada R, Mili H, Bicknell E, et al.Development and Application of a Metric on Semantic Nets[J]. IEEE Transactions on Systems, Man, and Cybernetics Society, 1989, 19(1): 17-30.
[13] Leacock C, Chodorow M.Combining Local Context and WordNet Similarity for Word Sense Identification[M]. MIT Press, 1998.
[14] Pekar V, Staab S.Taxonomy Learning: Factoring the Structure of a Taxonomy into a Semantic Classification Decision[C]// Proceedings of the 19th International Conference on Computational Linguistics, Taipei, Taiwan, China. New York: ACM Press, 2002: 1-7.
[15] Wu Z, Palmer M.Verb Semantics and Lexical Selection [C]//Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics. New York: ACM Press, 1994: 133-138.
[16] Tversky A.Features of Similarity[J]. Psychological Review, 1977, 84(4): 327-352.
[17] Wang J Z, Du Z, Payattakool R, et al.A New Method to Measure the Semantic Similarity of GO Terms[J]. Bioinformatics, 2007, 23(10): 1274-1281.
doi: 10.1093/bioinformatics/btm087 pmid: 17344234
[18] Couto F M, Silva M, Coutinho P M.Implementation of a Functional Semantic Similarity Measure Between geNe- products[D]. Lisbon: University of Lisbon, 2003.
[19] Othman R M, Deris S, Illias R M.A Genetic Similarity Algorithm for Searching the Gene Ontology Terms and Annotating Anonymous Protein Sequences[J]. Journal of Biomed Information, 2008, 41(1): 65-81.
doi: 10.1016/j.jbi.2007.05.010 pmid: 17681495
[20] 李文清, 孙新, 张常有, 等. 一种本体概念的语义相似度计算方法[J]. 自动化学报, 2012, 38(2): 229-235.
doi: 10.3724/SP.J.1004.2012.00229
[20] (Li Wenqing, Sun Xin, Zhang Changyou, et al.A Semantic Similarity Calculation Method of Ontology Concept[J]. Acta Automatica Sinica, 2012, 38(2): 229-235.)
doi: 10.3724/SP.J.1004.2012.00229
[21] 刘宏哲, 须德. 基于本体的语义相似度和相关度计算研究综述[J]. 计算机科学, 2012, 39(2): 8-13.
doi: 10.3969/j.issn.1002-137X.2012.02.002
[21] (Liu Hongzhe, Xu De.Review of Semantic Similarity and Correlation Calculation Based on Ontology[J]. Computer Science, 2012, 39(2): 8-13.)
doi: 10.3969/j.issn.1002-137X.2012.02.002
[22] 黄承慧, 印鉴, 侯昉. 一种结合词项语义信息和TF-IDF方法的文本相似度量方法[J]. 计算机学报, 2011, 34(5): 856-864.
doi: 10.3724/SP.J.1016.2011.00856
[22] (Huang Chenghui, In Jian, Hou Fang.A Text Similarity Measure Based on Semantic Information and TF- IDF Method[J]. Journal of Computers, 2011, 34(5): 856-864.)
doi: 10.3724/SP.J.1016.2011.00856
[23] 刘宏哲. 文本语义相似度计算方法研究[D]. 北京: 北京交通大学, 2012.
[23] (Liu Hongzhe.Text Semantic Similarity Calculation Method [D]. Beijing: Beijing Jiaotong University, 2012.)
[24] 白如江. 基于语义计算的科学研究前沿识别研究[D]. 北京: 中国科学院大学, 2015.
[24] (Bai Rujiang.Scientific Research Frontier Recognition Research Based on the Semantic Computing [D]. Beijing: Chinese Academy of Sciences, 2015.)
[1] 盛姝, 黄奇, 杨洋, 解绮雯, 秦新国. HL7 FHIR框架下中国医疗领域信息交换研究与解决方案[J]. 数据分析与知识发现, 2021, 5(11): 13-28.
[2] 曾桢,李纲,毛进,陈璟浩. 区域公共安全数据治理与业务领域本体研究*[J]. 数据分析与知识发现, 2020, 4(9): 41-55.
[3] 高原,施元磊,张蕾,曹天奕,冯筠. 基于游记文本的游客游览行程重构*[J]. 数据分析与知识发现, 2020, 4(2/3): 165-172.
[4] 强韶华,罗云鹿,李玉鹏,吴鹏. 基于RBR和CBR的金融事件本体推理研究 *[J]. 数据分析与知识发现, 2019, 3(8): 94-104.
[5] 邓诗琦,洪亮. 面向智能应用的领域本体构建研究*——以反电话诈骗领域为例[J]. 数据分析与知识发现, 2019, 3(7): 73-84.
[6] 高广尚. 用户画像构建方法研究综述*[J]. 数据分析与知识发现, 2019, 3(3): 25-35.
[7] 王颖,钱力,谢靖,常志军,孔贝贝. 科技大数据知识图谱构建模型与方法研究*[J]. 数据分析与知识发现, 2019, 3(1): 15-26.
[8] 何有世, 何述芳. 基于领域本体的产品网络口碑信息多层次细粒度情感挖掘*[J]. 数据分析与知识发现, 2018, 2(8): 60-68.
[9] 唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[10] 庞贝贝, 苟娟琼, 穆文歆. 面向高校学生深度辅导领域的主题建模和主题上下位关系识别研究*[J]. 数据分析与知识发现, 2018, 2(6): 92-101.
[11] 李琳, 李辉. 一种基于概念向量空间的文本相似度计算方法[J]. 数据分析与知识发现, 2018, 2(5): 48-58.
[12] 丁晟春, 刘梦露, 傅柱. 概念设计中基于知识流的多维设计知识统一建模技术研究*[J]. 数据分析与知识发现, 2018, 2(2): 11-19.
[13] 涂海丽, 唐晓波. 基于标签的商品推荐模型研究*[J]. 数据分析与知识发现, 2017, 1(9): 28-39.
[14] 陈二静, 姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[15] 吴丹, 刘畅, 李翼. 用户步行导航过程中的情感变化研究*[J]. 数据分析与知识发现, 2017, 1(5): 42-51.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn