Please wait a minute...
Advanced Search
数据分析与知识发现  2023, Vol. 7 Issue (12): 155-163     https://doi.org/10.11925/infotech.2096-3467.2022.1099
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
GKTR:一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型*
吕学强,杜一凡,张乐(),潘慧萍,田驰
北京信息科技大学网络文化与数字传播北京市重点实验室 北京 100101
GKTR Retrieval Model for Engineering Consulting Reports with Graph Convolution Topological and Keyword Features
Lyu Xueqiang,Du Yifan,Zhang Le(),Pan Huiping,Tian Chi
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
全文: PDF (919 KB)   HTML ( 7
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】针对现有检索方法语义特征提取不充分的问题,提出一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型。【方法】构建面向工程咨询报告的文本检索语料集,将语料传入BERT模型得到上下文向量,并通过图卷积网络和深度交互匹配模型得到第一个匹配得分;同时将段落关键词通过Word2Vec模型得到向量映射,与标题进行相似度计算得到第二个匹配得分。取两个匹配得分的平均值得到最终的匹配得分。【结果】GKTR联合多种文本交互匹配模型,相较于联合排序模型CEDR在P@20指标上最高提升3.06个百分点。【局限】实验数据主要来源于大型国企工程咨询公司的工程咨询报告,在其他领域中的效果有待验证。【结论】GKTR模型在面向工程咨询报告的文本检索语料库上,能够有效提升文本检索的效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吕学强
杜一凡
张乐
潘慧萍
田驰
关键词 文本检索图卷积网络关键词BERT联合排序    
Abstract

[Objective] This paper proposes a text retrieval model for engineering consulting reports that combines graph convolution topological and keyword features. It addresses the insufficient semantic feature extraction issues in existing retrieval methods. [Methods] First, we built a text retrieval corpus of engineering consulting reports. Then, we fed the corpus into a BERT model to obtain contextual vectors. Third, we obtained the first matching score through a graph convolutional network and a deep interactive matching model. We also mapped the paragraph keywords to vectors using a Word2Vec model and calculated their similarity scores with the titles to obtain the second matching score. Finally, we got their final matching score by averaging the two matching scores. [Results] Compared with the joint ranking model CEDR, our new model was up to 3.06% higher in the P@20 metric. [Limitations] The data was mainly from engineering consulting reports of a large state-owned company, which needs to be expanded. [Conclusions] The GKTR model could effectively improve text retrieval for engineering consulting reports.

Key wordsText Retrieval    Graph Convolution Network    Keywords    BERT    Joint Ranking
收稿日期: 2022-10-21      出版日期: 2023-05-16
ZTFLH:  TP391  
  G35  
基金资助:*国家自然科学基金项目(62171043);国家语委重点项目(ZDI145-10)
通讯作者: 张乐,ORCID:0000-0002-9620-511X,E-mail:zhangle@bistu.edu.cn。   
引用本文:   
吕学强, 杜一凡, 张乐, 潘慧萍, 田驰. GKTR:一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型*[J]. 数据分析与知识发现, 2023, 7(12): 155-163.
Lyu Xueqiang, Du Yifan, Zhang Le, Pan Huiping, Tian Chi. GKTR Retrieval Model for Engineering Consulting Reports with Graph Convolution Topological and Keyword Features. Data Analysis and Knowledge Discovery, 2023, 7(12): 155-163.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1099      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I12/155
Fig.1  融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型
段落文本 标注关键词
北京市大兴区旧宫镇第一中心小学始建于1948年,1978年迁址并确定为中心小学,2001年更名为第一中心小学。为落实大兴区教育规划和学校楼房化建设要求,整合旧宫镇教育资源…… 小学,教育资源,布局,教育质量
我国近二十年在社会经济各方面都取得了长足的发展,职业教育作为终身教育体系的重要组成部分,为我国的现代化建设培养了大量高素质的劳动者和实用型人才…… 职业教育,教学质量,技能
Table 1  关键词标注样例
标题序号 段落序号 匹配序号 相似度分数
q1 d18 1 25.20
q1 d38 2 21.00
q1 d35 3 20.74
q1 d36 4 18.98
q1 d657 5 18.98
Table 2  训练数据标记样例
主题词 关键词
教育资源 教育资源、配套设施、资源配置、优质资源…
教育发展 教育发展、高质量发展、多元化发展、全面发展…
教育布局 教育布局、资源布局、教育结构布局、统筹布局…
学位 学位、入学率、学位缺口、学位不足、入学压力…
均衡 均衡、均衡配置、均衡发展、优质均衡、均衡资源配置…
Table 3  主题词转换词典样例
操作系统 Linux
CPU Intel(R)Xeon(R)Gold 5118 CPU @2.30GHz
显卡 Tesla P4
Python 3.6.9
PyTorch 1.10.0
Table 4  实验环境
排序方法 模型 P@20(%)
Vanilla BERT CEDR 73.33
GKTR 76.39
DRMM CEDR 76.11
GKTR 78.24
KNRM CEDR 73.89
GKTR 75.34
PACRR CEDR 74.44
GKTR 75.97
Table 5  实验结果
[1] 谢红生. 工程咨询报告校对常见问题研究[J]. 中国工程咨询, 2015(11): 46-47.
[1] (Xie Hongsheng. Research on Common Problems in Proofreading Engineering Consulting Report[J]. Chinese Consulting Engineers, 2015(11): 46-47.)
[2] 丁志均, 杨青, 张会兵, 等. 基于非结构化文本检索模型综述[J]. 计算机应用研究, 2017, 34(6): 1601-1608,1612.
[2] (Ding Zhijun, Yang Qing, Zhang Huibing, et al. Review of Retrieval Models Based on Unstructured Text[J]. Application Research of Computers, 2017, 34(6): 1601-1608,1612.)
[3] Dierk S F. The SMART Retrieval System: Experiments in Automatic Document Processing—Gerard Salton, Ed. (Englewood Cliffs, N.J.: Prentice-Hall, 1971, 556 PP., $15.00)[J]. IEEE Transactions on Professional Communication, 1972, PC-15(1): 17.
[4] Robertson S E, Jones K S. Relevance Weighting of Search Terms[J]. Journal of the American Society for Information Science, 1976, 27(3): 129-146.
doi: 10.1002/asi.v27:3
[5] 戚园园. 基于特征表示学习的文本检索研究[D]. 北京: 北京邮电大学, 2021.
[5] (Qi Yuanyuan. Research on Text Retrieval Based on Feature Representations Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2021.)
[6] Huang P S, He X D, Gao J F, et al. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2013: 2333-2338.
[7] 邹傲, 郝文宁, 靳大尉, 等. 基于预训练和深度哈希的大规模文本检索研究[J]. 计算机科学, 2021, 48(11): 300-306.
doi: 10.11896/jsjkx.210300266
[7] (Zou Ao, Hao Wenning, Jin Dawei, et al. Study on Text Retrieval Based on Pre-Training and Deep Hash[J]. Computer Science, 2021, 48(11): 300-306.)
doi: 10.11896/jsjkx.210300266
[8] Dai Z Y, Callan J. Deeper Text Understanding for IR with Contextual Neural Language Modeling[C]// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019: 985-988.
[9] 陈丽萍, 任俊超. 基于对抗式数据增强的深度文本检索重排序[J]. 计算机系统应用, 2021, 30(7): 204-209.
[9] (Chen Liping, Ren Junchao. Deep Text Retrieval Re-Ranking Based on Adversarial Data Augmentation[J]. Computer Systems & Applications, 2021, 30(7): 204-209.)
[10] Schopf T, Braun D, Matthes F. Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics[C]// Proceedings of the 17th International Conference on Web Information Systems and Technologies. 2021: 124-132.
[11] Gadamshetti S, Deepak G, Santhanavijayan A, et al. RDRLLJ:Integrating Deep Learning Approach with Latent Semantic Analysis for Document Retrieval[A]//Shetty N R, Patnaik L M, Nagaraj H C, et al. Emerging Research in Computing, Information, Communication and Applications[M]. Singapore: Springer, 2022: 999-1007.
[12] Abolghasemi A, Verberne S, Azzopardi L. Improving BERT-Based Query-by-Document Retrieval with Multi-Task Optimization[OL]. arXiv Preprint, arXiv: 2202.00373.
[13] 张永伟, 刘婷, 刘畅, 等. 融合句法信息的文本语料库检索方法研究[J]. 数据分析与知识发现, 2022, 6(11): 25-37.
[13] (Zhang Yongwei, Liu Ting, Liu Chang, et al. Text Retrieval Based on Syntactic Information[J]. Data Analysis and Knowledge Discovery, 2022, 6(11): 25-37.)
[14] Qi Y Y, Zhang J Y, Liu Y S, et al. CGTR: Convolution Graph Topology Representation for Document Ranking[C]// Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ACM, 2020: 2173-2176.
[15] 曹帅. 基于深度学习的文本匹配研究综述[J]. 现代计算机, 2021(16): 74-78.
[15] (Cao Shuai. Survey of Research on Text Matching Based on Deep Learning[J]. Modern Computer, 2021(16): 74-78.)
[16] Fang H, Tao T, Zhai C X. Diagnostic Evaluation of Information Retrieval Models[J]. ACM Transactions on Information Systems, 2011, 29(2): Article No.7.
[17] Fang H, Zhai C X. Semantic Term Matching in Axiomatic Approaches to Information Retrieval[C]// Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2006: 115-122.
[18] Guo J F, Fan Y X, Ai Q Y, et al. A Deep Relevance Matching Model for Ad-Hoc Retrieval[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. ACM, 2016: 55-64.
[19] Xiong C Y, Dai Z Y, Callan J, et al. End-to-End Neural Ad-Hoc Ranking with Kernel Pooling[C]// Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017: 55-64.
[20] Hui K, Yates A, Berberich K, et al. PACRR: A Position-Aware Neural IR Model for Relevance Matching[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1049-1058.
[21] Hui K, Yates A, Berberich K, et al. Co-PACRR: A Context-Aware Neural IR Model for Ad-Hoc Retrieval[C]// Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 2018: 279-287.
[22] Ahmad W U, Chang K W, Wang H N. Context Attentive Document Ranking and Query Suggestion[C]// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019: 385-394.
[23] 朱路, 邓芳, 刘坤, 等. 基于语义自编码哈希学习的跨模态检索方法[J]. 数据分析与知识发现, 2021, 5(12): 110-122.
[23] (Zhu Lu, Deng Fang, Liu Kun, et al. Cross-Modal Retrieval Based on Semantic Auto-Encoder and Hash Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(12): 110-122.)
[24] Cui H J, Lu J Y, Ge Y, et al. How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation[OL]. arXiv Preprint, arXiv: 2201. 04672.
[25] MacAvaney S, Yates A, Cohan A, et al. CEDR: Contextualized Embeddings for Document Ranking[C]// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019: 1101-1104.
[26] Humeau S, Shuster K, Lachaux M A, et al. Poly-Encoders: Transformer Architectures and Pre-Training Strategies for Fast and Accurate Multi-Sentence Scoring[OL]. arXiv Preprint, arXiv: 1905.01969.
[1] 贺超城, 黄茜, 李欣儒, 王春迎, 吴江. 元宇宙的冷与热——融合BERT与动态主题模型的微博文本分析*[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2] 赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3] 刘洋, 丁星辰, 马莉莉, 王淳洋, 朱立芳. 基于多维度图卷积网络的旅游评论有用性识别*[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[4] 胥桂仙, 张子欣, 于绍娜, 董玉双, 田媛. 基于图卷积网络的藏文新闻文本分类*[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
[5] 徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[6] 本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[7] 苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[8] 张贞港, 余传明. 基于实体与关系融合的知识图谱补全模型研究*[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[9] 赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[10] 王宇飞, 张智雄, 赵旸, 张梦婷, 李雪思. 中文科技论文标题自动生成系统的设计与实现*[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[11] 张思阳, 魏苏波, 孙争艳, 张顺香, 朱广丽, 吴厚月. 基于多标签Seq2Seq模型的情绪-原因对提取模型*[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[12] 吴旭旭, 陈鹏, 江欢. 基于多特征融合的微博细粒度情感分析*[J]. 数据分析与知识发现, 2023, 7(12): 102-113.
[13] 高浩鑫, 孙利娟, 吴京宸, 高宇童, 吴旭. 基于异构图卷积网络的网络社区敏感文本分类模型*[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
[14] 潘小宇, 倪渊, 金春华, 张健. 基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建*[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[15] 李楠, 汪波. 跨学科语义漂移识别与可视化分析*[J]. 数据分析与知识发现, 2023, 7(10): 15-24.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn