GKTR：一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型<sup>*</sup>

doi:10.11925/infotech.2096-3467.2022.1099

数据分析与知识发现

2023, Vol. 7

Issue (12): 155-163 https://doi.org/10.11925/infotech.2096-3467.2022.1099

研究论文

本期目录 | 过刊浏览 | 高级检索

GKTR：一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型^*

吕学强,杜一凡,张乐(

),潘慧萍,田驰

北京信息科技大学网络文化与数字传播北京市重点实验室北京 100101

GKTR Retrieval Model for Engineering Consulting Reports with Graph Convolution Topological and Keyword Features

Lyu Xueqiang,Du Yifan,Zhang Le(

),Pan Huiping,Tian Chi

Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (919 KB) HTML ( 7 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】针对现有检索方法语义特征提取不充分的问题，提出一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型。【方法】构建面向工程咨询报告的文本检索语料集，将语料传入BERT模型得到上下文向量，并通过图卷积网络和深度交互匹配模型得到第一个匹配得分；同时将段落关键词通过Word2Vec模型得到向量映射，与标题进行相似度计算得到第二个匹配得分。取两个匹配得分的平均值得到最终的匹配得分。【结果】GKTR联合多种文本交互匹配模型，相较于联合排序模型CEDR在P@20指标上最高提升3.06个百分点。【局限】实验数据主要来源于大型国企工程咨询公司的工程咨询报告，在其他领域中的效果有待验证。【结论】GKTR模型在面向工程咨询报告的文本检索语料库上，能够有效提升文本检索的效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	吕学强
	杜一凡
	张乐
	潘慧萍
	田驰

关键词 ：文本检索, 图卷积网络, 关键词, BERT, 联合排序

Abstract：

[Objective] This paper proposes a text retrieval model for engineering consulting reports that combines graph convolution topological and keyword features. It addresses the insufficient semantic feature extraction issues in existing retrieval methods. [Methods] First, we built a text retrieval corpus of engineering consulting reports. Then, we fed the corpus into a BERT model to obtain contextual vectors. Third, we obtained the first matching score through a graph convolutional network and a deep interactive matching model. We also mapped the paragraph keywords to vectors using a Word2Vec model and calculated their similarity scores with the titles to obtain the second matching score. Finally, we got their final matching score by averaging the two matching scores. [Results] Compared with the joint ranking model CEDR, our new model was up to 3.06% higher in the P@20 metric. [Limitations] The data was mainly from engineering consulting reports of a large state-owned company, which needs to be expanded. [Conclusions] The GKTR model could effectively improve text retrieval for engineering consulting reports.

Key words： Text Retrieval Graph Convolution Network Keywords BERT Joint Ranking

收稿日期: 2022-10-21 出版日期: 2023-05-16

ZTFLH:	TP391
	G35

基金资助:*国家自然科学基金项目(62171043);国家语委重点项目(ZDI145-10)

通讯作者: 张乐，ORCID：0000-0002-9620-511X，E-mail：zhangle@bistu.edu.cn。

引用本文:

吕学强, 杜一凡, 张乐, 潘慧萍, 田驰. GKTR：一种融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型^*[J]. 数据分析与知识发现, 2023, 7(12): 155-163.
Lyu Xueqiang, Du Yifan, Zhang Le, Pan Huiping, Tian Chi. GKTR Retrieval Model for Engineering Consulting Reports with Graph Convolution Topological and Keyword Features. Data Analysis and Knowledge Discovery, 2023, 7(12): 155-163.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022.1099 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I12/155

Fig.1 融合图卷积拓扑特征和关键词特征的工程咨询报告检索模型

Table 1 关键词标注样例

Table 2 训练数据标记样例

Table 3 主题词转换词典样例

Table 4 实验环境

Table 5 实验结果

[1]	谢红生. 工程咨询报告校对常见问题研究[J]. 中国工程咨询, 2015(11): 46-47.
[1]	(Xie Hongsheng. Research on Common Problems in Proofreading Engineering Consulting Report[J]. Chinese Consulting Engineers, 2015(11): 46-47.)
[2]	丁志均, 杨青, 张会兵, 等. 基于非结构化文本检索模型综述[J]. 计算机应用研究, 2017, 34(6): 1601-1608,1612.
[2]	(Ding Zhijun, Yang Qing, Zhang Huibing, et al. Review of Retrieval Models Based on Unstructured Text[J]. Application Research of Computers, 2017, 34(6): 1601-1608,1612.)
[3]	Dierk S F. The SMART Retrieval System: Experiments in Automatic Document Processing—Gerard Salton, Ed. (Englewood Cliffs, N.J.: Prentice-Hall, 1971, 556 PP., $15.00)[J]. IEEE Transactions on Professional Communication, 1972, PC-15(1): 17.
[4]	Robertson S E, Jones K S. Relevance Weighting of Search Terms[J]. Journal of the American Society for Information Science, 1976, 27(3): 129-146. doi: 10.1002/asi.v27:3
[5]	戚园园. 基于特征表示学习的文本检索研究[D]. 北京: 北京邮电大学, 2021.
[5]	(Qi Yuanyuan. Research on Text Retrieval Based on Feature Representations Learning[D]. Beijing: Beijing University of Posts and Telecommunications, 2021.)
[6]	Huang P S, He X D, Gao J F, et al. Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. ACM, 2013: 2333-2338.
[7]	邹傲, 郝文宁, 靳大尉, 等. 基于预训练和深度哈希的大规模文本检索研究[J]. 计算机科学, 2021, 48(11): 300-306. doi: 10.11896/jsjkx.210300266
[7]	(Zou Ao, Hao Wenning, Jin Dawei, et al. Study on Text Retrieval Based on Pre-Training and Deep Hash[J]. Computer Science, 2021, 48(11): 300-306.) doi: 10.11896/jsjkx.210300266
[8]	Dai Z Y, Callan J. Deeper Text Understanding for IR with Contextual Neural Language Modeling[C]// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019: 985-988.
[9]	陈丽萍, 任俊超. 基于对抗式数据增强的深度文本检索重排序[J]. 计算机系统应用, 2021, 30(7): 204-209.
[9]	(Chen Liping, Ren Junchao. Deep Text Retrieval Re-Ranking Based on Adversarial Data Augmentation[J]. Computer Systems & Applications, 2021, 30(7): 204-209.)
[10]	Schopf T, Braun D, Matthes F. Lbl2Vec: An Embedding-Based Approach for Unsupervised Document Retrieval on Predefined Topics[C]// Proceedings of the 17th International Conference on Web Information Systems and Technologies. 2021: 124-132.
[11]	Gadamshetti S, Deepak G, Santhanavijayan A, et al. RDRLLJ:Integrating Deep Learning Approach with Latent Semantic Analysis for Document Retrieval[A]//Shetty N R, Patnaik L M, Nagaraj H C, et al. Emerging Research in Computing, Information, Communication and Applications[M]. Singapore: Springer, 2022: 999-1007.
[12]	Abolghasemi A, Verberne S, Azzopardi L. Improving BERT-Based Query-by-Document Retrieval with Multi-Task Optimization[OL]. arXiv Preprint, arXiv: 2202.00373.
[13]	张永伟, 刘婷, 刘畅, 等. 融合句法信息的文本语料库检索方法研究[J]. 数据分析与知识发现, 2022, 6(11): 25-37.
[13]	(Zhang Yongwei, Liu Ting, Liu Chang, et al. Text Retrieval Based on Syntactic Information[J]. Data Analysis and Knowledge Discovery, 2022, 6(11): 25-37.)
[14]	Qi Y Y, Zhang J Y, Liu Y S, et al. CGTR: Convolution Graph Topology Representation for Document Ranking[C]// Proceedings of the 29th ACM International Conference on Information & Knowledge Management. ACM, 2020: 2173-2176.
[15]	曹帅. 基于深度学习的文本匹配研究综述[J]. 现代计算机, 2021(16): 74-78.
[15]	(Cao Shuai. Survey of Research on Text Matching Based on Deep Learning[J]. Modern Computer, 2021(16): 74-78.)
[16]	Fang H, Tao T, Zhai C X. Diagnostic Evaluation of Information Retrieval Models[J]. ACM Transactions on Information Systems, 2011, 29(2): Article No.7.
[17]	Fang H, Zhai C X. Semantic Term Matching in Axiomatic Approaches to Information Retrieval[C]// Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2006: 115-122.
[18]	Guo J F, Fan Y X, Ai Q Y, et al. A Deep Relevance Matching Model for Ad-Hoc Retrieval[C]// Proceedings of the 25th ACM International Conference on Information and Knowledge Management. ACM, 2016: 55-64.
[19]	Xiong C Y, Dai Z Y, Callan J, et al. End-to-End Neural Ad-Hoc Ranking with Kernel Pooling[C]// Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017: 55-64.
[20]	Hui K, Yates A, Berberich K, et al. PACRR: A Position-Aware Neural IR Model for Relevance Matching[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1049-1058.
[21]	Hui K, Yates A, Berberich K, et al. Co-PACRR: A Context-Aware Neural IR Model for Ad-Hoc Retrieval[C]// Proceedings of the 11th ACM International Conference on Web Search and Data Mining. ACM, 2018: 279-287.
[22]	Ahmad W U, Chang K W, Wang H N. Context Attentive Document Ranking and Query Suggestion[C]// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019: 385-394.
[23]	朱路, 邓芳, 刘坤, 等. 基于语义自编码哈希学习的跨模态检索方法[J]. 数据分析与知识发现, 2021, 5(12): 110-122.
[23]	(Zhu Lu, Deng Fang, Liu Kun, et al. Cross-Modal Retrieval Based on Semantic Auto-Encoder and Hash Learning[J]. Data Analysis and Knowledge Discovery, 2021, 5(12): 110-122.)
[24]	Cui H J, Lu J Y, Ge Y, et al. How Can Graph Neural Networks Help Document Retrieval: A Case Study on CORD19 with Concept Map Generation[OL]. arXiv Preprint, arXiv: 2201. 04672.
[25]	MacAvaney S, Yates A, Cohan A, et al. CEDR: Contextualized Embeddings for Document Ranking[C]// Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2019: 1101-1104.
[26]	Humeau S, Shuster K, Lachaux M A, et al. Poly-Encoders: Transformer Architectures and Pre-Training Strategies for Fast and Accurate Multi-Sentence Scoring[OL]. arXiv Preprint, arXiv: 1905.01969.

[1]	贺超城, 黄茜, 李欣儒, 王春迎, 吴江. 元宇宙的冷与热——融合BERT与动态主题模型的微博文本分析^*[J]. 数据分析与知识发现, 2023, 7(9): 25-38.
[2]	赵雪峰, 吴德林, 吴伟伟, 孙卓荦, 胡瑾瑾, 廉莹, 单佳宇. 基于深度学习与多分类轮询机制的高质量“卡脖子”技术专利识别模型——以专利申请文件为研究主体*[J]. 数据分析与知识发现, 2023, 7(8): 30-45.
[3]	刘洋, 丁星辰, 马莉莉, 王淳洋, 朱立芳. 基于多维度图卷积网络的旅游评论有用性识别*[J]. 数据分析与知识发现, 2023, 7(8): 95-104.
[4]	胥桂仙, 张子欣, 于绍娜, 董玉双, 田媛. 基于图卷积网络的藏文新闻文本分类^*[J]. 数据分析与知识发现, 2023, 7(6): 73-85.
[5]	徐康, 余胜男, 陈蕾, 王传栋. 基于语言学知识增强的自监督式图卷积网络的事件关系抽取方法^*[J]. 数据分析与知识发现, 2023, 7(5): 92-104.
[6]	本妍妍, 庞雪芹. 融入词性的医疗命名实体识别研究^*[J]. 数据分析与知识发现, 2023, 7(5): 123-132.
[7]	苏明星, 吴厚月, 李健, 黄菊, 张顺香. 基于多层交互注意力机制的商品属性抽取^*[J]. 数据分析与知识发现, 2023, 7(2): 108-118.
[8]	张贞港, 余传明. 基于实体与关系融合的知识图谱补全模型研究^*[J]. 数据分析与知识发现, 2023, 7(2): 15-25.
[9]	赵一鸣, 潘沛, 毛进. 基于任务知识融合与文本数据增强的医学信息查询意图强度识别研究^*[J]. 数据分析与知识发现, 2023, 7(2): 38-47.
[10]	王宇飞, 张智雄, 赵旸, 张梦婷, 李雪思. 中文科技论文标题自动生成系统的设计与实现^*[J]. 数据分析与知识发现, 2023, 7(2): 61-71.
[11]	张思阳, 魏苏波, 孙争艳, 张顺香, 朱广丽, 吴厚月. 基于多标签Seq2Seq模型的情绪-原因对提取模型^*[J]. 数据分析与知识发现, 2023, 7(2): 86-96.
[12]	吴旭旭, 陈鹏, 江欢. 基于多特征融合的微博细粒度情感分析^*[J]. 数据分析与知识发现, 2023, 7(12): 102-113.
[13]	高浩鑫, 孙利娟, 吴京宸, 高宇童, 吴旭. 基于异构图卷积网络的网络社区敏感文本分类模型^*[J]. 数据分析与知识发现, 2023, 7(11): 26-36.
[14]	潘小宇, 倪渊, 金春华, 张健. 基于超平面-BERT-Louvain优化LDA模型的书法作品价值要素提取及指标体系构建^*[J]. 数据分析与知识发现, 2023, 7(10): 109-118.
[15]	李楠, 汪波. 跨学科语义漂移识别与可视化分析^*[J]. 数据分析与知识发现, 2023, 7(10): 15-24.

Viewed

Full text

Abstract

Cited

Shared

Discussed