科技查新查新点语义匹配方法研究

doi:10.11925/infotech.2096-3467.2018.1390

数据分析与知识发现

2019, Vol. 3

Issue (6): 50-56 https://doi.org/10.11925/infotech.2096-3467.2018.1390

研究论文

本期目录 | 过刊浏览 | 高级检索

科技查新查新点语义匹配方法研究

姚俊良,乐小虬(

)

(中国科学院文献情报中心北京 100190);(中国科学院大学经济与管理学院图书情报与档案管理系北京 100190)

Semantic Matching for Sci-Tech Novelty Retrieval

Junliang Yao,Xiaoqiu Le(

)

(National Science Library, Chinese Academy of Sciences, Beijing 100190, China);(Department of Library, Information and Archives Management, School of Economics and Management, University of Chinese Academy of Sciences, Beijing 100190, China)

摘要
参考文献
相关文章
Metrics

全文: PDF (530 KB) HTML ( 10 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】从科技查新候选检索结果中自动筛选与查新点语义相近的文献(期刊论文、专利)。【方法】设计基于Bi-GRU-ATT的深度多任务层次分类模型, 利用国际专利分类表(IPC)类别及专利数据, 训练多个不同层次分类模型, 利用少量论文数据进行Fine-tuning, 使之适用于论文和专利两种类别数据, 依照先父后子的次序识别查新点及候选记录的语义类别, 从而判定二者间的语义匹配度。【结果】在E21B专利分类下的两级分类模型中, 准确率分别达到82.37%和73.55%, 优于其他基准模型; 在使用真实查新点实验数据的语义匹配实验中, 语义匹配的精度达到88.13%, 比基准检索模型(TF-IDF)提高15.16%。【局限】仅在少量类别中开展训练, 还没有扩展到IPC所有分类中。【结论】初步实验表明该方法能够在一定程度上提升查新点语义匹配效果。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	姚俊良
	乐小虬

关键词 ：科技查新, 语义匹配, 多任务学习, Bi-GRU-ATT

Abstract：

[Objective] This paper tries to identify semantics similar to the novelty points from preliminary searching results, aiming to retrieve needed journal articles or patents automatically. [Methods] First, we designed a deep multi-task hierarchical classification model based on Bi-GRU-ATT. Then, we trained several different hierarchical classification models using International Patent Classification Table (IPC) categories and patents. Third, we used a small amount of paper data to fine-tune the model for papers and patents. Finally, we compared the semantic categories of new points and candidate records to collect the matching ones. [Results] With two-level classification of patents under IPC (E21B), the new model’s precisions were 82.37% and 73.55% respectively, which were better than the benchmark models. For real novelty search points data, the precision of semantic matching was 88.13%, which was 15.16% higher than that of TF-IDF. [Limitations] Only examined our model with a small amount of IPC categories . [Conclusions] The proposed method improves the semantic matching of novelty search points.

Key words： Sci-tech Novelty Retrieval Semantic Matching Multitask Learning Bi-GRU-ATT

收稿日期: 2018-12-10 出版日期: 2019-08-15

引用本文:

姚俊良,乐小虬. 科技查新查新点语义匹配方法研究[J]. 数据分析与知识发现, 2019, 3(6): 50-56.
Junliang Yao,Xiaoqiu Le. Semantic Matching for Sci-Tech Novelty Retrieval. Data Analysis and Knowledge Discovery, 2019, 3(6): 50-56.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1390 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I6/50

[1]	李凤侠, 战玉华, 赵军平, 等. 清华大学科技查新系统的开发与实践[J]. 大学图书馆学报, 2014, 32(2): 33-38.
[1]	(Li Fengxia, Zhan Yuhua, Zhao Junping, et al.Design and Practice of Tsinghua University Sci-Tech Novelty Search System[J]. Journal of Academic Libraries, 2014, 32(2): 33-38.)
[2]	王培霞, 余海, 陈力, 等. 科技查新中检索词智能抽取系统的设计与实现[J]. 现代图书情报技术, 2016(11): 82-93.
[2]	(Wang Peixia, Yu Hai, Chen Li, et al.Using Intelligent System to Extract Search Terms for Sci-Tech Novelty Retrieval[J]. New Technology of Library and Information Service, 2016(11): 82-93.)
[3]	王子璇, 乐小虬, 何远标. 基于WMD语义相似度的TextRank改进算法识别论文核心主题句研究[J]. 数据分析与知识发现, 2017, 1(4): 5-12.
[3]	(Wang Zixuan, Le Xiaoqiu, He Yuanbiao.Recognizing Core Topic Sentences with Improved TextRank Algorithm Based on WMD Semantic Similarity[J]. Data Analysis and Knowledge Discovery, 2017, 1(4): 5-12.)
[4]	Kusner M J, Sun Y, Kolkin N I, et al.From Word Embeddings to Document Distances[C]// Proceedings of the 32nd International Conference on International Conference on Machine Learning. 2015: 957-966.
[5]	Huang P S, He X, Gao J, et al.Learning Deep Structured Semantic Models for Web Search Using Clickthrough Data[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013: 2333-2338.
[6]	李欣, 王静静, 杨梓, 等. 基于SAO结构语义分析的新兴技术识别研究[J]. 情报杂志, 2016, 35(3): 80-84.
[6]	(Li Xin, Wang Jingjing, Yang Zi, et al.Identifying Emerging Technologies Based on Subject-Action-Object[J]. Journal of Intelligence, 2016, 35(3): 80-84.)
[7]	何喜军, 马珊, 武玉英. 基于本体和SAO结构的线上技术供需信息语义匹配研究[J]. 情报科学, 2018, 36(11): 95-100.
[7]	(He Xijun, Ma Shan, Wu Yuying.Research on Semantic Matching for Online Technology Supply and Demand Information Based on Ontology and SAO Structure[J]. Information Science, 2018, 36(11): 95-100.)
[8]	Joulin A, Grave E, Bojanowski P, et al.Bag of Tricks for Efficient Text Classification[OL]. arXiv Preprint, arXiv: 1607.01759.
[9]	Kim Y.Convolutional Neural Networks for Sentence Classification[OL]. arXiv Preprint, arXiv: 1408.5882.
[10]	Li F, Zhang M, Fu G, et al.A Bi-LSTM-RNN Model for Relation Classification Using Low-Cost Sequence Features[OL]. arXiv Preprint, arXiv: 1608.07720.
[11]	Pappas N, Popescu-Belis A.Multilingual Hierarchical Attention Networks for Document Classification[OL]. arXiv Preprint, arXiv: 1707.00896.
[12]	Misra I, Shrivastava A, Gupta A, et al.Cross-Stitch Networks for Multi-task Learning[C]// Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition. 2016: 3994-4003.
[13]	Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[14]	Cho K, Van Merrienboer B, Gulcehre C, et al.Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[OL]. arXiv Preprint, arXiv: 1406.1078.
[15]	Raffel C, Ellis D P W. Feed-Forward Networks with Attention can Solve Some Long-Term Memory Problems[OL]. arXiv Preprint, arXiv: 1512.08756.
[16]	Howard J, Ruder S.Universal Language Model Fine-tuning for Text Classification[OL]. arXiv Preprint, arXiv: 1801.06146.

[1]	杨晗迅, 周德群, 马静, 罗永聪. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究*[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[2]	王培霞,余海,陈力,王永吉. 科技查新中检索词智能抽取系统的设计与实现^*[J]. 现代图书情报技术, 2016, 32(11): 82-93.
[3]	郝慧. 一种基于科技查新的跨库检索去重算法[J]. 现代图书情报技术, 2015, 31(1): 89-95.
[4]	李广利, 李书宁. 科技查新报告自动生成软件的设计与实现[J]. 现代图书情报技术, 2013, 29(2): 82-87.
[5]	纪姗姗, 李春旺. 情境感知的集成融汇服务方法研究[J]. 现代图书情报技术, 2012, (12): 21-26.
[6]	于婷,宋宇宁 . 计算机辅助软件在科技查新工作中的应用[J]. 现代图书情报技术, 2006, 1(12): 85-88.
[7]	马景娣,田稷. 基于J2EE的科技查新综合信息系统的设计与实现[J]. 现代图书情报技术, 2004, 20(8): 77-78.
[8]	周国华,邵正荣. 建立查新工作网络管理平台的尝试[J]. 现代图书情报技术, 2004, 20(6): 64-66.

Viewed

Full text

Abstract

Cited

Shared

Discussed