融入术语与层级信息的专利关键短语抽取方法研究

doi:10.11925/infotech.2096-3467.2020.0577

数据分析与知识发现

2023, Vol. 7

Issue (6): 99-112 https://doi.org/10.11925/infotech.2096-3467.2020.0577

研究论文

本期目录 | 过刊浏览 | 高级检索

融入术语与层级信息的专利关键短语抽取方法研究

俞琰^1,²(

),王丽¹,郑斯煜¹

¹南京工业大学信息管理与技术研究所南京 210009
²东南大学成贤学院电子与计算机工程学院南京 210088

Patent Keyphrase Extraction Based on Patent Term and Layer Information

Yu Yan^1,²(

),Wang Li¹,Zheng Siyu¹

¹Institute of Information Management and Technology, Nanjing Tech University, Nanjing 210009, China
²College of Electronic and Computer Engineering, Southeast University Chengxian College, Nanjing 210088, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (1213 KB) HTML ( 6 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 针对图模型方法在专利关键短语抽取过程中偏向于选取长关键短语并忽略短语所在位置的问题，提出融入术语度与层级信息的专利关键短语抽取方法，提高专利关键短语抽取的准确性。【方法】 基于传统的图模型方法，提出一种新的术语度指标，以衡量候选关键短语的术语信息；根据专利文献特征，将专利划分为若干层级，提出层级权重指标，以度量候选关键短语位置信息。【结果】 融入术语信息，专利关键短语抽取方法F值相对提高7.615%（纳米）、11.515%（图像识别）、9.813%（芯片）和8.839 %（液晶显示）。融入层级信息，专利关键短语抽取方法F值相对提高9.880%（纳米）、6.929%（图像识别）、6.099%（芯片）和5.576%（液晶显示）。【局限】 基于词性规则的候选关键短语选取方法会产生较多的噪声。【结论】 利用术语度与层次信息的专利关键短语抽取方法能够有效提高专利关键短语抽取的准确性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	俞琰
	王丽
	郑斯煜

关键词 ：专利, 关键短语抽取, 术语, 层级

Abstract：

[Objective] This paper proposes a patent key phrase extraction method incorporating terminology and hierarchical information to improve the accuracy of patent key phrase extraction. It tries to improve the existing graph-based model, which tends to select long key phrases and ignores the phrases’ positional information. [Methods] Based on the traditional graph model, we constructed a new terminology degree metric to measure the terminological information of candidate key phrases. Considering the characteristics of patent documents, we divided patents into several hierarchies and used their weight metrics to measure the positional information of candidate key phrases. [Results] By incorporating terminology information, the F value of the new method improved by 7.615% (nanotechnology), 11.515% (image recognition), 9.813% (chip), and 8.839% (LCD). By incorporating the hierarchical information, the new method’s F value improved by 9.880% (nanotechnology), 6.929% (image recognition), 6.099% (chip), and 5.576% (LCD). [Limitations] The candidate key phrase selection method based on part-of-speech rules may produce more noise. [Conclusions] The proposed method effectively enhances the accuracy of patent key phrase extraction.

Key words： Patent Keyphrase Extraction Term Layer

收稿日期: 2020-06-17 出版日期: 2023-08-09

ZTFLH:

G202

通讯作者: 俞琰，ORCID： 0000-0002-9654-8614，E-mail： yuyanyuyan2004@126.com。

引用本文:

俞琰, 王丽, 郑斯煜. 融入术语与层级信息的专利关键短语抽取方法研究[J]. 数据分析与知识发现, 2023, 7(6): 99-112.
Yu Yan, Wang Li, Zheng Siyu. Patent Keyphrase Extraction Based on Patent Term and Layer Information. Data Analysis and Knowledge Discovery, 2023, 7(6): 99-112.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2020.0577 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2023/V7/I6/99

Table 1 基于图模型的关键短语抽取存在的问题示例

Fig.1 本文研究流程

Table2 候选关键短语术语度示例

Fig.2 专利文本层级

Table3 专利文本层级权重示例

Table4 融入术语度与层级权重的候选关键短语分值示例

Table5 数据集信息

Fig.3 重要性特征评估结果

Fig.4 术语度特征评估结果

Fig.5 层次权重特征评估结果

Fig.6 与改进的图模型方法比较

Fig.7 与典型方法比较

[1]	Mihalcea R, Tarau P. TextRank: Bringing Order into Texts[C]// Proceedings of the 2004 Empirical Methods in Natural Language Processing. ACL, 2004: 404-411.
[2]	Page L, Brin S, Motwani R, et al. The PageRank Citation Ranking: Bringing Order to the Web[C]// Proceedings of the 1999 International Conference of World Wide Web. 1999.
[3]	崔振新, 卢昊文. 民航安全信息中实现关键词提取的方法[J]. 交通信息与安全, 2016, 34(5): 82-86.
[3]	(Cui Zhenxin, Lu Haowen. A Method for Extraction of Keywords from Safety Information in Civil Aviation[J]. Journal of Transport Information and Safety, 2016, 34(5): 82-86.)
[4]	陈忆群, 周如旗, 朱蔚恒, 等. 挖掘专利知识实现关键词自动抽取[J]. 计算机研究与发展, 2016, 53(8): 1740-1752.
[4]	(Chen Yiqun, Zhou Ruqi, Zhu Weiheng, et al. Mining Patent Knowledge for Automatic Keyword Extraction[J]. Journal of Computer Research and Development, 2016, 53(8): 1740-1752.)
[5]	Hu J, Li S B, Yao Y, et al. Patent Keyword Extraction Algorithm Based on Distributed Representation for Patent Classification[J]. Entropy (Basel), 2018, 20(2): 104. doi: 10.3390/e20020104
[6]	Das Gollapalli S, Li X L, Yang P. Incorporating Expert Knowledge into Keyphrase Extraction[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. New York: ACM, 2017: 3180-3187.
[7]	Aquino G, Lanzarini L. Keyword Identification in Spanish Documents Using Neural Networks[J]. Journal of Computer Science and Technology, 2015, 15: 55-60.
[8]	Zhang Q, Wang Y, Gong Y Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. ACL, 2016: 836-845.
[9]	成彬, 施水才, 都云程, 等. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[9]	(Cheng Bin, Shi Shuicai, Du Yuncheng, et al. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. Data Analysis and Knowledge Discovery, 2021, 5(3): 101-108.)
[10]	Wang L T, Li F. SJTULTLAB: Chunk Based Method for Keyphrase Extraction[C]// Proceedings of the 5th International Workshop on Semantic Evaluation. New York: ACM, 2010: 158-161.
[11]	Noh H, Jo Y, Lee S. Keyword Selection and Processing Strategy for Applying Text Mining to Patent Analysis[J]. Expert Systems with Applications, 2015, 42(9): 4348-4360. doi: 10.1016/j.eswa.2015.01.050
[12]	刘峰, 吴瑞红, 徐川, 等. 专利文献中关键词抽取方法的改进[J]. 情报杂志, 2014, 33(12): 36-40.
[12]	(Liu Feng, Wu Ruihong, Xu Chuan, et al. Keyword Extraction of Patent Document: An Improved Approach[J]. Journal of Intelligence, 2014, 33(12): 36-40.)
[13]	黄磊, 伍雁鹏, 朱群峰. 关键词自动提取方法的研究与改进[J]. 计算机科学, 2014, 41(6): 204-207. doi: 10.11896/j.issn.1002-137X.2014.06.040
[13]	(Huang Lei, Wu Yanpeng, Zhu Qunfeng. Research and Improvement of TFIDF Text Feature Weighting Method[J]. Computer Science, 2014, 41(6): 204-207.) doi: 10.11896/j.issn.1002-137X.2014.06.040
[14]	张瑾. 基于改进TF-IDF算法的情报关键词提取方法[J]. 情报杂志, 2014, 33(4): 153-155.
[14]	(Zhang Jin. A Method of Intelligence Key Words Extraction Based on Improved TF-IDF[J]. Journal of Intelligence, 2014, 33(4): 153-155.)
[15]	牛萍, 黄德根. TF-IDF与规则相结合的中文关键词自动抽取研究[J]. 小型微型计算机系统, 2016, 37(4): 711-715.
[15]	(Niu Ping, Huang Degen. TF-IDF and Rules Based Automatic Extraction of Chinese Keywords[J]. Journal of Chinese Computer Systems, 2016, 37(4): 711-715.)
[16]	Joung J, Kim K. Monitoring Emerging Technologies for Technology Planning Using Technical Keyword Based Analysis from Patent Data[J]. Technological Forecasting and Social Change, 2017, 114: 281-292. doi: 10.1016/j.techfore.2016.08.020
[17]	Nguyen K L, Shin B J, Yoo S J. Hot Topic Detection and Technology Trend Tracking for Patents Utilizing Term Frequency and Proportional Document Frequency and Semantic Information[C]// Proceedings of the 2016 International Conference on Big Data and Smart Computing. IEEE, 2016: 223-230.
[18]	Hofmann T. Probabilistic Latent Semantic Indexing[C]// Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 50-57.
[19]	Blei D M, Ng A Y, Jordan M I, et al. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3(1): 993-1022.
[20]	Song Y Q, Pan S M, Liu S X, et al. Topic and Keyword Re-Ranking for LDA-Based Topic Modeling[C]// Proceedings of the 18th ACM Conference on Information and Knowledge Management. New York: ACM, 2009: 1757-1760.
[21]	Wei H, Gao G, Su X. LDA-Based Word Image Representation for Keyword Spotting on Historical Mongolia Documents[C]// Proceedings of the 23rd International Conference on Neural Information Processing. Springer, 2016: 432-441.
[22]	顾益军, 夏天. 融合LDA与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2014(7/8): 41-47.
[22]	(Gu Yijun, Xia Tian. Study on Keyword Extraction with LDA and TextRank Combination[J]. New Technology of Library and Information Service, 2014(7/8): 41-47.)
[23]	刘啸剑, 谢飞, 吴信东. 基于图和LDA主题模型的关键词抽取算法[J]. 情报学报, 2016, 35(6): 664-672.
[23]	(Liu Xiaojian, Xie Fei, Wu Xindong. Graph Based Keyphrase Extraction Using LDA Topic Model[J]. Journal of the China Society for Scientific and Technical Information, 2016, 35(6): 664-672.)
[24]	马力, 焦李成, 白琳, 等. 基于小世界模型的复合关键词提取方法研究[J]. 中文信息学报, 2009, 23(3): 121-128.
[24]	(Ma Li, Jiao Licheng, Bai Lin, et al. Research on a Compound Keywords Detection Method Based on Small World Model[J]. Journal of Chinese Information Processing, 2009, 23(3): 121-128.)
[25]	左晓飞, 刘怀亮, 范云杰, 等. 基于概念语义场的文本聚类算法研究[J]. 情报杂志, 2012, 31(5): 180-184.
[25]	(Zuo Xiaofei, Liu Huailiang, Fan Yunjie, et al. Research of Text Clustering Algorithm Based on Conceptual Semantic Field[J]. Journal of Intelligence, 2012, 31(5): 180-184.)
[26]	Boudin F. A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction[C]// Proceedings of the IJCNLP 2013 Workshop on Natural Language Processing for Social Media. ACL, 2013: 834-838.
[27]	李航, 唐超兰, 杨贤, 等. 融合多特征的TextRank关键词抽取方法[J]. 情报杂志, 2017, 36(8): 183-187.
[27]	(Li Hang, Tang Chaolan, Yang Xian, et al. TextRank Keyword Extraction Based on Multi Feature Fusion[J]. Journal of Intelligence, 2017, 36(8): 183-187.)
[28]	Florescu C, Caragea C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 1105-1115.
[29]	刘竹辰, 陈浩, 于艳华, 等. 词位置分布加权TextRank的关键词提取[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[29]	(Liu Zhuchen, Chen Hao, Yu Yanhua, et al. Extracting Keywords with TextRank and Weighted Word Positions[J]. Data Analysis and Knowledge Discovery, 2018, 2(9): 74-79.)
[30]	夏天. 词向量聚类加权TextRank的关键词抽取[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[30]	(Xia Tian. Extracting Keywords with Modified TextRank Model[J]. Data Analysis and Knowledge Discovery, 2017, 1(2): 28-34.)
[31]	宁建飞, 刘降珍. 融合Word2Vec与TextRank的关键词抽取研究[J]. 现代图书情报技术, 2016(6): 20-27.
[31]	(Ning Jianfei, Liu Jiangzhen. Using Word2Vec with Text Rank to Extract Keywords[J]. New Technology of Library and Information Service, 2016(6): 20-27.)
[32]	Wang R, Liu W, McDonald C. Using Word Embeddings to Enhance Keyword Identification for Scientific Publications[C]// Proceedings of the 26th Australasian Database Conference on Databases Theory and Applications. ACM, 2015: 257-268.
[33]	Li D C, Li S J, Li W J, et al. A Semi-Supervised Key Phrase Extraction Approach: Learning from Title Phrases Through a Document Semantic Network[C]// Proceedings of the ACL 2010 Conference Short Papers. New York: ACM, 2010: 296-300.
[34]	Li D C, Li S J. Hypergraph-Based Inductive Learning for Generating Implicit Key Phrases[C]// Proceedings of the 20th International Conference Companion on World Wide Web. New York: ACM, 2011: 77-78.
[35]	Lynn H M, Choi C, Choi J, et al. The Method of Semi-Supervised Automatic Keyword Extraction for Web Documents Using Transition Probability Distribution Generator[C]// Proceedings of the 2016 International Conference on Research in Adaptive and Convergent Systems. New York: ACM, 2016: 1-6.
[36]	Frantzi K, Ananiadou S, Mima H. Automatic Recognition of Multi-Word Terms: The C-Value/NC-Value Method[J]. International Journal on Digital Libraries, 2000, 3(2): 115-130. doi: 10.1007/s007999900023
[37]	Muralikumar J, Seelan S A, Vijayakumar N, et al. A Statistical Approach for Modeling Inter-Document Semantic Relationships in Digital Libraries[J]. Journal of Intelligent Information Systems, 2017, 48(3): 477-498. doi: 10.1007/s10844-016-0423-6
[38]	Le T, Jeong D H. NLP-Based Approach to Semantic Classification of Heterogeneous Transportation Asset Data Terminology[J]. Journal of Computing in Civil Engineering, 2017, 31(6): Article No. 04017057.
[39]	Yan E J, Williams J, Chen Z. Understanding Disciplinary Vocabularies Using a Full-Text Enabled Domain-Independent Term Extraction Approach[J]. PLoS One, 2017, 12(11): Article No. e0187762.
[40]	Thanawala P, Pareek J. MwTExt: Automatic Extraction of Multi-Word Terms to Generate Compound Concepts within Ontology[J]. International Journal of Information Technology, 2018, 10(3): 303-311. doi: 10.1007/s41870-018-0111-6
[41]	Bagheri A, Nadi S. Sentiment Miner: A Novel Unsupervised Framework for Aspect Detection from Customer Reviews[J]. International Journal of Computational Linguistics Research, 2018, 9(2): 120-130. doi: 10.6025/jcl/2018/9/2/120-130
[42]	Haque R, Penkale S, Way A. TermFinder: Log-Likelihood Comparison and Phrase-Based Statistical Machine Translation Models for Bilingual Terminology Extraction[J]. Language Resources and Evaluation, 2018, 52(2): 365-400. doi: 10.1007/s10579-018-9412-4
[43]	Li Z, Tong X, Zhang Y. Constructing the Phrase Dictionary and Visualizing Consumer Behaviors in the Food Industry Based on Online Reviews During the COVID-19 Pandemic[J]. CONVERTER, 2021: 624-632.
[44]	Lahiri S, Mihalcea R, Lai P H. Keyword Extraction from Emails[J]. Natural Language Engineering, 2017, 23(2): 295-317. doi: 10.1017/S1351324916000231
[45]	王志宏, 过弋. 基于词句重要性的中文专利关键词自动抽取研究[J]. 情报理论与实践, 2018, 41(9): 123-129. doi: 10.16353/j.cnki.1000-7490.2018.09.021
[45]	(Wang Zhihong, Guo Yi. Automatic Keywords Extraction from Chinese Patents Based on Sentence Importance Ranking[J]. Information Studies: Theory & Application, 2018, 41(9): 123-129.) doi: 10.16353/j.cnki.1000-7490.2018.09.021
[46]	Carletta J. Assessing Agreement on Classification Tasks: The Kappa Statistic[J]. Computational Linguistics, 1996, 22(2): 249-254.
[47]	陆伟, 程齐凯. 一种基于加权网络和句子窗口方案的信息检索模型[J]. 情报学报, 2013, 32(8): 797-804.
[47]	(Lu Wei, Cheng Qikai. An Information Retrieval Model Based on Weighted Graph and Sentence[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(8): 797-804.)
[48]	柳林青, 余瀚, 费宁, 等. 一种基于TextRank的单文本关键字提取算法[J]. 计算机应用研究, 2018, 35(3): 705-710.
[48]	(Liu Linqing, Yu Han, Fei Ning, et al. Key-Word Extracting Algorithm from Single Text Based on TextRank[J]. Application Research of Computers, 2018, 35(3): 705-710.)
[49]	Wan X, Xiao J. CollabRank: Towards a Collaborative Approach to Single-Document Keyphrase Extraction[C]// Proceedings of the 22nd International Conference on Computational Linguistics. 2008: 969-976.

[1]	施国良, 周抒, 王云峰, 施春江, 刘亮. 基于改进多头注意力机制的专利文本摘要生成研究^*[J]. 数据分析与知识发现, 2023, 7(6): 61-72.
[2]	邓娜, 何昕洋, 陈伟杰, 陈旭. MPMFC：一种融合网络邻里结构特征和专利语义特征的中药专利分类模型^*[J]. 数据分析与知识发现, 2023, 7(4): 145-158.
[3]	欧桂燕, 庞娜, 吴江. 专利审查周期影响因素研究——以中国人工智能领域为例*[J]. 数据分析与知识发现, 2022, 6(8): 20-30.
[4]	张乐, 杜一凡, 吕学强, 董志安. STNLTP:一种基于集成策略的中文专利摘要生成模型^*[J]. 数据分析与知识发现, 2022, 6(7): 107-117.
[5]	王丽, 刘细文. 基于专利数据的技术主题扩散量化研究与实现^*[J]. 数据分析与知识发现, 2022, 6(6): 1-10.
[6]	关鹏,王曰芬,傅柱,靳嘉林. 基于专利合作网络的研发团队识别及创新产出影响研究*[J]. 数据分析与知识发现, 2022, 6(5): 99-111.
[7]	肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究^*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[8]	刘小玲, 谭宗颖. 基于专利多属性融合的技术主题划分方法研究[J]. 数据分析与知识发现, 2022, 6(2/3): 45-54.
[9]	佟昕瑀, 赵蕊洁, 路永和. 基于预训练模型的多标签专利分类研究^*[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[10]	傅柱, 丁玮珂, 关鹏, 丁绪辉. 基于知识元的外文专利文献知识描述框架^*[J]. 数据分析与知识发现, 2022, 6(2/3): 263-273.
[11]	胡雅敏, 吴晓燕, 陈方. 基于机器学习的技术术语识别研究综述[J]. 数据分析与知识发现, 2022, 6(2/3): 7-17.
[12]	曾闻,王曰芬. 专业技术领域核心专利组合识别方法构建及其应用比较^*[J]. 数据分析与知识发现, 2022, 6(11): 61-71.
[13]	俞琰, 朱晟忱. 融入限定关系的专利关键词抽取方法^*[J]. 数据分析与知识发现, 2022, 6(10): 57-67.
[14]	张乐, 冷基栋, 吕学强, 崔卓, 王磊, 游新冬. RLCPAR：一种基于强化学习的中文专利摘要改写模型*[J]. 数据分析与知识发现, 2021, 5(7): 59-69.
[15]	高伊林,闵超. 中美对“一带一路”沿线技术扩散结构比较研究^*[J]. 数据分析与知识发现, 2021, 5(6): 80-92.

Viewed

Full text

Abstract

Cited

Shared

Discussed