Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (2/3): 308-317    DOI: 10.11925/infotech.2096-3467.2021.0972
Current Issue | Archive | Adv Search |
Extracting Chinese Patent Keywords with LSTM and Logistic Regression
Wei Tingting1,Jiang Tao1,Zheng Shuling2,Zhang Jiantao1()
1College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China
2Patent Examination Cooperation Guangdong Center of the Patent Office, Guangzhou 510535, China
Download: PDF (903 KB)   HTML ( 16
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper constructs a new method to extract keywords from Chinese patents based on the LSTM and logistic regression, aiming to identify low-frequency and long-tail keywords effectively. [Methods] First, we combined the LSTM neural network and logistic regression model to extract the candidate keywords. Then, we reconstructed the filtering rules to retrieve the target keywords. [Results] The extraction accuracy of all keywords, low-frequency keywords, long-tail keywords, and low-frequency long-tail keywords were 5%, 24%, 11% and 26% higher than those of existing methods. [Limitations] The proposed model classifies keywords by setting thresholds, which are not precise to process words near the thresholds. [Conclusions] Our new model could effectively discover key terms with low frequency and long characters from texts, which benefits patent analysis and other services.

Key wordsExtraction      LSTM Neural Network      Logistic Regression      Reorganization Filtering     
Received: 01 September 2021      Published: 14 April 2022
ZTFLH:  TP393  
Fund:Young Talents Program of Colleges and Universities of Chinese Guangdong Province Office of Education(2019KQNCX012);Regional Joint Fund of Chinese Guangdong Province(2019A1515110396);Project of Humanities and Social Sciences,Ministry of Education(20YJC740067)
Corresponding Authors: Zhang Jiantao,ORCID:0000-0002-1646-2643     E-mail: zhangjiantao@yeah.net

Cite this article:

Wei Tingting, Jiang Tao, Zheng Shuling, Zhang Jiantao. Extracting Chinese Patent Keywords with LSTM and Logistic Regression. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 308-317.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0972     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I2/3/308

The LSTM Memory Block
Keyword Recombination Screen Filter Plot
构词数量 出现次数 平均中文字符个数 占比
1 2 329 2.28 40.27%
2 2 465 3.79 42.62%
3 706 5.72 12.21%
4 199 7.60 3.44%
5 60 9.85 1.04%
6 14 11.43 0.24%
7 7 13.71 0.12%
8 1 17 0.02%
Patent Keyword Construction and Word Statistics
类别 关键词示例
长尾词 蒜苔采集装置、封尾贴标机构、余料剪切装置、升降装置、活动手柄、气动折入边装置、全自动水果采摘装置、剑杆织机、可伸缩剪臂、动力组件、摆臂连动组件、陶瓷茶叶罐
低频词 跷跷板、铰接扣、螺栓、剪刀车、摆动杆、可伸缩剪臂、动力组件、摆臂连动组件、陶瓷茶叶罐、全自动水果采摘装置、剑杆织机、气动折入边装置
低频长尾词 全自动水果采摘装置、气动折入边装置、剑杆织机、可伸缩剪臂、动力组件、摆臂连动组件、陶瓷茶叶罐
Examples of Various Keywords in the Patent Data
参数 input-dim hidden-dim layer-dim batch-size
arg1 3 10 2 50
arg2 5 8 1 60
arg3 4 5 2 80
arg4 2 4 2 90
arg5 3 7 1 73
Specific Parameter Settings
参数 P R F1
arg1 64% 44% 52%
arg2 63% 43% 51%
arg3 65% 43% 52%
arg4 67% 46% 54%
arg5 69% 48% 57%
Performance of Different Parameters on the Datasets
参数 取值
Loss函数 nn.CrossEntropyLoss()
optimzer Adam
batch_size 73
num_epoch 100
lr 0.001
input_dim 3
hidden_dim 7
layer_dim 1
output_dim 2
The Model Final Parameter Setting
模块名称 P R F1
加重组过滤层 69% 48% 57%
无重组过滤层 34% 22% 27%
Extraction Performance of the Recombination Filter Layer on All Keywords
模块名称 长尾词 低频词 低频长尾词
加重组过滤层 48% 43% 46%
无重组过滤层 15% 17% 18%
Extraction Accuracy of the Recombination Filter Layers on Different Keywords
算法名称 P R F1
LR 69% 48% 57%
SVM 61% 40% 48%
Extraction Performance of the Different Classification Models Across All Keywords
方法名称 P R F1
TF-IDF 49% 30% 38%
LDA 44% 26% 32%
TextRank 47% 29% 36%
LSTM-LDA 64% 39% 48%
LLWR 69% 48% 57%
Extraction Performance of the Various Methods on All Keywords
方法名称 长尾词 低频词 低频长尾词
TF-IDF 32% 18% 18%
LDA 20% 19% 18%
TextRank 32% 19% 19%
LSTM-LDA 37% 19% 20%
LLWR 48% 43% 46%
Drawing Accuracy of Various Methods on Different Keywords
k P R F1
2 58% 28% 37%
3 69% 48% 57%
4 62% 60% 61%
5 56% 76% 64%
6 53% 82% 64%
7 49% 83% 61%
Comparison of the Quantities Extracted from Different Keywords
构词数量 出现次数 平均字符长度 占比
Machine-Paper NLP-Paper Machine-Paper NLP-Paper Machine-Paper NLP-
Paper
1 1 246 1 522 3.01 3.32 32.43% 29.19%
2 1 908 2 785 4.25 4.37 49.66% 53.41%
3 545 728 6.02 6.22 14.19% 13.96%
4 114 154 7.95 8.22 2.97% 2.95%
5 20 19 9.4 10.26 4.89% 0.36%
6 5 5 11 11 1.43% 0.09%
7 2 0 13 0 0.68% 0
8 1 1 15 21 0.03% 0.02%
9 1 0 20 0 0.03% 0
Keywords Statistics of the Datasets of the Two Types of Papers
方法名称 P R F1
Machine-Paper NLP-Paper Machine-Paper NLP-Paper Machine-Paper NLP-Paper
TF-IDF 52% 54% 31% 33% 39% 41%
LDA 45% 43% 27% 25% 33% 31%
TextRank 56% 54% 32% 33% 41% 40%
LSTM-LDA 62% 64% 37% 40% 47% 49%
LLWR 67% 66% 41% 42% 50% 51%
Extraction Performance of Various Methods on All Keywords in the Paper Dataset
方法名称 长尾词 低频词 低频长尾词
Machine-Paper NLP-Paper Machine-Paper NLP-Paper Machine-Paper NLP-Paper
TF-IDF 23% 21% 19% 31% 17% 32%
LDA 19% 23% 11% 6% 12% 5%
TextRank 22% 29% 18% 30% 16% 30%
LSTM-LDA 27% 35% 20% 32% 16% 33%
LLWR 35% 39% 27% 32% 23% 35%
Drawing Accuracy of Various Methods on Different Keywords in the Paper Dataset
[1] Wang Z H, Guo Y. Sentence-Ranking-Enhanced Keywords Extraction from Chinese Patents[J]. Journal of Information Science and Engineering, 2019, 35(3):651-674.
[2] 谭婷婷, 陈高荣, 徐建. KEC: 基于cw2vec的中文专利关键词提取方法[J]. 计算机应用研究, 2020, 37(10):2907-2911, 2916.
[2] ( Tan Tingting, Chen Gaorong, Xu Jian. KEC: Chinese Patent Keyword Extraction Method Based on cw2vec[J]. Application Research of Computers, 2020, 37(10):2907-2911, 2916.)
[3] 夏天. 面向中文学术文本的单文档关键短语抽取[J]. 数据分析与知识发现, 2020, 4(7):76-86.
[3] ( Xia Tian. Extracting Key-Phrases from Chinese Scholarly Papers[J]. Data Analysis and Knowledge Discovery, 2020, 4(7):76-86.)
[4] Siddiqi S, Sharan A. Keyword and Keyphrase Extraction Techniques: A Literature Review[J]. International Journal of Computer Applications, 2015, 109(2):18-23.
[5] 俞琰, 尚明杰, 赵乃瑄. 权利要求特征驱动的专利关键词抽取方法[J]. 情报学报, 2021, 40(6):610-620.
[5] ( Yu Yan, Shang Mingjie, Zhao Naixuan. Patent Keyword Extraction Driven by Claim Features[J]. Journal of the China Society for Scientific and Technical Information, 2021, 40(6):610-620.)
[6] 王志宏, 过弋. 基于词句重要性的中文专利关键词自动抽取研究[J]. 情报理论与实践, 2018, 41(9):123-129, 160.
[6] ( Wang Zhihong, Guo Yi. Automatic Keywords Extraction from Chinese Patents Based on Sentence Importance Ranking[J]. Information Studies:Theory & Application, 2018, 41(9):123-129, 160.)
[7] Zhang Y, Tuo M X, Yin Q Y, et al. Keywords Extraction with Deep Neural Network Model[J]. Neurocomputing, 2020, 383:113-121.
doi: 10.1016/j.neucom.2019.11.083
[8] Chen Y, Wang J, Li P, et al. Single Document Keyword Extraction via Quantifying Higher-Order Structural Features of Word Co-occurrence Graph[J]. Computer Speech & Language, 2019, 57:98-107.
[9] Qian Y L, Jia C C, Liu Y M. BERT-Based Text Keyword Extraction[J]. Journal of Physics: Conference Series, 2021, 1992(4):042077.
doi: 10.1088/1742-6596/1992/4/042077
[10] 牛永洁. 基于Python的改进关键词提取算法的实现[J]. 电子设计工程, 2019, 27(13):11-15.
[10] ( Niu Yongjie. Implementation of Improved Keyword Extraction Algorithm Based on Python[J]. Electronic Design Engineering, 2019, 27(13):11-15.)
[11] 杨丹浩, 吴岳辛, 范春晓. 一种基于注意力机制的中文短文本关键词提取模型[J]. 计算机科学, 2020, 47(1):193-198.
[11] ( Yang Danhao, Wu Yuexin, Fan Chunxiao. Chinese Short Text Keyphrase Extraction Model Based on Attention[J]. Computer Science, 2020, 47(1):193-198.)
[12] Duari S, Bhatnagar V. Complex Network Based Supervised Keyword Extractor[J]. Expert Systems with Applications, 2020, 140:112876.
doi: 10.1016/j.eswa.2019.112876
[13] Huang Z X, Xie Z P. A Patent Keywords Extraction Method Using TextRank Model with Prior Public Knowledge[J]. Complex & Intelligent Systems, 2021. https://doi.org/10.1007/s40747-021-00343-8.
[14] Ramay W Y, Xu C Y, Illahi I. Keyword Extraction from Social Media via AHP[J]. Human Systems Management, 2019, 37(4):463-468.
doi: 10.3233/HSM-180344
[15] Duan X Y, Ying S, Cheng H L, et al. OILog: An Online Incremental Log Keyword Extraction Approach Based on MDP-LSTM Neural Network[J]. Information Systems, 2021, 95:101618.
doi: 10.1016/j.is.2020.101618
[16] 薛金成, 姜迪, 吴建德. 基于LSTM-A深度学习的专利文本分类研究[J]. 通信技术, 2019, 52(12):2888-2892.
[16] ( Xue Jincheng, Jiang Di, Wu Jiande. Patent Text Classification Based on Long Short-Term Memory Network and Attention Mechanism[J]. Communications Technology, 2019, 52(12):2888-2892.)
[17] 马建红, 王瑞杨, 姚爽, 等. 基于深度学习的专利分类方法[J]. 计算机工程, 2018, 44(10):209-214.
[17] ( Ma Jianhong, Wang Ruiyang, Yao Shuang, et al. Patent Classification Method Based on Depth Learning[J]. Computer Engineering, 2018, 44(10):209-214.)
[18] 向进勇, 刘小龙, 丁明扬, 等. 基于卷积递归深度学习模型的句子级文本情感分类[J]. 东北师大学报(自然科学版), 2020, 52(2):73-79.
[18] ( Xiang Jinyong, Liu Xiaolong, Ding Mingyang, et al. Convolutional Recurrent Deep Learning Model for Sentence Sentiment Classification[J]. Journal of Northeast Normal University(Natural Science Edition), 2020, 52(2):73-79.)
[19] 宁珊, 严馨, 周枫, 等. 融合LSTM和LDA差异的新闻文本关键词抽取方法[J]. 计算机工程与科学, 2020, 42(1):153-160.
[19] ( Ning Shan, Yan Xin, Zhou Feng, et al. A News Keyword Extraction Method Combining LSTM and LDA Differences[J]. Computer Engineering and Science, 2020, 42(1):153-160.)
[1] Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu. Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents[J]. 数据分析与知识发现, 2022, 6(2/3): 318-328.
[2] Ding Shengchun, You Weijing, Wang Xiaoying. Extracting Weapon Attributes Based on Word Completion[J]. 数据分析与知识发现, 2022, 6(2/3): 289-297.
[3] Wang Yifan,Li Bo,Shi Hua,Miao Wei,Jiang Bin. Annotation Method for Extracting Entity Relationship from Ancient Chinese Works[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[4] Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei. Extracting Relationship Among Military Domains with BERT and Relation Position Features[J]. 数据分析与知识发现, 2021, 5(8): 1-12.
[5] Han Hui, Liu Xiuwen. Automatic Scoring for Subjective Questions in Maritime Competency Assessment[J]. 数据分析与知识发现, 2021, 5(8): 113-121.
[6] Chai Qingfeng, Shi Linyan, Mei Shan, Xiong Haitao, He Huixin. Extracting Knowledge Elements of Sci-Tech Literature Based on Artificial and Machine Features[J]. 数据分析与知识发现, 2021, 5(8): 132-144.
[7] Tan Ying, Tang Yifei. Extracting Citation Contents with Coreference Resolution[J]. 数据分析与知识发现, 2021, 5(8): 25-33.
[8] Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge. Extracting PDF Tables Based on Word Vectors[J]. 数据分析与知识发现, 2021, 5(8): 34-44.
[9] Chen Xingyue, Ni Liping, Ni Zhiwei. Extracting Financial Events with ELECTRA and Part-of-Speech[J]. 数据分析与知识发现, 2021, 5(7): 36-47.
[10] Yu Xuehan, He Lin, Xu Jian. Extracting Events from Ancient Books Based on RoBERTa-CRF[J]. 数据分析与知识发现, 2021, 5(7): 26-35.
[11] Zhao Danning,Mu Dongmei,Bai Sen. Automatically Extracting Structural Elements of Sci-Tech Literature Abstracts Based on Deep Learning[J]. 数据分析与知识发现, 2021, 5(7): 70-80.
[12] Yan Qiang,Zhang Xiaoyan,Zhou Simin. Extracting Keywords Based on Sememe Similarity[J]. 数据分析与知识发现, 2021, 5(4): 80-89.
[13] Shi Xiang,Liu Ping. Extraction and Representation of Domain Knowledge with Semantic Description Model and Knowledge Elements——Case Study of Information Retrieval[J]. 数据分析与知识发现, 2021, 5(4): 123-133.
[14] Cheng Bin,Shi Shuicai,Du Yuncheng,Xiao Shibin. Keyword Extraction for Journals Based on Part-of-Speech and BiLSTM-CRF Combined Model[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[15] Hu Shaohu,Zhang Yingyi,Zhang Chengzhi. Review of Keyword Extraction Studies[J]. 数据分析与知识发现, 2021, 5(3): 45-59.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn