Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (5): 68-76    DOI: 10.11925/infotech.2096-3467.2018.0659
Current Issue | Archive | Adv Search |
Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning
Jinzhu Zhang(),Yiming Hu
School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China
Download: PDF (731 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      

[Objective] This paper aims to automatically identify scientific references in patent(SRP), and then extract titles from SRP to support in-depth data mining. [Methods] Firstly, we used the Doc2Vec method to generate vectors for the patent citations. Then, we identified the SRPs with support vector machine (SVM). Third, we created vectors for the metadata (such as titles) of SRP, and extracted titles with SVM. [Results] We examined the proposed method with patent citations from the genetic field. The accuracy of SRP recognition and titles extraction reached 99.27% and 92.59% respectively. The latter was 5.96% higher than those of the traditional methods. [Limitations] Manually tagging the training set was very time consuming, and there are format requirements for the experimental data. [Conclusions] The proposed method could effectively identify and extract patent citations and titles.

Key wordsScientific References in Patent      Metadata Extraction      Machine Learning      Representation Learning     
Received: 20 June 2018      Published: 03 July 2019

Cite this article:

Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning. Data Analysis and Knowledge Discovery, 2019, 3(5): 68-76.

URL:     OR

[1] Narin F, Hamilton K S, Olivastro D.The Increasing Linkage Between U.S. Technology and Public Science[J]. Research Policy, 1997, 26(3): 317-330.
[2] 姜霖, 王东波. 引文元数据的自动发现和标注方法研究——以外文引文为例[J]. 数据分析与知识发现, 2017, 1(1): 47-54.
[2] (Jiang Lin, Wang Dongbo.Automatically Detecting and Tagging Foreign Language Citation Metadata[J]. Data Analysis and Knowledge Discovery, 2017, 1(1): 47-54.)
[3] 高霞, 官建成. 非专利引文衍生的科学期刊共被引网络分析[J]. 科学学研究, 2010, 28(5): 675-680.
[3] (Gao Xia, Guan Jiancheng.Co-citation Analysis of Scientific Journal Networks Derived from Non-patent Reference[J]. Studies in Science of Science, 2010, 28(5): 675-680.)
[4] Wei W, King I, Lee H M.Bibliographic Attributes Extraction with Layer-upon-Layer Tagging[C]// Proceedings of the 9th International Conference on Document Analysis and Recognition. 2007: 804-808.
[5] 钱建立, 吴广茂, 蒋路. 基于特征相似度的科技论文元数据提取算法研究[J]. 微电子学与计算机, 2008, 25(8): 129-132.
[5] (Qian Jianli, Wu Guangmao, Jiang Lu.Research on Paper Metadata Extraction Algorithm Based on Feature Similarity[J]. Microelectronics and Computer, 2008, 25(8): 129-132.)
[6] 杨宇, 张铭, 周宝曜. 基于多种规则的课程元数据自动抽取[J]. 计算机科学, 2008, 35(3): 94-96.
[6] (Yang Yu, Zhang Ming, Zhou Baoyao.A Rule-based Metadata Extractor for Learning Materials[J]. Computer Science, 2008, 35(3): 94-96.)
[7] Day M Y, Tsai T H, Sung C L, et al.Reference Metadata Extraction Using a Hierarchical Knowledge Representation Framework[J]. Decision Support Systems, 2007, 43(1): 152-167.
[8] Cortez E, Silva A S D, Mesquita F, et al. FLUX-CIM: Flexible Unsupervised Extraction of Citation Metadata[C]// Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries. ACM, 2007: 215-224.
[9] Seymore K, McCallum A, Rosenfeld R. Learning Hidden Markov Model Structure for Information Extraction[C]// Proceedings of the 1999 AAAI Workshop on Machine Learning for Information Extraction. 1999: 37-42.
[10] Nanba H, Anzen N, Okumura M.Automatic Extraction of Citation Information in Japanese Patent Applications[J]. International Journal on Digital Libraries, 2008, 9(2): 151-161.
[11] Han H, Giles C L, Manavoglu E, et al.Automatic Document Metadata Extraction Using Support Vector Machines[C]// Proceedings of the 2003 Joint Conference on Digital Libraries. IEEE, 2003: 37-48.
[12] 张铭, 银平, 邓志鸿, 等. SVM+BiHMM: 基于统计方法的元数据抽取混合模型[J]. 软件学报, 2008, 19(2): 358-368.
[12] (Zhang Ming, Yin Ping, Deng Zhihong, et al.SVM+BiHMM: A Hybrid Statistic Model for Metadata Extraction[J]. Journal of Software, 2008, 19(2): 358-368.)
[13] 蒋新. 英美学术文献的几种主要引文方式[J]. 图书与情报, 2003(3): 26-30.
[13] (Jiang Xin.Several Main Quotation Ways in British-American Academic Documents[J]. Library and Information, 2003(3): 26-30.)
[14] Le Q, Mikolov T.Distributed Representations of Sentences and Documents[C]// Proceedings of the 2014 International Conference on Machine Learning. 2014: 1188-1196.
[15] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint. arXiv: 1301.3781.
[16] Hinton G E.Learning Distributed Representations of Concepts[C]// Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1986: 1-12.
[17] 于政. 基于深度学习的文本向量化研究与应用[D]. 上海:华东师范大学, 2016.
[17] (Yu Zheng.The Study and Application of Text Embeddings with Deep Learning Technique[D]. Shanghai: East China Normal University, 2016.)
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Chen Wenjie,Wen Yi,Yang Ning. Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation[J]. 数据分析与知识发现, 2021, 5(5): 41-50.
[8] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[9] Zhang Xin,Wen Yi,Xu Haiyun. A Prediction Model with Network Representation Learning and Topic Model for Author Collaboration[J]. 数据分析与知识发现, 2021, 5(3): 88-100.
[10] Zhang Jinzhu, Yu Wenqian. Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning[J]. 数据分析与知识发现, 2021, 5(2): 50-60.
[11] Yu Chuanming, Zhang Zhengang, Kong Lingge. Comparing Knowledge Graph Representation Models for Link Prediction[J]. 数据分析与知识发现, 2021, 5(11): 29-44.
[12] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[13] Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu. A Comparative Study of Word Representation Models Based on Deep Learning[J]. 数据分析与知识发现, 2020, 4(8): 28-40.
[14] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[15] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938