Please wait a minute...
New Technology of Library and Information Service  2016, Vol. 32 Issue (6): 28-36    DOI: 10.11925/infotech.1003-3513.2016.06.04
Orginal Article Current Issue | Archive | Adv Search |
Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields
Wang Miping(),Wang Hao,Deng Sanhong,Wu Zhixiang
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF(1202 KB)   HTML ( 63
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposed a model to extract metallurgy patent terms in Chinese effectively. [Methods] We created the model to automatically identify metallurgy patent terminologies in Chinese with the help of conditional random fields(CRFs) technology. This model was tested with an incomplete core corpus. We discussed the development process and then compared the impacts of various CRFs factors to this character-role-labeled model. [Results] The new model combined the character sequences, level features, areal features and temperature features of the patent terms. Its precision rate was 94.26%, the recall rate was 94.37%, and the F1 value was 94.5%, while the length of the proximity window and the values of the parameter c and f were 3, 1, and 1 respectively. [Limitations] Some of the term labels were not accurate enough due to the incomplete core corpus. We did not compare our model with other methods to discuss the reliability of the CRFs. [Conclusions] The CRFs model could effectively identify the metallurgy patent terms in Chinese under appropriate working conditions.

Key wordsChinese patent terminology      CRFs      Terminology extraction      Sequence labeling     
Received: 01 March 2016      Published: 18 July 2016

Cite this article:

Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields. New Technology of Library and Information Service, 2016, 32(6): 28-36.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2016.06.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2016/V32/I6/28

[1] 贺延芳. 专利文献研究助力我国创新活动[N]. 中国知识产权报, 2012-03-23(4).
[1] (He Yanfang. The Patent Literature Study Assist in Chinese Innovation Activities [N]. China Intellectual Property News, 2012-03-23(4).)
[2] 葛煦, 卢宝华, 杨湘华, 等. 谈高校科技发展中专利文献的利用[J]. 技术与创新管理, 2005, 26(1): 68-70.
[2] (Ge Xu, Lu Baohua, Yang Xianghua, et al.Utilization of Patent Literature on the Development of Science and Technology in Universities[J]. Technology and Innovation Management, 2005, 26(1): 68-70.)
[3] 贾志琦, 邵曰剑. 有效利用专利文献提高企业技术创新能力[J]. 山西科技, 2008(1): 91-93.
[3] (Jia Zhiqi, Shao Yuejian.Enhance Enterprises’ Technological Innovative Capability Through Effective Use of Patent Documents[J]. Shanxi Science and Technology, 2008(1): 91-93.)
[4] Uzunbas M G, Chen C, Metaxas D.An Efficient Conditional Random Field Approach for Automatic and Interactive Neuron Segmentation[J]. Medical Image Analysis, 2016, 27: 31-44.
[5] 张雷瀚, 吕学强, 李卓, 等. 领域本体术语的抽取方法研究[J]. 情报学报, 2014, 33(2): 167-174.
[5] (Zhang Leihan, Lv Xueqiang, Li Zhuo, et al.Research on Extraction Methods for Domain Ontology Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(2): 167-174.)
[6] 袁劲松, 张小明, 李舟军, 等. 术语自动抽取方法研究综述[J]. 计算机科学, 2015, 42(8): 7-12.
[6] (Yuan Jinsong, Zhang Xiaoming, Li Zhoujun, et al.Survey of Automatic Terminology Extraction Methodologies[J]. Computer Science, 2015, 42(8): 7-12.)
[7] 汤青, 吕学强, 李卓, 等. 领域本体术语抽取研究[J]. 现代图书情报技术, 2014(1): 43-50.
[7] (Tang Qing, Lv Xueqiang, Li Zhuo, et al.Research on Domain Ontology Term Extraction[J]. New Technology of Library and Information Service, 2014(1): 43-50.)
[8] 王昊, 刘建华, 苏新宁, 等. 面向语义网的本体学习技术和系统研究[J]. 现代图书情报技术, 2009(1): 64-72.
[8] (Wang Hao, Liu Jianhua, Su Xinning, et al.Research on Techniques and Systems of Ontology Learning for Semantic Web[J]. New Technology of Library and Information Service , 2009(1): 64-72.)
[9] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
[9] (Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[10] 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6): 68-75.
[10] (Hua Bolin.Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6): 68-75.)
[11] Zhou H T, Chen J, Dong G M, et al. Detection and Diagnosis of Bearing Faults Using Shift-invariant Dictionary Learning and Hidden Markov Model [J]. Mechanical Systems and Signal Processing, 2016, 72-73: 65-79.
[12] 乐娟, 赵玺. 基于HMM的京剧机构命名实体识别算法[J]. 计算机工程, 2013, 39(6): 266-271, 286.
[12] (Le Juan, Zhao Xi.Algorithm of Beijing Opera Organization Names Entity Recognition Based on HMM[J]. Computer Engineering, 2013, 39(6): 266-271, 286.)
[13] 李丽双, 王意文, 黄德根. 基于信息熵和词频分布变化的术语抽取研究[J]. 中文信息学报, 2015, 29(1): 82-87.
[13] (Li Lishuang, Wang Yiwen, Huang Degen.Term Extraction Based on Information Entropy and Word Frequency Distribution Variety[J]. Journal of Chinese Information Processing, 2015, 29(1): 82-87.)
[14] 卢达威, 宋柔. 基于最大熵模型的汉语标点句缺失话题自动识别初探[J]. 计算机工程与科学, 2015, 37(12): 2282-2293.
[14] (Lu Dawei, Song Rou.Automatic Recognition of the Absent Topics in Chinese Punctuation Clauses Based on Maximum Entropy Model[J]. Computer Engineering and Science, 2015, 37(12): 2282-2293.)
[15] 何径舟, 王厚峰. 基于特征选择和最大熵模型的汉语词义消歧[J]. 软件学报, 2010, 21(6): 1287-1295.
[15] (He Jingzhou, Wang Houfeng.Chinese Word Sense Disambiguation Based on Maximum Entropy Model with Feature Selection[J]. Journal of Software, 2010, 21(6): 1287-1295.)
[16] 王昊, 邓三鸿. HMM和CRFs在信息抽取应用中的比较研究[J]. 现代图书情报技术, 2007(12): 57-63.
[16] (Wang Hao, Deng Sanhong.Comparative Study on HMM and CRFs Applying in Information Extraction[J]. New Technology of Library and Information Service, 2007(12): 57-63.)
[17] Song D J, Liu W, Zhou T Y et al. Efficient Robust Conditional Random Fields[J]. IEEE Transactions on Image Processing, 2015, 24(10): 3124-3136.
[18] 邓三鸿, 王昊, 秦嘉杭, 等. 基于字角色标注的中文书目关键词标引研究[J]. 中国图书馆学报, 2012, 38(2): 38-49.
[18] (Deng Sanhong, Wang Hao, Qin Jiahang, et al.Research on Keywords Indexing for Chinese Bibliography Based on Word Roles Annotation[J]. Journal of Library Science in China, 2012, 38(2): 38-49.)
[19] 王昊, 苏新宁. 基于CRFs的角色标注人名识别模型在网络舆情分析中的应用[J]. 情报学报, 2009, 28(1): 88-96.
[19] (Wang Hao, Su Xinning.Model for Person Name Recognition Based on Role Labeling Using CRFs and Its Application to Web Opinion Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(1): 88-96.)
[20] 刘伙玉, 王东波, 苏新宁. 多特征下的科研论文段落自动划分与构成要素识别研究[J]. 情报学报, 2015, 34(4): 388-397.
[20] (Liu Huoyu, Wang Dongbo, Su Xinning.Research of Paragraphs Segmentation and Elements Recognition for Academic Papers Based on Multi-features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(4): 388-397.)
[21] 李鹏, 桂婕, 乔晓东, 等. 条件随机场与规则集成的专利摘要信息抽取[J]. 数字图书馆论坛, 2010(9): 2-6.
[21] (Li Peng, Gui Jie, Qiao Xiaodong, et al.Patent Summary Information Extraction Based on Conditional Random Fields and Rule Integrated[J]. Digital Library Forum, 2010(9): 2-6.)
[22] 刘辉, 刘耀. 基于条件随机场的专利术语抽取[J]. 数字图书馆论坛, 2014(12): 46-49.
[22] (Liu Hui, Liu Yao.Patent Term Extraction Based on Conditional Random Field[J]. Digital Library Forum, 2014(12): 46-49.)
[23] 黄绍杉, 乔晓东, 桂婕, 等. 基于条件随机场的专利摘要信息抽取研究[J]. 数字图书馆论坛, 2010(9): 7-12.
[23] (Huang Shaoshan, Qiao Xiaodong, Gui Jie, et al.Research on Summary of Patent Information Extraction Based on Conditional Random Field[J]. Digital Library Forum, 2010(9): 7-12.)
[24] 李洪政, 晋耀红. 基于条件随机场方法的汉语专利文本介词短语识别[J]. 现代语文(语言研究), 2015(7): 120-122.
[24] (Li Hongzheng, Jin Yaohong.Recognition of Chinese Patent Text Prepositional Phrase Based on conditional Random Field[J]. Modern Chinese, 2015(7): 120-122.)
[25] Peng F, McCallum A. Infomation Extraction from Research Papers Using Conditional Random Fields[J]. Information Processing and Management, 2006, 42(4): 963-979.
[1] Guoming Feng,Xiaodong Zhang,Suhui Liu. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[2] Jiang Lin,Wang Dongbo. Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model[J]. 现代图书情报技术, 2016, 32(2): 9-15.
[3] Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[4] Shi Cui, Wang Yang, Yang Bin, Yao Ye. Identification of Non-nest Coordination for Chinese Patent Literature[J]. 现代图书情报技术, 2014, 30(10): 76-83.
[5] Meng Meiren, Ding Shengchun. Research on the Credibility of Online Chinese Product Reviews[J]. 现代图书情报技术, 2013, 29(9): 60-66.
[6] Hua Bolin. Extracting Information Method Term from Chinese Academic Literature[J]. 现代图书情报技术, 2013, (6): 68-75.
[7] Gu Jun, Xu Xin. Study on Ontology Relation Extraction in Chinese Patent Documents[J]. 现代图书情报技术, 2013, 29(10): 73-78.
[8] Kang Xiaoli, Zhang Chengzhi. Chinese-English Comparable Corpus Construction for Bilingual Terminology Extraction[J]. 现代图书情报技术, 2012, 28(2): 28-33.
[9] Kang Xiaoli,Zhang Chengzhi,Wang Huilin. Survey on Bilingual Terminology Extraction from Comparable Corpora[J]. 现代图书情报技术, 2009, (10): 7-13.
[10] Wang Hao,Deng Sanhong. Comparative Study on HMM and CRFs Applying in Information Extraction[J]. 现代图书情报技术, 2007, 2(12): 57-63.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn