Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (6): 28-36    DOI: 10.11925/infotech.1003-3513.2016.06.04
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于CRFs的冶金领域中文专利术语抽取研究*
王密平(),王昊,邓三鸿,吴志祥
南京大学信息管理学院 南京 210023
江苏省数据工程与知识服务重点实验室 南京 210023
Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields
Wang Miping(),Wang Hao,Deng Sanhong,Wu Zhixiang
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
全文: PDF(1202 KB)   HTML ( 63
输出: BibTeX | EndNote (RIS)      
摘要 

目的】探讨冶金领域中文专利术语抽取模型的最优条件, 用于有效地抽取冶金领域专利术语。【方法】使用尚不完善的核心语料库, 在无需人工标引的情况下, 采用条件随机场(CRFs)构建字角色标注的冶金领域中文专利术语识别模型。详细说明模型的构建过程, 同时重点对比CFRs的各个因素(特征组合、字长窗口等)对识别效果的影响。【结果】实验结果表明字序列、级别特征、领域特征、温度特征的组合在字长窗口为3, c等于1, f等于1时, 准确率达到94.26%, 召回率达到94.37%, F1值达到94.5%。【局限】核心词典欠完善, 使得部分词语标注不够准确; 未与其他方法作详细比较, 未详细说明CRFs的可靠性。【结论】CRFs在适当的角色和特征以及特征模板的组合下能较好地识别出冶金领域的中文专利术语。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王密平
王昊
邓三鸿
吴志祥
关键词 中文专利术语条件随机场术语抽取序列标注    
Abstract

[Objective] This paper proposed a model to extract metallurgy patent terms in Chinese effectively. [Methods] We created the model to automatically identify metallurgy patent terminologies in Chinese with the help of conditional random fields(CRFs) technology. This model was tested with an incomplete core corpus. We discussed the development process and then compared the impacts of various CRFs factors to this character-role-labeled model. [Results] The new model combined the character sequences, level features, areal features and temperature features of the patent terms. Its precision rate was 94.26%, the recall rate was 94.37%, and the F1 value was 94.5%, while the length of the proximity window and the values of the parameter c and f were 3, 1, and 1 respectively. [Limitations] Some of the term labels were not accurate enough due to the incomplete core corpus. We did not compare our model with other methods to discuss the reliability of the CRFs. [Conclusions] The CRFs model could effectively identify the metallurgy patent terms in Chinese under appropriate working conditions.

Key wordsChinese patent terminology    CRFs    Terminology extraction    Sequence labeling
收稿日期: 2016-03-01     
基金资助:*本文系江苏省自然科学基金项目“面向专利预警的中文本体学习研究”(项目编号: BK20130587)和江苏省“333”工程项目“面向知识服务的中文本体学习研究”(项目编号:BRA2015401)的研究成果之一
引用本文:   
王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.06.04.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.06.04
[1] 贺延芳. 专利文献研究助力我国创新活动[N]. 中国知识产权报, 2012-03-23(4).
[1] (He Yanfang. The Patent Literature Study Assist in Chinese Innovation Activities [N]. China Intellectual Property News, 2012-03-23(4).)
[2] 葛煦, 卢宝华, 杨湘华, 等. 谈高校科技发展中专利文献的利用[J]. 技术与创新管理, 2005, 26(1): 68-70.
[2] (Ge Xu, Lu Baohua, Yang Xianghua, et al.Utilization of Patent Literature on the Development of Science and Technology in Universities[J]. Technology and Innovation Management, 2005, 26(1): 68-70.)
[3] 贾志琦, 邵曰剑. 有效利用专利文献提高企业技术创新能力[J]. 山西科技, 2008(1): 91-93.
[3] (Jia Zhiqi, Shao Yuejian.Enhance Enterprises’ Technological Innovative Capability Through Effective Use of Patent Documents[J]. Shanxi Science and Technology, 2008(1): 91-93.)
[4] Uzunbas M G, Chen C, Metaxas D.An Efficient Conditional Random Field Approach for Automatic and Interactive Neuron Segmentation[J]. Medical Image Analysis, 2016, 27: 31-44.
[5] 张雷瀚, 吕学强, 李卓, 等. 领域本体术语的抽取方法研究[J]. 情报学报, 2014, 33(2): 167-174.
[5] (Zhang Leihan, Lv Xueqiang, Li Zhuo, et al.Research on Extraction Methods for Domain Ontology Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(2): 167-174.)
[6] 袁劲松, 张小明, 李舟军, 等. 术语自动抽取方法研究综述[J]. 计算机科学, 2015, 42(8): 7-12.
[6] (Yuan Jinsong, Zhang Xiaoming, Li Zhoujun, et al.Survey of Automatic Terminology Extraction Methodologies[J]. Computer Science, 2015, 42(8): 7-12.)
[7] 汤青, 吕学强, 李卓, 等. 领域本体术语抽取研究[J]. 现代图书情报技术, 2014(1): 43-50.
[7] (Tang Qing, Lv Xueqiang, Li Zhuo, et al.Research on Domain Ontology Term Extraction[J]. New Technology of Library and Information Service, 2014(1): 43-50.)
[8] 王昊, 刘建华, 苏新宁, 等. 面向语义网的本体学习技术和系统研究[J]. 现代图书情报技术, 2009(1): 64-72.
[8] (Wang Hao, Liu Jianhua, Su Xinning, et al.Research on Techniques and Systems of Ontology Learning for Semantic Web[J]. New Technology of Library and Information Service , 2009(1): 64-72.)
[9] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
[9] (Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[10] 化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6): 68-75.
[10] (Hua Bolin.Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6): 68-75.)
[11] Zhou H T, Chen J, Dong G M, et al. Detection and Diagnosis of Bearing Faults Using Shift-invariant Dictionary Learning and Hidden Markov Model [J]. Mechanical Systems and Signal Processing, 2016, 72-73: 65-79.
[12] 乐娟, 赵玺. 基于HMM的京剧机构命名实体识别算法[J]. 计算机工程, 2013, 39(6): 266-271, 286.
[12] (Le Juan, Zhao Xi.Algorithm of Beijing Opera Organization Names Entity Recognition Based on HMM[J]. Computer Engineering, 2013, 39(6): 266-271, 286.)
[13] 李丽双, 王意文, 黄德根. 基于信息熵和词频分布变化的术语抽取研究[J]. 中文信息学报, 2015, 29(1): 82-87.
[13] (Li Lishuang, Wang Yiwen, Huang Degen.Term Extraction Based on Information Entropy and Word Frequency Distribution Variety[J]. Journal of Chinese Information Processing, 2015, 29(1): 82-87.)
[14] 卢达威, 宋柔. 基于最大熵模型的汉语标点句缺失话题自动识别初探[J]. 计算机工程与科学, 2015, 37(12): 2282-2293.
[14] (Lu Dawei, Song Rou.Automatic Recognition of the Absent Topics in Chinese Punctuation Clauses Based on Maximum Entropy Model[J]. Computer Engineering and Science, 2015, 37(12): 2282-2293.)
[15] 何径舟, 王厚峰. 基于特征选择和最大熵模型的汉语词义消歧[J]. 软件学报, 2010, 21(6): 1287-1295.
[15] (He Jingzhou, Wang Houfeng.Chinese Word Sense Disambiguation Based on Maximum Entropy Model with Feature Selection[J]. Journal of Software, 2010, 21(6): 1287-1295.)
[16] 王昊, 邓三鸿. HMM和CRFs在信息抽取应用中的比较研究[J]. 现代图书情报技术, 2007(12): 57-63.
[16] (Wang Hao, Deng Sanhong.Comparative Study on HMM and CRFs Applying in Information Extraction[J]. New Technology of Library and Information Service, 2007(12): 57-63.)
[17] Song D J, Liu W, Zhou T Y et al. Efficient Robust Conditional Random Fields[J]. IEEE Transactions on Image Processing, 2015, 24(10): 3124-3136.
[18] 邓三鸿, 王昊, 秦嘉杭, 等. 基于字角色标注的中文书目关键词标引研究[J]. 中国图书馆学报, 2012, 38(2): 38-49.
[18] (Deng Sanhong, Wang Hao, Qin Jiahang, et al.Research on Keywords Indexing for Chinese Bibliography Based on Word Roles Annotation[J]. Journal of Library Science in China, 2012, 38(2): 38-49.)
[19] 王昊, 苏新宁. 基于CRFs的角色标注人名识别模型在网络舆情分析中的应用[J]. 情报学报, 2009, 28(1): 88-96.
[19] (Wang Hao, Su Xinning.Model for Person Name Recognition Based on Role Labeling Using CRFs and Its Application to Web Opinion Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(1): 88-96.)
[20] 刘伙玉, 王东波, 苏新宁. 多特征下的科研论文段落自动划分与构成要素识别研究[J]. 情报学报, 2015, 34(4): 388-397.
[20] (Liu Huoyu, Wang Dongbo, Su Xinning.Research of Paragraphs Segmentation and Elements Recognition for Academic Papers Based on Multi-features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(4): 388-397.)
[21] 李鹏, 桂婕, 乔晓东, 等. 条件随机场与规则集成的专利摘要信息抽取[J]. 数字图书馆论坛, 2010(9): 2-6.
[21] (Li Peng, Gui Jie, Qiao Xiaodong, et al.Patent Summary Information Extraction Based on Conditional Random Fields and Rule Integrated[J]. Digital Library Forum, 2010(9): 2-6.)
[22] 刘辉, 刘耀. 基于条件随机场的专利术语抽取[J]. 数字图书馆论坛, 2014(12): 46-49.
[22] (Liu Hui, Liu Yao.Patent Term Extraction Based on Conditional Random Field[J]. Digital Library Forum, 2014(12): 46-49.)
[23] 黄绍杉, 乔晓东, 桂婕, 等. 基于条件随机场的专利摘要信息抽取研究[J]. 数字图书馆论坛, 2010(9): 7-12.
[23] (Huang Shaoshan, Qiao Xiaodong, Gui Jie, et al.Research on Summary of Patent Information Extraction Based on Conditional Random Field[J]. Digital Library Forum, 2010(9): 7-12.)
[24] 李洪政, 晋耀红. 基于条件随机场方法的汉语专利文本介词短语识别[J]. 现代语文(语言研究), 2015(7): 120-122.
[24] (Li Hongzheng, Jin Yaohong.Recognition of Chinese Patent Text Prepositional Phrase Based on conditional Random Field[J]. Modern Chinese, 2015(7): 120-122.)
[25] Peng F, McCallum A. Infomation Extraction from Research Papers Using Conditional Random Fields[J]. Information Processing and Management, 2006, 42(4): 963-979.
[1] 黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[2] 唐慧慧,王昊,张紫玄,王雪颖. 基于汉字标注的中文历史事件名抽取研究*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[3] 冯国明,张晓冬,刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[4] 王东波,吴毅,叶文豪,刘睿伦. 多特征知识下的食品安全事件实体抽取研究*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[5] 张越,王东波,朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[6] 张琳,秦策,叶文豪. 基于条件随机场的法言法语实体自动识别模型研究*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[7] 贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[8] 姜霖,王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究*[J]. 现代图书情报技术, 2016, 32(2): 9-15.
[9] 隋明爽,崔雷. 结合多种特征的CRF模型用于化学物质-疾病命名实体识别[J]. 现代图书情报技术, 2016, 32(10): 91-97.
[10] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[11] 姜春涛. 自动标注中文专利的引文信息[J]. 现代图书情报技术, 2015, 31(10): 81-87.
[12] 何宇, 吕学强, 徐丽萍. 新能源汽车领域中文术语抽取方法[J]. 现代图书情报技术, 2015, 31(10): 88-94.
[13] 张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[14] 唐守利, 徐宝祥. 基于本体的云服务语义检索系统研究[J]. 现代图书情报技术, 2014, 30(12): 27-35.
[15] 曾镇, 吕学强, 李卓. 搜索日志中中文人名的自动识别[J]. 现代图书情报技术, 2014, 30(12): 71-77.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn