基于CRFs的冶金领域中文专利术语抽取研究<sup>*</sup>

doi:10.11925/infotech.1003-3513.2016.06.04

现代图书情报技术

2016, Vol. 32

Issue (6): 28-36 https://doi.org/10.11925/infotech.1003-3513.2016.06.04

研究论文

本期目录 | 过刊浏览 | 高级检索

基于CRFs的冶金领域中文专利术语抽取研究^*

王密平(

),王昊,邓三鸿,吴志祥

南京大学信息管理学院南京 210023
江苏省数据工程与知识服务重点实验室南京 210023

Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields

Wang Miping(

),Wang Hao,Deng Sanhong,Wu Zhixiang

School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China

摘要
参考文献
补充材料
相关文章
Metrics

全文: PDF (1202 KB) HTML ( 63 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】探讨冶金领域中文专利术语抽取模型的最优条件, 用于有效地抽取冶金领域专利术语。【方法】使用尚不完善的核心语料库, 在无需人工标引的情况下, 采用条件随机场(CRFs)构建字角色标注的冶金领域中文专利术语识别模型。详细说明模型的构建过程, 同时重点对比CFRs的各个因素(特征组合、字长窗口等)对识别效果的影响。【结果】实验结果表明字序列、级别特征、领域特征、温度特征的组合在字长窗口为3, c等于1, f等于1时, 准确率达到94.26%, 召回率达到94.37%, F1值达到94.5%。【局限】核心词典欠完善, 使得部分词语标注不够准确; 未与其他方法作详细比较, 未详细说明CRFs的可靠性。【结论】CRFs在适当的角色和特征以及特征模板的组合下能较好地识别出冶金领域的中文专利术语。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王密平
	王昊
	邓三鸿
	吴志祥

关键词 ：中文专利术语, 条件随机场, 术语抽取, 序列标注

Abstract：

[Objective] This paper proposed a model to extract metallurgy patent terms in Chinese effectively. [Methods] We created the model to automatically identify metallurgy patent terminologies in Chinese with the help of conditional random fields(CRFs) technology. This model was tested with an incomplete core corpus. We discussed the development process and then compared the impacts of various CRFs factors to this character-role-labeled model. [Results] The new model combined the character sequences, level features, areal features and temperature features of the patent terms. Its precision rate was 94.26%, the recall rate was 94.37%, and the F1 value was 94.5%, while the length of the proximity window and the values of the parameter c and f were 3, 1, and 1 respectively. [Limitations] Some of the term labels were not accurate enough due to the incomplete core corpus. We did not compare our model with other methods to discuss the reliability of the CRFs. [Conclusions] The CRFs model could effectively identify the metallurgy patent terms in Chinese under appropriate working conditions.

Key words： Chinese patent terminology CRFs Terminology extraction Sequence labeling

收稿日期: 2016-03-01 出版日期: 2016-07-18

基金资助:*本文系江苏省自然科学基金项目“面向专利预警的中文本体学习研究”(项目编号: BK20130587)和江苏省“333”工程项目“面向知识服务的中文本体学习研究”(项目编号:BRA2015401)的研究成果之一

引用本文:

王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究^*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang. Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields. New Technology of Library and Information Service, 2016, 32(6): 28-36.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.06.04 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2016/V32/I6/28

[1]	贺延芳. 专利文献研究助力我国创新活动[N]. 中国知识产权报, 2012-03-23(4).
[1]	(He Yanfang. The Patent Literature Study Assist in Chinese Innovation Activities [N]. China Intellectual Property News, 2012-03-23(4).)
[2]	葛煦, 卢宝华, 杨湘华, 等. 谈高校科技发展中专利文献的利用[J]. 技术与创新管理, 2005, 26(1): 68-70.
[2]	(Ge Xu, Lu Baohua, Yang Xianghua, et al.Utilization of Patent Literature on the Development of Science and Technology in Universities[J]. Technology and Innovation Management, 2005, 26(1): 68-70.)
[3]	贾志琦, 邵曰剑. 有效利用专利文献提高企业技术创新能力[J]. 山西科技, 2008(1): 91-93.
[3]	(Jia Zhiqi, Shao Yuejian.Enhance Enterprises’ Technological Innovative Capability Through Effective Use of Patent Documents[J]. Shanxi Science and Technology, 2008(1): 91-93.)
[4]	Uzunbas M G, Chen C, Metaxas D.An Efficient Conditional Random Field Approach for Automatic and Interactive Neuron Segmentation[J]. Medical Image Analysis, 2016, 27: 31-44.
[5]	张雷瀚, 吕学强, 李卓, 等. 领域本体术语的抽取方法研究[J]. 情报学报, 2014, 33(2): 167-174.
[5]	(Zhang Leihan, Lv Xueqiang, Li Zhuo, et al.Research on Extraction Methods for Domain Ontology Terminology[J]. Journal of the China Society for Scientific and Technical Information, 2014, 33(2): 167-174.)
[6]	袁劲松, 张小明, 李舟军, 等. 术语自动抽取方法研究综述[J]. 计算机科学, 2015, 42(8): 7-12.
[6]	(Yuan Jinsong, Zhang Xiaoming, Li Zhoujun, et al.Survey of Automatic Terminology Extraction Methodologies[J]. Computer Science, 2015, 42(8): 7-12.)
[7]	汤青, 吕学强, 李卓, 等. 领域本体术语抽取研究[J]. 现代图书情报技术, 2014(1): 43-50.
[7]	(Tang Qing, Lv Xueqiang, Li Zhuo, et al.Research on Domain Ontology Term Extraction[J]. New Technology of Library and Information Service, 2014(1): 43-50.)
[8]	王昊, 刘建华, 苏新宁, 等. 面向语义网的本体学习技术和系统研究[J]. 现代图书情报技术, 2009(1): 64-72.
[8]	(Wang Hao, Liu Jianhua, Su Xinning, et al.Research on Techniques and Systems of Ontology Learning for Semantic Web[J]. New Technology of Library and Information Service , 2009(1): 64-72.)
[9]	谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
[9]	(Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[10]	化柏林. 针对中文学术文献的情报方法术语抽取[J]. 现代图书情报技术, 2013(6): 68-75.
[10]	(Hua Bolin.Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6): 68-75.)
[11]	Zhou H T, Chen J, Dong G M, et al. Detection and Diagnosis of Bearing Faults Using Shift-invariant Dictionary Learning and Hidden Markov Model [J]. Mechanical Systems and Signal Processing, 2016, 72-73: 65-79.
[12]	乐娟, 赵玺. 基于HMM的京剧机构命名实体识别算法[J]. 计算机工程, 2013, 39(6): 266-271, 286.
[12]	(Le Juan, Zhao Xi.Algorithm of Beijing Opera Organization Names Entity Recognition Based on HMM[J]. Computer Engineering, 2013, 39(6): 266-271, 286.)
[13]	李丽双, 王意文, 黄德根. 基于信息熵和词频分布变化的术语抽取研究[J]. 中文信息学报, 2015, 29(1): 82-87.
[13]	(Li Lishuang, Wang Yiwen, Huang Degen.Term Extraction Based on Information Entropy and Word Frequency Distribution Variety[J]. Journal of Chinese Information Processing, 2015, 29(1): 82-87.)
[14]	卢达威, 宋柔. 基于最大熵模型的汉语标点句缺失话题自动识别初探[J]. 计算机工程与科学, 2015, 37(12): 2282-2293.
[14]	(Lu Dawei, Song Rou.Automatic Recognition of the Absent Topics in Chinese Punctuation Clauses Based on Maximum Entropy Model[J]. Computer Engineering and Science, 2015, 37(12): 2282-2293.)
[15]	何径舟, 王厚峰. 基于特征选择和最大熵模型的汉语词义消歧[J]. 软件学报, 2010, 21(6): 1287-1295.
[15]	(He Jingzhou, Wang Houfeng.Chinese Word Sense Disambiguation Based on Maximum Entropy Model with Feature Selection[J]. Journal of Software, 2010, 21(6): 1287-1295.)
[16]	王昊, 邓三鸿. HMM和CRFs在信息抽取应用中的比较研究[J]. 现代图书情报技术, 2007(12): 57-63.
[16]	(Wang Hao, Deng Sanhong.Comparative Study on HMM and CRFs Applying in Information Extraction[J]. New Technology of Library and Information Service, 2007(12): 57-63.)
[17]	Song D J, Liu W, Zhou T Y et al. Efficient Robust Conditional Random Fields[J]. IEEE Transactions on Image Processing, 2015, 24(10): 3124-3136.
[18]	邓三鸿, 王昊, 秦嘉杭, 等. 基于字角色标注的中文书目关键词标引研究[J]. 中国图书馆学报, 2012, 38(2): 38-49.
[18]	(Deng Sanhong, Wang Hao, Qin Jiahang, et al.Research on Keywords Indexing for Chinese Bibliography Based on Word Roles Annotation[J]. Journal of Library Science in China, 2012, 38(2): 38-49.)
[19]	王昊, 苏新宁. 基于CRFs的角色标注人名识别模型在网络舆情分析中的应用[J]. 情报学报, 2009, 28(1): 88-96.
[19]	(Wang Hao, Su Xinning.Model for Person Name Recognition Based on Role Labeling Using CRFs and Its Application to Web Opinion Analysis[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(1): 88-96.)
[20]	刘伙玉, 王东波, 苏新宁. 多特征下的科研论文段落自动划分与构成要素识别研究[J]. 情报学报, 2015, 34(4): 388-397.
[20]	(Liu Huoyu, Wang Dongbo, Su Xinning.Research of Paragraphs Segmentation and Elements Recognition for Academic Papers Based on Multi-features[J]. Journal of the China Society for Scientific and Technical Information, 2015, 34(4): 388-397.)
[21]	李鹏, 桂婕, 乔晓东, 等. 条件随机场与规则集成的专利摘要信息抽取[J]. 数字图书馆论坛, 2010(9): 2-6.
[21]	(Li Peng, Gui Jie, Qiao Xiaodong, et al.Patent Summary Information Extraction Based on Conditional Random Fields and Rule Integrated[J]. Digital Library Forum, 2010(9): 2-6.)
[22]	刘辉, 刘耀. 基于条件随机场的专利术语抽取[J]. 数字图书馆论坛, 2014(12): 46-49.
[22]	(Liu Hui, Liu Yao.Patent Term Extraction Based on Conditional Random Field[J]. Digital Library Forum, 2014(12): 46-49.)
[23]	黄绍杉, 乔晓东, 桂婕, 等. 基于条件随机场的专利摘要信息抽取研究[J]. 数字图书馆论坛, 2010(9): 7-12.
[23]	(Huang Shaoshan, Qiao Xiaodong, Gui Jie, et al.Research on Summary of Patent Information Extraction Based on Conditional Random Field[J]. Digital Library Forum, 2010(9): 7-12.)
[24]	李洪政, 晋耀红. 基于条件随机场方法的汉语专利文本介词短语识别[J]. 现代语文(语言研究), 2015(7): 120-122.
[24]	(Li Hongzheng, Jin Yaohong.Recognition of Chinese Patent Text Prepositional Phrase Based on conditional Random Field[J]. Modern Chinese, 2015(7): 120-122.)
[25]	Peng F, McCallum A. Infomation Extraction from Research Papers Using Conditional Random Fields[J]. Information Processing and Management, 2006, 42(4): 963-979.

[1]		Download
[2]		Download

[1]	王一钒,李博,史话,苗威,姜斌. 古汉语实体关系联合抽取的标注方法*[J]. 数据分析与知识发现, 2021, 5(9): 63-74.
[2]	王昊, 林克柔, 孟镇, 李心蕾. 文本表示及其特征生成对法律判决书中多类型实体识别的影响分析[J]. 数据分析与知识发现, 2021, 5(7): 10-25.
[3]	成彬,施水才,都云程,肖诗斌. 基于融合词性的BiLSTM-CRF的期刊关键词抽取方法[J]. 数据分析与知识发现, 2021, 5(3): 101-108.
[4]	赵平,孙连英,涂帅,卞建玲,万莹. 改进的知识迁移景点实体识别算法研究及应用^*[J]. 数据分析与知识发现, 2020, 4(5): 118-126.
[5]	李成梁,赵中英,李超,亓亮,温彦. 基于依存关系嵌入与条件随机场的商品属性抽取方法^*[J]. 数据分析与知识发现, 2020, 4(5): 54-65.
[6]	刘浏,秦天允,王东波. 非物质文化遗产传统音乐术语自动抽取^*[J]. 数据分析与知识发现, 2020, 4(12): 68-75.
[7]	黄菡,王宏宇,王晓光. 结合主动学习的条件随机场模型用于法律术语的自动识别^*[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
[8]	肖连杰,孟涛,王伟,吴志祥. *基于深度学习的情报分析方法识别研究 ^ ——以安全情报领域为例**[J]. 数据分析与知识发现, 2019, 3(10): 20-28.
[9]	唐慧慧, 王昊, 张紫玄, 王雪颖. 基于汉字标注的中文历史事件名抽取研究^*[J]. 数据分析与知识发现, 2018, 2(7): 89-100.
[10]	冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[11]	王东波, 吴毅, 叶文豪, 刘睿伦. 多特征知识下的食品安全事件实体抽取研究^*[J]. 数据分析与知识发现, 2017, 1(3): 54-61.
[12]	张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究^*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[13]	张琳, 秦策, 叶文豪. 基于条件随机场的法言法语实体自动识别模型研究^*[J]. 数据分析与知识发现, 2017, 1(11): 46-52.
[14]	贺惠新,刘丽娟. 主动学习的科技文献研究对象标引体系研究^*[J]. 现代图书情报技术, 2016, 32(3): 67-73.
[15]	姜霖,王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究^*[J]. 现代图书情报技术, 2016, 32(2): 9-15.

Viewed

Full text

Abstract

Cited

Shared

Discussed