中文专利文献中连续符号串的语义识别<sup>*</sup>

doi:10.11925/infotech.2096-3467.2017.1065

数据分析与知识发现

2018, Vol. 2

Issue (5): 11-22 https://doi.org/10.11925/infotech.2096-3467.2017.1065

研究论文

本期目录 | 过刊浏览 | 高级检索

中文专利文献中连续符号串的语义识别^*

王雪颖, 王昊(

), 张紫玄

南京大学信息管理学院南京 210023
江苏省数据工程与知识服务重点实验室(南京大学) 南京 210023

Recognizing Semantics of Continuous Strings in Chinese Patent Documents

Wang Xueying, Wang Hao(

), Zhang Zixuan

School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210023, China

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (681 KB) HTML ( 4 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】 解决汉语文档中连续字符串的语义识别问题。【方法】 使用钢铁冶金领域专利文献中已识别语义的部分符号串作为学习语料, 利用基础特征、汉字特征、符号串特征进行测试, 根据实验结果确定最佳模型。使用最佳模型, 对规则未能判别语义的符号串展开测试。【结果】 将测试结果与人工判别的真实角色进行比对, 发现Y的P值最小为98.15%, 最大为99.62%, N的P值最小为96.87%, 最大为99.34%; Y的R值最小为96.56%, 最大为99.04%, N的R值最小为98.73%, 最大为99.67%; Y的F1值最小为97.71%, 最大为99.33%; N的F1值最小为97.98%, 最大为99.42%, 可以看出识别效果较为理想。【局限】受学习语料规模的影响和研究时间的限制, 未能将已识别角色的语料加入样本中学习。【结论】 该模型在中文专利文献中连续符号串的语义判别方面具有较高的可行性、有效性和可移植性, 为英文文献中符号串的语义判别提供思路。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	王雪颖
	王昊
	张紫玄

关键词 ：中文专利, 钢铁冶金领域, 连续符号串, 语义识别

Abstract：

[Objective] This paper aims to extract the semantic information from continuous strings in Chinese patent documents in the field of iron and steel metallurgy. [Methods] First, we collected strings with identified the semantics as the learning corpus. Then, we examined the basic features, as well as characteristics of Chinese characters and strings with the corpus to establish the best model. Finally, we used this model to recognize the semantics of other strings. [Results] The proposed model could effectively extract semantics of the continuous strings. [Limitations] We did not include the identified characters to the training corpus. [Conclusions] The new model could identify the semantics of continuous strings in Chinese patent documents, which could be used to study the continuous strings in English literature.

Key words： Chinese Patent Documents Iron and Steel Metallurgy Continuous Strings Semantic Recognition

收稿日期: 2017-10-26 出版日期: 2018-06-20

ZTFLH:

G306

基金资助:*本文系江苏省“333工程”项目“面向知识服务的中文本体学习研究”(项目编号: BRA2015401)和国家自然科学基金青年项目“面向学术资源的TSD与TDC测度及分析研究”(项目编号: 71503121)的研究成果之一

引用本文:

王雪颖, 王昊, 张紫玄. 中文专利文献中连续符号串的语义识别^*[J]. 数据分析与知识发现, 2018, 2(5): 11-22.
Wang Xueying,Wang Hao,Zhang Zixuan. Recognizing Semantics of Continuous Strings in Chinese Patent Documents. Data Analysis and Knowledge Discovery, 2018, 2(5): 11-22.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2017.1065 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I5/11

字符的标注角色集合

特征及其取值情况

CRFs的特征模板

基于基本特征的实验结果(汉字: ABEMS)

基于基本特征的实验结果(汉字: A)

基于汉字特征的实验结果(汉字: ABEMS)

基于汉字特征的实验结果(汉字: A)

各角色在使用单个特征时的识别情况

各角色在依次叠加特征时的识别情况

Y和N的P值

Y和N的R值

Y和N的F1值

0 模型应用的实验结果

1 误判条目

[1]	Trappey C V, Wu H Y, Taghaboni-Dutta F, et al.Using Patent Data for Technology Forecasting: China RFID Patent Analysis[J]. Advanced Engineering Informatics, 2011, 25(1): 53-64. doi: 10.1016/j.aei.2010.05.007
[2]	王密平. 汉语专利术语抽取及应用研究——以钢铁冶金领域为例[D]. 南京: 南京大学, 2017.
[2]	(Wang Miping.A Study on Chinese Terms Extraction and Their Application: The Case of Iron and Steel Metallurgy[D]. Nanjing: Nanjing University, 2017.)
[3]	陈志雄, 曾辉. 中文专利文献自动分类[J]. 嘉应学院学报, 2010, 28(2): 24-29. doi: 10.3969/j.issn.1006-642X.2010.02.006
[3]	(Chen Zhixiong, Zeng Hui.Chinese Patent Text Automatic Categorization System[J]. Journal of Jiaying University, 2010, 28(2): 24-29.) doi: 10.3969/j.issn.1006-642X.2010.02.006
[4]	徐川, 施水才, 房祥, 等. 中文专利文献术语抽取[J].计算机工程与设计, 2013, 34(6): 2175-2179.
[4]	(Xu Chuan, Shi Shuicai, Fang Xiang, et al.Chinese Patent Terminology Extraction[J]. Computer Engineering and Design, 2013, 34(6): 2175-2179.)
[5]	王密平, 王昊, 邓三鸿, 等. 基于CRFs的冶金领域中文专利术语抽取研究[J].现代图书情报技术, 2016(6): 28-36.
[5]	(Wang Miping, Wang Hao, Deng Sanhong, et al.A Study on Chinese Terms Extraction and Their Application: The Case of Iron and Steel Metallurgy[J]. New Technology of Library and Information Service, 2016(6): 28-36.)
[6]	韩杰冰. 基于字角色标注的中文专利术语识别研究[D]. 南京: 南京大学, 2015.
[6]	(Han Jiebing.The Research on Chinese Term Recognition of Patents Based on Word-Role Tagging[D]. Nanjing: Nanjing University, 2015.)
[7]	姜武. 模式识别技术在山茶属植物数值分类学和叶绿素含量预测中的应用研究[D]. 金华: 浙江师范大学, 2013.
[7]	(Jiang Wu.Application of Pattern Recognition Techniques in Plant Numerical Taxonomy and Chlorophyll Content of Genus Camellia[D]. Jinhua: Zhejiang Normal University, 2013.)
[8]	罗俊, 王清丽, 张华, 等. 不同甘蔗基因型光合特性的数值分类[J].应用与环境生物学报, 2007, 13(4): 461-465. doi: 10.3321/j.issn:1006-687x.2007.04.004
[8]	(Luo Jun, Wang Qingli, Zhang Hua, et al.Phenetic Classification for Photosynthetic Characters of Different Sugarcane Varieties[J]. Chinese Journal of Applied and Environmental Biology, 2007, 13(4): 461-465.) doi: 10.3321/j.issn:1006-687x.2007.04.004
[9]	刘晓云, 陈文新. 三叶草、猪屎豆和含羞草植物根瘤菌16S rDNA PCR-RFLP分析和数值分类研究[J]. 中国农业大学学报, 2003, 8(3): 1-6. doi: 10.3321/j.issn:1007-4333.2003.03.001
[9]	(Liu Xiaoyun, Chen Wenxin.16S rDNA PCR-RFLP Analysis and Numerical Taxonomy for Rhizobia Isolated from Trifolium, Crotalaria and Mimosa[J]. Journal of China Agricultural University, 2003, 8(3): 1-6.) doi: 10.3321/j.issn:1007-4333.2003.03.001
[10]	刘勇, 孙中海, 刘德春, 等. 部分柚类品种数值分类研究[J].果树学报, 2006, 23(1): 35-40.
[10]	(Liu Yong, Sun Zhonghai, Liu Dechun, et al.Numerical Classification of Some Grapefruit Cultivars[J]. Journal of Fruit Science, 2006, 23(1): 35-40.)
[11]	杜琪珍, 李名君, 刘维华, 等. 茶组植物的化学分类及数值分类[J].茶叶科学, 1990, 10(2): 1-12.
[11]	(Du Qizhen, Li Mingjun, Liu Weihua, et al.Chemical and Numerical Taxonomies of Tea Section Plants[J]. Journal of Tea Science, 1990, 10(2): 1-12.)
[12]	罗礼溥, 郭宪国. 云南医学革螨数值分类研究[J]. 热带医学杂志, 2007, 50(1): 172-177. doi: 10.3321/j.issn:0454-6296.2007.02.011
[12]	(Luo Lipu, Guo Xianguo.Classification of a Medically Important Group of Gamasid Mites by Numerical Taxonomy in Yunnan, China[J]. Journal of Tropical Medicine, 2007, 50(1): 172-177. ) doi: 10.3321/j.issn:0454-6296.2007.02.011
[13]	陈晓琴, 陈强, 张世熔, 等. 流沙河流域土壤自生固氮菌数值分类及BOX-PCR研究[J]. 农业环境科学学报, 2006, 25(S): 528-532.
[13]	(Chen Xiaoqin, Chen Qiang, Zhang Shirong, et al.Taxonomy and BOX-PCR Analysis of Free-Living Dizotrophs Isolated from Soils in Liusha River Valley[J]. Journal of Agro-Environment Science, 2006, 25(S): 528-532.)
[14]	孙家梅. 白蛉的数值分类和基于DNA条形码的分子系统学研究[D]. 广州: 暨南大学, 2010.
[14]	(Sun Jiamei.The Numerical Taxonomy and Molecular Systematic Using Phlebotomus DNA Barcode of Phlebotomine Sandflies[D]. Guangzhou: Jinan University, 2010.)
[15]	么枕生. 用于数值分类的聚类分析[J]. 海洋湖沼通报, 1994(2): 1-12.
[15]	(Yao Zhensheng.Cluster Analysis Used in Numerical Classification[J]. Transactions of Oceanology and Limnology, 1994(2): 1-12.)
[16]	李宏乔, 樊孝忠. 汉语文本中特殊符号串的自动识别技术[J]. 计算机工程, 2004, 30(12): 114-115.
[16]	(Li Hongqiao, Fan Xiaozhong.Technique of Special Strings Automatic Recognition in Chinese Texts[J]. Computer Engineering, 2004, 30(12): 114-115.)
[17]	赵欣欣. 基于字符编码的文本隐藏算法及其攻击方法研究[D]. 合肥: 中国科学技术大学, 2009.
[17]	(Zhao Xinxin.Research on Character Coding Based Text Stenographer and Its Attack Methods[D]. Hefei: University of Science and Technology of China, 2009.)
[18]	金花, 朱亚涛, 靳志强. 农业文献知识获取中斜体字符识别技术的应用研究[J].河北农业大学学报, 2015, 38(6): 124-128. doi: 10.13320/j.cnki.jauh.2015.0148
[18]	(Jin Hua, Zhu Yatao, Jin Zhiqiang.Research on Detection Method of English Italic Characters in Agriculture Acquisition[J]. Journal of Agricultural University of Hebei, 2015, 38(6): 124-128.) doi: 10.13320/j.cnki.jauh.2015.0148
[19]	汤青, 吕学强, 李卓, 等. 领域本体术语抽取研究[J].现代图书情报技术, 2014(1): 43-50.
[19]	(Tang Qing, Lv Xueqiang, Li Zhuo, et al.Research on Domain Ontology Term Extraction[J]. New Technology of Library and Information Service, 2014(1): 43-50.)
[20]	屈鹏, 王惠临. 面向信息分析的专利术语抽取研究[J].图书情报工作, 2013, 57(1): 130-135. doi: 10.7536/j.jssn.0252-3116.2013.01.023
[20]	(Qu Peng, Wang Huilin.Patent Term Extraction for Information Analysis[J]. Library and Information Service, 2013, 57(1): 130-135.) doi: 10.7536/j.jssn.0252-3116.2013.01.023
[21]	胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013(2): 24-29.
[21]	(Hu Apei, Zhang Jing, Liu Junli.Chinese Term Extraction Based on Improved C-value Method[J]. New Technology of Library and Information Service, 2013(2): 24-29.)
[22]	侯婷, 吕学强, 李卓. 专利术语抽取的层次过滤方法[J]. 现代图书情报技术, 2015(1): 24-30.
[22]	(Hou Ting, Lv Xueqiang, Li Zhuo.Hierarchical Filtering Method for Patent Term Extraction[J]. New Technology of Library and Information Service, 2015(1): 24-30.)
[23]	何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014(3): 73-79.
[23]	(He Yuanbiao, Le Xiaoqiu, Zhang Fan.Research on Keyphrase Extraction from Scholarly Article Outline[J]. New Technology of Library and Information Service, 2014(3): 73-79.)
[24]	杜丽萍, 李晓戈, 周元哲, 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000, 1005. doi: 10.11772/j.issn.1001-9081.2015.04.0996
[24]	(Du Liping, Li Xiaoge, Zhou Yuanzhe, et al.Application of Mutual Information Improvement Method in Term Extraction[J]. Computer Applications, 2015, 35(4): 996-1000, 1005.) doi: 10.11772/j.issn.1001-9081.2015.04.0996
[25]	谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
[25]	(Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[26]	屈鹏, 王惠临. 专利信息服务中的术语抽取[J]. 情报科学, 2015, 33(9): 66-71.
[26]	(Qu Peng, Wang Huilin.Term Extraction in Patent Information Services[J]. Information Science, 2015, 33(9): 66-71.)
[27]	曾文, 徐硕, 张运良, 等. 科技文献术语的自动抽取技术研究与分析[J]. 现代图书情报技术, 2014(1): 51-55.
[27]	(Zeng Wen, Xu Shuo, Zhang Yunliang, et al.The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J]. New Technology of Library and Information Service, 2014(1): 51-55.)
[28]	化柏林. 针对中文学术文献的情报方法术语抽取[J].现代图书情报技术, 2013(6): 68-75.
[28]	(Hua Bolin.Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6): 68-75.)
[29]	袁劲松, 张小明, 李舟军.术语自动抽取方法研究综述[J].计算机科学, 2015, 42(8): 7-12. doi: 10.11896/j.issn.1002-137X.2015.8.002
[29]	(Yuan Jinsong, Zhang Xiaoming, Li Zhoujun.Survey of Automatic Term Extraction Methodologies[J]. Computer Science, 2015, 42(8): 7-12.) doi: 10.11896/j.issn.1002-137X.2015.8.002
[30]	张文静, 梁颖红. 术语抽取技术研究[J].信息技术, 2008, 32(3): 6-9.
[30]	(Zhang Wenjing, Liang Yinghong.Research on Term Extraction Technology[J]. Information Technology, 2008, 32(3): 6-9.)
[31]	周浪. 中文术语抽取若干问题研究[D].南京: 南京理工大学, 2010.
[31]	(Zhou Lang.A Study on the Chinese Term Extraction[D]. Nanjing: Nanjing University of Science and Technology, 2010.)
[32]	唐涛, 周俏丽, 张桂平. 统计与规则相结合的术语抽取[J].沈阳航空航天大学学报, 2011, 28(5): 71-74.
[32]	(Tang Tao, Zhou Qiaoli, Zhang Guiping.Term Extraction Based on the Combination of Statistics and Rules[J]. Journal of Shenyang University of Aeronautics and Astronautics, 2011, 28(5): 71-74.)
[33]	陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016, 60(2): 122-128. doi: 10.13266/j.issn.0252-3116.2016.02.019
[33]	(Chen Feng, Zhai Yujia, Wang Fang.Automatic Theory Recognition in Academic Journals Based on CRF[J]. Library and Information Service, 2016, 60(2): 122-128.) doi: 10.13266/j.issn.0252-3116.2016.02.019
[34]	逯万辉, 马建霞. 基于CRFs的领域爆发词识别的研究与实现[J]. 情报科学, 2014, 32(1): 89-93.
[34]	(Lu Wanhui, Ma Jianxia.Research and Implementation on the Domain Burst Word Recognition Based on CRFs[J]. Information Science, 2014, 32(1): 89-93.)
[35]	王荣洋, 鞠久朋, 李寿山, 等. 基于CRFs的评价对象抽取特征研究[J]. 中文信息学报, 2012, 26(2): 56-61.
[35]	(Wang Rongyang, Ju Jiupeng, Li Shoushan, et al.Feature Engineering for CRFs Based Opinion Target Extraction[J]. Journal of Chinese Information Processing, 2012, 26(2): 56-61.)
[36]	侯立斌, 李培峰, 朱巧明. 基于CRFs和跨事件的事件识别研究[J]. 计算机工程, 2012, 38(24): 191-195. doi: 10.3969/j.issn.1000-3428.2012.24.045
[36]	(Hou Libin, Li Peifeng, Zhu Qiaoming.Study of Event Recognition Based on CRFs and Cross-event[J]. Computer Engineering, 2012, 38(24): 191-195.) doi: 10.3969/j.issn.1000-3428.2012.24.045
[37]	罗彦彦, 黄德根. 基于CRFs边缘概率的中文分词[J].中文信息学报, 2009, 23(5): 3-8.
[37]	(Luo Yanyan, Huang Degen.Chinese Word Segmentation Based on the Marginal Probabilities Generated by CRFs[J]. Journal of Chinese Information Processing, 2009, 23(5): 3-8.)
[38]	CRF++[EB/OL]. [2017-11-16]. .
[39]	周志华, 王珏. 机器学习及其应用[M]. 北京: 清华大学出版社, 2009.
[39]	(Zhou Zhihua, Wang Jue.Machine Learning and Its Application [M]. Beijing: Tsinghua University Press, 2009.)

[1]	王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究^*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
[2]	张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.

Viewed

Full text

Abstract

Cited

Shared

Discussed