Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (5): 11-22    DOI: 10.11925/infotech.2096-3467.2017.1065
Orginal Article Current Issue | Archive | Adv Search |
Recognizing Semantics of Continuous Strings in Chinese Patent Documents
Wang Xueying, Wang Hao(), Zhang Zixuan
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210023, China
Download: PDF (681 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to extract the semantic information from continuous strings in Chinese patent documents in the field of iron and steel metallurgy. [Methods] First, we collected strings with identified the semantics as the learning corpus. Then, we examined the basic features, as well as characteristics of Chinese characters and strings with the corpus to establish the best model. Finally, we used this model to recognize the semantics of other strings. [Results] The proposed model could effectively extract semantics of the continuous strings. [Limitations] We did not include the identified characters to the training corpus. [Conclusions] The new model could identify the semantics of continuous strings in Chinese patent documents, which could be used to study the continuous strings in English literature.

Key wordsChinese Patent Documents      Iron and Steel Metallurgy      Continuous Strings      Semantic Recognition     
Received: 26 October 2017      Published: 20 June 2018
ZTFLH:  G306  

Cite this article:

Wang Xueying,Wang Hao,Zhang Zixuan. Recognizing Semantics of Continuous Strings in Chinese Patent Documents. Data Analysis and Knowledge Discovery, 2018, 2(5): 11-22.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.1065     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I5/11

角色(R) 说明 示例
B 术语首字 如“脱氧剂”之“脱”
M 术语中字 如“脱氧剂”之“氧”
E 术语尾字 如“脱氧剂”之“剂”
S 单字术语 如“铁”、“锰”等
A 非术语词中的字 如“一种钢水脱氧剂”
之“一”“种”
T 不能确定是否有意义的符号串 如“(63”
Y 有意义的符号串 如“A356”
N 无意义的符号串 如“-”
特征类别 具体特征 取值情况
基本特征 字序列(Z) 汉字或连续符号串
汉字
特征[5]
姓氏特征(X) Y(姓氏字)/N(非姓氏字)
音译特征(Y) Y(音译外来字)/N(非音译外来字)
领域特征(K) Y(领域常用字)/N(非领域常用字)
级别特征(G) X(一级常用字)/Y(二级常用字)/Z(其他)
分类特征(C) X(指事字)/Y(象形字)/Z(形声字)/U(会意字)/V(其他)
符号串
特征
符号串长度(L) 非负整数
第一个字符的ASCII码(A) 正整数
符号串中含有的数字个数(M) 非负整数
符号串中含有的字母个数(H) 非负整数
是否含有“*+%: ℃Δθ×± ≤~≥λ<>>”等特殊
符号(S)
Y(含有)/N(不含)
是否含有“-,,、——/“”() ii iv’[]$→①②③④⑤⑥⑦ ⑧⑨…?″”等标点符号、
标号(P)
Y(含有)/N(不含)
第一个字符是否为数字或字母(F) Y(是)/N(否)
模板名称 观察特征 字长窗口 汉字标注角色
TMPT0 Z 5 ABEMS
A
TMPT1 ZXYKGC 5 ABEMS
A
TMPT2 ZXYKGCL 5 A
TMPT3 ZXYKGCLA 5 A
TMPT4 ZXYKGCLAM 5 A
TMPT5 ZXYKGCLAMH 5 A
TMPT6 ZXYKGCLAMHS 5 A
TMPT7 ZXYKGCLAMHSP 5 A
TMPT8 ZXYKGCLAMHSPF 5 A
测试\实际 Y N A B E M S
Y 1400 14 14 0 6 0 2
N 1 1735 0 0 1 0 0
A 165 61 20350 138 131 31 18
B 2 0 30 4038 12 19 11
E 1 1 36 6 4032 22 14
M 0 0 3 6 9 704 0
S 0 0 0 7 4 1 623
总计 1569 1811 20433 4195 4195 777 668
测试\实际 Y N A
Y 1372 14 10
N 1 1718 0
A 196 79 30258
总计 1569 1811 30268
测试\实际 Y N A B E M S
Y 1547 27 7 0 0 0 1
N 20 1784 4 0 0 0 1
A 2 0 20281 130 139 30 6
B 0 0 64 4031 13 18 15
E 0 0 67 11 4026 26 11
M 0 0 7 13 10 701 0
S 0 0 3 10 7 2 634
总计 1569 1811 20433 4195 4195 0 1
测试\实际 Y N A
Y 1547 26 6
N 20 1785 4
A 2 0 30258
总计 1569 1811 30268
实际
角色
测试
角色
使用的特征
L A M H S P F
Y Y 1548 1541 1537 1550 1539 1538 1515
N 19 26 29 17 25 28 48
A 2 2 3 2 2 3 3
N Y 20 13 20 22 19 22 16
N 1790 1798 1791 1789 1792 1788 1795
A 1 0 0 0 0 1 0
A Y 1 6 6 6 6 7 1
N 4 4 4 4 4 3 10
A 30258 30258 30258 30258 30258 30258 30257
实际
角色
测试
角色
叠加特征个数
1 2 3 4 5 6 7
Y Y 1548 1556 1551 1554 1554 1554 1554
N 19 10 15 12 12 12 12
A 2 3 3 3 3 3 3
N Y 20 8 7 22 7 7 8
N 1790 1802 1803 1789 1803 1803 1802
A 1 1 1 0 1 1 1
A Y 1 1 1 6 1 1 0
N 4 3 4 4 3 1 0
A 30258 30264 30263 30258 30264 30266 30268
测试\实际 Y N 总计
Y 90 10 100
N 4 96 100
总计 94 106 200
误判类型 序号 误判条目 所在语句 训练样本中该
类型的条目数
N判为Y 1 1 其特征在于塞棒头2垂直固定在球弧状体1的弧面中部 0
2 5 渣池内的水经过过滤器5过滤后 0
3 8 渣池内的水由过滤器8过滤后 0
4 10 吊挂螺杆10 0
5 11 通过将还原性气体11含有的硫分在还原炉4内移动到还原铁3上 0
6 18 2的加工区域MA附近设置的向导装置16自如移动的滑鞍18上 0
7 25 25抵接到轴向端面801 0
8 A 再根据阿仑尼乌斯关系式算出系数A和再结晶激活能 1
9 9) 9)中至少部分焙烧菱镁矿 0
10 a) 重复步骤一a)过程 0
Y判为N 11 -200 磨细到-200目60%以上 0
12 B B及Nb中的一种以上 0
13 (LT—H 该发明主要用特制的净化加热主包芯线(LT–H主线)和辅助料微调钢水成分副包芯线(LT–H副线)净化加热钢水 0
14 “U” 倒“U”字型管的前后内壁分别设有阶梯形施感板 0
[1] Trappey C V, Wu H Y, Taghaboni-Dutta F, et al.Using Patent Data for Technology Forecasting: China RFID Patent Analysis[J]. Advanced Engineering Informatics, 2011, 25(1): 53-64.
doi: 10.1016/j.aei.2010.05.007
[2] 王密平. 汉语专利术语抽取及应用研究——以钢铁冶金领域为例[D]. 南京: 南京大学, 2017.
[2] (Wang Miping.A Study on Chinese Terms Extraction and Their Application: The Case of Iron and Steel Metallurgy[D]. Nanjing: Nanjing University, 2017.)
[3] 陈志雄, 曾辉. 中文专利文献自动分类[J]. 嘉应学院学报, 2010, 28(2): 24-29.
doi: 10.3969/j.issn.1006-642X.2010.02.006
[3] (Chen Zhixiong, Zeng Hui.Chinese Patent Text Automatic Categorization System[J]. Journal of Jiaying University, 2010, 28(2): 24-29.)
doi: 10.3969/j.issn.1006-642X.2010.02.006
[4] 徐川, 施水才, 房祥, 等. 中文专利文献术语抽取[J].计算机工程与设计, 2013, 34(6): 2175-2179.
[4] (Xu Chuan, Shi Shuicai, Fang Xiang, et al.Chinese Patent Terminology Extraction[J]. Computer Engineering and Design, 2013, 34(6): 2175-2179.)
[5] 王密平, 王昊, 邓三鸿, 等. 基于CRFs的冶金领域中文专利术语抽取研究[J].现代图书情报技术, 2016(6): 28-36.
[5] (Wang Miping, Wang Hao, Deng Sanhong, et al.A Study on Chinese Terms Extraction and Their Application: The Case of Iron and Steel Metallurgy[J]. New Technology of Library and Information Service, 2016(6): 28-36.)
[6] 韩杰冰. 基于字角色标注的中文专利术语识别研究[D]. 南京: 南京大学, 2015.
[6] (Han Jiebing.The Research on Chinese Term Recognition of Patents Based on Word-Role Tagging[D]. Nanjing: Nanjing University, 2015.)
[7] 姜武. 模式识别技术在山茶属植物数值分类学和叶绿素含量预测中的应用研究[D]. 金华: 浙江师范大学, 2013.
[7] (Jiang Wu.Application of Pattern Recognition Techniques in Plant Numerical Taxonomy and Chlorophyll Content of Genus Camellia[D]. Jinhua: Zhejiang Normal University, 2013.)
[8] 罗俊, 王清丽, 张华, 等. 不同甘蔗基因型光合特性的数值分类[J].应用与环境生物学报, 2007, 13(4): 461-465.
doi: 10.3321/j.issn:1006-687x.2007.04.004
[8] (Luo Jun, Wang Qingli, Zhang Hua, et al.Phenetic Classification for Photosynthetic Characters of Different Sugarcane Varieties[J]. Chinese Journal of Applied and Environmental Biology, 2007, 13(4): 461-465.)
doi: 10.3321/j.issn:1006-687x.2007.04.004
[9] 刘晓云, 陈文新. 三叶草、猪屎豆和含羞草植物根瘤菌16S rDNA PCR-RFLP分析和数值分类研究[J]. 中国农业大学学报, 2003, 8(3): 1-6.
doi: 10.3321/j.issn:1007-4333.2003.03.001
[9] (Liu Xiaoyun, Chen Wenxin.16S rDNA PCR-RFLP Analysis and Numerical Taxonomy for Rhizobia Isolated from Trifolium, Crotalaria and Mimosa[J]. Journal of China Agricultural University, 2003, 8(3): 1-6.)
doi: 10.3321/j.issn:1007-4333.2003.03.001
[10] 刘勇, 孙中海, 刘德春, 等. 部分柚类品种数值分类研究[J].果树学报, 2006, 23(1): 35-40.
[10] (Liu Yong, Sun Zhonghai, Liu Dechun, et al.Numerical Classification of Some Grapefruit Cultivars[J]. Journal of Fruit Science, 2006, 23(1): 35-40.)
[11] 杜琪珍, 李名君, 刘维华, 等. 茶组植物的化学分类及数值分类[J].茶叶科学, 1990, 10(2): 1-12.
[11] (Du Qizhen, Li Mingjun, Liu Weihua, et al.Chemical and Numerical Taxonomies of Tea Section Plants[J]. Journal of Tea Science, 1990, 10(2): 1-12.)
[12] 罗礼溥, 郭宪国. 云南医学革螨数值分类研究[J]. 热带医学杂志, 2007, 50(1): 172-177.
doi: 10.3321/j.issn:0454-6296.2007.02.011
[12] (Luo Lipu, Guo Xianguo.Classification of a Medically Important Group of Gamasid Mites by Numerical Taxonomy in Yunnan, China[J]. Journal of Tropical Medicine, 2007, 50(1): 172-177. )
doi: 10.3321/j.issn:0454-6296.2007.02.011
[13] 陈晓琴, 陈强, 张世熔, 等. 流沙河流域土壤自生固氮菌数值分类及BOX-PCR研究[J]. 农业环境科学学报, 2006, 25(S): 528-532.
[13] (Chen Xiaoqin, Chen Qiang, Zhang Shirong, et al.Taxonomy and BOX-PCR Analysis of Free-Living Dizotrophs Isolated from Soils in Liusha River Valley[J]. Journal of Agro-Environment Science, 2006, 25(S): 528-532.)
[14] 孙家梅. 白蛉的数值分类和基于DNA条形码的分子系统学研究[D]. 广州: 暨南大学, 2010.
[14] (Sun Jiamei.The Numerical Taxonomy and Molecular Systematic Using Phlebotomus DNA Barcode of Phlebotomine Sandflies[D]. Guangzhou: Jinan University, 2010.)
[15] 么枕生. 用于数值分类的聚类分析[J]. 海洋湖沼通报, 1994(2): 1-12.
[15] (Yao Zhensheng.Cluster Analysis Used in Numerical Classification[J]. Transactions of Oceanology and Limnology, 1994(2): 1-12.)
[16] 李宏乔, 樊孝忠. 汉语文本中特殊符号串的自动识别技术[J]. 计算机工程, 2004, 30(12): 114-115.
[16] (Li Hongqiao, Fan Xiaozhong.Technique of Special Strings Automatic Recognition in Chinese Texts[J]. Computer Engineering, 2004, 30(12): 114-115.)
[17] 赵欣欣. 基于字符编码的文本隐藏算法及其攻击方法研究[D]. 合肥: 中国科学技术大学, 2009.
[17] (Zhao Xinxin.Research on Character Coding Based Text Stenographer and Its Attack Methods[D]. Hefei: University of Science and Technology of China, 2009.)
[18] 金花, 朱亚涛, 靳志强. 农业文献知识获取中斜体字符识别技术的应用研究[J].河北农业大学学报, 2015, 38(6): 124-128.
doi: 10.13320/j.cnki.jauh.2015.0148
[18] (Jin Hua, Zhu Yatao, Jin Zhiqiang.Research on Detection Method of English Italic Characters in Agriculture Acquisition[J]. Journal of Agricultural University of Hebei, 2015, 38(6): 124-128.)
doi: 10.13320/j.cnki.jauh.2015.0148
[19] 汤青, 吕学强, 李卓, 等. 领域本体术语抽取研究[J].现代图书情报技术, 2014(1): 43-50.
[19] (Tang Qing, Lv Xueqiang, Li Zhuo, et al.Research on Domain Ontology Term Extraction[J]. New Technology of Library and Information Service, 2014(1): 43-50.)
[20] 屈鹏, 王惠临. 面向信息分析的专利术语抽取研究[J].图书情报工作, 2013, 57(1): 130-135.
doi: 10.7536/j.jssn.0252-3116.2013.01.023
[20] (Qu Peng, Wang Huilin.Patent Term Extraction for Information Analysis[J]. Library and Information Service, 2013, 57(1): 130-135.)
doi: 10.7536/j.jssn.0252-3116.2013.01.023
[21] 胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013(2): 24-29.
[21] (Hu Apei, Zhang Jing, Liu Junli.Chinese Term Extraction Based on Improved C-value Method[J]. New Technology of Library and Information Service, 2013(2): 24-29.)
[22] 侯婷, 吕学强, 李卓. 专利术语抽取的层次过滤方法[J]. 现代图书情报技术, 2015(1): 24-30.
[22] (Hou Ting, Lv Xueqiang, Li Zhuo.Hierarchical Filtering Method for Patent Term Extraction[J]. New Technology of Library and Information Service, 2015(1): 24-30.)
[23] 何远标, 乐小虬, 张帆. 学术论文大纲中关键术语抽取方法研究[J]. 现代图书情报技术, 2014(3): 73-79.
[23] (He Yuanbiao, Le Xiaoqiu, Zhang Fan.Research on Keyphrase Extraction from Scholarly Article Outline[J]. New Technology of Library and Information Service, 2014(3): 73-79.)
[24] 杜丽萍, 李晓戈, 周元哲, 等. 互信息改进方法在术语抽取中的应用[J]. 计算机应用, 2015, 35(4): 996-1000, 1005.
doi: 10.11772/j.issn.1001-9081.2015.04.0996
[24] (Du Liping, Li Xiaoge, Zhou Yuanzhe, et al.Application of Mutual Information Improvement Method in Term Extraction[J]. Computer Applications, 2015, 35(4): 996-1000, 1005.)
doi: 10.11772/j.issn.1001-9081.2015.04.0996
[25] 谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011(4): 29-34.
[25] (Gu Jun, Wang Hao.Study on Term Extraction on the Basis of Chinese Domain Texts[J]. New Technology of Library and Information Service, 2011(4): 29-34.)
[26] 屈鹏, 王惠临. 专利信息服务中的术语抽取[J]. 情报科学, 2015, 33(9): 66-71.
[26] (Qu Peng, Wang Huilin.Term Extraction in Patent Information Services[J]. Information Science, 2015, 33(9): 66-71.)
[27] 曾文, 徐硕, 张运良, 等. 科技文献术语的自动抽取技术研究与分析[J]. 现代图书情报技术, 2014(1): 51-55.
[27] (Zeng Wen, Xu Shuo, Zhang Yunliang, et al.The Research and Analysis on Automatic Extraction of Science and Technology Literature Terms[J]. New Technology of Library and Information Service, 2014(1): 51-55.)
[28] 化柏林. 针对中文学术文献的情报方法术语抽取[J].现代图书情报技术, 2013(6): 68-75.
[28] (Hua Bolin.Extracting Information Method Term from Chinese Academic Literature[J]. New Technology of Library and Information Service, 2013(6): 68-75.)
[29] 袁劲松, 张小明, 李舟军.术语自动抽取方法研究综述[J].计算机科学, 2015, 42(8): 7-12.
doi: 10.11896/j.issn.1002-137X.2015.8.002
[29] (Yuan Jinsong, Zhang Xiaoming, Li Zhoujun.Survey of Automatic Term Extraction Methodologies[J]. Computer Science, 2015, 42(8): 7-12.)
doi: 10.11896/j.issn.1002-137X.2015.8.002
[30] 张文静, 梁颖红. 术语抽取技术研究[J].信息技术, 2008, 32(3): 6-9.
[30] (Zhang Wenjing, Liang Yinghong.Research on Term Extraction Technology[J]. Information Technology, 2008, 32(3): 6-9.)
[31] 周浪. 中文术语抽取若干问题研究[D].南京: 南京理工大学, 2010.
[31] (Zhou Lang.A Study on the Chinese Term Extraction[D]. Nanjing: Nanjing University of Science and Technology, 2010.)
[32] 唐涛, 周俏丽, 张桂平. 统计与规则相结合的术语抽取[J].沈阳航空航天大学学报, 2011, 28(5): 71-74.
[32] (Tang Tao, Zhou Qiaoli, Zhang Guiping.Term Extraction Based on the Combination of Statistics and Rules[J]. Journal of Shenyang University of Aeronautics and Astronautics, 2011, 28(5): 71-74.)
[33] 陈锋, 翟羽佳, 王芳. 基于条件随机场的学术期刊中理论的自动识别方法[J]. 图书情报工作, 2016, 60(2): 122-128.
doi: 10.13266/j.issn.0252-3116.2016.02.019
[33] (Chen Feng, Zhai Yujia, Wang Fang.Automatic Theory Recognition in Academic Journals Based on CRF[J]. Library and Information Service, 2016, 60(2): 122-128.)
doi: 10.13266/j.issn.0252-3116.2016.02.019
[34] 逯万辉, 马建霞. 基于CRFs的领域爆发词识别的研究与实现[J]. 情报科学, 2014, 32(1): 89-93.
[34] (Lu Wanhui, Ma Jianxia.Research and Implementation on the Domain Burst Word Recognition Based on CRFs[J]. Information Science, 2014, 32(1): 89-93.)
[35] 王荣洋, 鞠久朋, 李寿山, 等. 基于CRFs的评价对象抽取特征研究[J]. 中文信息学报, 2012, 26(2): 56-61.
[35] (Wang Rongyang, Ju Jiupeng, Li Shoushan, et al.Feature Engineering for CRFs Based Opinion Target Extraction[J]. Journal of Chinese Information Processing, 2012, 26(2): 56-61.)
[36] 侯立斌, 李培峰, 朱巧明. 基于CRFs和跨事件的事件识别研究[J]. 计算机工程, 2012, 38(24): 191-195.
doi: 10.3969/j.issn.1000-3428.2012.24.045
[36] (Hou Libin, Li Peifeng, Zhu Qiaoming.Study of Event Recognition Based on CRFs and Cross-event[J]. Computer Engineering, 2012, 38(24): 191-195.)
doi: 10.3969/j.issn.1000-3428.2012.24.045
[37] 罗彦彦, 黄德根. 基于CRFs边缘概率的中文分词[J].中文信息学报, 2009, 23(5): 3-8.
[37] (Luo Yanyan, Huang Degen.Chinese Word Segmentation Based on the Marginal Probabilities Generated by CRFs[J]. Journal of Chinese Information Processing, 2009, 23(5): 3-8.)
[38] CRF++[EB/OL]. [2017-11-16]. .
[39] 周志华, 王珏. 机器学习及其应用[M]. 北京: 清华大学出版社, 2009.
[39] (Zhou Zhihua, Wang Jue.Machine Learning and Its Application [M]. Beijing: Tsinghua University Press, 2009.)
[1] Chen Hao, Zhang Mengyi, Cheng Xiufeng. Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan[J]. 数据分析与知识发现, 2021, 5(10): 37-50.
[2] Gao Yilin,Min Chao. Comparing Technology Diffusion Structure of China and the U.S. to Countries Along the Belt and Road[J]. 数据分析与知识发现, 2021, 5(6): 80-92.
[3] Wei Ling,Li Shuying,Fang Shu. Methods and Applications for Technology Roadmap[J]. 数据分析与知识发现, 2020, 4(9): 1-14.
[4] Wu Yuying,Sun Ping,He Xijun,Jiang Guorui. Predicting Transactions Among Agents in Patent Transfer Weighted Networks for New Energy[J]. 数据分析与知识发现, 2018, 2(11): 73-79.
[5] Li Shuying,Fang Shu. Review of Data Analysis Methods in Measuring Technology Fusion and Trend[J]. 数据分析与知识发现, 2017, 1(7): 2-12.
[6] Zhai Dongsheng,Guo Cheng,Zhang Jie,Xia Jun. Recommending Potential R&D Partners Based on Patents[J]. 数据分析与知识发现, 2017, 1(3): 10-20.
[7] Shi Liping, Yuan Jingting, Tang Shulin. An Approach to Dynamic Evaluation of Patent Cooperation Ability of Cluster Core Enterprise with Culture Embeddness Perturbation[J]. 现代图书情报技术, 2014, 30(3): 96-103.
[8] Xiao Yufeng, Jiang Hong, Dong Ke. A Study on Mediation Roles to Patent Assignee Citation Network[J]. 现代图书情报技术, 2011, (11): 60-66.
[9] Zhang Peng Liu Ping Tang Tiantian Gao Xianglin Deng Liang Sun Dalong. The Application of Bradford’s Law in Patent Analysis System[J]. 现代图书情报技术, 2010, 26(7/8): 84-87.
[10] Tang Tiantian,Liu Ping,Zhang Peng,Ge Fubin,Li Ming. Application of Gompertz Curve Model in the Patent Trend Forecast[J]. 现代图书情报技术, 2009, 25(11): 59-63.
[11] Ma Jianxia,Sun Chengquan. Status and Trends of Patent Information Analysis Software[J]. 现代图书情报技术, 2006, 22(1): 66-70.
[12] Shao Chengmin,Qiu Chen. The Analysis and Comparison Between Two China Patent Search System[J]. 现代图书情报技术, 2005, 21(4): 86-87.
[13] Gao Yilin, Min Chao. A Comparative Study on the Technology Diffusion Structure of China and the US to the Belt and Road region [J]. 数据分析与知识发现, 0, (): 1-.
[14] Liu Xiaoling, Tan Zongying. Research on the Method of Technology Topics Division based on Patent Multi-attribute Fusion [J]. 数据分析与知识发现, 0, (): 1-.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn