Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (1): 24-30    DOI: 10.11925/infotech.1003-3513.2015.01.04
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
专利术语抽取的层次过滤方法
侯婷, 吕学强, 李卓
北京信息科技大学网络文化与数字传播重点实验室 北京 100101
Hierarchical Filtering Method for Patent Term Extraction
Hou Ting, Lv Xueqiang, Li Zhuo
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
全文: PDF(453 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 专利术语作为专利文献的核心内容和重要组成部分, 其抽取任务是专利研究的基础工作。[方法] 提出一种基于层次过滤的方法抽取专利术语。基于后缀数组获取重复字串作为候选词, 根据候选词集合中无效字串的特点将其分为破碎字串、冗余字串和通用词, 通过识别和过滤三类无效字串获得专利术语。分别提出计算独立性算法过滤破碎字串, 相对活跃度计算方法和分词纠错法过滤冗余字串。[结果] 实验结果表明, 该方法对中文专利术语抽取有较好的效果, 平均正确率为90.54%, 平均召回率为87.33%。[局限] 只针对重复字串, 无法识别文献中出现频次为1的专利术语。[结论] 该方法用于专利术语抽取是有效的。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
吕学强
侯婷
李卓
关键词 专利术语层次过滤独立性计算相对活跃度    
Abstract

[Objective] As the core content and the important part of patent documents, the extraction task of patent terms is regarded as the basis of research works on the patent. [Methods] A hierarchical filtering method is presented to extract terms. Based on the suffix array, this method takes repeated strings as the candidate words and divides invalid strings into three classes, including the broken string, the redundant string and the common word, according to their features in the candidate set. Besides, by removing the above invalid strings, patent terms are obtained. The authors propose an independence calculation method, a relative activity calculation method and a word segmentation error correction method to filter broken strings and redundant strings respectively. [Results] Experimental results show that the proposed method has a good effect on Chinese patent term extraction. The average precision is 90.54% and the average recall is 87.33%. [Limitations] The method is just suitable for repeated strings and cannot identify the term which frequency number is 1. [Conclusions] The method is effective in patent term extraction.

Key wordsPatent terms    Hierarchical filtering method    Independence calculation    Relative Active Degree
收稿日期: 2014-06-11     
:  TP391.1  
基金资助:

本文系国家自然科学基金项目"基于本体的专利自动标引研究"(项目编号:61271304)、北京市教委科技发展计划重点项目暨北京市自然科学基金B类重点项目"面向领域的互联网多模态信息精准搜索方法研究"(项目编号:KZ201311232037)和北京市属高等学校创新团队建设与教师职业发展计划项目"大数据内容理解的理论基础及智能化处理技术"(项目编号:IDHT20130519)的研究成果之一。

通讯作者: 侯婷,ORCID:0000-0001-6599-1106,E-mail:houtingting163@126.com。     E-mail: houtingting163@126.com
作者简介: 作者贡献声明: 吕学强: 提出研究课题; 侯婷: 设计实验方案, 完成实验并撰写论文; 李卓: 数据处理和分析, 论文最终版本修订。
引用本文:   
侯婷, 吕学强, 李卓. 专利术语抽取的层次过滤方法[J]. 现代图书情报技术, 2015, 31(1): 24-30.
Hou Ting, Lv Xueqiang, Li Zhuo. Hierarchical Filtering Method for Patent Term Extraction. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2015.01.04.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.01.04

[1] 王朝晖. 专利文献的特点及其利用[J]. 现代情报, 2008(9): 151-152, 156. (Wang Zhaohui. Characteristics and Utilization of Patent Documentation [J]. Modern Information, 2008(9): 151-152, 156. )
[2] 李江华, 时鹏, 胡长军. 一种适用于复合术语的本体概念学习方法[J]. 计算机科学, 2013, 40(5): 168-172. (Li Jianghua, Shi Peng, Hu Changjun. Ontology Concept Learning Method for Compound Terms [J]. Computer Science, 2013, 40(5): 168-172.)
[3] Chambers N, Jurafsky D. Template-based Information Extraction without the Templates [C]. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (HLT'11). Stroudsburg, Pennsylvania, USA: Association for Computational Linguistics, 2001: 976-986.
[4] 潘虹, 徐朝军. LCS算法在术语抽取中的应用研究[J]. 情报学报, 2010, 29(5): 853-857. (Pan Hong, Xu Chaojun. Application of LCS-based Algorithm in Chinese Term Extraction [J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5): 853-857.)
[5] 施水才, 王锴, 韩艳铧, 等. 基于条件随机场的领域术语识别研究[J]. 计算机工程与应用, 2013, 49(10): 147-149. (Shi Shuicai, Wang Kai, Han Yanhua, et al. Terminology Recognition Based on Conditional Random Fields [J]. Computer Engineering and Applications, 2013, 49(10): 147-149.)
[6] 韩红旗, 朱东华, 汪雪锋. 专利技术术语的抽取方法[J]. 情报学报, 2011, 30(12): 1280-1285. (Han Hongqi, Zhu Donghua, Wang Xuefeng. Technical Term Extraction Method for Patent Document [J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(12): 1280-1285.)
[7] 徐川, 施水才, 房祥, 等. 中文专利文献术语抽取[J]. 计算机工程与设计, 2013, 34(6): 2175-2179. (Xu Chuan, Shi Shuicai, Fang Xiang, et al. Chinese Patent Terminology Extraction [J]. Computer Engineering and Design, 2013, 34(6): 2175-2179.)
[8] 刘豹, 张桂平, 蔡东风. 基于统计和规则相结合的科技术语自动抽取研究[J]. 计算机工程与应用, 2008, 44(23): 147-150. (Liu Bao, Zhang Guiping, Cai Dongfeng. Technical Term Automatic Extraction Research Based on Statistics and Rules [J]. Computer Engineering and Applications, 2008, 44(23): 147-150.)
[9] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报:自然科学版, 2013, 49(1): 159-164. (Yue Jinyuan, Xu Jin'an, Zhang Yujie. Chinese Word Segmentation for Patent Documents [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[10] 谷俊, 严明. 基于中文专利的新技术术语识别研究[J]. 情报科学, 2013, 31(1): 144-149. (Gu Jun, Yan Ming. Study of New Technology Detection Based on Chinese Patents [J]. Information Science, 2013, 31(1): 144-149.)
[11] 百度文库. 现代汉语词表 [EB/OL]. [2014-06-10]. http://wenku.baidu.com/view/b41a75ea19e8b8f67c1cb99b.html. (Baidu Library. Modern Chinese Vocabulary [EB/OL]. [2014-06-10]. http://wenku.baidu.com/view/b41a75ea19e8b8f67c1cb99b.html.)
[12] Yamamoto M, Church K W. Using Suffix Arrays to Compute Term Frequency and Document Frequency for all Substrings in a Corpus [J]. Computational Linguistics, 2001, 27(1): 1-30.
[13] 吕学强, 张乐, 黄志丹, 等. 基于散列技术的快速子串归并算法[J]. 复旦学报: 自然科学版, 2004, 43(5): 948-951. (Lv Xueqiang, Zhang Le, Huang Zhidan, et al. Fast Hash Algorithms on Statistical Substring Reduction [J]. Journal of Fudan University: Natural Science, 2004, 43(5): 948-951.)
[14] 吕学强. 面向机器翻译的E-Chunk获取与应用研究[D]. 沈阳: 东北大学, 2003. (Lv Xueqiang. Research of E-Chunk Acquisition and Application in Machine Translation [D]. Shenyang: Northeastern University, 2003.)
[15] 周浪, 冯冲, 黄河燕, 等. 一种基于独立性统计的子串归并算法[J]. 计算机工程与应用, 2010, 46(24): 129-131. (Zhou Lang, Feng Chong, Huang Heyan, et al. Substring Reduction Algorithm Based on Independence Statistic [J]. Computer Engineering and Applications, 2010, 46(24): 129-131.)
[16] 周浪, 冯冲, 黄河燕. 一种面向术语抽取的短语过滤技术[J]. 计算机工程与应用, 2009, 45(19): 9-11. (Zhou Lang, Feng Chong, Huang Heyan. Phrase Filtering Technology Oriented to Term Extraction [J]. Computer Engineering and Applications, 2009, 45(19): 9-11.)
[17] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS [C]. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN'03). Stroudsburg, Pennsylvania, USA: Association for Computational Linguistics, 2003: 184-187.

[1] 王密平,王昊,邓三鸿,吴志祥. 基于CRFs的冶金领域中文专利术语抽取研究*[J]. 现代图书情报技术, 2016, 32(6): 28-36.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn