Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (1): 24-30    DOI: 10.11925/infotech.1003-3513.2015.01.04
Current Issue | Archive | Adv Search |
Hierarchical Filtering Method for Patent Term Extraction
Hou Ting, Lv Xueqiang, Li Zhuo
Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Download: PDF(453 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] As the core content and the important part of patent documents, the extraction task of patent terms is regarded as the basis of research works on the patent. [Methods] A hierarchical filtering method is presented to extract terms. Based on the suffix array, this method takes repeated strings as the candidate words and divides invalid strings into three classes, including the broken string, the redundant string and the common word, according to their features in the candidate set. Besides, by removing the above invalid strings, patent terms are obtained. The authors propose an independence calculation method, a relative activity calculation method and a word segmentation error correction method to filter broken strings and redundant strings respectively. [Results] Experimental results show that the proposed method has a good effect on Chinese patent term extraction. The average precision is 90.54% and the average recall is 87.33%. [Limitations] The method is just suitable for repeated strings and cannot identify the term which frequency number is 1. [Conclusions] The method is effective in patent term extraction.

Key wordsPatent terms      Hierarchical filtering method      Independence calculation      Relative Active Degree     
Received: 11 June 2014      Published: 12 February 2015
:  TP391.1  

Cite this article:

Hou Ting, Lv Xueqiang, Li Zhuo. Hierarchical Filtering Method for Patent Term Extraction. New Technology of Library and Information Service, 2015, 31(1): 24-30.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2015.01.04     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2015/V31/I1/24

[1] 王朝晖. 专利文献的特点及其利用[J]. 现代情报, 2008(9): 151-152, 156. (Wang Zhaohui. Characteristics and Utilization of Patent Documentation [J]. Modern Information, 2008(9): 151-152, 156. )
[2] 李江华, 时鹏, 胡长军. 一种适用于复合术语的本体概念学习方法[J]. 计算机科学, 2013, 40(5): 168-172. (Li Jianghua, Shi Peng, Hu Changjun. Ontology Concept Learning Method for Compound Terms [J]. Computer Science, 2013, 40(5): 168-172.)
[3] Chambers N, Jurafsky D. Template-based Information Extraction without the Templates [C]. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics (HLT'11). Stroudsburg, Pennsylvania, USA: Association for Computational Linguistics, 2001: 976-986.
[4] 潘虹, 徐朝军. LCS算法在术语抽取中的应用研究[J]. 情报学报, 2010, 29(5): 853-857. (Pan Hong, Xu Chaojun. Application of LCS-based Algorithm in Chinese Term Extraction [J]. Journal of the China Society for Scientific and Technical Information, 2010, 29(5): 853-857.)
[5] 施水才, 王锴, 韩艳铧, 等. 基于条件随机场的领域术语识别研究[J]. 计算机工程与应用, 2013, 49(10): 147-149. (Shi Shuicai, Wang Kai, Han Yanhua, et al. Terminology Recognition Based on Conditional Random Fields [J]. Computer Engineering and Applications, 2013, 49(10): 147-149.)
[6] 韩红旗, 朱东华, 汪雪锋. 专利技术术语的抽取方法[J]. 情报学报, 2011, 30(12): 1280-1285. (Han Hongqi, Zhu Donghua, Wang Xuefeng. Technical Term Extraction Method for Patent Document [J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(12): 1280-1285.)
[7] 徐川, 施水才, 房祥, 等. 中文专利文献术语抽取[J]. 计算机工程与设计, 2013, 34(6): 2175-2179. (Xu Chuan, Shi Shuicai, Fang Xiang, et al. Chinese Patent Terminology Extraction [J]. Computer Engineering and Design, 2013, 34(6): 2175-2179.)
[8] 刘豹, 张桂平, 蔡东风. 基于统计和规则相结合的科技术语自动抽取研究[J]. 计算机工程与应用, 2008, 44(23): 147-150. (Liu Bao, Zhang Guiping, Cai Dongfeng. Technical Term Automatic Extraction Research Based on Statistics and Rules [J]. Computer Engineering and Applications, 2008, 44(23): 147-150.)
[9] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报:自然科学版, 2013, 49(1): 159-164. (Yue Jinyuan, Xu Jin'an, Zhang Yujie. Chinese Word Segmentation for Patent Documents [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[10] 谷俊, 严明. 基于中文专利的新技术术语识别研究[J]. 情报科学, 2013, 31(1): 144-149. (Gu Jun, Yan Ming. Study of New Technology Detection Based on Chinese Patents [J]. Information Science, 2013, 31(1): 144-149.)
[11] 百度文库. 现代汉语词表 [EB/OL]. [2014-06-10]. http://wenku.baidu.com/view/b41a75ea19e8b8f67c1cb99b.html. (Baidu Library. Modern Chinese Vocabulary [EB/OL]. [2014-06-10]. http://wenku.baidu.com/view/b41a75ea19e8b8f67c1cb99b.html.)
[12] Yamamoto M, Church K W. Using Suffix Arrays to Compute Term Frequency and Document Frequency for all Substrings in a Corpus [J]. Computational Linguistics, 2001, 27(1): 1-30.
[13] 吕学强, 张乐, 黄志丹, 等. 基于散列技术的快速子串归并算法[J]. 复旦学报: 自然科学版, 2004, 43(5): 948-951. (Lv Xueqiang, Zhang Le, Huang Zhidan, et al. Fast Hash Algorithms on Statistical Substring Reduction [J]. Journal of Fudan University: Natural Science, 2004, 43(5): 948-951.)
[14] 吕学强. 面向机器翻译的E-Chunk获取与应用研究[D]. 沈阳: 东北大学, 2003. (Lv Xueqiang. Research of E-Chunk Acquisition and Application in Machine Translation [D]. Shenyang: Northeastern University, 2003.)
[15] 周浪, 冯冲, 黄河燕, 等. 一种基于独立性统计的子串归并算法[J]. 计算机工程与应用, 2010, 46(24): 129-131. (Zhou Lang, Feng Chong, Huang Heyan, et al. Substring Reduction Algorithm Based on Independence Statistic [J]. Computer Engineering and Applications, 2010, 46(24): 129-131.)
[16] 周浪, 冯冲, 黄河燕. 一种面向术语抽取的短语过滤技术[J]. 计算机工程与应用, 2009, 45(19): 9-11. (Zhou Lang, Feng Chong, Huang Heyan. Phrase Filtering Technology Oriented to Term Extraction [J]. Computer Engineering and Applications, 2009, 45(19): 9-11.)
[17] Zhang H P, Yu H K, Xiong D Y, et al. HHMM-based Chinese Lexical Analyzer ICTCLAS [C]. In: Proceedings of the 2nd SIGHAN Workshop on Chinese Language Processing (SIGHAN'03). Stroudsburg, Pennsylvania, USA: Association for Computational Linguistics, 2003: 184-187.

[1] Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[2] Xu Deshan, Li Hui, Zhang Yunliang. A Method of Keywords Annotation Based on Linked Triples[J]. 现代图书情报技术, 2015, 31(9): 31-37.
[3] Chen Shiqin, Li Wenjiang. Application of WebSocket in Library Mobile Information Service[J]. 现代图书情报技术, 2015, 31(9): 90-96.
[4] Hu Juxiang, Lv Xueqiang, Liu Kehui. Complaint Text Classification Based on Guiding Words[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[5] Duan Yufeng, Zhu Wenjing, Chen Qiao, Liu Wei, Liu Fenghong. The Study on Out-of-Vocabulary Identification on a Model Based on the Combination of CRFs and Domain Ontology Elements Set[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[6] Li Junfeng, Lv Xueqiang, Zhou Shaojun. Patent Keyword Indexing Based on Weighted Complex Graph Model[J]. 现代图书情报技术, 2015, 31(3): 26-32.
[7] Ma Bin, Yin Lifeng. A Parallel Naive Bayesian Network Public Opinion Fast Classification Algorithm Based on Hadoop Platform[J]. 现代图书情报技术, 2015, 31(2): 78-84.
[8] Tang Shouli, Xu Baoxiang. Research on Ontology-based Cloud Services Semantic Retrieval System[J]. 现代图书情报技术, 2014, 30(12): 27-35.
[9] Tang Xiaobo, Xiao Lu. Research of Text Feature Extraction on Dependency Parsing Network[J]. 现代图书情报技术, 2014, 30(11): 31-37.
[10] Shi Cui, Wang Yang, Yang Bin, Yao Ye. Identification of Non-nest Coordination for Chinese Patent Literature[J]. 现代图书情报技术, 2014, 30(10): 76-83.
[11] Zhang Yongjun, Liu Jinling, Ma Jialin. Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow[J]. 现代图书情报技术, 2014, 30(7): 101-106.
[12] Li Wenjiang, Chen Shiqin. WeChat as Library Public Service Platform for the APP Client[J]. 现代图书情报技术, 2014, 30(7): 133-138.
[13] Tang Qing,Lv Xueqiang,Li Zhuo,Shi Shuicai,. Research on Domain Ontology Term Extraction[J]. 现代图书情报技术, 2014, 30(1): 43-50.
[14] Li Wenjiang, Chen Shiqin. Design of Library Information Push System Based on Android GCM Service[J]. 现代图书情报技术, 2013, 29(11): 91-96.
[15] Xiong Liyan, Tan Long, Zhong Maosheng. An Automatic Term Extraction System of Improved C-value Based on Effective Word Frequency[J]. 现代图书情报技术, 2013, 29(9): 54-59.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn