Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (9): 91-98    DOI: 10.11925/infotech.1003-3513.2014.09.12
Current Issue | Archive | Adv Search |
Research of the Word Segmentation for Chinese Patent Claims
Zhang Jie, Zhang Haichao, Zhai Dongsheng
School of Economics and Management, Beijing University of Technology, Beijing 100124, China
Download: PDF(885 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] To segment Chinese patent claims and fulfill the research needs of patent similarity. [Methods] This paper not only summarizes the segmentation words, the rules of substring segmentation and the rules of domain terms extraction, but also constructs the domain dictionary. The method based on domain dictionaries and rules to segment Chinese patent claims is presented. [Results] The experimental results show that the precision is 90%, the recall-rate is 95%, and F-score is 92%. [Limitations] However, the huge field of dictionaries reduces the efficiency of large-scale segmentation. [Conclusions] This proposed method further improves the effectiveness and efficiency of Chinese patent claims segmentation.

Key wordsChinese patent claim      Chinese word segmentation      Domain dictionary      Terms extraction     
Received: 21 February 2014      Published: 20 October 2014
:  TP391  

Cite this article:

Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims. New Technology of Library and Information Service, 2014, 30(9): 91-98.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.09.12     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I9/91

[1] 赵铁军, 吕雅娟, 于浩,等. 提高汉语自动分词精度的多步处理策略[J]. 中文信息学报, 2001, 15(1): 13-18. (Zhao Tiejun, Lv Yajuan, Yu Hao, et al. Increasing Accuracy of Chinese Segmentation with Strategy of Multi-step Processing [J]. Journal of Chinese Information Processing, 2001, 15(1): 13-18.)
[2] 奉国和, 郑伟. 国内中文自动分词技术研究综述[J]. 图书情报工作, 2011, 55(2): 41-45. (Feng Guohe, Zheng Wei. Review of Chinese Automatic Word Segmentation [J]. Library and Information Service, 2011, 55(2): 41-45.)
[3] 邹海山, 吴勇, 吴月珠, 等. 中文搜索引擎中的中文信息处理技术[J]. 计算机应用研究, 2000, 17(12): 21-24. (Zou Haishan, Wu Yong, Wu Yuezhu, et al. Chinese Text Processing in Chinese Search Engine [J]. Application Research of Computers, 2000, 17(12): 21-24.)
[4] 莫建文, 郑阳, 首照宇, 等. 改进的基于词典的中文分词方法[J]. 计算机工程与设计, 2013, 34(5): 1802-1807. (Mo Jianwen, Zheng Yang, Shou Zhaoyu, et al. Improved Chinese Word Segmentation Method Based on Dictionary [J]. Computer Engineering and Design, 2013, 34(5): 1802-1807.)
[5] 李玲. 基于双词典机制的中文分词系统设计[J]. 机械工程与自动化, 2013(1): 17-19. (Li Ling. Design of Chinese Word Segmentation System Based on Dual-dictionary Mechanism [J]. Mechanical Engineering & Automation, 2013(1): 17-19.)
[6] 何国斌, 赵晶璐. 基于最大匹配的中文分词概率算法研究[J]. 计算机工程, 2010, 36(5): 173-175. (He Guobin, Zhao Jinglu. Research on Probabilistic Algorithm of Chinese Word Segmentation Based on the Maximum Match [J]. Computer Engineering, 2010, 36(5): 173-175.)
[7] 梁桢, 李禹生. 基于Hash 结构词典的逆向回溯中文分词技术研究[J]. 计算机工程与设计, 2010, 31(23): 5158-5161. (Liang Zhen, Li Yusheng. Reverse Backtracking Research of Chinese Segmentation Based on Dictionary of Hash Structure [J]. Computer Engineering and Design, 2010, 31(23): 5158-5161.)
[8] 田思虑, 李德华, 潘莹. 一种改进的基于二元统计的 HMM 分词算法[J]. 计算机与数字工程, 2011, 39(1): 14-16, 20. (Tian Silv, Li Dehua, Pan Ying. Improved 2-Gram HMM Algorithm for Chinese Word Segmentation [J]. Computer & Digital Engineering, 2011, 39(1): 14-16, 20.)
[9] 冯永, 李华, 钟将, 等. 基于自适应中文分词和近似SVM的文本分类算法[J]. 计算机科学, 2010, 37(1): 251-254, 293. ( Feng Yong, Li Hua, Zhong Jiang, et al. Text Classification Algorithm Based on Adaptive Chinese Word Segmentation and Proximal SVM [J]. Computer Science, 2010, 37(1): 251-254, 293.)
[10] 赵秦怡, 王丽珍. 一种基于互信息的串扫描中文文本分词方法[J]. 情报杂志, 2010, 29(7): 161-162, 172. (Zhao Qinyi, Wang Lizhen. A Method of String-Scanning Chinese Word Segmentation Based on Mutual Information [J]. Journal of Intelligence, 2010, 29(7): 161-162,172.)
[11] 刘丹, 方卫国, 周泓. 基于贝叶斯网络的二元语法中文分词模型[J]. 计算机工程, 2010, 36(1):12-14. (Liu Dan, Fang Weiguo, Zhou Hong. Bigram Chinese Word Segmentation Model Based on Bayesian Network [J]. Computer Engineering, 2010, 36(1): 12-14.)
[12] 王彩荣. 汉语自动分词专家系统的设计与实现[J]. 微处理机, 2004, 25(3): 56-57, 60. (Wang Cairong. The Design and Implementation of Expert System for Automatic Segmentation of Chinese Words [J]. Microprocessors, 2004, 25(3): 56-57, 60.)
[13] 尹锋. 基于神经网络的汉语自动分词系统的设计与分析[J]. 情报学报, 1998, 17(1): 41-50. (Yin Feng. Design and Analysis of Chinese Automatic Segmenting System Based on Neural Network [J]. Journal of the China Society for Scientific and Technical Information, 1998, 17(1): 41-50.)
[14] 来斯惟, 徐立恒, 陈玉博, 等. 基于表示学习的中文分词算法探索[J]. 中文信息学报, 2013, 27(5): 8-14. (Lai Siwei, Xu Liheng, Chen Yubo, et al. Chinese Word Segment Based on Character Representation Learning[J]. Journal of Chinese Information Processing, 2013, 27(5): 8-14.)
[15] 王靖, 徐向阳, 符蓉. 一种优化的用于中文分词的CRF机器学习模型[J]. 微计算机信息, 2010, 26 (4-3): 169-170, 147. (Wang Jing, Xu Xiangyang, Fu Rong. An Optimized CRF Model Used for Chinese Word Segmentation [J]. Microcomputer Information, 2010, 26(4-3): 169-170, 147.)
[16] 佟晓筠, 宋国龙, 刘强, 等. 中文分词及词性标注一体化模型研究[J]. 计算机科学, 2007, 34(9): 174-175, 212. (Tong Xiaojun, Song Guolong, Liu Qiang, et al. Research on the Model of Integrating Chinese Word Segmentation with Part- of-speech Tagging [J]. Computer Science, 2007, 34(9): 174-175, 212.)
[17] 蒋建洪, 赵嵩正, 罗玫. 词典与统计方法结合的中文分词模型研究及应用[J]. 计算机工程与设计, 2012, 33(1): 387-391. (Jiang Jianhong, Zhao Songzheng, Luo Mei. Analysis and Application of Chinese Word Segmentation Model Which Consist of Dictionary and Statics Method [J]. Computer Engineering and Design, 2012, 33(1): 387-391.)
[18] 张梅山, 邓知龙, 车万翔, 等. 统计与词典相结合的领域自适应中文分词[J]. 中文信息学报, 2013, 26(2): 8-12. (Zhang Meishan, Deng Zhilong, Che Wanxiang, et al. Combining Statistical Model and Dictionary for Domain Adaption of Chinese Word Segmentation [J]. Journal of Chinese Information Processing, 2013, 26(2): 8-12.)
[19] 张桂平, 刘东生, 尹宝生, 等. 面向专利文献的中文分词技术的研究[J]. 中文信息学报, 2010, 24(3): 112-116.
(Zhang Guiping, Liu Dongsheng, Yin Baosheng, et al. Research on Chinese Word Segmentation for Patent Documents [J]. Journal of Chinese Information Processing, 2010, 24(3): 112-116.)
[20] 岳金媛, 徐金安, 张玉洁. 面向专利文献的汉语分词技术研究[J]. 北京大学学报: 自然科学版, 2013, 49(1): 159-164. (Yue Jinyuan, Xu Jin'an, Zhang Yujie. Chinese Word Segmentation for Patent Documents [J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2013, 49(1): 159-164.)
[21] 宋立峰. 中文分词算法在专利文献中的应用研究[J]. 海峡科学, 2011(7): 9-11, 26. (Song Lifeng. Research on Chinese Word Segmentation Algorithm for Patent Documents [J]. Straits Science, 2011(7): 9-11, 26.)
[22] 张华平. NLPIR汉语分词系统 [EB/OL]. [2014-01-15]. http://ictclas.nlpir.org. ( Zhang Huaping. NLPIR [EB/OL]. [2014-01-15]. http://ictclas.nlpir.org.)
[23] 国家知识产权局. 审查指南[M]. 北京: 知识产权出版社, 2006: 218-242. (State Intellectual Property Office of the People's Republic of China. Guidelines for Patent Examination [M]. Beijing: Intellectual Property Publishing House, 2006: 218-242.)
[24] 翟东升, 马文姗. 中文专利权利要求书分词算法研究[J]. 情报杂志, 2011, 30(11): 152-155. (Zhai Dongsheng, Ma Wenshan. Research the Algorithm of Chinese Patent Claims Segmentation [J]. Journal of Intelligence, 2011, 30(11): 152-155.)
[25] 胡少荣, 孟嗣仪, 刘云, 等. 网页信息自动抽取技术的研究[J]. 铁路计算机应用, 2010, 19(9): 37-40. (Hu Shaorong, Meng Siyi, Liu Yun, et al. Research on Automatic Extraction Technology of Web Information [J]. Railway Computer Application, 2010, 19(9): 37-40.)
[26] 胡阿沛, 张静, 刘俊丽. 基于改进C-value方法的中文术语抽取[J]. 现代图书情报技术, 2013(2): 24-29. (Hu Apei, Zhang Jing, Liu Junli. Chinese Term Extraction Based on Improved C-value Method [J]. New Technology of Library and Information Service, 2013(2): 24-29.)
[27] 日立专利信息检索系统Digi-patent/s [EB/OL]. [2014-01- 06]. http://www.digi-patent-s.com.cn. (Digi-patent/s [EB/ OL]. [2014-01-06]. http://www.digi-patent-s.com.cn.)
[28] 中华人民共和国国家标准. GB/T13715-92, 信息处理用现代汉语分词规范 [S]. (The People's Republic of China National Standard. GB/T13715-92, Contemporary Chinese Language Word Segmentation Specification for Information Processing [S].)

[1] Guoming Feng,Xiaodong Zhang,Suhui Liu. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[2] Weijian Ni,Haohao Sun,Tong Liu,Qingtian Zeng. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[3] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[4] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[5] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[6] Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus[J]. 现代图书情报技术, 2012, 28(6): 50-53.
[7] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[8] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[9] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[10] Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[11] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[12] Wu Shaogen . Study of Scheme Automaton for Chinese Word Automatic Segmentation[J]. 现代图书情报技术, 2006, 1(5): 47-49.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn