Please wait a minute...
New Technology of Library and Information Service  2013, Vol. 29 Issue (7/8): 55-62    DOI: 10.11925/infotech.1003-3513.2013.07-08.08
article Current Issue | Archive | Adv Search |
Model Construction and Experiment Analysis of Automatic Indexing for Chinese Books
Wang Hao, Zou Jieli, Deng Sanhong
School of Information Management, Nanjing University, Nanjing 210093, China
Download: PDF(1144 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  For the problem of automatic keywords indexing for Chinese books, this paper introduces the machine learning algorithm of Condition Radom Fields to deal with it. The method generates an annotation model including semantic relations and rule features among sequence entities though training the large number of existing keywords data of Chinese books indexed by manual, then uses the annotation model for machine predicting so that to automatically extract the books' keywords. The paper mainly solves two problems. First, because the parameters choice of CRFs will affect the indexing performance, the authors make comparative tests from several angles so as to identify the optimal parameter set of CRFs for the specific problem of keywords indexing for Chinese books. Second, the authors discusse the effect of different observed features to the keywords indexing, and demonstrate four observed features which can improve the indexing performance effectively through the experiments analysis. Finally, the optimal model of keywords indexing oriented to Chinese books is constructed.
Key wordsCondition Random Fields      Keywords indexing      Feature template      Word length of window      Feature function      Soft boundary parameter      Observed feature roles     
Received: 27 May 2013      Published: 02 September 2013
: 

TP391

 

Cite this article:

Wang Hao, Zou Jieli, Deng Sanhong. Model Construction and Experiment Analysis of Automatic Indexing for Chinese Books. New Technology of Library and Information Service, 2013, 29(7/8): 55-62.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2013.07-08.08     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2013/V29/I7/8/55

[1] 马张华.信息组织[M].北京:清华大学出版社,2003.(Ma Zhanghua. Information Organization[M].Beijing: Tsinghua University Press,2003.)
[2] Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction[C].In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,1999:668-673.
[3] Turney P D. Learning to Extract Keyphrases from Text[R]. NRC Technical Report. ERB-1057. Canada: National Research Council,1999:1-43.
[4] Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction[J].Information Processing and Management,2007,43(6):1705-1714.
[5] 李素建,王厚峰,俞士汶,等.关键词自动标引的最大熵模型应用研究[J]. 计算机学报,2004,27(9):1192-1197.(Li Sujian,Wang Houfeng,Yu Shiwen,et al. Research on Maximum Entropy Model for Keyword Indexing[J].Chinese Journal of Computers,2004,27(9):1192-1197.)
[6] Zhang K, Xu H,Tang Jie,et al. Keyword Extraction Using Support Vector Machine[C].In: Proceedings of the 7th International Conference on Web-Age Information Management (WAIM'06),Hong Kong,China.2006.
[7] 邓三鸿, 王昊,秦嘉杭,等.基于字角色标注的中文书目关键词标引研究[J]. 中国图书馆学报,2012,38(2):38-49.(Deng Sanhong, Wang Hao, Qin Jiahang,et al. Research on Keywords Indexing for Chinese Bibliography Based on Word Roles Annotation[J]. Journal of Library Science in China, 2012,38(2):38-49.)
[8] 宗成庆. 统计自然语言处理[M]. 北京: 清华大学出版社, 2008.(Zong Chengqing. Statistical Natural Language Processing[M].Beijing: Tsinghua University Press,2008.)
[9] CRFs + +[OL].[2013-03-12]. http://crfpp.googlecode.com/svn/trunk/doc/index.html.
[10] 朱莎莎,刘宗田,付剑锋,等. 基于条件随机场的中文时间短语识别[J]. 计算机工程, 2011,37(15):164-167.(Zhu Shasha,Liu Zongtian, Fu Jianfeng, et al. Chinese Temporal Phrase Recognition Based on Conditional Random Fields[J].Computer Engineering, 2011,37(15):164-167.)
[11] 李航.统计学习方法[M].北京:清华大学出版社,2012:212.(Li Hang. Methods of Statistical Learning[M].Beijing: Tsinghua University Press,2012:212.)
[1] Yue Zhang,Dongbo Wang,Danhao Zhu. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[2] Xu Deshan, Li Hui, Zhang Yunliang. A Method of Keywords Annotation Based on Linked Triples[J]. 现代图书情报技术, 2015, 31(9): 31-37.
[3] Zeng Zhen, Lv Xueqiang, Li Zhuo. The Automatic Identification of Chinese Names in Query Logs[J]. 现代图书情报技术, 2014, 30(12): 71-77.
[4] Tang Yafen. Research of Automatically Recognizing Name in Pre-Qin Ancient Chinese Classics[J]. 现代图书情报技术, 2013, 29(7/8): 63-68.
[5] Zhu Danhao Wang Dongbo Xie Jing. Automatic Identification of Prepositional Phrase Based on Conditional Random Field[J]. 现代图书情报技术, 2010, 26(7/8): 79-83.
[6] Liu Kun,Lv Xueqiang,Wang Tao,Shi Shuicai. Binarization for Document Image Based on Multi-scale Conditional Random Fields[J]. 现代图书情报技术, 2009, 25(4): 79-81.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn