|
|
Chinese High-frequency Words Extraction Algorithm Without Thesaurus |
Jiang Hua, Su Xiaoguang |
Department of Equipment Economics and Management, Naval University of Engineering, Wuhan 430033, China |
|
|
Abstract Based on PAT array,introducing LCP array to count the length of the common prefixes of text suffixes, a new algorithm without thesaurus is presented for extracting high-frequency words of Chinese text by scanning LCP arrary.The algorithm does not depend on segmentation dictionary and can extract any repeated string,especially the new words and combined words.Experimental results show that high-frequency words extracted by the algorithm achieve a high acceptance rate and this algorithm is more effective in extracting combined words than ICTCLAS.
|
Received: 27 March 2012
Published: 30 August 2012
|
|
[1] 黄昌宁,赵海.中文分词十年回顾[J]. 中文信息学报 ,2007,21(3):8-18. (Huang Changning,Zhao Hai. Chinese Word Segmentation:A Decade Review[J].Journal of Chinese Information Processing, 2007,21(3):8-18.)[3] Zhou G D,Su J,Tey T G.Hybrid Text Chunking[C].In:Proceedings of CoNLL- 2000 and LLL-2000,Lisbon, Portugal.Stroudsburg, PA, USA:Association for Computational Linguistics,2000: 163-165.[4] Zhou G D,Su J.Named Entity Recognition Using an HMM-based Chunk Tagger[C].In:Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics(ACL).Philadelphia,USA. Stroudsburg, PA, USA:Association for Computational Linguistics,2002:473-480.[5] 沈勤中,周国栋,朱巧明,等.基于字位置概率特征的条件随机场中文分词方法[J]. 苏州大学学报:自然科学版 ,2008,24(3):49-53.(Shen Qinzhong,Zhou Guodong,Zhu Qiaoming, et al. CRFs-based Chinese Word Segmentation Method with Character Position Probability Feature[J]. Journal of Suzhou University:Natural Science Edition,2008,24(3):49-53.)[6] 金翔宇,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法[J]. 中文信息学报 ,2001,15(6):33-39.(Jin Xiangyu,Sun Zhengxing,Zhang Fuyan. A Domain-independent Dictionary- free Lexical Acquisition Model for Chinese Document [J]. Journal of Chinese Information Processing, 2001,15(6): 33-39.)[7] 韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究[J]. 中文信息学报 ,2001,15(2):23-30.(Han Kesong,Wang Yongcheng,Chen Guilin. Research on Fast High-frequency Strings Extracting and Statistics Algorithm with No Thesaurus[J].Journal of Chinese Information Processing, 2001,15(2): 23-30.)[8] 任禾,曾隽芳.一种基于信息熵的中文高频词抽取算法[J]. 中文信息学报 , 2006,20(5):40-43.(Ren He,Zeng Junfang. A Chinese Word Extraction Algorithm Based on Information Entropy[J]. Journal of Chinese Information Processing, 2006,20(5): 40-43.)[9] Manber U,Myers G. Suffix Arrays: A New Method for On-line String Searches[J]. SIAM Journal on Computing,1993,22(5): 935-948.[10] Bentley J L,Sedgewick R. Fast Algorithms for Sorting and Searching Strings[C]. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans,USA. Philadelphia, PA, USA:Society for Industrial and Applied Mathematics,1997:319-327.[11] 江华,赵建新,王海岚.PAT数组全文检索技术的研究与改进[J]. 现代图书情报技术 ,2005(8):37-41.(Jiang Hua,Zhao Jianxin,Wang Hailan.Research on a Full-text Indexing Structure of PAT Array[J].New Technology of Library and Information Service,2005(8):37-41.)[12] ICTCLAS[EB/OL].[2012-03-05].http://ictclas.org/. |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|