Please wait a minute...
New Technology of Library and Information Service  2012, Vol. 28 Issue (6): 50-53    DOI: 10.11925/infotech.1003-3513.2012.06.08
Current Issue | Archive | Adv Search |
Chinese High-frequency Words Extraction Algorithm Without Thesaurus
Jiang Hua, Su Xiaoguang
Department of Equipment Economics and Management, Naval University of Engineering, Wuhan 430033, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  Based on PAT array,introducing LCP array to count the length of the common prefixes of text suffixes, a new algorithm without thesaurus is presented for extracting high-frequency words of Chinese text by scanning LCP arrary.The algorithm does not depend on segmentation dictionary and can extract any repeated string,especially the new words and combined words.Experimental results show that high-frequency words extracted by the algorithm achieve a high acceptance rate and this algorithm is more effective in extracting combined words than ICTCLAS.
Key wordsChinese information processing      High-frequency word extraction      PAT array      Chinese word segmentation      Keyword detection     
Received: 27 March 2012      Published: 30 August 2012
: 

TP391

 

Cite this article:

Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus. New Technology of Library and Information Service, 2012, 28(6): 50-53.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2012.06.08     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2012/V28/I6/50

[1] 黄昌宁,赵海.中文分词十年回顾[J]. 中文信息学报 ,2007,21(3):8-18. (Huang Changning,Zhao Hai. Chinese Word Segmentation:A Decade Review[J].Journal of Chinese Information Processing, 2007,21(3):8-18.)

[3] Zhou G D,Su J,Tey T G.Hybrid Text Chunking[C].In:Proceedings of CoNLL- 2000 and LLL-2000,Lisbon, Portugal.Stroudsburg, PA, USA:Association for Computational Linguistics,2000: 163-165.

[4] Zhou G D,Su J.Named Entity Recognition Using an HMM-based Chunk Tagger[C].In:Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics(ACL).Philadelphia,USA. Stroudsburg, PA, USA:Association for Computational Linguistics,2002:473-480.

[5] 沈勤中,周国栋,朱巧明,等.基于字位置概率特征的条件随机场中文分词方法[J]. 苏州大学学报:自然科学版 ,2008,24(3):49-53.(Shen Qinzhong,Zhou Guodong,Zhu Qiaoming, et al. CRFs-based Chinese Word Segmentation Method with Character Position Probability Feature[J]. Journal of Suzhou University:Natural Science Edition,2008,24(3):49-53.)

[6] 金翔宇,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法[J]. 中文信息学报 ,2001,15(6):33-39.(Jin Xiangyu,Sun Zhengxing,Zhang Fuyan. A Domain-independent Dictionary- free Lexical Acquisition Model for Chinese Document [J]. Journal of Chinese Information Processing, 2001,15(6): 33-39.)

[7] 韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究[J]. 中文信息学报 ,2001,15(2):23-30.(Han Kesong,Wang Yongcheng,Chen Guilin. Research on Fast High-frequency Strings Extracting and Statistics Algorithm with No Thesaurus[J].Journal of Chinese Information Processing, 2001,15(2): 23-30.)

[8] 任禾,曾隽芳.一种基于信息熵的中文高频词抽取算法[J]. 中文信息学报 , 2006,20(5):40-43.(Ren He,Zeng Junfang. A Chinese Word Extraction Algorithm Based on Information Entropy[J]. Journal of Chinese Information Processing, 2006,20(5): 40-43.)

[9] Manber U,Myers G. Suffix Arrays: A New Method for On-line String Searches[J]. SIAM Journal on Computing,1993,22(5): 935-948.

[10] Bentley J L,Sedgewick R. Fast Algorithms for Sorting and Searching Strings[C]. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans,USA. Philadelphia, PA, USA:Society for Industrial and Applied Mathematics,1997:319-327.

[11] 江华,赵建新,王海岚.PAT数组全文检索技术的研究与改进[J]. 现代图书情报技术 ,2005(8):37-41.(Jiang Hua,Zhao Jianxin,Wang Hailan.Research on a Full-text Indexing Structure of PAT Array[J].New Technology of Library and Information Service,2005(8):37-41.)

[12] ICTCLAS[EB/OL].[2012-03-05].http://ictclas.org/.
[1] Feng Guoming,Zhang Xiaodong,Liu Suhui. DBLC Model for Word Segmentation Based on Autonomous Learning[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[2] Ni Weijian,Sun Haohao,Liu Tong,Zeng Qingtian. An Unsupervised Approach to Optimize Chinese Word Segmentation on Domain Literature[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[3] Zhang Yue,Wang Dongbo,Zhu Danhao. Segmenting Chinese Words from Food Safety Emergencies[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[4] Yufeng Duan,Sisi Huang. Information Extraction from Chinese Plant Species Diversity Description Text[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[5] Yu Xincong, Li Honglian, Lv Xueqiang. Research on the Application of Hyponymy in the Enrollment Robot[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[6] Zhang Jie, Zhang Haichao, Zhai Dongsheng. Research of the Word Segmentation for Chinese Patent Claims[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[7] Deng Shasha, Zhang Pengzhu, Li Xinmiao. A Method for Network Opinion Modeling Based on Governmental Public Decision Domain[J]. 现代图书情报技术, 2012, (9): 69-74.
[8] Li Wenjiang, Chen Shiqin. Application of AIMLBot Intelligent Robot in Real-time Virtual Reference Service[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[9] Shi Chongde, Wang Huilin. Research on Chinese Word Segmentation Optimization in Statistical Machine Translation[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[10] Gu Jun, Wang Hao. Study on Term Extraction on the Basis of Chinese Domain Texts[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[11] Xie Hui,Qin Jie,Hu Shuangshuang. The Study on the Duplicated Web Pages Detection Algorithm Based on the Keyword from User’s Submission[J]. 现代图书情报技术, 2008, 24(7): 43-46.
[12] Zhang Jinzhu,Zhang Dong,Wang Huilin. The Research of Character-Position-Based Chinese Word Segmentation[J]. 现代图书情报技术, 2008, 24(5): 39-43.
[13] Yao Xingshan. The Improvement in a Chinese Word Segmentation Based on Hash Algorism[J]. 现代图书情报技术, 2008, 24(3): 78-81.
[14] Zhang Chengzhi,Su Xinning . Recognition Mutually Exclusive Words for Information Retrieval[J]. 现代图书情报技术, 2007, 2(2): 44-48.
[15] Zhang Chengzhi,Su Xinning . Lexical Knowledge Discovery for Information Retrieval[J]. 现代图书情报技术, 2007, 2(1): 10-14.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn