无词典中文高频词快速抽取算法

doi:10.11925/infotech.1003-3513.2012.06.08

现代图书情报技术

2012, Vol. 28

Issue (6): 50-53 https://doi.org/10.11925/infotech.1003-3513.2012.06.08

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

无词典中文高频词快速抽取算法

江华, 苏晓光

海军工程大学装备经济管理系武汉 430033

Chinese High-frequency Words Extraction Algorithm Without Thesaurus

Jiang Hua, Su Xiaoguang

Department of Equipment Economics and Management, Naval University of Engineering, Wuhan 430033, China

摘要
参考文献
相关文章
Metrics

全文: PDF (439 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要在PAT数组的基础上,引入LCP数组记录文本后缀串的相同前缀长度,通过扫描LCP数组快速抽取文本高频词。该算法不依赖于分词词典,通过探测重复出现串来提取高频词,并能够抽取任意重复字符串,对新词、组合词抽取特别有效。实验结果表明,该算法抽取的高频词可以达到较高的可接受率,在与ICTCLAS系统关键词抽取的比较中也有较高的相同率,且在发现组合词方面更具优势。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	江华
	苏晓光

关键词 ：中文信息处理, 高频词抽取, PAT数组, 中文分词, 关键词分析

Abstract：Based on PAT array,introducing LCP array to count the length of the common prefixes of text suffixes, a new algorithm without thesaurus is presented for extracting high-frequency words of Chinese text by scanning LCP arrary.The algorithm does not depend on segmentation dictionary and can extract any repeated string,especially the new words and combined words.Experimental results show that high-frequency words extracted by the algorithm achieve a high acceptance rate and this algorithm is more effective in extracting combined words than ICTCLAS.

Key words： Chinese information processing High-frequency word extraction PAT array Chinese word segmentation Keyword detection

收稿日期: 2012-03-27 出版日期: 2012-08-30

TP391

引用本文:

江华, 苏晓光. 无词典中文高频词快速抽取算法[J]. 现代图书情报技术, 2012, 28(6): 50-53.
Jiang Hua, Su Xiaoguang. Chinese High-frequency Words Extraction Algorithm Without Thesaurus. New Technology of Library and Information Service, 2012, 28(6): 50-53.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.06.08 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2012/V28/I6/50

[1] 黄昌宁,赵海.中文分词十年回顾[J]. 中文信息学报 ,2007,21(3):8-18. (Huang Changning,Zhao Hai. Chinese Word Segmentation:A Decade Review[J].Journal of Chinese Information Processing, 2007,21(3):8-18.)

[3] Zhou G D,Su J,Tey T G.Hybrid Text Chunking[C].In:Proceedings of CoNLL- 2000 and LLL-2000,Lisbon, Portugal.Stroudsburg, PA, USA:Association for Computational Linguistics,2000: 163-165.

[4] Zhou G D,Su J.Named Entity Recognition Using an HMM-based Chunk Tagger[C].In:Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics(ACL).Philadelphia,USA. Stroudsburg, PA, USA:Association for Computational Linguistics,2002:473-480.

[5] 沈勤中,周国栋,朱巧明,等.基于字位置概率特征的条件随机场中文分词方法[J]. 苏州大学学报:自然科学版 ,2008,24(3):49-53.(Shen Qinzhong,Zhou Guodong,Zhu Qiaoming, et al. CRFs-based Chinese Word Segmentation Method with Character Position Probability Feature[J]. Journal of Suzhou University:Natural Science Edition,2008,24(3):49-53.)

[6] 金翔宇,孙正兴,张福炎.一种中文文档的非受限无词典抽词方法[J]. 中文信息学报 ,2001,15(6):33-39.(Jin Xiangyu,Sun Zhengxing,Zhang Fuyan. A Domain-independent Dictionary- free Lexical Acquisition Model for Chinese Document [J]. Journal of Chinese Information Processing, 2001,15(6): 33-39.)

[7] 韩客松,王永成,陈桂林.无词典高频字串快速提取和统计算法研究[J]. 中文信息学报 ,2001,15(2):23-30.(Han Kesong,Wang Yongcheng,Chen Guilin. Research on Fast High-frequency Strings Extracting and Statistics Algorithm with No Thesaurus[J].Journal of Chinese Information Processing, 2001,15(2): 23-30.)

[8] 任禾,曾隽芳.一种基于信息熵的中文高频词抽取算法[J]. 中文信息学报 , 2006,20(5):40-43.(Ren He,Zeng Junfang. A Chinese Word Extraction Algorithm Based on Information Entropy[J]. Journal of Chinese Information Processing, 2006,20(5): 40-43.)

[9] Manber U,Myers G. Suffix Arrays: A New Method for On-line String Searches[J]. SIAM Journal on Computing,1993,22(5): 935-948.

[10] Bentley J L,Sedgewick R. Fast Algorithms for Sorting and Searching Strings[C]. In: Proceedings of the 8th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans,USA. Philadelphia, PA, USA:Society for Industrial and Applied Mathematics,1997:319-327.

[11] 江华,赵建新,王海岚.PAT数组全文检索技术的研究与改进[J]. 现代图书情报技术 ,2005(8):37-41.(Jiang Hua,Zhao Jianxin,Wang Hailan.Research on a Full-text Indexing Structure of PAT Array[J].New Technology of Library and Information Service,2005(8):37-41.)

[12] ICTCLAS[EB/OL].[2012-03-05].http://ictclas.org/.

[1]	唐琳,郭崇慧,陈静锋. 中文分词技术研究综述^*[J]. 数据分析与知识发现, 2020, 4(2/3): 1-17.
[2]	尤众喜,华薇娜,潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响 ^*[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[3]	冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[4]	倪维健, 孙浩浩, 刘彤, 曾庆田. 面向领域文献的无监督中文分词自动优化方法^*[J]. 数据分析与知识发现, 2018, 2(2): 96-104.
[5]	张越, 王东波, 朱丹浩. 面向食品安全突发事件汉语分词的特征选择及模型优化研究^*[J]. 数据分析与知识发现, 2017, 1(2): 64-72.
[6]	段宇锋,黄思思. 中文植物物种多样性描述文本的信息抽取研究^*[J]. 现代图书情报技术, 2016, 32(1): 87-96.
[7]	余昕聪, 李红莲, 吕学强. 本体上下位关系在招生问答机器人中的应用研究[J]. 现代图书情报技术, 2015, 31(12): 65-71.
[8]	张杰, 张海超, 翟东升. 面向中文专利权利要求书的分词方法研究[J]. 现代图书情报技术, 2014, 30(9): 91-98.
[9]	邓莎莎, 张朋柱, 李欣苗. 政府公共决策领域中网络民意建模方法研究[J]. 现代图书情报技术, 2012, (9): 69-74.
[10]	李文江, 陈诗琴. AIMLBot智能机器人在实时虚拟参考咨询中的应用[J]. 现代图书情报技术, 2012, 28(7): 127-132.
[11]	石崇德, 王惠临. 统计机器翻译中文分词优化技术研究[J]. 现代图书情报技术, 2012, 28(4): 29-34.
[12]	滕广青, 毕强, 鲍玉来. 基于粒度概念分析法的文献关键词分析——以Ontology领域关键词为例[J]. 现代图书情报技术, 2011, 27(9): 1-6.
[13]	谷俊, 王昊. 基于领域中文文本的术语抽取方法研究[J]. 现代图书情报技术, 2011, 27(4): 29-34.
[14]	季培培, 鄢小燕, 岑咏华, 王凌燕. 面向领域中文文本信息处理的术语语义层次获取研究[J]. 现代图书情报技术, 2010, 26(9): 37-41.
[15]	常智荣,马自卫,李高虎. 基于Nutch的专题网页资源采集服务系统的设计与实现[J]. 现代图书情报技术, 2010, 26(3): 19-26.

Viewed

Full text

Abstract

Cited

Shared

Discussed