基于特征翻译和潜在语义标引的跨语言文本聚类实验分析<sup>*</sup>

doi:10.11925/infotech.1003-3513.2014.01.05

现代图书情报技术

2014, Vol. 30

Issue (1): 28-35 https://doi.org/10.11925/infotech.1003-3513.2014.01.05

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

基于特征翻译和潜在语义标引的跨语言文本聚类实验分析^*

邓三鸿, 万接喜, 王昊, 刘喜文

南京大学信息管理学院南京 210093

Experimental Study of Multilingual Text Clustering

Deng Sanhong, Wan Jiexi, Wang Hao, Liu Xiwen

School of Information Management,Nanjing University,Nanjing 210093,China

摘要
参考文献
相关文章
Metrics

全文: PDF (639 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要【目的】通过多组实验来分析跨语言文本聚类中的基于特征翻译和潜在语义标引性能、注意事项和发展方向。【方法】从有关双语站点选取2 736篇中英文对齐的双语新闻语料,以基于特征翻译和潜在语义标引这两种方法分别进行文本聚类实验,并进行各自召回率、准确率、F值的对比。【结果】基于特征翻译的方法处理相对简单,能明显提升多语言文本的聚类效果；基于潜在语义标引的方法由于方法自身在时间和空间复杂度以及其他固有缺陷,最终结果差强人意。【局限】样本丰富度有待进一步扩展,期待在高性能计算环境下对LSI方法进行更全面的实验。【结论】基于特征翻译的方法需进一步提高翻译系统的性能,而LSI方法则需要解决计算复杂度、K值选取等问题。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	邓三鸿
	万接喜
	王昊
	刘喜文

关键词 ：跨语言文本聚类, 特征翻译, 潜在语义标引

Abstract：[Objective] Analyzing the performance,the crucial points and direction of characteristics translation and LSI in cross-language text clustering. [Methods] Selecting 2736 Sino-British bilingual news text from some bilingual websites,complete the clustering test with these two methods and compare the parameters,such as recall rate,accuracy and F value. [Results] Characteristics translation method improves clustering while the LSI method doesn’t get a good result for its time and space complexity. [Limitations] Samples need to be expanded and the LSI experiment need to be repeated in a high-performance computing environments. [Conclusions] Characteristics translation method need some more effective translation system,and the LSI method need to solve the calculation complexity and the select of the K value,etc.

Key words： Cross-language text clustering Characteristics translation LSI

收稿日期: 2014-02-14 出版日期: 2014-02-14

TP391

基金资助:本文系国家自然科学基金项目“面向知识服务的知识组织模式与应用研究”（项目编号：71273126）和国家社会科学重点项目“基于语义的馆藏资源深度聚合与可视化展示研究”（项目编号：11AZD090）的研究成果之一。

作者简介: 作者贡献声明：邓三鸿: 提出研究思路；王昊：设计研究方案；万接喜：采集数据、进行实验、论文起草；邓三鸿，刘喜文：论文最终版本修订。

引用本文:

邓三鸿,万接喜,王昊,刘喜文. 基于特征翻译和潜在语义标引的跨语言文本聚类实验分析^*[J]. 现代图书情报技术, 2014, 30(1): 28-35.
Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering. New Technology of Library and Information Service, 2014, 30(1): 28-35.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.01.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I1/28

[1] 章成志, 王惠临. 多语言文本聚类研究综述[J]. 现代图书情报技术,2009（6）：31-36.（Zhang Chengzhi,Wang Huilin. Survey on Multilingual Documents Clustering[J]. New Tech- nology of Library and Information Service,2009（6）：31-36.）
[2]韩普,万接喜,王东波. 基于混合策略的英汉双语新闻聚类研究[J]. 情报科学,2013,31（1）：118-122.（Han Pu,Wan Jiexi,Wang Dongbo. Research on English-Chinese Bilingual News Clustering Based on Mixed Strategy[J]. Information Science,2013,31（1）：118-122.）
[3]刘飒,章成志.多语言文本表示研究综述[J]. 现代图书情报技术,2010（6）：33-41.（Liu Sa,Zhang Chengzhi. Survey of Multilingual Document Representation[J]. New Technol- ogy of Library and Information Service,2010（6）：33-41.）
[4]Chen H H,Lin C J. A Multilingual News Summarizer[C]. In：Proceedings of the 18th International Conference on Compu- tational Linguistics. Stroudsburg, PA：Association for Com- putational Linguistics,2000：159-165.
[5]Leftin L J. Newsblaster Russian-English Clustering Perfor- mance Analysis[R]. Columbia Computer Science Technical Reports,2003.
[6]Wu K,Lu B L. Cross-Lingual Document Clustering[C]. In：Proceedings of the 11th Pacific-Asia Conference on Know- ledge Discovery and Data Mining. Berlin,Heidelberg：Springer,2007：956-963.
[7]Montalvo S,Martínez R,Casillas A,et al. Multilingual News Clustering：Feature Translation vs. Identification of Cognate Named Entities[J]. Pattern Recognition Letter,2007,28（16）：2305-2311.
[8]Denicia-Carral C,Montes-Gomez M,Villasenor-Pineda L,et al. Bilingual Document Clustering Using Translation Independent Features[C]. In：Proceedings of CICLing’10. 2010.
[9]Negri M,Magnini B. Using WordNet Predicates for Multilingual Named Entity Recognition[C]. In：Proceedings of the 2nd Global WordNet Conference.2004：169-174.
[10]Dumais S T,Letsche T A,Littman M L,et al. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing[C]. In：Proceedings of the AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence,1997：15-21.
[11]Wei C P,Yang C C,Lin C M. A Latent Semantic Indexing- based Approach to Multilingual Document Clustering[J]. Decision Support Systems,2008,45（3）：606-620.
[12]金千里,赵军,徐波.弱指导的统计隐含语义分析及其在跨语言信息检索中的应用[C]. 见：语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集.北京：清华大学,2003：527-533. （Jin Qianli,Zhao Jun,Xu Bo.Weakly-supervised Probabilistic Latent Semantic Analysis and Its Applications in Multilingual Information Retrieval[C]. In：Proceedings of the 7th Joint Conference on Computational Linguistics（JCCL2005）. Beijing：Tsinghua University,2003：527-533.）
[13]Montalvo S,Martínez R,Casillas A,et al. Bilingual News Clustering Using Named Entities and Fuzzy Similarity[C]. In：Proceedings of the 10th International Conference on Text,Speech and Dialogue. Berlin,Heidelberg：Springer,2007：107-114.
[14]Kumar N K,Santosh G S K,Varma V. Effectively Mining Wikipedia for Clustering Multilingual Documents[C]. In：Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems（NLDB 2011）. LNCS 6716. Berlin,Heidelberg：Springer,2011：254-257.
[15]Kumar N K,Santosh G S K,Varma V. Multilingual Document Clustering Using Wikipedia as External Knowledge[C]. In：Proceedings of the 2nd International Conference on Multidisciplinary Information Retrieval Facility. Berlin,Heidelberg：Springer,2011：108-117.
[16]马晓佳.基于潜在语义标引的文本聚类研究[J]. 情报探索,2010（7）：3-5.（Ma Xiaojia. Document Clustering Based on LSI[J]. Information Research,2010（7）：3-5.）
[17]卫威,王建民.一种大规模数据的快速潜在语义索引[J]. 计算机工程,2009,35（15）：35-37,40（Wei Wei,Wang Jianmin. Fast Latent Semantic Indexing on Large-scale Dataset[J]. Computer Engineering,2009,35（15）：35-37,40.）
[18]Heritrix首页、文档和下载[EB/OL]. [2012-12-17]. http://www.oschina.net/p/heritrix/.（All about Heritrix [EB/OL]. [2012-12-17]. http://www.oschina.net/p/heritrix/.）
[19]HTMLParser——Simple HTML and XHTML Parser[EB/OL]. [2013-02-04]. http://docs.python.org/2/library/htmlparser.html.
[20]有道翻译[EB/OL]. [2013-03-11]. http://fanyi.youdao.com/.（Youdao Online-Translation[EB/OL]. [2013-03-11]. http://fanyi.youdao.com.）
[21]Hall M,Frank E,Holmes G,et al. The WEKA Data Mining Software: An Update[J]. ACM SIGKDD Explorations New- sletter, 2009, 11（1）: 10-18.
[22]王东波,韩普,沈思,等. 基于英汉双语短语级平行语料的类别知识挖掘研究[J]. 现代图书情报技术,2012（11）：40-46.（Wang Dongbo,Han Pu,Shen Si,et al. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. New Technology of Library and Information Service,2012（11）：40-46.）
[23]罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用,2005,25（9）：2031-2033.（Luo Xin,Xia Delin,Yan Puliu.Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Computer Applications,2005,25（9）：2021-2033.

[1]	章成志,王惠临. 多语言文本聚类研究综述*[J]. 现代图书情报技术, 2009, 25(6): 31-36.
[2]	孙海霞,成颖. 潜在语义标引（LSI）研究综述*[J]. 现代图书情报技术, 2007, 2(9): 49-53.

Viewed

Full text

Abstract

Cited

Shared

Discussed