Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (1): 28-35    DOI: 10.11925/infotech.1003-3513.2014.01.05
KNOWLEDGE ORGANIZATION AND KNOWLEDGE MANAGEMENT Current Issue | Archive | Adv Search |
Experimental Study of Multilingual Text Clustering
Deng Sanhong, Wan Jiexi, Wang Hao, Liu Xiwen
School of Information Management,Nanjing University,Nanjing 210093,China
Download: PDF(639 KB)   HTML  
Export: BibTeX | EndNote (RIS)      
Abstract  [Objective] Analyzing the performance,the crucial points and direction of characteristics translation and LSI in cross-language text clustering. [Methods] Selecting 2736 Sino-British bilingual news text from some bilingual websites,complete the clustering test with these two methods and compare the parameters,such as recall rate,accuracy and F value. [Results] Characteristics translation method improves clustering while the LSI method doesn’t get a good result for its time and space complexity. [Limitations] Samples need to be expanded and the LSI experiment need to be repeated in a high-performance computing environments. [Conclusions] Characteristics translation method need some more effective translation system,and the LSI method need to solve the calculation complexity and the select of the K value,etc.
Key wordsCross-language text clustering      Characteristics translation      LSI     
Received: 14 February 2014      Published: 14 February 2014
:  TP391  

Cite this article:

Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering. New Technology of Library and Information Service, 2014, 30(1): 28-35.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.01.05     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I1/28

[1] 章成志, 王惠临. 多语言文本聚类研究综述[J]. 现代图书情报技术,2009(6):31-36.(Zhang Chengzhi,Wang Huilin. Survey on Multilingual Documents Clustering[J]. New Tech- nology of Library and Information Service,2009(6):31-36.)
[2]韩普,万接喜,王东波. 基于混合策略的英汉双语新闻聚类研究[J]. 情报科学,2013,31(1):118-122.(Han Pu,Wan Jiexi,Wang Dongbo. Research on English-Chinese Bilingual News Clustering Based on Mixed Strategy[J]. Information Science,2013,31(1):118-122.)
[3]刘飒,章成志.多语言文本表示研究综述[J]. 现代图书情报技术,2010(6):33-41.(Liu Sa,Zhang Chengzhi. Survey of Multilingual Document Representation[J]. New Technol- ogy of Library and Information Service,2010(6):33-41.)
[4]Chen H H,Lin C J. A Multilingual News Summarizer[C]. In:Proceedings of the 18th International Conference on Compu- tational Linguistics. Stroudsburg, PA:Association for Com- putational Linguistics,2000:159-165.
[5]Leftin L J. Newsblaster Russian-English Clustering Perfor- mance Analysis[R]. Columbia Computer Science Technical Reports,2003.
[6]Wu K,Lu B L. Cross-Lingual Document Clustering[C]. In:Proceedings of the 11th Pacific-Asia Conference on Know- ledge Discovery and Data Mining. Berlin,Heidelberg:Springer,2007:956-963.
[7]Montalvo S,Martínez R,Casillas A,et al. Multilingual News Clustering:Feature Translation vs. Identification of Cognate Named Entities[J]. Pattern Recognition Letter,2007,28(16):2305-2311.
[8]Denicia-Carral C,Montes-Gomez M,Villasenor-Pineda L,et al. Bilingual Document Clustering Using Translation Independent Features[C]. In:Proceedings of CICLing’10. 2010.
[9]Negri M,Magnini B. Using WordNet Predicates for Multilingual Named Entity Recognition[C]. In:Proceedings of the 2nd Global WordNet Conference.2004:169-174.
[10]Dumais S T,Letsche T A,Littman M L,et al. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing[C]. In:Proceedings of the AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence,1997:15-21.
[11]Wei C P,Yang C C,Lin C M. A Latent Semantic Indexing- based Approach to Multilingual Document Clustering[J]. Decision Support Systems,2008,45(3):606-620.
[12]金千里,赵军,徐波.弱指导的统计隐含语义分析及其在跨语言信息检索中的应用[C]. 见:语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集.北京:清华大学,2003:527-533. (Jin Qianli,Zhao Jun,Xu Bo.Weakly-supervised Probabilistic Latent Semantic Analysis and Its Applications in Multilingual Information Retrieval[C]. In:Proceedings of the 7th Joint Conference on Computational Linguistics(JCCL2005). Beijing:Tsinghua University,2003:527-533.)
[13]Montalvo S,Martínez R,Casillas A,et al. Bilingual News Clustering Using Named Entities and Fuzzy Similarity[C]. In:Proceedings of the 10th International Conference on Text,Speech and Dialogue. Berlin,Heidelberg:Springer,2007:107-114.
[14]Kumar N K,Santosh G S K,Varma V. Effectively Mining Wikipedia for Clustering Multilingual Documents[C]. In:Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems(NLDB 2011). LNCS 6716. Berlin,Heidelberg:Springer,2011:254-257.
[15]Kumar N K,Santosh G S K,Varma V. Multilingual Document Clustering Using Wikipedia as External Knowledge[C]. In:Proceedings of the 2nd International Conference on Multidisciplinary Information Retrieval Facility. Berlin,Heidelberg:Springer,2011:108-117.
[16]马晓佳.基于潜在语义标引的文本聚类研究[J]. 情报探索,2010(7):3-5.(Ma Xiaojia. Document Clustering Based on LSI[J]. Information Research,2010(7):3-5.)
[17]卫威,王建民.一种大规模数据的快速潜在语义索引[J]. 计算机工程,2009,35(15):35-37,40(Wei Wei,Wang Jianmin. Fast Latent Semantic Indexing on Large-scale Dataset[J]. Computer Engineering,2009,35(15):35-37,40.)
[18]Heritrix首页、文档和下载[EB/OL]. [2012-12-17]. http://www.oschina.net/p/heritrix/.(All about Heritrix [EB/OL]. [2012-12-17]. http://www.oschina.net/p/heritrix/.)
[19]HTMLParser——Simple HTML and XHTML Parser[EB/OL]. [2013-02-04]. http://docs.python.org/2/library/htmlparser.html.
[20]有道翻译[EB/OL]. [2013-03-11]. http://fanyi.youdao.com/.(Youdao Online-Translation[EB/OL]. [2013-03-11]. http://fanyi.youdao.com.)
[21]Hall M,Frank E,Holmes G,et al. The WEKA Data Mining Software: An Update[J]. ACM SIGKDD Explorations New- sletter, 2009, 11(1): 10-18.
[22]王东波,韩普,沈思,等. 基于英汉双语短语级平行语料的类别知识挖掘研究[J]. 现代图书情报技术,2012(11):40-46.(Wang Dongbo,Han Pu,Shen Si,et al. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. New Technology of Library and Information Service,2012(11):40-46.)
[23]罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用,2005,25(9):2031-2033.(Luo Xin,Xia Delin,Yan Puliu.Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Computer Applications,2005,25(9):2021-2033.
[1] Chen Bing,Tai Xiaoying. Semantic Retrieval Using Ontology and Document Refinement[J]. 现代图书情报技术, 2009, 25(12): 42-46.
[2] Chen Yue,Guo Li. Latent Semantic Indexing and Its Application[J]. 现代图书情报技术, 2001, 17(6): 27-29.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn