Abstract:[Objective] Analyzing the performance,the crucial points and direction of characteristics translation and LSI in cross-language text clustering. [Methods] Selecting 2736 Sino-British bilingual news text from some bilingual websites,complete the clustering test with these two methods and compare the parameters,such as recall rate,accuracy and F value. [Results] Characteristics translation method improves clustering while the LSI method doesn’t get a good result for its time and space complexity. [Limitations] Samples need to be expanded and the LSI experiment need to be repeated in a high-performance computing environments. [Conclusions] Characteristics translation method need some more effective translation system,and the LSI method need to solve the calculation complexity and the select of the K value,etc.
邓三鸿,万接喜,王昊,刘喜文. 基于特征翻译和潜在语义标引的跨语言文本聚类实验分析*[J]. 现代图书情报技术, 2014, 30(1): 28-35.
Deng Sanhong,Wan Jiexi,Wang Hao,Liu Xiwen. Experimental Study of Multilingual Text Clustering. New Technology of Library and Information Service, 2014, 30(1): 28-35.
[1] 章成志, 王惠临. 多语言文本聚类研究综述[J]. 现代图书情报技术,2009(6):31-36.(Zhang Chengzhi,Wang Huilin. Survey on Multilingual Documents Clustering[J]. New Tech- nology of Library and Information Service,2009(6):31-36.) [2]韩普,万接喜,王东波. 基于混合策略的英汉双语新闻聚类研究[J]. 情报科学,2013,31(1):118-122.(Han Pu,Wan Jiexi,Wang Dongbo. Research on English-Chinese Bilingual News Clustering Based on Mixed Strategy[J]. Information Science,2013,31(1):118-122.) [3]刘飒,章成志.多语言文本表示研究综述[J]. 现代图书情报技术,2010(6):33-41.(Liu Sa,Zhang Chengzhi. Survey of Multilingual Document Representation[J]. New Technol- ogy of Library and Information Service,2010(6):33-41.) [4]Chen H H,Lin C J. A Multilingual News Summarizer[C]. In:Proceedings of the 18th International Conference on Compu- tational Linguistics. Stroudsburg, PA:Association for Com- putational Linguistics,2000:159-165. [5]Leftin L J. Newsblaster Russian-English Clustering Perfor- mance Analysis[R]. Columbia Computer Science Technical Reports,2003. [6]Wu K,Lu B L. Cross-Lingual Document Clustering[C]. In:Proceedings of the 11th Pacific-Asia Conference on Know- ledge Discovery and Data Mining. Berlin,Heidelberg:Springer,2007:956-963. [7]Montalvo S,Martínez R,Casillas A,et al. Multilingual News Clustering:Feature Translation vs. Identification of Cognate Named Entities[J]. Pattern Recognition Letter,2007,28(16):2305-2311. [8]Denicia-Carral C,Montes-Gomez M,Villasenor-Pineda L,et al. Bilingual Document Clustering Using Translation Independent Features[C]. In:Proceedings of CICLing’10. 2010. [9]Negri M,Magnini B. Using WordNet Predicates for Multilingual Named Entity Recognition[C]. In:Proceedings of the 2nd Global WordNet Conference.2004:169-174. [10]Dumais S T,Letsche T A,Littman M L,et al. Automatic Cross-Language Information Retrieval Using Latent Semantic Indexing[C]. In:Proceedings of the AAAI Symposium on Cross-Language Text and Speech Retrieval. American Association for Artificial Intelligence,1997:15-21. [11]Wei C P,Yang C C,Lin C M. A Latent Semantic Indexing- based Approach to Multilingual Document Clustering[J]. Decision Support Systems,2008,45(3):606-620. [12]金千里,赵军,徐波.弱指导的统计隐含语义分析及其在跨语言信息检索中的应用[C]. 见:语言计算与基于内容的文本处理——全国第七届计算语言学联合学术会议论文集.北京:清华大学,2003:527-533. (Jin Qianli,Zhao Jun,Xu Bo.Weakly-supervised Probabilistic Latent Semantic Analysis and Its Applications in Multilingual Information Retrieval[C]. In:Proceedings of the 7th Joint Conference on Computational Linguistics(JCCL2005). Beijing:Tsinghua University,2003:527-533.) [13]Montalvo S,Martínez R,Casillas A,et al. Bilingual News Clustering Using Named Entities and Fuzzy Similarity[C]. In:Proceedings of the 10th International Conference on Text,Speech and Dialogue. Berlin,Heidelberg:Springer,2007:107-114. [14]Kumar N K,Santosh G S K,Varma V. Effectively Mining Wikipedia for Clustering Multilingual Documents[C]. In:Proceedings of the 16th International Conference on Applications of Natural Language to Information Systems(NLDB 2011). LNCS 6716. Berlin,Heidelberg:Springer,2011:254-257. [15]Kumar N K,Santosh G S K,Varma V. Multilingual Document Clustering Using Wikipedia as External Knowledge[C]. In:Proceedings of the 2nd International Conference on Multidisciplinary Information Retrieval Facility. Berlin,Heidelberg:Springer,2011:108-117. [16]马晓佳.基于潜在语义标引的文本聚类研究[J]. 情报探索,2010(7):3-5.(Ma Xiaojia. Document Clustering Based on LSI[J]. Information Research,2010(7):3-5.) [17]卫威,王建民.一种大规模数据的快速潜在语义索引[J]. 计算机工程,2009,35(15):35-37,40(Wei Wei,Wang Jianmin. Fast Latent Semantic Indexing on Large-scale Dataset[J]. Computer Engineering,2009,35(15):35-37,40.) [18]Heritrix首页、文档和下载[EB/OL]. [2012-12-17]. http://www.oschina.net/p/heritrix/.(All about Heritrix [EB/OL]. [2012-12-17]. http://www.oschina.net/p/heritrix/.) [19]HTMLParser——Simple HTML and XHTML Parser[EB/OL]. [2013-02-04]. http://docs.python.org/2/library/htmlparser.html. [20]有道翻译[EB/OL]. [2013-03-11]. http://fanyi.youdao.com/.(Youdao Online-Translation[EB/OL]. [2013-03-11]. http://fanyi.youdao.com.) [21]Hall M,Frank E,Holmes G,et al. The WEKA Data Mining Software: An Update[J]. ACM SIGKDD Explorations New- sletter, 2009, 11(1): 10-18. [22]王东波,韩普,沈思,等. 基于英汉双语短语级平行语料的类别知识挖掘研究[J]. 现代图书情报技术,2012(11):40-46.(Wang Dongbo,Han Pu,Shen Si,et al. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. New Technology of Library and Information Service,2012(11):40-46.) [23]罗欣,夏德麟,晏蒲柳.基于词频差异的特征选取及改进的TF-IDF公式[J]. 计算机应用,2005,25(9):2031-2033.(Luo Xin,Xia Delin,Yan Puliu.Improved Feature Selection Method and TF-IDF Formula Based on Word Frequency Differentia[J]. Computer Applications,2005,25(9):2021-2033.