|
|
Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level |
Wang Dongbo1, Han Pu2, Shen Si2, Wei Xiangqing3 |
1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China; 2. School of Information Management, Nanjing University, Nanjing 210093, China; 3. Bilingual Dictionary Research Center, Nanjing University, Nanjing 210093, China |
|
|
Abstract The experiment of mining the category knowledge from English-Chinese humanities and social sciences parallel corpus in phrase level is performed based on the established clustering algorithm. The clustering and morphological conversion algorithms are determined by experimental data and specific research needs. The performance of English-Chinese bilingual word features is better than monolingual word by comparing the performance of the Chinese, English and English-Chinese word level knowledge clustering. The category knowledge is directly applied to knowledge base and machine translation system, and the English and Chinese word's expression is explored in mining the category knowledge.
|
Received: 09 October 2012
Published: 06 February 2013
|
|
[1] Boley D, Gini M, Gross R, et al. Partitioning-based Clustering for Web Document Categorization[J]. Decision Support Systems, 1999, 27(3): 329-341. [2] Mao J, Jain A K. A Self-organizing Network for Hyperellipsoidal Clustering[J]. IEEE Transactions on Neural Networks, 1996, 7(1):16-29. [3] Cai W L, Chen S C, Zhang D Q. Fast and Robust Fuzzy C-means Clustering Algorithms Incorporating Local Information for Image Segmentation[J]. Pattern Recognition, 2007, 40(3): 825-838. [4] 章成志, 王惠临.多语言文本聚类研究综述[J]. 现代图书情报技术, 2009(6): 31-36. (Zhang Chengzhi, Wang Huilin. Survey on Multilingual Documents Clustering[J]. New Technology of Library and Information Service, 2009(6): 31-36.) [5] 章成志, 王惠临.基于专业领域平行语料的双语核心术语抽取研究[C]. 见: 中国计算机语言学研究前沿进展(2007-2009). 北京: 清华大学出版社, 2009: 358-363. (Zhang Chengzhi, Wang Huilin. Bilingual Core Terminology Extraction Research Based on the Parallel Corpus in Professional Fields[C]. In: Proceedings of Advances of Computational Linguistics in China (2007-2009). Beijing: Tsinghua University Press, 2009: 358-363.) [6] Chen H H, Lin C J. A Multilingual News Summarizer[C]. In: Proceedings of the 18th International Conference on Computational Linguistics-Volume 1. Stroudsburg: Association for Computational Linguistics, 2000:159-165. [7] Lawrence J L. Newsblaster Russian-English Clustering Performance Analysis[R]. Columbia Computer Science Technical Reports, 2003. [8] Evans D K, Klavans J L, McKeown K R. Columbia Newsblaster: Multilingual News Summarization on the Web[C]. In: Proceedings of HLT-NAACL 2004. Stroudsburg: Association for Computational Linguistics, 2004:1-4. [9] Mathieu B, Besancon R, Fluhr C. Multilingual Document Clusters Discovery[C]. In: Proceedings of RIAO 2004. 2004:116-125. [10] Montalvo S, Martinez R, Casillas A, et al. Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities[C]. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2006: 1145-1152. [11] Dumais S T, Letsche T A, Littman M L. Automatic Cross-language Information Retrieval Using Latent Semantic Indexing[C]. In: Proceedings of the AAAI Symposium on Cross-language Text and Speech Retrieval. American Association for Artificial Intelligence, 1997:15-21. [12] Wei C P, Yang C C, Lin C M. A Latent Semantic Indexing-based Approach to Multilingual Document Clustering[J]. Decision Support Systems, 2008, 45(3): 606-620. [13] Montalvo S, Martinez R, Casillas A, et al. Bilingual News Clustering Using Named Entities and Fuzzy Similarity[C]. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue (TSD'07). Berlin,Heidelberg: Springer-Verlag, 2007:107-114. [14] Lloyd S. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28 (2): 129-137. [15] Sneath P H, Sokal R R. Numerical Taxonomy: The Principles and Practice of Numerical Classification[M]. San Francisco: Freeman, 1973. [16] Savaresi S M, Boley D L. On the Performance of Bisecting K-means and PDDP[C]. In: Proceedings of the 1st SIAM International Conference on Data Mining. 2001:1-14. [17] Karypis Lab. CLUTO[EB/OL].[2012-09-30]. http://glaros.dtc.umn.edu/gkhome/views/cluto/. [18] 文本分类语料库(复旦)测试语料[EB/OL].[2012-08-21]. http://www.datatang.com/data/43543.(Datatang. Test Corpus of Text Classification Corpus (Fudan)[EB/OL].[2012-08-21]. http://www.datatang.com/data/43543.) [19] ICTCLAS[EB/OL].[2012-08-21]. http://ictclas.org/. [20] Huang Z X. Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values[J]. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304. [21] The Porter Stemming Algorithm[EB/OL].[2012-07-21]. http://tartarus.org/martin/PorterStemmer/. [22] The English (Porter2) Stemming Algorithm[EB/OL].[2012-08-11]. http://snowball.tartarus.org/algorithms/english/stemmer.html. [23] European Languages Lemmatizer[EB/OL].[2012-08-15]. http://lemmatizer.org/. [24] Stemming and Lemmatization[EB/OL].[2012-09-15].http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html. [25] 20 Newsgroups[EB/OL].[2012-09-10]. http://qwone.com/~jason/20Newsgroups/. [26] 中国社会科学研究评价中心. 中文社会科学引文索引[EB/OL].[2012-09-28]. http://cssci.nju.edu.cn/news_show.asp?Articleid=163. (Chinese Social Sciences Research Evaluation Center. Chinese Social Sciences Citation Index[EB/OL].[2012-09-28]. http://cssci.nju.edu.cn/news_show.asp?Articleid=163.) |
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|