Please wait a minute...
Advanced Search
现代图书情报技术  2012, Vol. Issue (11): 40-46     https://doi.org/10.11925/infotech.1003-3513.2012.11.07
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
基于英汉双语短语级平行语料的类别知识挖掘研究
王东波1, 韩普2, 沈思2, 魏向清3
1. 南京农业大学信息科学技术学院 南京 210095;
2. 南京大学信息管理学院 南京 210093;
3. 南京大学双语词典研究中心 南京 210093
Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level
Wang Dongbo1, Han Pu2, Shen Si2, Wei Xiangqing3
1. College of Information Science and Technology, Nanjing Agricultural University, Nanjing 210095, China;
2. School of Information Management, Nanjing University, Nanjing 210093, China;
3. Bilingual Dictionary Research Center, Nanjing University, Nanjing 210093, China
全文: PDF (769 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 在已有聚类算法的基础上,基于英汉双语短语级人文社会科学平行语料,进行类别知识挖掘的实验。根据实验数据并结合具体的研究需求,确定相应的聚类算法和英语形态转换的算法。通过对汉语、英语和英汉双语词汇级知识聚类的性能进行对比,确定英汉双语词汇特征的性能优于单语。获取的类别知识可以直接应用到知识库、机器翻译模型的构建中,同时探究英汉两种词汇在类别知识获取过程中具体表现。
服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
王东波
韩普
沈思
魏向清
关键词 CSSCI英汉双语短语级平行语料BisectingK-meansClustering算法类别知识    
Abstract:The experiment of mining the category knowledge from English-Chinese humanities and social sciences parallel corpus in phrase level is performed based on the established clustering algorithm. The clustering and morphological conversion algorithms are determined by experimental data and specific research needs. The performance of English-Chinese bilingual word features is better than monolingual word by comparing the performance of the Chinese, English and English-Chinese word level knowledge clustering. The category knowledge is directly applied to knowledge base and machine translation system, and the English and Chinese word's expression is explored in mining the category knowledge.
Key wordsCSSCI    English-Chinese parallel corpus in phrase level    Bisecting K-means clustering algorithm    Category knowledge
收稿日期: 2012-10-09      出版日期: 2013-02-06
:  TP391  
基金资助:本文系国家高技术研究发展计划(863计划)“以科技文献服务为主的搜索引擎研制”(项目编号:2011AA01A206)、国家社会科学基金重点项目“人文社会科学汉英动态术语数据库的构建研究”(项目编号:11AYY002)和江苏省研究生培养创新工程“基于异构社会网络数据的信息集成与检索研究”(项目编号:CXZZ12-0073)的研究成果之一。
通讯作者: 王东波     E-mail: wangdongbo0102@gmail.com
引用本文:   
王东波, 韩普, 沈思, 魏向清. 基于英汉双语短语级平行语料的类别知识挖掘研究[J]. 现代图书情报技术, 2012, (11): 40-46.
Wang Dongbo, Han Pu, Shen Si, Wei Xiangqing. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level. New Technology of Library and Information Service, 2012, (11): 40-46.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2012.11.07      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2012/V/I11/40
[1] Boley D, Gini M, Gross R, et al. Partitioning-based Clustering for Web Document Categorization[J]. Decision Support Systems, 1999, 27(3): 329-341.
[2] Mao J, Jain A K. A Self-organizing Network for Hyperellipsoidal Clustering[J]. IEEE Transactions on Neural Networks, 1996, 7(1):16-29.
[3] Cai W L, Chen S C, Zhang D Q. Fast and Robust Fuzzy C-means Clustering Algorithms Incorporating Local Information for Image Segmentation[J]. Pattern Recognition, 2007, 40(3): 825-838.
[4] 章成志, 王惠临.多语言文本聚类研究综述[J]. 现代图书情报技术, 2009(6): 31-36. (Zhang Chengzhi, Wang Huilin. Survey on Multilingual Documents Clustering[J]. New Technology of Library and Information Service, 2009(6): 31-36.)
[5] 章成志, 王惠临.基于专业领域平行语料的双语核心术语抽取研究[C]. 见: 中国计算机语言学研究前沿进展(2007-2009). 北京: 清华大学出版社, 2009: 358-363. (Zhang Chengzhi, Wang Huilin. Bilingual Core Terminology Extraction Research Based on the Parallel Corpus in Professional Fields[C]. In: Proceedings of Advances of Computational Linguistics in China (2007-2009). Beijing: Tsinghua University Press, 2009: 358-363.)
[6] Chen H H, Lin C J. A Multilingual News Summarizer[C]. In: Proceedings of the 18th International Conference on Computational Linguistics-Volume 1. Stroudsburg: Association for Computational Linguistics, 2000:159-165.
[7] Lawrence J L. Newsblaster Russian-English Clustering Performance Analysis[R]. Columbia Computer Science Technical Reports, 2003.
[8] Evans D K, Klavans J L, McKeown K R. Columbia Newsblaster: Multilingual News Summarization on the Web[C]. In: Proceedings of HLT-NAACL 2004. Stroudsburg: Association for Computational Linguistics, 2004:1-4.
[9] Mathieu B, Besancon R, Fluhr C. Multilingual Document Clusters Discovery[C]. In: Proceedings of RIAO 2004. 2004:116-125.
[10] Montalvo S, Martinez R, Casillas A, et al. Multilingual Document Clustering: An Heuristic Approach Based on Cognate Named Entities[C]. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2006: 1145-1152.
[11] Dumais S T, Letsche T A, Littman M L. Automatic Cross-language Information Retrieval Using Latent Semantic Indexing[C]. In: Proceedings of the AAAI Symposium on Cross-language Text and Speech Retrieval. American Association for Artificial Intelligence, 1997:15-21.
[12] Wei C P, Yang C C, Lin C M. A Latent Semantic Indexing-based Approach to Multilingual Document Clustering[J]. Decision Support Systems, 2008, 45(3): 606-620.
[13] Montalvo S, Martinez R, Casillas A, et al. Bilingual News Clustering Using Named Entities and Fuzzy Similarity[C]. In: Proceedings of the 10th International Conference on Text, Speech and Dialogue (TSD'07). Berlin,Heidelberg: Springer-Verlag, 2007:107-114.
[14] Lloyd S. Least Squares Quantization in PCM[J]. IEEE Transactions on Information Theory, 1982, 28 (2): 129-137.
[15] Sneath P H, Sokal R R. Numerical Taxonomy: The Principles and Practice of Numerical Classification[M]. San Francisco: Freeman, 1973.
[16] Savaresi S M, Boley D L. On the Performance of Bisecting K-means and PDDP[C]. In: Proceedings of the 1st SIAM International Conference on Data Mining. 2001:1-14.
[17] Karypis Lab. CLUTO[EB/OL].[2012-09-30]. http://glaros.dtc.umn.edu/gkhome/views/cluto/.
[18] 文本分类语料库(复旦)测试语料[EB/OL].[2012-08-21]. http://www.datatang.com/data/43543.(Datatang. Test Corpus of Text Classification Corpus (Fudan)[EB/OL].[2012-08-21]. http://www.datatang.com/data/43543.)
[19] ICTCLAS[EB/OL].[2012-08-21]. http://ictclas.org/.
[20] Huang Z X. Extensions to the K-means Algorithm for Clustering Large Data Sets with Categorical Values[J]. Data Mining and Knowledge Discovery, 1998, 2(3): 283-304.
[21] The Porter Stemming Algorithm[EB/OL].[2012-07-21]. http://tartarus.org/martin/PorterStemmer/.
[22] The English (Porter2) Stemming Algorithm[EB/OL].[2012-08-11]. http://snowball.tartarus.org/algorithms/english/stemmer.html.
[23] European Languages Lemmatizer[EB/OL].[2012-08-15]. http://lemmatizer.org/.
[24] Stemming and Lemmatization[EB/OL].[2012-09-15].http://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html.
[25] 20 Newsgroups[EB/OL].[2012-09-10]. http://qwone.com/~jason/20Newsgroups/.
[26] 中国社会科学研究评价中心. 中文社会科学引文索引[EB/OL].[2012-09-28]. http://cssci.nju.edu.cn/news_show.asp?Articleid=163. (Chinese Social Sciences Research Evaluation Center. Chinese Social Sciences Citation Index[EB/OL].[2012-09-28]. http://cssci.nju.edu.cn/news_show.asp?Articleid=163.)
[1] 温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[2] 刘洪伟, 高鸿铭, 陈丽, 詹明君, 梁周扬. 基于用户浏览行为的兴趣识别管理模型*[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[3] 贾晓婷, 王名扬, 曹宇. 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究*[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[4] 刘明辉. 基于K-means聚类分析的民航系统恐怖主义风险评估*[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[5] 王雪颖, 张紫玄, 王昊, 邓三鸿. 中国农产品品牌评价研究的内容解析*[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[6] 官琴, 邓三鸿, 王昊. 中文文本聚类常用停用词表对比研究*[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[7] 方小飞, 黄孝喜, 王荣波, 谌志群, 王小华. 基于LDA模型的移动投诉文本热点话题识别*[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[8] 刘睿伦, 叶文豪, 高瑞卿, 唐梦嘉, 王东波. 基于大数据岗位需求的文本聚类研究*[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
[9] 钮亮. 共主题网络方法及应用*[J]. 现代图书情报技术, 2016, 32(7-8): 137-146.
[10] 陈挺, 韩涛, 李泽霞, 李国鹏, 王小梅. 科研项目布局差异对比方法研究——以NSF和EUFP项目为例[J]. 现代图书情报技术, 2015, 31(7-8): 89-96.
[11] 任育伟, 吕学强, 李卓, 徐丽萍. 搜索日志中命名实体识别[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[12] 肖天久, 刘颖. 《红楼梦》词和N元文法分析[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[13] 张文君, 王军, 徐山川. 电商用户需求状态的聚类分析——以淘宝网女装为例[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[14] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[15] 赵捧未, 马琳, 秦春秀. P2P用户兴趣社区形成研究[J]. 现代图书情报技术, 2013, 29(10): 53-58.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn