从互联网上自动获取领域平行语料

doi:10.11925/infotech.1003-3513.2014.12.05

现代图书情报技术

2014, Vol. 30

Issue (12): 36-43 https://doi.org/10.11925/infotech.1003-3513.2014.12.05

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

从互联网上自动获取领域平行语料

邵健, 章成志

南京理工大学经济管理学院南京 210094

Automatic Acquisition of Domain Parallel Corpora from Internet

Shao Jian, Zhang Chengzhi

School of Economics & Management, Nanjing University of Science & Technology, Nanjing 210094, China

摘要
参考文献
相关文章
Metrics

全文: PDF (487 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的]对获取的双语语料进行分类, 对分类后的双语语料进行句子对齐处理, 生成领域平行语料.[方法]利用基于SVM算法的文本分类器对获取的中英双语语料进行分类.使用长度法和词汇法相结合的句子对齐工具对分类后的语料进行句子对齐工作, 为提高句子对齐的正确率, 利用人工对齐的中英平行语料计算中英文句子长度参数, 结合中英双语词典, 获取高质量的专业领域平行语料.[结果]使用该方法, 对每个领域语料进行句子对齐后, 取得95.45%的句子对齐正确率.计算得到的句子平均长度比为1.7777, 方差为1.2640.[局限]由于双语语料的初始对齐程度比较好, 因此句子对齐正确率可能不具有普遍代表性.[结论]从实验结果看, 该方法是有效的, 能够获取质量令人满意的领域平行语料.

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	邵健
	章成志

关键词 ：句子对齐, 文本分类, 平行语料, 机器翻译

Abstract：

[Objective] To automatically obtain domain parallel corpora via classified bilingual corpora and sentence alignment. [Methods] Classify bilingual corpora based on text classification technology, use sentence alignment tool to align classified bilingual corpus based on length information of bilingual sentence and bilingual dictionary. This paper uses artificial aligned bilingual corpora to calculate length parameters. [Results] The results obtain 95.45% rate of sentence aligned correctly. The length mean is 1.7777 and variance is 1.2640. [Limitations] Due to the extent of the initial alignment of bilingual corpus is satisfied, so the result of alignment is not universally representative. [Conclusions] The result proves the method presented in this paper is effective, so this method can acquire high quality domain parallel corpora.

Key words： Sentence alignment Text classification Parallel corpora Machine translation

收稿日期: 2014-06-30 出版日期: 2015-01-20

TP361

通讯作者: 章成志 E-mail: zhangcz@njust.edu.cn E-mail: zhangcz@njust.edu.cn

作者简介: 作者贡献声明: 邵健: 设计研究方案, 实验设计与实施, 数据清洗与分析, 论文起草; 章成志: 提出研究思路, 讨论研究方案, 数据采集及分析, 最终版本修订.

引用本文:

邵健, 章成志. 从互联网上自动获取领域平行语料[J]. 现代图书情报技术, 2014, 30(12): 36-43.
Shao Jian, Zhang Chengzhi. Automatic Acquisition of Domain Parallel Corpora from Internet. New Technology of Library and Information Service, 2014, 30(12): 36-43.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.12.05 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I12/36

[1] Koehn P. Europarl: A Parallel Corpus for Statistical Machine Translation [C]. In: Proceedings of the 10th Machine Translation Summit, Phuket, Thailand. 2005: 79-86.
[2] 吴琳, 魏星, 霍翠婷. 基于 Web 的专利双语语料自动获取研究及实现——以esp@cenet数据库为例[J]. 现代图书情报技术, 2009(9): 57-63. (Wu Lin, Wei Xing, Huo Cuiting. Research and Implement of Automatic Patent Bilingual Corpus Extraction from Web——Taking esp@cenet as an Example [J]. New Technology of Library and Information Service, 2009(9): 57-63. )
[3] Resnik P, Smith N A. The Web as a Parallel Corpus [J]. Computational Linguistics, 2003, 29(3): 349-380.
[4] Ma X, Liberman M Y. BITS: A Method for Bilingual Text Search over the Web [C]. In: Proceedings of Machine Translation Summit VII, Singapore. 1999.
[5] Chen J, Nie J. Automatic Construction of Parallel English-Chinese Corpus for Cross-Language Information Retrieval [C]. In: Proceedings of the 6th Applied Natural Language Processing Conference, Seattle, Washington, USA. 2000: 21-28.
[6] Zhang Y, Wu K, Gao J, et al. Automatic Acquisition of Chinese-English Parallel Corpus from the Web [C]. In: Proceedings of the 28th European Conference on IR Research, London, UK. Springer Berlin Heidelberg, 2006: 420-431.
[7] Zhang C Z, Yao X C, Kit C. Finding More Bilingual Web Pages with High Credibility via Link Analysis [C]. In: Proceedings of the 6th Workshop on Building and Using Comparable Corpora, Sofia, Bulgaria. 2013.
[8] 刘奇, 刘洋, 孙茂松. URL 模式与 HTML 结构相结合的平行网页获取方法 [J]. 中文信息学报, 2013, 27(3): 91-99. (Liu Qi, Liu Yang, Sun Maosong. A Parallel Pages Mining Approach: Comibing URL Patterns and HTML Structures [J]. Journal of Chinese Information Processing, 2013, 27(3): 91-99.)
[9] Gale W A, Church K W. A Program for Aligning Sentences in Bilingual Corpora [J]. Computational Linguistics, 1993, 19(1): 75-102.
[10] Brown P F, Lai J C, Mercer R L. Aligning Sentences in Parallel Corpora [C]. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, 1991: 169-176.
[11] Kay M, Röscheisen M. Text-translation Alignment [J]. Computational Linguistics, 1993, 19(1): 121-142.
[12] Chen S F. Aligning Sentences in Bilingual Corpora Using Lexical Information [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[13] Church K W. Char_align: A Program for Aligning Parallel Texts at the Character Level [C]. In: Proceedings of the 31st Annual Meeting of the Association for Computational Linguistics, 1993.
[14] Wu D. Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria [C]. In: Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, 1994: 80-87.
[15] Moore R C. Fast and Accurate Sentence Alignment of Bilingual Corpora [C]. In: Proceedings of the 5th Conference of the Association for Machine Translation in the Americas, Tiburon, CA, USA. 2002: 135-144.
[16] Fattah M A, Bracewell D B, Ren F, et al. Sentence Alignment Using P-NNT and GMM [J]. Computer Speech & Language, 2007,21(4): 594-608.
[17] Sennrich R, Volk M. MT-based Sentence Alignment for Ocr-generated Parallel Texts [C]. In: Proceedings of the 9th Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado, USA. 2010.
[18] Trieu H L, Nguyen P T, Nguyen K A. Improving Moore's Sentence Alignment Method Using Bilingual Word Clustering [C]. In: Proceedings of the 5th International Conference on Knowledge and Systems Engineering. Springer International Publishing, 2014: 149-160.
[19] 熊文新. 英汉环保领域平行语料的句对齐与再对齐[J]. 现代图书情报技术, 2013(6): 36-41. (Xiong Wenxin. Sentence Alignment and Re-Alignment for Environmental Protection Texts in English-Chinese Parallel Corpus [J]. New Technology of Library and Information Service, 2013(6): 36-41.)
[20] Vapnik V N. The Nature of Statistical Learning Theory[M]. Springer New York, 2000.
[21] Forman G. An Extensive Empirical Study of Feature Selection Metrics for Text Classification [J]. Journal of Machine Learning Research, 2003, 3: 1289-1305.
[22] Mesleh A M A. Chi Square Feature Extraction Based SVMs Arabic Language Text Categorization System [J]. Journal of Computer Science, 2007, 3(6): 430-435.
[23] Peng H, Long F, Ding C. Feature Selection Based on Mutual Information: Criteria of Max-Dependency, Max-Relevance, and Min-Redundancy [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005, 27(8): 1226-1238.
[24] Robertson S. Understanding Inverse Document Frequency: On Theoretical Arguments for IDF [J]. Journal of Documentation, 2004, 60(5): 503-520.
[25] Varga D, Halácsy P, Kornai A, et al. Parallel Corpora for Medium Density Languages [A].// Recent Advances in Natural Language Processing IV [M]. John Benjamins Publishing Company, 2007: 247-258.
[26] Ma X. Champollion: A Robust Parallel Text Sentence Aligner [C]. In: Proceedings of the 5th International Conference on Language Resources and Evaluation, 2006.
[27] Chang C C, Lin C J. LIBSVM: A Library for Support Vector Machines [J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): Article No.27.
[28] Stolcke A. SRILM -An Extensible Language Modeling Toolkit [C]. In: Proceedings of the 7th International Conference on Spoken Language Processing, Denver, Colorado, USA. 2002.
[29] Xiao T, Zhu J, Zhang H, et al. NiuTrans: An Open Source Toolkit for Phrase-based and Syntax-based Machine Translation [C]. In: Proceedings of the ACL 2012 System Demonstrations, 2012.
[30] Przybocki M A, Peterson K, Bronsart S. Translation Adequacy and Preference Evaluation Tool (TAP-ET) [C]. In: Proceedings of the International Conference on Language Resources and Evaluation, LREC, Marrakech, Morocco. 2008.

[1]	陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[2]	周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[3]	刘文斌, 何彦青, 吴振峰, 董诚. 基于BERT和多相似度融合的句子对齐方法研究*[J]. 数据分析与知识发现, 2021, 5(7): 48-58.
[4]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[5]	王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[6]	梁继文,江川,王东波. 基于多特征融合的先秦典籍汉英句子对齐研究^*[J]. 数据分析与知识发现, 2020, 4(9): 123-132.
[7]	唐晓波,高和璇. 基于关键词词向量特征扩展的健康问句分类研究 ^*[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[8]	王思迪,胡广伟,杨巳煜,施云. 基于文本分类的政府网站信箱自动转递方法研究*[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[9]	石磊,王毅,成颖,魏瑞斌. 自然语言处理中的注意力机制研究综述^*[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[10]	徐月梅,刘韫文,蔡连侨. 基于深度融合特征的政务微博转发规模预测模型^*[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
[11]	徐彤彤,孙华志,马春梅,姜丽芬,刘逸琛. 基于双向长效注意力特征表达的少样本文本分类模型研究^*[J]. 数据分析与知识发现, 2020, 4(10): 113-123.
[12]	余本功,曹雨蒙,陈杨楠,杨颖. 基于nLD-SVM-RF的短文本分类研究*[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[13]	聂维民,陈永洲,马静. 融合多粒度信息的文本向量表示模型 ^*[J]. 数据分析与知识发现, 2019, 3(9): 45-52.
[14]	邵云飞,刘东苏. 基于类别特征扩展的短文本分类方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[15]	秦贺然,刘浏,李斌,王东波. 融入实体特征的典籍自动分类研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 68-76.

Viewed

Full text

Abstract

Cited

Shared

Discussed