中文词义消歧上下文最优边界问题研究*

doi:10.11925/infotech.1003-3513.2009.07-08.10

现代图书情报技术

2009, Vol. 25

Issue (7-8): 49-53 https://doi.org/10.11925/infotech.1003-3513.2009.07-08.10

知识组织与知识管理

本期目录 | 过刊浏览 | 高级检索

中文词义消歧上下文最优边界问题研究*

李纲¹寇广增¹夏晨曦²全吉³张东赫⁴

¹(武汉大学信息管理学院武汉 430072)
² (北京市科学技术情报研究所北京100048)
³ (武汉大学系统工程研究所武汉 430072)
⁴ (郑准泽远山经济大学远山朝鲜)

Optimal Context Window for Chinese Word Sense Disambiguation

Li Gang¹Kou Guangzeng¹Xia Chenxi²Quan Ji³Jiang Donghyok⁴

¹ (School of Information Management, Wuhan University, Wuhan 430072, China)
²(Beijing Science and Technology Information Institute, Beijing 100048, China)
³(Institute of Systems Engineering, Wuhan University, Wuhan 430072, China)
⁴(JengJunTaek WonSan Economic College, WonSan, North Korea)

摘要
参考文献
相关文章
Metrics

全文: PDF (772 KB)
输出: BibTeX | EndNote (RIS)

摘要

为了选择最优的边界，采用交叉验证方法，将取得错误率最低的上下文边界确定为上下文最优边界，并应用此方法对SemEval-2007中文数据集进行处理，得出此数据集的上下文最优边界为［-2,+2］。为了验证其结果的有效性，进一步采用SemEval-2007测试集进行消歧测试，结果表明采用交叉验证法确定的最优边界对词义消歧准确率有一定提升。同时对不同词性歧义词的最优边界也进行讨论。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	寇广增
	李纲
	夏晨曦
	全吉
	张东赫

关键词 ：词义消歧, 上下文边界, 特征选择, 中文

Abstract：

To determine the optimal context field of ambiguous word, the paper uses cross-validation method to identify the optimal context window, and the best one has the lowest error rate in all of candidates. Using this method, it processes SemEval-2007 data sets and finds that the optimal context windows for this data sets is ［-2, +2］. In order to verify this result, there is a WSD test for SemEval-2007 test data sets, which shows that the performance of Chinese WSD upgrades to a certain extent. And the different optimal context windows for different parts of speech of ambiguous word are discussed.

Key words： Word sense disambiguation Context window Feature selection Chinese

收稿日期: 2009-07-04 出版日期: 2009-08-25

TP391

基金资助:

* 本文系国家自然科学基金项目“文本集特征提取方法及应用研究”（项目编号：70673070）的研究成果之一。

通讯作者: 寇广增 E-mail: kouguangzeng@yahoo.com.cn

作者简介: 李纲,寇广增,夏晨曦,全吉,张东赫

引用本文:

李纲,寇广增,夏晨曦,全吉,张东赫. 中文词义消歧上下文最优边界问题研究*[J]. 现代图书情报技术, 2009, 25(7-8): 49-53.
Li Gang,Kou Guangzeng,Xia Chenxi,Quan Ji,Jang Donghyok. Optimal Context Window for Chinese Word Sense Disambiguation. New Technology of Library and Information Service, 2009, 25(7-8): 49-53.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2009.07-08.10 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2009/V25/I7-8/49

［1］ Nancy Ide, Jean Véronis. Introduction to the Special Issue on Word Sense Disambiguation: The State of the Art ［J］.Computational Linguistics, 1998, 24(1): 2-40.
［2］ Mosteller F, Wallace D L. Inference and Disputed Authorship: The Federalist Papers［M］. USA: Addison-Wesley Educational Publishers Inc, 1964.
［3］ Martin W J R, Al B P F, Van Sterkenburg P J G. On the Processing of Text Corpus: From Textual Data to Lexicographical Information ［A］. //Lexicography: Principles and Practice ［M］. USA: Academic Press, 1983: 56-64.
［4］ Choueka Y, Lusignan S. Disambiguation by Short Contexts ［J］.Computers and the Humanities, 1985, 19(3):147-157.
［5］ Gale W A, Church K W, Yarowsky D. A Method for Disambiguating Word Senses in a Large Corpus ［J］.Computers and the Humanities, 1992, 26(5-6): 415-439.
［6］ Yarowsky D. One Sense per Collocation ［C］.In:Proceedings of the Workshop on Human Language Technology, Princeton, New Jersey. USA: Association for Computational Linguistics, 1993: 266-271.
［7］ Hughes J. Automatically Acquiring a Classification of Words ［D］. Paris: University of Leeds, 1994.
［8］朱靖波, 李珩, 张跃, 等. 基于对数模型的词义自动消歧［J］.软件学报, 2001, 12(9): 1405-1412.
［9］卢志茂, 刘挺, 郎君, 等. 神经网络和贝叶斯网络在汉语词义消歧上的对比研究［J］.高技术通讯,2004,14(8): 15-19.
［10］吴云芳, 王淼, 金澎, 等. 多分类器集成的汉语词义消歧研究［J］.计算机研究与发展, 2008, 45(8):1354-1361.
［11］陈佳,罗振声. 一种基于语义搭配的汉语词义消歧方法［J］.微计算机信息, 2008,24(3):186-188.
［12］谢宇,张仰森,肖建涛. 规则与统计相结合的汉语词义消歧模型［J］.北京机械工业学院学报:综合版, 2007,22(3): 5-9.
［13］朱姝,张政. 基于多层次句子相似度与向量空间模型的词义消歧［J］.北京工商大学学报:自然科学版, 2009, 27(2):68-72.
［14］鲁松, 白硕. 自然语言处理中词语上下文有效范围的定量描述［J］.计算机学报, 2001, 24(7): 742-747.
［15］ Jin P, Wu Y, Yu S. SemEval-2007 Task 05: Multilingual Chinese-English Lexical Sample Task ［C］. In:Proceedings of the 4th International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. USA: Association for Computational Linguistics, 2007:19-23.
［16］ Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques ［M］. 2nd Edition. USA: Morgan Kaufmann, 2005.
［17］ Pedersen T. A Simple Approach to Building Ensembles of Naive Bayesian Classifiers for Word Sense Disambiguation ［C］.In:Proceedings of the 1st Conference on North American Chapter of the Association for Computational Linguistics. USA: Morgan Kaufmann, 2000: 63-69.
［18］ John G H, Langley P. Estimating Continuous Distributions in Bayesian Classifiers ［C］. In:Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. USA: Morgan Kaufmann, 1995:338-345.

[1]	尹鹏博,潘伟民,张海军,陈德刚. 基于BERT-BiGA模型的标题党新闻识别研究^*[J]. 数据分析与知识发现, 2021, 5(6): 126-134.
[2]	林克柔,王昊,龚丽娟,张宝隆. 融合多特征的中文论文同名学者消歧研究 ^*[J]. 数据分析与知识发现, 2021, 5(4): 90-102.
[3]	梁家铭, 赵洁, 郑鹏, 黄流深, 叶敏祺, 董振宁. 特征选择下融合图像和文本分析的在线短租平台信任计算框架 ^*[J]. 数据分析与知识发现, 2021, 5(2): 129-140.
[4]	张润彤,陈东华,赵红梅,朱晓敏. 基于中文语义分析的计算机辅助ICD-11编码方法研究*[J]. 数据分析与知识发现, 2020, 4(4): 44-55.
[5]	唐琳,郭崇慧,陈静锋. 中文分词技术研究综述^*[J]. 数据分析与知识发现, 2020, 4(2/3): 1-17.
[6]	刘婧茹,宋阳,贾睿,张翼鹏,罗勇,马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别^*[J]. 数据分析与知识发现, 2020, 4(10): 124-133.
[7]	胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 ^*[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[8]	尤众喜,华薇娜,潘雪莲. 中文分词器对图书评论和情感词典匹配程度的影响 ^*[J]. 数据分析与知识发现, 2019, 3(7): 23-33.
[9]	周成,魏红芹. *专利价值评估与分类研究^——基于自组织映射支持向量机**[J]. 数据分析与知识发现, 2019, 3(5): 117-124.
[10]	梁家铭,赵洁,Jianlong Zhou,董振宁. 用户隐式行为挖掘在抗信誉共谋中的应用研究^*[J]. 数据分析与知识发现, 2019, 3(5): 125-138.
[11]	温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法^*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[12]	谭章禄,王兆刚,胡翰. 一种基于χ²统计的特征分类选择方法研究^*[J]. 数据分析与知识发现, 2019, 3(2): 72-78.
[13]	王雪颖, 王昊, 张紫玄. 中文专利文献中连续符号串的语义识别^*[J]. 数据分析与知识发现, 2018, 2(5): 11-22.
[14]	冯国明, 张晓冬, 刘素辉. 基于自主学习的专业领域文本DBLC分词模型[J]. 数据分析与知识发现, 2018, 2(5): 40-47.
[15]	温廷新, 李洋子, 孙静霜. 基于改进的果蝇优化算法的文本特征选择优化模型[J]. 数据分析与知识发现, 2018, 2(5): 59-69.

Viewed

Full text

Abstract

Cited

Shared

Discussed