面向汉语句法功能分布知识库的词汇类别知识挖掘研究*

引用本文

王东波, 朱丹浩. 面向汉语句法功能分布知识库的词汇类别知识挖掘研究&^*. 现代图书情报技术, 2013, 29(3): 33-37
Wang Dongbo, Zhu Danhao. Research of Mining the Word Category Knowledge for Chinese Syntactic Function Distribution Knowledge Base. New Technology of Library and Information Service, 2013, 29(3): 33-37 复制到剪切板

Permissions

面向汉语句法功能分布知识库的词汇类别知识挖掘研究^*

王东波¹, 朱丹浩²

1(南京农业大学信息科学技术学院南京 210095)

2(联合国大学国际软件技术研究所澳门 3058)

修回日期:2013-03-01

基金:本文系国家自然科学基金面上项目“面向知识服务的知识组织模式与应用研究”(项目编号:71273126)、高技术研究发展计划(863计划)“以科技文献服务为主的搜索引擎研制”(项目编号:2011AA01A206)和国家社会科学基金项目“基于语言特征的中文意见挖掘研究”(项目编号:11CYY031)的研究成果之一。

摘要

基于清华大学汉语树库,通过多叉树存储结构,构建汉语词汇的句法功能分布知识库。在构建的句法功能分布知识库基础上,利用Sparse Feature Clustering(SFC)中的K-medoids聚类算法,对汉语词汇的类别知识进行挖掘探究。

关键词: 树库; 词汇句法功能; 知识库; SFC

Research of Mining the Word Category Knowledge for Chinese Syntactic Function Distribution Knowledge Base

Wang Dongbo¹, Zhu Danhao²

1(College of Information and Technology Science, Nanjing Agricultural University, Nanjing 210095, China)

2(International Institute for Software Technology, United Nations University, Macao 3058, China)

Abstract

According to the Chinese word syntactic function distribution, the paper constructs syntactic function distribution knowledge in multi-way tree storage structure base based on Tsinghua treebank. The Chinese word category knowledge is mined by using the K-medoids clustering algorithm of Sparse Feature Clustering based on syntactic function distribution knowledge base.

Keyword: Treebank; Word syntactic function; Knowledge base; SFC

Show Figures

1 引言

随着自然语言处理和文本挖掘技术的发展,从非结构化文本中挖掘和抽取相应的专门或通用知识以便更好地服务于基础和应用研究日益成为一种趋势。本文基于清华大学树库,通过SFC(Sparse Feature Clustering)中的K-medoids聚类算法,在汉语词汇句法功能知识库的基础上,挖掘词汇具体类别知识;挖掘的类别知识不仅可以应用到中文信息处理的汉语句法结构歧义消解上,而且对于构建大规模精确度更高的汉语树库具有重要的促进作用。

关于依据词汇句法功能对整个汉语词汇进行分类,相关的研究者进行了多方面的探究。在朱德熙提出利用词汇的句法功能进行分类的设想基础上,从词汇句法功能要为汉语完全自动句法分析服务这一目的出发,陈小荷^{[ 1]}给出了主要依据汉语词汇的句法功能分布对汉语主要实词词汇进行分类的理念。从上述词汇句法功能的设想和理念出发,基于人民日报大规模语料,徐艳华^{[ 2]}在计算机的辅助下,通过手工逐一分析一般名词、动词、形容词以及高频副词的语法功能,并完成3 514个常用汉语词汇的简单语法功能数据库的构建。陈锋等^{[ 3]}基于清华树库的短语结构在句法库中所充当的句法功能,详细考察介宾结构、主谓结构、定中结构、述宾结构、述补结构和状中结构的句法功能。卢俊之等^{[ 4]}在获取词汇和短语结构句法功能分布的基础上,基于句法功能匹配的理论,主要使用消歧算法,构建汉语自动句法分析器。上述研究者在基于汉语词汇句法功能的基础上,从理念、人工分类和应用的角度,对整个汉语词汇的分类问题进行探索和研究,本文正是在上述研究基础上的一个延续。在具体的研究过程中,使用相应的聚类算法从词汇句法功能知识库中挖掘出相应的词汇知识。关于聚类算法,国内外研究者在单语和双语上都进行了多方面的探究。崔尚卿等^{[ 5]}从基于密度的聚类算法本身存在的问题出发,提出不均匀密度的自动聚类算法。王伟^{[ 6]}从各个角度对文本聚类的技术进行综述,关注文本聚类中的文本聚类定义、文本预处理、文本特征抽取和文本聚类的各种算法中的各方面问题。王舵等^{[ 7]}针对聚类速度慢、初值对结果的影响大、易陷入局部最优等问题,提出基于相似度测度和覆盖方法来解决上述问题。潘章明^{[ 8]}将基于差分进化的自动聚类算法和半监督方法结合起来,在具体的实验中,启发式地引入部分已经获得了带标签的数据,聚类的精确度也得到了有效的提高。于洪等^{[ 9]}从面向知识的聚类框架出发,使用决策粗糙集理论对聚类算法进行研究,并且考虑了一种特定的损失函数。在单语文本的聚类研究中,国外的相关学者^{[ 10, 11, 12]}给出了各种文本聚类的方法以及相应的改进算法,而双语文本或多语文本的聚类,则相对比较复杂。采用“先聚类后合并”的策略,Chen等^{[ 13]}通过计算簇间相似度,把中文的名词、动词主要实词和命名实体译成对照的英语,在单语简单聚类和合并聚类簇的方法前提下完成了对英汉双语新闻的自动聚类。在机器翻译系统的基础上,结合双语词典,Leftin^{[ 14]}在把Russian和English两种文本中的词汇完成全部对照的前提下,开发了面向这两种语言的文本自动聚类系统。在把目标语言翻译成英语的基础上,Evans等^{[ 15]}设计了Columabia’s Newsblaster系统,用来进行新闻摘要和话题的自动生成。通过借助语言分析,Mathieu等^{[ 16]}从文本中抽取相应的特征词,在相对应的双语词典基础上,完成了英文、法文和西班牙文混合文本的自动聚类。

2 汉语句法功能知识库构建

本文有关汉语词汇类别知识挖掘的研究是在清华汉语树库^{[ 17]}上测试、验证的,同时研究构建了清华汉语句法功能知识库。

2.1 数据源简介

按照完整的层次结构描述方法,在16种汉语短语成分标记和27种句法结构关系的前提下,清华大学计算机系智能技术与系统国家重点实验室在以文学、学术、新闻、应用4种体裁构成的汉语平衡语料基础上构建了一个100万字的汉语树库。其短语句法结构分布如表1所示:

表1 清华汉语树库短语句法结构分布

2.2 汉语词汇句法功能获取和汉语句法功能知识库构建

基于词汇句法功能统计的需要,结合清华汉语句子树的特点,本文中多叉树被用来存放整个句法树,并使用树遍历算法遍历每一棵句法树,从而可以便捷地对句法树的短语结构或词汇功能进行存储、统计、编辑和删节等。

清华汉语句子树使用对称的标记符号“[ ]”标记了句法结构的起始和结尾。从多叉树数据结构特点出发,结合语料库标记,构建句法多叉树的流程如下:汉语句子构成一棵多叉树;每个短语结构标记构成多叉树中的非叶节点;每个词汇或者标点符号构成叶节点。基于已有的清华汉语句子树,在多叉树存储结构上,通过设计程序完成清华汉语句子多叉树的构建工作,算法流程如图1所示:

	Figure Option View Download New Window
	图1 汉语句子树生成流程

根据图1所示的流程和句法树构造算法,完成对汉语句子的树构造,具体例子如下:“ZJ [ZW 地理学/n [PO 是/vC [DZ [DZ 一/m 门/qN ] [DZ [LH [ZZ 很/dD 古老/a ] [很 /dD 年轻/a ] ] 的/u 学科/n ] ] ] ] 。/。 ]”,句法生成的树样例如图2所示:

	Figure Option View Download New Window
	图2 句法树样例

基于汉语句子的句法生成树,使用根节点深度优先的方法,通过树检索遍历整个树结构,当遍历到叶子节点的时候,抽取词汇,并且记录其父节点所在的句法结构,如图2所示,当遍历到“一”这一词汇的时候,记录它充当句法功能的句法结构“DZ”。在上述算法的基础上,获取了清华大学汉语词汇句法功能,并构建了词汇句法功能知识库,清华大学汉语词汇句法功能知识库的部分样例如表2所示。

表2 清华大学部分汉语词汇句法功能知识库样例

3 聚类算法简介

对汉语语法功能分布知识库进行聚类,可以将汉语词语按照其语法功能特征自然归类,从而实现基于句法功能知识库的类别知识挖掘。对建立的知识库进行观察可以发现,数据表中维度为40,共有数据51 455条,其中不为0的数据项共有104 035个。而表中总共拥有数据项40×51 455=2 058 200个,也就是说不为空的数据占总数据的5.05%。可以得出结论,该聚类问题是一个高维属性稀疏聚类问题,无法直接通过传统的低维聚类方法处理。

SFC^{[ 18]}应用在各个维度的属性取值是区间变量的高维聚类问题上,SFC中使用K-medoids算法来聚类,在具体进行词汇类别知识挖掘的过程中,本文自定义并改进了获取medoids的算法:在每一个类U中,计算每一个对象i到其他任意对象j的距离之和s(i),选取拥有最小的距离和的节点作为新的medoid。距离和公式如下:
s(i)=j∈U,j≠id(i,j)(1).

由于在实际数据中可能存在这种情况:数据对象m和n所有值皆相同,按照稀疏相似性应该被分到一类中,然而如果初次随机抽取的medoids中既包括m又包括n,那么就会出现m和n被分到两类的情况,不符合实际情况。针对这个问题,本文在算法中改进了初始随机medoids的获取算法:设定一个阈值μ,在随机获得medoids时,如果获取的任意两个medoids的稀疏差异度小于μ,则删去一个medoid,并重新获取一个,直至任意两个medoids的稀疏差异度大于μ为止。

4 词汇类别挖掘过程和结果

4.1 词汇类别知识挖掘的过程

K-medoids算法按一定的策略随机选择初始聚类中心medoids,聚类的结果受medoids随机选取的影响较大。因此,本文做了5组试验,设定聚类数为40。在平均经过迭代后,得到5组聚类结果。将结果按照其包含词条数目大小降序排列后,部分聚类结果统计如表3所示:

表3 SFC部分聚类结果统计

从表3的统计结果可以发现,最大分类的词条数目分别为5 805、5 611、5 466、6 010和6 425,误差在10%之内;最小分类的词条数目分别为59、98、97、204、50,最大误差为73.54。5次试验的标准差最低为1 313.368,最高为1 461.444,误差为10.13%,这说明分类结果的离散程度非常接近。图3给出了5次试验结果中分类所含词数的散点图,横轴表示分类序号,纵轴表示分类中所包含的词数。

	Figure Option View Download New Window
	图3 5次试验结果分布

从图3中可以看出,5次试验结果中,每个分类所含词数的分布非常接近。这说明虽然medoids选取的随机性对于聚类结果有一定影响,但聚类所获得的分布还是很稳定的。

4.2 词汇类别知识挖掘的结果

在上述聚类实验的基础上,本文给出了基于词汇句法功能知识库挖掘出来的具体的词汇各类别下的前10个词汇样例,部分词汇类别知识挖掘的结果如表4所示:

表4 部分词汇类别知识挖掘的结果

通过表4所给出的10类词汇聚类结果样例可以看出,依据词汇句法功能挖掘出来的词汇类别虽然从具体的已有词性标注判断有些不属于同一类别,但从整体的分布上看,还是基本上符合句法功能分布的。如果从语义的角度考虑,通过本文的资源和方法挖掘出来的词汇类别有些确实不属于同一类别,这也从一定程度上说明仅仅依据词汇句法功能获取词汇类别知识的精确率有待提升。通过聚类方法挖掘出来的词汇类别知识是否符合汉语词汇的真实分布,需要在具体的消歧任务上进行测试,这也是下一步需要完成的工作。

5 结语

本文基于清华汉语树库,通过多叉树存储结构,统计了汉语词汇的句法功能,并在此基础上构建汉语词汇句法功能知识库。基于SFC高维聚类下的K-medoids算法,在汉语词汇句法功能知识库的基础上,挖掘出清华汉语所有词汇的类别知识。在未来研究中,一方面将扩大树库的规模,比如把宾州汉语树库^{[ 19]}通过结构转换增加到清华汉语树库中,同时把本文获取类别知识的方法,应用到以印欧语言为代表的英语当中;另一个方面将会把通过聚类方法挖掘到的类别知识应用到具体的句法结构消歧中,从而验证通过该方法获取的词汇类别知识的优越性。

参考文献

View Option

[1]	陈小荷. 从自动句法分析角度看汉语词类问题[J]. 语言教学与研究, 1999（3）: 63-72. （Chen Xiaohe. Chinese Words’Classes from the Perspective of Automatic Syntactic Analysis[J]. Language Teaching and Research, 1999（3）: 63-72. ） [本文引用:1]
[2]	徐艳华. 现代汉语实词语法功能考察及词类体系重构[D]. 南京: 南京师范大学, 2006. (Xu Yanhua. Survey on Modern Chinese Notional Word Grammar Function and Reconstructing the POS System[D]. Nanjing: Nanjing Normal University, 2006. ) [本文引用:1] [CJCR: 0.3509]
[3]	陈锋, 陈小荷. 基于树库的现代汉语短语分布考察[J]. 语言科学, 2008, 7（1）: 12-17. （Chen Feng, Chen Xiaohe. A Study on Grammartical Functions of Phrases in Mand arin Chinese Based on Chinese TreeBank[J]. Linguistic Sciences, 2008, 7（1）: 12-17. ） [本文引用:1] [CJCR: 0.6043]
[4]	卢俊之, 陈小荷, 王东波, 等. 基于语法功能匹配的汉语句法分析算法[J]. 计算机工程与应用, 2008, 44（16）: 151-153, 159. （Lu Junzhi, Chen Xiaohe, Wang Dongbo, et al. Chinese Parsing Algorithm Based on Grammar Function Match[J]. Computer Engineering and Applications, 2008, 44（16）: 151-153, 159. ） [本文引用:1] [CJCR: 0.457]
[5]	崔尚卿, 马秀莉, 唐世渭, 等. 基于不均匀密度的自动聚类算法[J]. 计算机工程, 2008, 34（23）: 86-88. （Cui Shangqing, Ma Xiuli, Tang Shiwei, et al. Auto-clustering Algorithm Based on Non-uniform Density[J]. Computer Engineering, 2008, 34（23）: 86-88. ） [本文引用:1] [CJCR: 0.492]
[6]	王伟. 文本自动聚类技术研究[J]. 情报杂志, 2009, 28（2）: 94-96. （Wang Wei. Research on Text Automatic Clustering[J]. Journal of Intelligence, 2009, 28（2）: 94-96. ） [本文引用:1] [CJCR: 0.951]
[7]	王舵, 郄君, 张娟, 等. 一种快速词自动聚类算法[J]. 计算机应用与软件, 2010, 27（8）: 277-278. （Wang Duo, Qie Jun, Zhang Juan, et al. A New Algorithm of Words Automatic Clustering[J]. Computer Applications and Software, 2010, 27（8）: 277-278. ） [本文引用:1] [CJCR: 0.515]
[8]	潘章明. 半监督的自动聚类[J]. 计算机应用, 2010, 30（10）: 2614-2617. （Pan Zhangming. Semi-supervised Automatic Clustering[J]. Journal of Computer Applications, 2010, 30（10）: 2614-2617. ） [本文引用:1] [CJCR: 0.646]
[9]	于洪, 储双双. 一种基于决策粗糙集的自动聚类方法[J]. 计算机科学, 2011, 38（1）: 221-224. （Yu Hong, Chu Shuangshuang. Novel Autonomous Clustering Method Based on Decision-theoretic Rough Set[J]. Computer Science, 2011, 38（1）: 221-224. ） [本文引用:1] [CJCR: 0.61]
[10]	Boley D, Gini M, Gross R, et al. Partitioning-based Clustering for Web Document Categorization[J]. Decision Support Systems, 1999, 27（3）: 329-341. [本文引用:1] [JCR: 2.201]
[11]	Mao J, Jain A K. A Self-organizing Network for Hyperellipsoidal Clustering[J]. IEEE Transactions on Neural Networks, 1996, 7（1）: 16-29. [本文引用:1] [JCR: 2.952]
[12]	Cai W, Chen S, Zhang D. Fast and Robust Fuzzy C-means Clustering Algorithms Incorporating Local Information for Image Segmentation[J]. Pattern Recognition, 2007, 40（3）: 825-838. [本文引用:1] [JCR: 2.632]
[13]	Chen H H, Lin C J. A Multilingual News Summarizer[C]. In: Proceedings of the 18th International Conference on Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2000: 159-165. [本文引用:1]
[14]	Leftin L J. Newsblaster Russian-English Clustering Performance Analysis[R]. Columbia Computer Science Technical Reports, 2003. [本文引用:1]
[15]	Evans D K, Klavans J L, McKeown K R. Columbia Newsblaster: Multilingual News Summarization on the Web Demonstration[C]. In: Proceedings of HLT-NAACL 2004. Stroudsburg: Association for Computational Linguistics, 2004: 1-4. [本文引用:1]
[16]	Mathieu B, Besancon R, Fluhr C. Multilingual Document Clusters Discovery[C]. In: Proceedings of RIAO 2004. 2004: 116-125. [本文引用:1]
[17]	周强, 张伟, 俞士汶. 汉语树库的构建[J]. 中文信息学报, 1997（4）: 42-51. （Zhou Qiang, Zhang Wei, Yu Shiwen. Building a Chinese Treebank[J]. Journal of Chinese Information Processing, 1997（4）: 42-51. ） [本文引用:1] [CJCR: 1.13]
[18]	Dhillon I S, Mallela S, Kumar R. A Divisive Information Theoretic Feature Clustering Algorithm for Text Classification[J]. The Journal of Machine Learning Research, 2003, 3（1）: 1265-1287. [本文引用:1]
[19]	Marcus M P, Marcinkiewicz M A, Santorini B. Building a Large Annotated Corpus of English: The Penn Treebank[J]. Computational Linguistics, 1993, 19（2）: 313-330. [本文引用:1] [JCR: 0.94]

1999

0.0

... 在朱德熙提出利用词汇的句法功能进行分类的设想基础上,从词汇句法功能要为汉语完全自动句法分析服务这一目的出发,陈小荷[<xref ref-type="bibr" rid="R1">1</xref>]给出了主要依据汉语词汇的句法功能分布对汉语主要实词词汇进行分类的理念 ...

2006

0.0

0.3509

... 从上述词汇句法功能的设想和理念出发,基于人民日报大规模语料,徐艳华[<xref ref-type="bibr" rid="R2">2</xref>]在计算机的辅助下,通过手工逐一分析一般名词、动词、形容词以及高频副词的语法功能,并完成3 514个常用汉语词汇的简单语法功能数据库的构建 ...

2008

0.0

0.6043

Linguistic Sci. 2008, 7(1):12 - 17

A Study on Grammartical functions of Phrases in Mandarin Chinese Based on Chinese TreeBank

基于树库的现代汉语短语分布考察

<name MTML-type="CN">(陈锋)</e>,<name MTML-type="CN">(陈小荷)</e>.<article-title MTML-type="CN">(基于树库的现代汉语短语分布考察)</e>[J].<source MTML-type="CN">(语言科学)</e>,<year>2008</year>,<volumn>7</volumn>（<issue>1</issue>）:<fpage>12</fpage>-<lpage>17</lpage>.（<name>Chen Feng</name>,<name>Chen Xiaohe</name>.<article-title>A Study on Grammartical Functions of Phrases in Mandarin Chinese Based on Chinese TreeBank</article-title>[J].<source>Linguistic Sciences</source>,<year>2008</year>, <volumn>7</volumn>（<issue>1</issue>）:<fpage>12</fpage>-<lpage>17</lpage>.）

School of Chinese Language and Literature of Nanjing Normal University， Nanjing Jiangsu 210097

The grammatical function of a phrase instance should be determined while parsing. Based on the largescale treebank corpus, the paper makes an investigation into the grammatical functions of Phrases in Mandarin Chinese. The paper first describes the distribution of the grammatical functions of six kinds of phrases categorized by structure with statistical data based largescale tree corpus. And then the distribution of the grammatical functions of 11 kinds of phrases categorized by function is explored.Finally, The paper compares several methods which are used to distinguish the grammatical function of an instance of npDZ phrase, [JP2]and the conclusion that a parser would hardly performance well if it distinguishes the grammatical function of an instance of phrase only by category.

动句法分析中需要确定短语实例的语法功能。文章试图在大规模汉语树库统计的基础上对汉语短语的语法功能分布进行定量分析,以此评估确定短语实例语法功能的三种方法。首先描述了汉语短语结构类充当11种语法功能的情况，然后对短语功能类充当各种语法功能进行统计与分析，最后使用了核心词来估计定中短语的语法功能。在比较分析了不同方法估计短语实例的语法功能的效果后得出结论：汉语短语的语法功能表现出一定的聚合性，但自动句法分析中以类标记来估计短语语法功能效果欠佳。

... 陈锋等[<xref ref-type="bibr" rid="R3">3</xref>]基于清华树库的短语结构在句法库中所充当的句法功能,详细考察介宾结构、主谓结构、定中结构、述宾结构、述补结构和状中结构的句法功能 ...

2008

0.0

0.457

Comput Eng Appl. 2008, 44(16):151 - 153

Chinese parsing algorithm based on grammar function match

基于语法功能匹配的汉语句法分析算法

LU Jun-zhi,CHEN Xiao-he,WANG Dong-bo,CHEN Feng

卢俊之,陈小荷,王东波,陈锋

School of Chinese Language and Literature，Nanjing Normal University，Nanjing 210097，China

Based on the primary method of Grammar Function Match as the syntactic analysis，this paper realizes a kind of parsing algorithm，which views the TCT 973 as principal resource to survey the grammatical function.This algorithm not only efficiently reduces the fake ambiguities，but also has a favorably analyzed efficiency，which analyzes results including abundant and accurate grammatical information.The experiment indicates that the rate of phrase precision and recall reaches 75.17%，73.69% and 65.06%，56.55% respectively in close test and open test.

以语法功能匹配作为句法分析的基本方法，以100万词清华973树库作为语法功能调查的主要资源，实现了一种基于语法功能匹配的句法分析算法。该算法能有效减少伪歧义，具有良好的分析效率，其分析结果包含了丰富而准确的语法信息。实验表明，短语正确率与召回率在封闭测试和开放测试中分别达到75.17%、73.69%和65.06%、56.55%。

... 卢俊之等[<xref ref-type="bibr" rid="R4">4</xref>]在获取词汇和短语结构句法功能分布的基础上,基于句法功能匹配的理论,主要使用消歧算法,构建汉语自动句法分析器 ...

2008

0.0

0.492

Comput Eng. 2008, 34(23):86 - 88

Auto-clustering Algorithm Based on Non-uniform Density

基于不均匀密度的自动聚类算法

CUI Shang-qing1,2, MA Xiu-li1, TANG Shi-wei1, WANG Wen-qing2

崔尚卿1,2，马秀莉1，唐世渭1，王文清2

(1. State Key Lab of Optical & Auditive Information Processing, Peking University, Beijing 100871; 2. Institute of Digital Library, Peking University, Beijing 100871)

According to the fact that few density-based clustering algorithm can automatically process the data with non-uniform density, an auto-clustering algorithm based on non-uniform density is proposed. This algorithm has all the merits of the existed density-based clustering algorithm and can deal with the date effectively. Experimental results show its efficiency.

针对基于密度的聚类算法不能自动处理密度分布不均匀的数据问题，提出一种基于不均匀密度的自动聚类算法。该算法既保持了一般基于密度算法的优点，也能有效地处理分布不均匀的数据。实验结果表明，该算法是有效的。

... 崔尚卿等[<xref ref-type="bibr" rid="R5">5</xref>]从基于密度的聚类算法本身存在的问题出发,提出不均匀密度的自动聚类算法 ...

2009

0.0

0.951

J Intell. 2009, (2):94 - 97

Research on Text Automatic Clustering

文本自动聚类技术研究

<name MTML-type="CN">(王伟)</e>.<article-title MTML-type="CN">(文本自动聚类技术研究)</e>[J].<source MTML-type="CN">(情报杂志)</e>,<year>2009</year>,<volumn>28</volumn>（<issue>2</issue>）:<fpage>94</fpage>-<lpage>96</lpage>.（<name>Wang Wei</name>.<article-title>Research on Text Automatic Clustering</article-title>[J].<source>Journal of Intelligence</source>,<year>2009</year>,<volumn>28</volumn>（<issue>2</issue>）:<fpage>94</fpage>-<lpage>96</lpage>.）

王伟

自动聚类作为一种自动化程度较高的无监督机器学习技术,在信息检索和数据挖掘领域得到了广泛的应用.探讨了文本聚类的定义和步骤,依据文本自动聚类的步骤分别对文本的处理、自动聚类算法以及文本聚类结果的评价进行了阐述.

... 王伟[<xref ref-type="bibr" rid="R6">6</xref>]从各个角度对文本聚类的技术进行综述,关注文本聚类中的文本聚类定义、文本预处理、文本特征抽取和文本聚类的各种算法中的各方面问题 ...

2010

0.0

0.515

... 王舵等[<xref ref-type="bibr" rid="R7">7</xref>]针对聚类速度慢、初值对结果的影响大、易陷入局部最优等问题,提出基于相似度测度和覆盖方法来解决上述问题 ...

2010

0.0

0.646

J Comput Appl. 2010, (10):2614 - 2617

Semi-supervised automatic clustering

半监督的自动聚类

PANZhang-ming

潘章明

基于进化算法的自动聚类方法在处理聚类结构比较松散的数据集时,存在聚美准确性不高、收敛速度慢的缺陷,为此提出一种半监督的自动聚类算法.该算法从调整染色体的解码过程入手,首先从染色体中分离出聚类数和所有的质心,然后使用最近邻规则滤去部分偏离数据集分布区域的无效质心,最后嵌入先验信息辅助K-均值方法对剩余的质心聚类,进一步优化染色体的解码结果.实验结果表明,该算法对聚类结构紧密或松散的数据集均可给出较精确的聚类结果.

... 潘章明[<xref ref-type="bibr" rid="R8">8</xref>]将基于差分进化的自动聚类算法和半监督方法结合起来,在具体的实验中,启发式地引入部分已经获得了带标签的数据,聚类的精确度也得到了有效的提高 ...

2011

0.0

0.61

Comput Sci. 2011, (1):221 - 224

Novel Autonomous Clustering Method Based on Decision-theoretic Rough Set

一种基于决策粗糙集的自动聚类方法

YUHong;CHUShuang-shuang

于洪;储双双

提出了一种基于决策粗糙集的面向知识的自动聚类方法.在面向知识的聚类算法中,获取初始聚类结果依赖人工阈值的设置.为此,首先根据物理学知识提出了一种差值排序方法来自动得到阈值.另外,讨论了决策粗糙集模型的损失函数,提出了一种聚类评估方法;通过对聚类结果的评估来实现自动聚类.实验结果表明新方法是有效的.

... 于洪等[<xref ref-type="bibr" rid="R9">9</xref>]从面向知识的聚类框架出发,使用决策粗糙集理论对聚类算法进行研究,并且考虑了一种特定的损失函数 ...

1999

2.201

0.0

... 在单语文本的聚类研究中,国外的相关学者[<xref ref-type="bibr" rid="R10">10</xref>,<xref ref-type="bibr" rid="R11">11</xref>,<xref ref-type="bibr" rid="R12">12</xref>]给出了各种文本聚类的方法以及相应的改进算法,而双语文本或多语文本的聚类,则相对比较复杂 ...

1996

2.952

0.0

2007

2.632

0.0

Pattern Recogn. 2007, 40(3):825 - 838 DOI:10.1016/j.patcog.2006.07.011

Fast and robust fuzzy c-means clustering algorithms incorporating local information for image segmentation

<name>Cai W</name>,<name>Chen S</name>,<name>Zhang D</name>.<article-title>Fast and Robust Fuzzy C-means Clustering Algorithms Incorporating Local Information for Image Segmentation</article-title>[J]. <source>Pattern Recognition</source>, <year>2007</year>,<volumn>40</volumn>（<issue>3</issue>）:<fpage>825</fpage>-<lpage>838</lpage>.

<h2 class="secHeading" id="section_abstract">Abstract</h2>Fuzzy c-means (FCM) algorithms with spatial constraints (FCM_S) have been proven effective for image segmentation. However, they still have the following disadvantages: (1) although the introduction of local spatial information to the corresponding objective functions enhances their insensitiveness to noise to some extent, they still lack enough robustness to noise and outliers, especially in absence of prior knowledge of the noise; (2) in their objective functions, there exists a crucial parameter <span id="mmlsi17" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si17.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=91923fe7749ae216ae2ef927db31935d')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">α used to balance between robustness to noise and effectiveness of preserving the details of the image, it is selected generally through experience; and (3) the time of segmenting an image is dependent on the image size, and hence the larger the size of the image, the more the segmentation time. In this paper, by incorporating local spatial and gray information together, a novel fast and robust FCM framework for image segmentation, i.e., fast generalized fuzzy c-means (FGFCM) clustering algorithms, is proposed. FGFCM can mitigate the disadvantages of FCM_S and at the same time enhances the clustering performance. Furthermore, FGFCM not only includes many existing algorithms, such as fast FCM and enhanced FCM as its special cases, but also can derive other new algorithms such as FGFCM_S1 and FGFCM_S2 proposed in the rest of this paper. The major characteristics of FGFCM are: (1) to use a new factor <span id="mmlsi18" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si18.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=be475f415c6d1fb85ed577539a63889f')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">Sij as a local (both spatial and gray) similarity measure aiming to guarantee both noise-immunity and detail-preserving for image, and meanwhile remove the empirically-adjusted parameter <span id="mmlsi19" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si19.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=e2e4ba19080011b3570a12c72c203da8')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">α; (2) fast clustering or segmenting image, the segmenting time is only dependent on the number of the gray-levels q rather than the size <span id="mmlsi20" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si20.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=94d1232019695055d0f8a5068e7f3c91')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">N(?q) of the image, and consequently its computational complexity is reduced from <span id="mmlsi21" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si21.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=4a507097659bc921ac94e47012d0e6b3')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">O(NcI1) to <span id="mmlsi22" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si22.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=5cb6e66c34d42280212decb88f640501')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">O(qcI2), where c is the number of the clusters, <span id="mmlsi23" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si23.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=e822c283eac9c8ffbf3ff85312d98c3f')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source">I1 and <span id="mmlsi24" onclick="submitCitation('/science?_ob=MathURL&_method=retrieve&_eid=1-s2.0-S0031320306003451&_mathId=si24.gif&_pii=S0031320306003451&_issn=00313203&_acct=C000228598&_version=1&_userid=10&md5=5b92146d16a87749cff46cbd58d457ea')" style="cursor:pointer;" alt="Click to view the MathML source" title="Click to view the MathML source"><img height="16" border="0" style="vertical-align:bottom" width="138" alt="View the MathML source" title="View the MathML source" src="http://ars.els-cdn.com/content/image/1-s2.0-S0031320306003451-si24.gif"> are the numbers of iterations, respectively, in the standard FCM and our proposed fast segmentation method. The experiments on the synthetic and real-world images show that FGFCM algorithm is effective and efficient.

2000

0.0

... 采用“先聚类后合并”的策略,Chen等[<xref ref-type="bibr" rid="R13">13</xref>]通过计算簇间相似度,把中文的名词、动词主要实词和命名实体译成对照的英语,在单语简单聚类和合并聚类簇的方法前提下完成了对英汉双语新闻的自动聚类 ...

2003

0.0

... 在机器翻译系统的基础上,结合双语词典,Leftin[<xref ref-type="bibr" rid="R14">14</xref>]在把Russian和English两种文本中的词汇完成全部对照的前提下,开发了面向这两种语言的文本自动聚类系统 ...

2004

0.0

... 在把目标语言翻译成英语的基础上,Evans等[<xref ref-type="bibr" rid="R15">15</xref>]设计了Columabia’s Newsblaster系统,用来进行新闻摘要和话题的自动生成 ...

2004

0.0

... 通过借助语言分析,Mathieu等[<xref ref-type="bibr" rid="R16">16</xref>]从文本中抽取相应的特征词,在相对应的双语词典基础上,完成了英文、法文和西班牙文混合文本的自动聚类 ...

1997

0.0

1.13

... 2 汉语句法功能知识库构建本文有关汉语词汇类别知识挖掘的研究是在清华汉语树库[<xref ref-type="bibr" rid="R17">17</xref>]上测试、验证的,同时研究构建了清华汉语句法功能知识库 ...

2003

0.0

... SFC[<xref ref-type="bibr" rid="R18">18</xref>]应用在各个维度的属性取值是区间变量的高维聚类问题上,SFC中使用K-medoids算法来聚类,在具体进行词汇类别知识挖掘的过程中,本文自定义并改进了获取medoids的算法:在每一个类U中,计算每一个对象i到其他任意对象j的距离之和s(i),选取拥有最小的距离和的节点作为新的medoid ...

1993

0.94

0.0

... 在未来研究中,一方面将扩大树库的规模,比如把宾州汉语树库[<xref ref-type="bibr" rid="R19">19</xref>]通过结构转换增加到清华汉语树库中,同时把本文获取类别知识的方法,应用到以印欧语言为代表的英语当中 ...