Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (10): 49-55    DOI: 10.11925/infotech.1003-3513.2014.10.08
  知识组织与知识管理 本期目录 | 过刊浏览 | 高级检索 |
向量空间模型文本建模的语义增量化改进研究
胡吉明, 肖璐
武汉大学信息资源研究中心 武汉 430072
Semantic Incremental Improvement on Vector Space Model for Text Modeling
Hu Jiming, Xiao Lu
Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF(552 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的] 基于语义增量对向量空间模型文本分类方法进行改进, 并进行实验验证。[方法] 梳理目前文本表示中语义向量引入和改进的相关研究, 提出文本的语义向量表示实现框架。根据主题词和词汇分别与领域本体中概念之间的映射关系, 构建概念层次树和定位词汇, 计算概念语义相似度, 结合语义增量实现文本的语义向量构建。[结果] 通过文本分类的对比实验发现, 本文所提方法可行且有效, 在宏平均准确率、宏平均召回率和宏平均F1方面优于其他方法。[局限] 在向量空间模型基础上的改进, 语义信息的表达不够充分, 应继续探索文本建模的真正语义化实现方法; 应对多种类型数据进行实验验证, 以提高方法的适用性。[结论] 探索原始向量空间模型的语义化问题, 对当前文本分类及其语义关联等研究具有现实意义。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
胡吉明
肖璐
关键词 文本建模语义向量空间模型语义增量语义相似度    
Abstract

[Objective] This paper improves the methods of text classification based on VSM using semantic increment, and the model is verified by experiments. [Methods] Combing the studies of semantic vector and its improvement in text representation, this paper improves VSM based on semantic increment, and proposes an implementation frame of semantic vector representation of texts. Furthermore, based on the mapping relationships between words and concepts in domain Ontology, the construction of concept hierarchy tree and words positioning are constructed, semantic similarity of concepts is calculated, and the semantic vector model of texts' representation is achieved. [Results] The comparative experiments of texts classification demonstrate that the proposed method is feasible and effective, and the performance of this method is better than traditional methods from the perspectives of Precison, Recall and F1-Measure. [Limitations] The description of text semantic information is not good enough, and it is necessary to explore the authentic semantic methods in text modeling. In addition, more comparative experiments on several datasets should be conducted in order to obtain more accurate results. [Conclusions] The semantic improvement on traditional VSM is explored which is important for further text classification and semantic association.

Key wordsText modeling    Semantic Vector Space Model    Semantic increment    Semantic similarity
收稿日期: 2014-03-17     
:  TP391  
基金资助:

本文系国家自然科学基金青年项目"社会网络环境下基于用户-资源关联的信息推荐研究"(项目编号:71303178)和武汉大学人文社会科学研究项目"社会网络环境下基于关系社区发现的用户建模研究"(项目编号:274013)的研究成果之一。

通讯作者: 胡吉明 E-mail: whuhujiming@qq.com     E-mail: whuhujiming@qq.com
作者简介: 作者贡献声明: 胡吉明: 提出研究思路, 设计研究方案, 实施研究过程, 撰写和修订论文; 肖璐: 采集、清洗和分析数据并进行对比实验。
引用本文:   
胡吉明, 肖璐. 向量空间模型文本建模的语义增量化改进研究[J]. 现代图书情报技术, 2014, 30(10): 49-55.
Hu Jiming, Xiao Lu. Semantic Incremental Improvement on Vector Space Model for Text Modeling. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2014.10.08.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.10.08

[1] Salton G, Wong A, Yang C S. A Vector Space Model for Automatic Indexing [J]. Communications of the ACM, 1975, 18(11): 613-620.
[2] Liu G Z. The Semantic Vector Space Model (SVSM): A Text Representation and Searching Technique [C]. In: Proceedings of the 27th Hawaii International Conference on System Science. 1994:928-937.
[3] 杨玉珍, 刘培玉, 姜沛佩. 向量空间模型中结合句法的文本表示研究[J]. 计算机工程, 2011, 37(3): 58-60. (Yang Yuzhen, Liu Peiyu, Jiang Peipei. Research on Text Representation with Combination of Syntactic in Vector Space Model [J]. Computer Engineering, 2011, 37(3): 58-60.)
[4] Chang B, Dho H, Lee Y, et al. Concept Based Learning Contents Retrieval by Using Extended Vector Space Model with Ontology [J]. Information-an International Interdisciplinary Journal, 2012, 15(2): 793-804.
[5] Tasi C, Huang Y, Liu C, et al. Applying VSM and LCS to Develop an Integrated Text Retrieval Mechanism [J]. Expert Systems with Applications, 2012, 39(4): 3974-3982.
[6] Virpioja S, Paukkeri M, Tripathi A, et al. Evaluating Vector Space Models with Canonical Correlation Analysis [J]. Natural Language Engineering, 2012, 18(3): 399-436.
[7] Nasir J A, Varlamis I, Karim A, et al. Semantic Smoothing for Text Clustering [J]. Knowledge-Based Systems, 2012, 54: 216-229.
[8] Sbattella L, Tedesco R. A Novel Semantic Information Retrieval System Based on a Three-level Domain Model [J]. Journal of Systems and Software, 2013, 86(5): 1426-1452.
[9] Liu G Z. Semantic Vector Space Model: Implementation and Evaluation [J]. Journal of the American Society for Information Science, 1997, 48(5): 395-417.
[10] Zadeh P D H, Reformat M Z. Assessment of Semantic Similarity of Concepts Defined in Ontology [J]. Information Sciences, 2013, 250: 21-39.
[11] Bobillo F, Delgado M, Sanchez-Sanchez J C. Parallel Algorithms for Fuzzy Ontology Reasoning [J]. IEEE Transactions on Fuzzy Systems, 2013, 21(4): 775-781.
[12] Turney P D, Pantel P. From Frequency to Meaning: Vector Space Models of Semantics [J]. Journal of Artificial Intelligence Research, 2010, 37(1): 141-188.
[13] 余传明, 张小青, 陈雷. 基于LDA模型的评论热点挖掘:原理与实现[J]. 情报理论与实践, 2010, 33(5): 103-106. (Yu Chuanming, Zhang Xiaoqing, Chen Lei. Mining Hot Topics of User Comment Based on LDA Model: Principle & Approach [J]. Information Studies: Theory & Application, 2010, 33(5): 103-106.)
[14] Maedche A, Staab S. Ontology Learning for the Semantic Web[J]. IEEE Intelligent Systems, 2001, 16(2): 72-79.
[15] 唐明伟, 卞艺杰, 陶飞飞. 基于领域本体的语义向量空间模型[J]. 情报学报, 2011, 30(9): 951-955. (Tang Mingwei, Bian Yijie, Tao Feifei. Semantic Vector Space Model Based on Domain Ontology [J]. Journal of the China Society for Scientific and Technical Information, 2011, 30(9): 951-955.)
[16] Oleshchuk V, Pedersen A. Ontology Based Semantic Similarity Comparison of Documents [C]. In: Proceedings of the 14th International Workshop on Database and Expert Systems Applications. IEEE, 2003: 735-738.
[17] 魏凯斌, 冉延平, 余牛. 语义相似度的计算方法研究与分析[J]. 计算机技术与发展, 2010, 20(7): 102-105. (Wei Kaibin, Ran Yanping, Yu Niu. The Research and Analysis of Computing Methods on Semantic Similarity [J]. Computer Technology and Development, 2010, 20(7): 102-105.)
[18] Sanchez D, Batet M. A Semantic Similarity Method Based on Information Content Exploiting Multiple Ontologies [J]. Expert Systems with Applications, 2013, 40(4): 1393-1399.
[19] Pietranik M, Nguyen N T. Semantic Distance Measure Between Ontology Concept's Attributes [C]. In: Proceedings of the 15th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Berlin, Heidelberg: Springer-Verlag, 2011: 210-219.
[20] Turney P D. Similarity of Semantic Relations [J]. Computational Linguistics, 2006, 32(3): 379-416.
[21] 谭松波, 王月粉. 中文文本分类语料库-TanCorpV1.0 [OL]. [2013-09-10]. http://www.searchforum.org.cn/tansongbo/corpus. htm. (Tan Songbo, Wang Yuefen. The Corpus of Chinese Text Classification- TanCorpV1.0 [OL]. [2013-09-10]. http://www. searchforum.org.cn/tansongbo/corpus.htm.)
[22] 中国科学院计算技术研究所. ICTCLAS2011[EB/OL]. [2013-09-21]. http://ictclas.org/ictclas_download.aspx. (Institute of Computing Technology, Chinese Academy of Sciences. ICTCLAS2011[EB/OL]. [2013-09-21]. http://ictclas.org/ictclas_ download.aspx.)
[23] 求TanCorp的文档向量[EB/OL]. [2014-03-10]. http://www. cnblogs.com/zhangchaoyang/articles/2355397.html. (Calculate the Text Vector from TanCorp [EB/OL]. [2014-03-10]. http://www.cnblogs.com/zhangchaoyang/articles/2355397.html.)
[24] Tsang I W, Kocsor A, Kwok J T. LibCVM Toolkit Version: 2.2 (beta)[EB/OL]. [2011-08-29]. http://c2inet.sce.ntu.edu.sg/ ivor/cvm.html.
[25] Y?ld?r?m E A. Two Algorithms for the Minimum Enclosing Ball Problem [J]. SIAM Journal on Optimization, 2008, 19(3): 1368-1391.
[26] Sebastiani F. Machine Learning in Automated Text Categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.
[27] Mobasher B, Dai H, Luo T, et al. Discovery and Evaluation of Aggregate Usage Profiles for Web Personalization [J]. Data Mining and Knowledge Discovery, 2002, 6(1): 61-82.

[1] 陈二静,姜恩波. 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017, 1(6): 1-11.
[2] 翟东升,蔡文浩,张杰,李振飞. 改进的中文商标语义相似度计算方法研究[J]. 数据分析与知识发现, 2017, 1(11): 19-28.
[3] 刘健,毕强,刘庆旭,王福. 数字文献资源内容服务推荐研究*——基于本体规则推理和语义相似度计算[J]. 现代图书情报技术, 2016, 32(9): 70-77.
[4] 巴志超,李纲,朱世伟. 基于语义网络的研究兴趣相似性度量方法*[J]. 现代图书情报技术, 2016, 32(4): 81-90.
[5] 毕强, 刘健, 鲍玉来. 基于语义相似度的文本聚类研究*[J]. 数据分析与知识发现, 2016, 32(12): 9-16.
[6] 刘怀亮, 杜坤, 秦春秀. 基于知网语义相似度的中文文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[7] 范雪雪, 王志荣, 徐晤, 梁银, 马小虎. 基于医学本体的术语相似度算法研究[J]. 现代图书情报技术, 2015, 31(12): 57-64.
[8] 何超, 张玉峰. 融合语义相似度的商务情报链接分析算法研究[J]. 现代图书情报技术, 2013, 29(3): 27-32.
[9] 孙海霞, 李军莲, 李丹亚, 吴英杰, 李晓瑛. 基于CMeSH语义系统的领域自由词-主题词语义映射研究[J]. 现代图书情报技术, 2013, 29(11): 46-51.
[10] 马军红. 分阶段融合的文本语义相似度计算方法[J]. 现代图书情报技术, 2013, 29(10): 20-26.
[11] 王莉. 基于关键词链的动态分面研究[J]. 现代图书情报技术, 2012, 28(7): 76-81.
[12] 邢美凤. 科技文献关键词冗余解决方案研究[J]. 现代图书情报技术, 2012, 28(1): 34-39.
[13] 徐健 张智雄 肖卓 邓昭俊. 科技术语语义相似度计算方法研究综述[J]. 现代图书情报技术, 2010, 26(7/8): 51-57.
[14] 孙海霞 钱庆 吴英杰 李军莲. MeSH词表的语义相似度计算研究*[J]. 现代图书情报技术, 2010, 26(6): 12-16.
[15] 孙海霞,钱庆,成颖. 基于本体的语义相似度计算方法研究综述*[J]. 现代图书情报技术, 2010, 26(1): 51-56.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn