Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (2): 59-66    DOI: 10.11925/infotech.1003-3513.2016.02.08
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于《知网》的多种类型文献混合自动分类研究*
李湘东1,2(),刘康1,丁丛1,高凡1
1武汉大学信息管理学院 武汉 430072
2武汉大学信息资源研究中心 武汉 430072
A New Automatic Categorization Method with Documents Based on HowNet
Li Xiangdong1,2(),Liu Kang1,Ding Cong1,Gao Fan1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
全文: PDF(645 KB)   HTML ( 52
输出: BibTeX | EndNote (RIS)      
摘要 

目的】解决由于不同类型文献而产生的特征不匹配等问题, 提高待分类文本的分类效果。【方法】使用与待分类文本属于不同文献类型的文本作为语料库的训练集, 引入第三方资源《知网》进行语义特征扩展。【结果】利用该方法在网页、图书、非学术性期刊、学术性期刊4种类型文献上进行分类实验, 与未经过扩展的分类方法相比, 分类准确率提高1.2%至11.0%。【局限】未对每一种文献类型都使用公开语料进行测试, 因此本文方法的通用性和实验结果的客观性有待进一步检验。【结论】实验结果表明, 该方法具有一定的可行性和实用性, 在不同程度上可以消除不同类型文献之间的语义差异, 从语料库构建和特征扩展两个途径提高文本自动分类的分类效果。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李湘东
刘康
丁丛
高凡
关键词 第三方资源知网特征扩展语义差异    
Abstract

[Objective] This paper aims to solve the feature mismatch problem caused by different document types and improve the performance of automatic classification technology. [Methods] We proposed a new method to extend the semantic features using documents of various types as the corpus, which were introduced the third-party resource HowNet and were different with the other un-categorized ones. [Results] Compared with the non-feature-extension classification method, the proposed method increased the F-measure by 1.2% to 11.0% in our classification experiment. Four document types, used in our study included webpages, books, non-academic periodicals and academic journals. [Limitations] Not every type of document was tested with the publicly accessible corpus, thus, more tests were needed to examine the generalization and objectiveness of the new method. [Conclusions] Our study showed that the proposed method was feasible. It could effectively eliminate the semantic differences among various types of collections and improve the performance of automatic text classification through corpus construction and feature extension.

Key wordsThird-party resource    HowNet    Feature extension    Semantic difference
收稿日期: 2015-08-12     
基金资助:*本文系国家社会科学基金项目“多种类型文本数字资源自动分类研究”(项目编号:15BTQ066)的研究成果之一
引用本文:   
李湘东,刘康,丁丛,高凡. 基于《知网》的多种类型文献混合自动分类研究*[J]. 现代图书情报技术, 2016, 32(2): 59-66.
Li Xiangdong,Liu Kang,Ding Cong,Gao Fan. A New Automatic Categorization Method with Documents Based on HowNet. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.02.08.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.02.08
[1] 薛春香, 夏祖奇, 侯汉清. 基于语料和基于标引经验的自动分类模式比较[J]. 南京农业大学学报: 社会科学版, 2005, 5(4): 85-91.
[1] (Xue Chunxiang, Xia Zuqi, Hou Hanqing.A Comparison of Automatic Classification Between Corpus-based Model and Experiences-based Model[J]. Journal of Nanjing Agricultural University: Social Sciences Edition, 2005, 5(4): 85-91.)
[2] Pong J Y H, Kwok R C W, Lau R Y K, et al. A Comparative Study of Two Automatic Document Classification Methods in a Library Setting[J]. Journal of Information Science, 2008, 34(2): 213-230.
[3] 李湘东, 胡逸泉, 巴志超, 等. 数字图书馆多种类型文献混合自动分类研究[J]. 图书馆杂志, 2014, 33(11): 42-48.
[3] (Li Xiangdong, Hu Yiquan, Ba Zhichao, et al.The Study of Mixed Automatic Categorization on Digital Library Collections[J]. Library Journal, 2014, 33(11): 42-48.)
[4] 知网[DB/OL]. [2015-06-15]. .
[4] (HowNet Knowledge Database [DB/OL]. [2015-06-15].
[5] Pan S J, Yang Q.A Survey on Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
[6] Wang P, Domeniconi C, Hu J.Using Wikipedia for Co-Clustering Based Cross-Domain Text Classification [C]. In: Proceedings of the 8th IEEE International Conference on Data Mining. IEEE, 2008.
[7] Lu Z, Zhu Y, Pan S J, et al.Source Free Transfer Learning for Text Classification [C]. In: Proceedings of the 28th Association for the Advancement of Artificial Intelligence Conference on Artificia Intelligence. 2014.
[8] 赵辉, 刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11): 120-124.
[8] (Zhao Hui, Liu Huailiang.Classification Algorithm of Chinese Short Texts Based on Wikipedia[J]. Library and Information Service, 2013, 57(11): 120-124.)
[9] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145.
[9] (Ning Yahui, Fan Xinghua, Wu Yu.Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[10] 李湘东, 曹环, 丁丛, 等. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015(2): 31-38.
[10] (Li Xiangdong, Cao Huan, Ding Cong, et al.Short-text Classification Based on HowNet and Domain Keyword Set Extension[J]. New Technology of Library and Information Service, 2015(2): 31-38.)
[11] 施聪莺, 徐朝军, 杨晓江. TFIDF算法研究综述[J]. 计算机应用, 2009, 29(S1): 167-170.
[11] (Shi Congying, Xu Chaojun, Yang Xiaojiang.Study of TFIDF Algorithm[J]. Journal of Computer Applications, 2009, 29(S1): 167-170.)
[12] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 计算语言学及中文语言处理, 2002,7(2): 59-76.
[12] (Liu Qun, Li Sujian.Word Similarity Computating Based on How-net[J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76.)
[13] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005, 28(4): 595-602.
[13] (Wu Jian, Wu Zhaohui, Li Ying, et al.Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
[14] 李生琦, 田巧燕, 汤承. 基于《<知网>》词汇语义相关度计算的消歧方法[J]. 情报学报, 2009, 28(5): 706-711.
[14] (Li Shengqi, Tian Qiaoyan, Tang Cheng.Disambiguating Method for Computing Relevancy Based on HowNet Semantic Knowledge[J]. Journal of the China Society for Scientific and Technical Information, 2009, 28(5): 706-711.)
[15] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文本分类研究[J]. 计算机工程与设计, 2013, 34(10): 3613-3618.
[15] (Sun Jianwang, Lv Xueqiang, Zhang Leihan.Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[16] 搜狗互联网语料库[DB/OL]. [2015-06-03]. .
[16] (SogouT [DB/OL]. [2015-06-03].
[17] Tan S.An Effective Refinement Strategy for KNN Text Classifier[J]. Expert Systems with Applications, 2006, 30(2): 290-298.
[18] 奉国和. 文本分类性能评价研究[J]. 情报杂志, 2011, 30(8): 66-70.
[18] (Feng Guohe.Review of Performance Evaluation of Text Classification[J]. Journal of Intelligence, 2011, 30(8): 66-70.)
[1] 李湘东,阮涛,刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[2] 李湘东,巴志超,高凡. 数字文本自动分类中特征语义关联及加权策略研究综述与展望*[J]. 现代图书情报技术, 2016, 32(9): 17-26.
[3] 高峰, 熊晶, 刘永革. 基于知网的甲骨卜辞释义问题的可拓性研究[J]. 现代图书情报技术, 2015, 31(7-8): 58-64.
[4] 李湘东, 曹环, 丁丛, 黄莉. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[5] 刘怀亮, 杜坤, 秦春秀. 基于知网语义相似度的中文文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[6] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[7] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013, (6): 42-48.
[8] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012, 28(3): 47-52.
[9] 傅继彬,刘杰,贾可亮,毛金涛. 基于知网和术语相关度的本体关系抽取研究*[J]. 现代图书情报技术, 2008, 24(9): 36-40.
[10] 张丽华 . 知网节与知识网络[J]. 现代图书情报技术, 2006, 1(9): 85-88.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn