Please wait a minute...
Advanced Search
现代图书情报技术  2016, Vol. 32 Issue (9): 17-26    DOI: 10.11925/infotech.1003-3513.2016.09.02
  综述评介 本期目录 | 过刊浏览 | 高级检索 |
数字文本自动分类中特征语义关联及加权策略研究综述与展望*
李湘东1,2(),巴志超1,3,高凡1
1武汉大学信息管理学院 武汉 430072
2武汉大学信息资源研究中心 武汉 430072
3山东省科学院情报研究所 济南 250014
Review of Digital Documents Automatic Classification Research
Li Xiangdong1,2(),Ba Zhichao1,3,Gao Fan1
1School of Information Management, Wuhan University, Wuhan 430072, China
2Center for the Studies of Information Resources, Wuhan University, Wuhan 430072, China
3Information Research Institute of Shandong Academy of Sciences Ji’nan 250014, China
全文: PDF(455 KB)   HTML ( 26
输出: BibTeX | EndNote (RIS)      
摘要 

目的】探讨目前针对书目、题录信息以及新闻网页、博客等新兴媒体开展的数字文本自动分类研究中存在的主要问题和可能的解决方向。【文献范围】基于机器学习方法的自动分类研究领域中, 关于特征语义转换、特征扩展和加权策略等方面的主要研究成果及相关文献。【方法】按照主要研究、关键技术、现有成果水平和今后发展方向等方面进行分析归纳。【结果】针对特征语义转换、特征扩展和加权策略等研究领域, 分析问题的现象和原因, 指出当前研究在文本语义表示、各种知识库的利用等方面存在的不足。【局限】没有涉及分类过程中分类算法等其他比较成熟的研究领域。【结论】今后可以从向量空间模型与概率主题模型相结合、利用各种外部知识库并提高概念相似度计算能力、结合多种加权策略构建复合加权表示模型等方向开展分类研究, 以提高数字文本自动分类的性能。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李湘东
巴志超
高凡
关键词 自动分类特征语义关联特征语义转换特征扩展加权策略    
Abstract

[Objective] This paper discusses the existing issues and possible solutions to the automatic classification of digital documents (i.e. library bibliographies, news pages and social media posts). [Coverage] We reviewed literature on the feature semantics conversion, feature expansion and weighting strategy from the field of Automatic Classification based on machine learning. [Methods] We analyzed the leading studies, key technologies, current achievements, and future directions from the published articles. [Results] Our research found the limits of previous studies on semantic representation of texts and utilization of knowledge bases. [Limitations] We did not discuss the classification algorithms. [Conclusions] To improve the effectiveness of automatic classification of digital documents, future research could try to combine Vector Space Model with Probabilistic Topic Model, use the knowledge base to improve the concept similarity computing, as well as construct composite weighted strategy.

Key wordsAutomatic classification    Feature semantic association    Feature semantic conversion    Feature expansion    Weighting strategy
收稿日期: 2016-01-22     
基金资助:*本文系国家社会科学基金项目“多种类型文本数字资源自动分类研究”(项目编号: 15BTQ066)的研究成果之一
引用本文:   
李湘东,巴志超,高凡. 数字文本自动分类中特征语义关联及加权策略研究综述与展望*[J]. 现代图书情报技术, 2016, 32(9): 17-26.
Li Xiangdong,Ba Zhichao,Gao Fan. Review of Digital Documents Automatic Classification Research. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2016.09.02.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2016.09.02
[1] 王细薇, 樊兴华, 赵军. 一种基于特征扩展的中文短文本分类方法[J]. 计算机应用, 2009, 29(3): 843-845.
[1] (Wang Xiwei, Fan Xinghua, Zhao Jun.Method for Chinese Short Text Classification Based on Feature Extension[J]. Journal of Computer Applications, 2009, 29(3): 843-845.)
[2] 王细薇, 张凯. 一种改进的基于共现关系的短文本特征扩展算法研究[J]. 河南城建学院学报, 2012, 21(4): 48-50.
[2] (Wang Xiwei, Zhang Kai.Improved Expansion Algorithm Based on Co-occurrence Relationship Between Short Text Feature[J]. Journal of Henan University of Urban Construction, 2012, 21(4): 48-50.)
[3] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013(6): 42-48.
[3] (Hu Yongjun, Jiang Jiaxin, Chang Huiyou.A New Method of Keywords Extraction for Chinese Short-Text Classification[J]. New Technology of Library and Information Service, 2013(6): 42-48.)
[4] Vo D T, Ock C Y.Learning to Classify Short Text from Scientific Documents Using Topic Models with Various Types of Knowledge[J]. Expert Systems with Applications, 2015, 42(3): 1684-1698.
[5] 宁亚辉, 樊兴华, 吴渝. 基于领域词语本体的短文本分类[J]. 计算机科学, 2009, 36(3): 142-145.
[5] (Ning Yahui, Fan Xinghua, Wu Yu.Short Text Classification Based on Domain Word Ontology[J]. Computer Science, 2009, 36(3): 142-145.)
[6] 赵辉, 刘怀亮. 一种基于维基百科的中文短文本分类算法[J]. 图书情报工作, 2013, 57(11): 120-124.
[6] (Zhao Hui, Liu Huailiang.Classification Algorithm of Chinese Short Texts Based on Wikipedia[J]. Library and Information Service, 2013, 57(11): 120-124.)
[7] Phan X H, Nguyen L M, Horiguchi S.Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-scale Data Collections[C]. In: Proceedings of the 17th International Conference on World Wide Web. ACM, 2008.
[8] 王盛, 樊兴华, 陈现麟.利用上下位关系的中文短文本分类[J]. 计算机应用, 2010, 30(3): 603-607.
[8] (Wang Sheng, Fan Xinghua, Chen Xianlin.Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010, 30(3): 603-607.)
[9] Fan X, Wei D.A Method of Agent and Patient Relation Acquisition for Short-Text Classification [A].// Advanced Research on Computer Science and Information Engineering[M]. Springer Berlin Heidelberg, 2011.
[10] Pan S J, Ni X, Sun J T, et al.Cross-domain Sentiment Classification via Spectral Feature Alignment [C]. In: Proceedings of the 19th International Conference on World Wide Web. ACM, 2010.
[11] Soundarya R.Sentiment Classification for Domain Adaptation Using Cross Domains [C]. In: Proceedings of the 2015 International Conference on Futuristic Trends in Computing and Communication, Chennai, Tamilnadu, India. 2015.
[12] Wang P, Domeniconi C, Hu J.Cross-Domain Text Classification Using Wikipedia[J]. IEEE Intelligent Informatics Bulletin, 2008, 9(1): 5-17.
[13] Xie S, Fan W, Peng J, et al.Latent Space Domain Transfer Between High Dimensional Overlapping Distributions[C]. In: Proceedings of the 18th International Conference on World Wide Web, NewYork, NY, USA. ACM, 2009.
[14] 马范玲, 胡泽文. 基于SUMO本体的图书自动分类模型研究[J]. 情报杂志, 2011, 30(1): 168-173.
[14] (Ma Fanling, Hu Zewen.The Research on Automatic Library Classification Model Based on SUMO Ontology[J]. Journal of Intelligence, 2011, 30(1): 168-173.)
[15] 朱平, 范少辉, 岳永德. 一种集成本体和SVM的文本分类方法[J]. 江西理工大学学报, 2012, 33(1): 68-72.
[15] (Zhu Ping, Fan Shaohui, Yue Yongde.A Text Classification Method Integrating Ontology and SVM[J]. Journal of Jiangxi University of Science and Technology, 2012, 33(1): 68-72.)
[16] Xiang E W, Cao B, Hu D H, et al.Bridging Domains Using World Wide Knowledge for Transfer Learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(6): 770-783.
[17] 马芳. 基于SUMO本体的中文文本自动分类研究[J]. 情报科学, 2015, 33(6): 43-47.
[17] (Ma Fang.Research on Chinese Text Automatic Classification Based on SUMO Ontology[J]. Information Science, 2015, 33(6): 43-47.)
[18] 胡泽文, 王效岳, 白如江. 基于SUMO和WordNet本体集成的文本分类模型研究[J]. 现代图书情报技术, 2011(1): 31-38.
[18] (Hu Zewen, Wang Xiaoyue, Bai Rujiang.Study on Text Classification Model Based on SUMO and WordNet Ontology Integration[J]. New Technology of Library and Information Service, 2011(1): 31-38.)
[19] Xue G, Dai W, Yang Q, et al.Topic-bridged PLSA for Cross-domain Text Classification[C]. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 627-634.
[20] 李湘东, 胡逸泉, 黄莉. 采用LDA主题模型的多种类型文献混合自动分类研究[J]. 图书馆论坛, 2015, 35(1): 74-80.
[20] (Li Xiangdong, Hu Yiquan, Huang Li.A Study of Mixed Automatic Categorization of Multi-type Document Adopting LDA Model[J]. Library Tribune, 2015, 35(1): 74-80.)
[21] 刘怀亮, 杜坤, 秦春秀. 基于知网语义相似度的中文文本分类研究[J]. 现代图书情报技术, 2015(2): 39-45.
[21] (Liu Huailiang, Du Kun, Qin Chunxiu.Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. New Technology of Library and Information Service, 2015(2): 39-45.)
[22] Bollegala D, Maehara T, Kawarabayashi K.Unsupervised Cross-Domain Word Representation Learning [C]. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015.
[23] Bouma G.Normalized (Pointwsie) Mutual Information in Collocation Extraction [C]. In: Proceedings of the Bi-Annual Conference of the German Society for Computational Linguistics and Language Technology.2009.
[24] Ng A Y, Jordan M, Weiss Y.On Spectral Clustering: Analysis and an Algorithm [C]. In: Proceedings of the 2011 Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada. 2001.
[25] Dai W, Xue G R, Yang Q, et al.Co-clustering Based Classification for Out-of-Domain Documents [C]. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2007.
[26] Milne D, Witten I H. An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links[C]. In: Proceeding of AAAI Workshop on Wikipedia and Artificial Intelligence: An Evolving Synergy. AAAI Press, 2008.
[27] 赵辉, 刘怀亮. 面向社区问答的中文短文本分类算法研究[J]. 现代情报, 2013, 33(10): 70-74.
[27] (Zhao Hui, Liu Huailiang.Research on Chinese Short Text Classification Algorithm for Community-Based Q&A[J]. Journal of Modern Information, 2013, 33(10): 70-74.)
[28] 吴健, 吴朝晖, 李莹, 等. 基于本体论和词汇语义相似度的Web服务发现[J]. 计算机学报, 2005, 28(4): 595-602.
[28] (Wu Jian, Wu Zhaohui, Li Ying, et al.Web Service Discovery Based on Ontology and Similarity of Words[J]. Chinese Journal of Computers, 2005, 28(4): 595-602.)
[29] 刘群, 李素建. 基于《知网》的词汇语义相似度计算[J]. 计算语言学及中文语言处理, 2002, 7(2): 59-76.
[29] (Liu Qun, Li Sujian.Word Similarity Computating Based on How-Net[J]. Computational Linguistics and Chinese Language Processing, 2002, 7(2): 59-76.)
[30] 孙建旺, 吕学强, 张雷瀚. 基于语义与最大匹配度的短文本分类研究[J]. 计算机工程与设计, 2013, 34(10): 3613-3618.
[30] (Sun Jianwang, Lv Xueqiang, Zhang Leihan.Short Text Classification Based on Semantics and Maximum Matching Degree[J]. Computer Engineering and Design, 2013, 34(10): 3613-3618.)
[31] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
[32] 张志飞, 苗夺谦, 高灿.基于LDA主题模型的短文本分类方法[J]. 计算机应用, 2013, 33(6): 1587-1590.
[32] (Zhang Zhifei, Miao Duoqian, Gao Can.Short Text Classification Using Latent Dirichlet Allocation[J]. Journal of Computer Applications, 2013, 33(6): 1587-1590.)
[33] 孙世杰, 濮建忠.基于LDA模型的Twitter中文微博热点主题词组发现[J]. 洛阳师范学院学报, 2012, 31(11): 60-64.
[33] (Sun Shijie, Pu Jianzhong.A Hot Topic Phrase Selection Based on LDA for Chinese Tweets[J]. Journal of Luoyang Normal University, 2012, 31(11): 60-64.)
[34] Bischof J M, Airoldi E M.Summarizing Topical Content with Word Frequency and Exclusivity[C]. In: Proceedings of the 29th International Conference on Machine Learning. 2012.
[35] 王昊, 叶鹏, 邓三鸿.机器学习在中文期刊论文自动分类研究中的应用[J].现代图书情报技术, 2014(3): 80-87.
[35] (Wang Hao, Ye Peng, Deng Sanhong.Machine Learning Application in Chinese Journal Articles Automatic Classification[J]. New Technology of Library and Information Service, 2014(3): 80-87.)
[36] 李湘东, 巴志超, 黄莉.一种基于加权LDA模型和多粒度的文本特征选择方法[J].现代图书情报技术, 2015(5): 42-48.
[36] (Li Xiangdong, Ba Zhichao, Huang Li.A Text Feature Selection Method Based on Weighted Latent Dirichlet Allocation and Multi-granularity[J]. New Technology of Library and Information Service, 2015(5): 42-48.)
[37] Lertantree V, Theeramunkong T.Effect of Term Distributions on Centroid-based Text Categorization[J]. Information Sciences, 2004, 158(1): 89-115.
[38] 蒋健. 文本分类中特征提取和特征加权方法研究[D]. 重庆: 重庆大学, 2010.
[38] (Jiang Jian.Text Classification in the Feature Extraction and Feature Weighting Method Research [D]. Chongqing: Chongqing University, 2010.)
[39] Salton G, Yang C S.On the Specification of Term Values in Automatic Indexing[J]. Journal of Documentation, 1973, 29(4): 351-372.
[40] Peter A C, Brett W B, Stephen H, et al.An Information- Theoretic, Vector-Space-Model Approach to Cross-Language Information Retrieval[J]. Journal of Natural Language Engineering, 2011, 17(1): 37-70.
[1] 李湘东,高凡,李悠海. 共通语义空间下的跨文献类型文本自动分类研究*[J]. 数据分析与知识发现, 2018, 2(9): 66-73.
[2] 邓三鸿,傅余洋子,王昊. 基于LSTM模型的中文图书多标签分类研究*[J]. 数据分析与知识发现, 2017, 1(7): 52-60.
[3] 李湘东,阮涛,刘康. 基于维基百科的多种类型文献自动分类研究*[J]. 数据分析与知识发现, 2017, 1(10): 43-52.
[4] 李湘东,刘康,丁丛,高凡. 基于《知网》的多种类型文献混合自动分类研究*[J]. 现代图书情报技术, 2016, 32(2): 59-66.
[5] 李湘东, 曹环, 丁丛, 黄莉. 利用《知网》和领域关键词集扩展方法的短文本分类研究[J]. 现代图书情报技术, 2015, 31(2): 31-38.
[6] 何琳, 万健, 何娟, 郭诗云. 基于社会标签的中文图书自动分类研究[J]. 现代图书情报技术, 2014, 30(9): 1-7.
[7] 王昊, 叶鹏, 邓三鸿. 机器学习在中文期刊论文自动分类研究中的应用[J]. 现代图书情报技术, 2014, 30(3): 80-87.
[8] 赵辉, 刘怀亮. 面向用户生成内容的短文本聚类算法研究[J]. 现代图书情报技术, 2013, 29(9): 88-92.
[9] 胡冰, 张建立. 基于统计分布的中文专利自动分类方法研究[J]. 现代图书情报技术, 2013, 29(7/8): 101-106.
[10] 杨贺, 杨奕虹, 李宁. 关键词-分类号关联词表构建[J]. 现代图书情报技术, 2013, 29(7/8): 107-113.
[11] 胡勇军, 江嘉欣, 常会友. 基于LDA高频词扩展的中文短文本分类[J]. 现代图书情报技术, 2013, (6): 42-48.
[12] 范云杰, 刘怀亮. 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012, 28(3): 47-52.
[13] 徐健, 温浩胜. 人才网页自动识别系统研究[J]. 现代图书情报技术, 2011, 27(6): 20-26.
[14] 马芳. 基于RBFNN的专利自动分类研究[J]. 现代图书情报技术, 2011, 27(12): 58-63.
[15] 王梅文. 基于本体进行自动分类的元搜索引擎的设计与实现[J]. 现代图书情报技术, 2008, 24(9): 58-63.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn