Please wait a minute...
Advanced Search
现代图书情报技术  2014, Vol. 30 Issue (7): 101-106    DOI: 10.11925/infotech.1003-3513.2014.07.14
  情报分析与研究 本期目录 | 过刊浏览 | 高级检索 |
中文短信文本信息流中多话题的分类抽取
张永军, 刘金岭, 马甲林
淮阴工学院中文信息处理研究室, 淮安223003
Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow
Zhang Yongjun, Liu Jinling, Ma Jialin
Chinese Information Processing Laboratory, Huaiyin Institute of Technology, Huai'an 223003, China
全文: PDF(454 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]为更有效地在中文短信文本信息流((SMS Text Message Flow, SM F)中进行多话题的分类提取,提出一种基于SM_ F特点的话题分类抽取方法SM_F_ HTo[方法]将SM_F分割成多个短信文本子集SM_Fi,通过层次的狄利克雷过程信息抽取与TF-IDF相结合,建立短信文本向量集上多个概率分布,采用吉布斯抽样并结合特征同属于临时话题的概率进行SM_F话题分类抽取。[结果]实验结果表明,SM_ F_ HT在困惑度和对数似然比方面优越于模型CCLDA和CCMixo[局限]在短信文本预处理和特征同的抽取方面,还需进一步优化算法和提高数据质量。[结论]提出的SM_F_HT方法对SM_F的多话题分类抽取是有效的。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘金岭
张永军
马甲林
Abstract

[Objective] A topic classification extraction model named SM_ F_ HT is proposed to find multiple topics more effectively in Chinese SMS text message Flow (SM少).[Methods] In this model, SM_ F is divided into SMS text subsets TF-IDF combined with the hierarchical Dirichlet processes of information extraction are used to build multiple probability distributions of the SMS text vector set. Finally topic classification on SM_ F is extracted using Gibbs sampling in conjunction with the probability of the characteristic words which belong to local topic.[Results]Experimental results show that SM_F_HT is superior to CCLDA and CCMix models in perplexity and log like lihood ratio.[Limitations] In fields of SMS text pre processing and keyword extraction, this algorithm still needs further optimization.[Conclusions] The SM_ F_ HT scheme is effective for multiple topics classification extraction of SM_F.

Key wordsShort message text    Message flow    Topic extract    Dirichlet    Gibbs sample
收稿日期: 2014-01-05     
:  TP391.1  
基金资助:

国家级星火计划项目“农村民生建设信息反馈平台建设”(项目编号:2011GA690190)的研究成果之一

通讯作者: 刘金岭E-mail:liujinlingg@126.com     E-mail: liujinlingg@126.com
作者简介: 作者贡献声明:张永军:设计研究方案;论文起草;编程井进行实验刘金岭:提出研究思路,论文最终版本修订;马甲林:采集、清洗和分析数据。
引用本文:   
张永军, 刘金岭, 马甲林. 中文短信文本信息流中多话题的分类抽取[J]. 现代图书情报技术, 2014, 30(7): 101-106.
Zhang Yongjun, Liu Jinling, Ma Jialin. Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow. New Technology of Library and Information Service, DOI:10.11925/infotech.1003-3513.2014.07.14.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.07.14

[1] 刘金岭, 倪晓红, 王新功. 手机短信文本信息流的自动文摘生成[J]. 现代图书情报技术, 2013(2): 43-49. (Liu Jinling,Ni Xiaohong, Wang Xingong. Automatic Abstracting Generating Based on Mobile Short Message Text Information Flow[J]. New Technology of Library and Information Service, 2013(2): 43-49.)
[2] Allan J, Papka R, Lavrenko V. On-line New Event Detection and Tracking[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’98). New York: ACM, 1998: 37-45.
[3] 蔡淑琴, 张静, 王旸, 等. 基于中心化的微博热点发现方法[J]. 管理学报, 2012, 9(6): 874-879. (Cai Shuqin, Zhang Jing, Wang Yang, et al. Micro-blogging Hotspot Discovery Method Based on Centralization[J]. Chinese Journal of Management, 2012, 9(6): 874-879.)
[4] 马斌, 洪宇, 陆剑江, 等. 基于线索树双层聚类的微博话题检测[J]. 中文信息学报, 2012, 26 (6): 121-128. (Ma Bin, Hong Yu, Lu Jianjiang, et al. A Thread-based Two-stage Clustering Method of Microblog Topic Detection[J]. Journal of Chinese Information Processing, 2012, 26(6): 121-128.)
[5] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99). New York: ACM, 1999: 50-57.
[6] Zhai C, Velivelli A, Yu B.A Cross-collection Mixture Model for Comparative Text Mining[C]. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04), Seattle, Washington, USA. New York: ACM, 2004: 743-748.
[7] Yin Z, Cao L, Han J, et al. Geographical Topic Discovery and Comparison[C]. In: Proceedings of the 20th International Conference on World Wide Web (WWW’11). New York: ACM, 2011: 247-256.
[8] 赵华, 赵铁军, 张姝, 等. 基于内容分析的话题检测研究[J]. 哈尔滨工业大学学报, 2006, 38(10): 1740-1743. (Zhao Hua, Zhao Tiejun, Zhang Shu, et al. Topic Detection Research Based on Content Analysis[J]. Journal of Harbin Institute of Technology, 2006, 38(10): 1740-1743.)
[9] Paul M J, Girju R. Cross-cultural Analysis of Blogs and Forums with Mixed-collection Topic Model[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1408-1417.
[10] Paul M J, Girju R. Comparative Scientific Research Analysis with a Language-Independent Cross-Collection Model[J]. Procesamiento del Lenguaje Natural, 2010, 45: 153-160.
[11] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng. Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[12] 何建民, 张义. 基于类熵距离测量的热点话题识别方法研究[J]. 情报科学, 2012, 30(8): 1147-1150, 1166. (He Jianmin, Zhang Yi. Research on Identifying Method for the Hot Topics Based on Class Entropy Distance Measurement[J]. Information Science, 2012, 30(8): 1147-1150, 1166.)
[13] 刘金岭. 基于降维的短信文本语义分类及主题提取[J]. 计算机工程与应用, 2010, 46(23): 159-161, 174. (Liu Jinling. Dimensionality Reduction of Short Message Text Classification and Thematic Extraction of Semantic[J]. Computer Engineering and Applications, 2010, 46(23): 159-161, 174.)
[14] 尤芳. Gibbs抽样在正态混合模型中的参数估计[J]. 统计与决策, 2009(15): 150-151. (You Fang. Gibbs Sampling in The Gaussian Mixture Model Parameter Estimation[J]. Statistics and Decision, 2009(15): 150-151.)
[15] 赵爱华, 刘培玉, 郑燕. 基于LDA的新闻话题子话题划分方法[J]. 小型微型计算机系统, 2013, 34(4): 732-737. (Zhao Aihua, Liu Peiyu, Zheng Yan. Subtopic Division in News Topic Based on Latent Dirichlet Allocation[J]. Journal of Chinese Computer Systems, 2013, 34(4): 732-737.)
[16] 黄永文, 何中市. 基于互信息的统计语言模型平滑技术[J]. 中文信息学报, 2005, 19(4): 46-51. (Huang Yongwen, He Zhongshi. A Smoothing Technique for Statistical Language Model Based on Mutual Information[J]. Journal of Chinese Information Processing, 2005, 19(4): 46-51.)
[17] 李国臣. 文本分类中基于对数似然比测试的特征词选择方法[J]. 中文信息学报, 1999, 13(4): 16-21. (Li Guochen. A Log-Likelihood-Ratio-Test-Based Feature Word Selection Approach in Text Categorization[J]. Journal of Chinese Information Processing, 1999, 13(4): 16-21.)

[1] 李晓峰,马静,李驰,朱恒民. 基于XGBoost模型的电商商品品名识别算法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[2] 许德山, 李辉, 张运良. 文献关键词链接标引方法研究[J]. 现代图书情报技术, 2015, 31(9): 31-37.
[3] 陈诗琴, 李文江. WebSocket在图书馆移动信息服务中的应用[J]. 现代图书情报技术, 2015, 31(9): 90-96.
[4] 胡菊香, 吕学强, 刘克会. 利用类别引导词的投诉文本分类[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[5] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[6] 李军锋, 吕学强, 周绍钧. 带权复杂图模型的专利关键词标引研究[J]. 现代图书情报技术, 2015, 31(3): 26-32.
[7] 马宾, 殷立峰. 一种基于Hadoop平台的并行朴素贝叶斯网络舆情快速分类算法[J]. 现代图书情报技术, 2015, 31(2): 78-84.
[8] 侯婷, 吕学强, 李卓. 专利术语抽取的层次过滤方法[J]. 现代图书情报技术, 2015, 31(1): 24-30.
[9] 唐守利, 徐宝祥. 基于本体的云服务语义检索系统研究[J]. 现代图书情报技术, 2014, 30(12): 27-35.
[10] 唐晓波, 肖璐. 基于依存句法网络的文本特征提取研究[J]. 现代图书情报技术, 2014, 30(11): 31-37.
[11] 石翠, 王杨, 杨彬, 姚晔. 面向中文专利文献的单层并列结构识别[J]. 现代图书情报技术, 2014, 30(10): 76-83.
[12] 李文江, 陈诗琴. 微信作为APP客户端的图书馆公共服务平台[J]. 现代图书情报技术, 2014, 30(7): 133-138.
[13] 汤青,吕学强,李卓,施水才,. 领域本体术语抽取研究*[J]. 现代图书情报技术, 2014, 30(1): 43-50.
[14] 李文江, 陈诗琴. 基于Android GCM服务的图书馆信息推送系统设计[J]. 现代图书情报技术, 2013, 29(11): 91-96.
[15] 熊李艳, 谭龙, 钟茂生. 基于有效词频的改进C-value自动术语抽取方法[J]. 现代图书情报技术, 2013, 29(9): 54-59.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn