[目的]为更有效地在中文短信文本信息流((SMS Text Message Flow, SM F)中进行多话题的分类提取,提出一种基于SM_ F特点的话题分类抽取方法SM_F_ HTo[方法]将SM_F分割成多个短信文本子集SM_Fi,通过层次的狄利克雷过程信息抽取与TF-IDF相结合,建立短信文本向量集上多个概率分布,采用吉布斯抽样并结合特征同属于临时话题的概率进行SM_F话题分类抽取。[结果]实验结果表明,SM_ F_ HT在困惑度和对数似然比方面优越于模型CCLDA和CCMixo[局限]在短信文本预处理和特征同的抽取方面,还需进一步优化算法和提高数据质量。[结论]提出的SM_F_HT方法对SM_F的多话题分类抽取是有效的。
[Objective] A topic classification extraction model named SM_ F_ HT is proposed to find multiple topics more effectively in Chinese SMS text message Flow (SM少).[Methods] In this model, SM_ F is divided into SMS text subsets TF-IDF combined with the hierarchical Dirichlet processes of information extraction are used to build multiple probability distributions of the SMS text vector set. Finally topic classification on SM_ F is extracted using Gibbs sampling in conjunction with the probability of the characteristic words which belong to local topic.[Results]Experimental results show that SM_F_HT is superior to CCLDA and CCMix models in perplexity and log like lihood ratio.[Limitations] In fields of SMS text pre processing and keyword extraction, this algorithm still needs further optimization.[Conclusions] The SM_ F_ HT scheme is effective for multiple topics classification extraction of SM_F.
张永军, 刘金岭, 马甲林. 中文短信文本信息流中多话题的分类抽取[J]. 现代图书情报技术, 2014, 30(7): 101-106.
Zhang Yongjun, Liu Jinling, Ma Jialin. Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow. New Technology of Library and Information Service, 2014, 30(7): 101-106.
[1] 刘金岭, 倪晓红, 王新功. 手机短信文本信息流的自动文摘生成[J]. 现代图书情报技术, 2013(2): 43-49. (Liu Jinling,Ni Xiaohong, Wang Xingong. Automatic Abstracting Generating Based on Mobile Short Message Text Information Flow[J]. New Technology of Library and Information Service, 2013(2): 43-49.)
[2] Allan J, Papka R, Lavrenko V. On-line New Event Detection and Tracking[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’98). New York: ACM, 1998: 37-45.
[3] 蔡淑琴, 张静, 王旸, 等. 基于中心化的微博热点发现方法[J]. 管理学报, 2012, 9(6): 874-879. (Cai Shuqin, Zhang Jing, Wang Yang, et al. Micro-blogging Hotspot Discovery Method Based on Centralization[J]. Chinese Journal of Management, 2012, 9(6): 874-879.)
[4] 马斌, 洪宇, 陆剑江, 等. 基于线索树双层聚类的微博话题检测[J]. 中文信息学报, 2012, 26 (6): 121-128. (Ma Bin, Hong Yu, Lu Jianjiang, et al. A Thread-based Two-stage Clustering Method of Microblog Topic Detection[J]. Journal of Chinese Information Processing, 2012, 26(6): 121-128.)
[5] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99). New York: ACM, 1999: 50-57.
[6] Zhai C, Velivelli A, Yu B.A Cross-collection Mixture Model for Comparative Text Mining[C]. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04), Seattle, Washington, USA. New York: ACM, 2004: 743-748.
[7] Yin Z, Cao L, Han J, et al. Geographical Topic Discovery and Comparison[C]. In: Proceedings of the 20th International Conference on World Wide Web (WWW’11). New York: ACM, 2011: 247-256.
[8] 赵华, 赵铁军, 张姝, 等. 基于内容分析的话题检测研究[J]. 哈尔滨工业大学学报, 2006, 38(10): 1740-1743. (Zhao Hua, Zhao Tiejun, Zhang Shu, et al. Topic Detection Research Based on Content Analysis[J]. Journal of Harbin Institute of Technology, 2006, 38(10): 1740-1743.)
[9] Paul M J, Girju R. Cross-cultural Analysis of Blogs and Forums with Mixed-collection Topic Model[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1408-1417.
[10] Paul M J, Girju R. Comparative Scientific Research Analysis with a Language-Independent Cross-Collection Model[J]. Procesamiento del Lenguaje Natural, 2010, 45: 153-160.
[11] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng. Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[12] 何建民, 张义. 基于类熵距离测量的热点话题识别方法研究[J]. 情报科学, 2012, 30(8): 1147-1150, 1166. (He Jianmin, Zhang Yi. Research on Identifying Method for the Hot Topics Based on Class Entropy Distance Measurement[J]. Information Science, 2012, 30(8): 1147-1150, 1166.)
[13] 刘金岭. 基于降维的短信文本语义分类及主题提取[J]. 计算机工程与应用, 2010, 46(23): 159-161, 174. (Liu Jinling. Dimensionality Reduction of Short Message Text Classification and Thematic Extraction of Semantic[J]. Computer Engineering and Applications, 2010, 46(23): 159-161, 174.)
[14] 尤芳. Gibbs抽样在正态混合模型中的参数估计[J]. 统计与决策, 2009(15): 150-151. (You Fang. Gibbs Sampling in The Gaussian Mixture Model Parameter Estimation[J]. Statistics and Decision, 2009(15): 150-151.)
[15] 赵爱华, 刘培玉, 郑燕. 基于LDA的新闻话题子话题划分方法[J]. 小型微型计算机系统, 2013, 34(4): 732-737. (Zhao Aihua, Liu Peiyu, Zheng Yan. Subtopic Division in News Topic Based on Latent Dirichlet Allocation[J]. Journal of Chinese Computer Systems, 2013, 34(4): 732-737.)
[16] 黄永文, 何中市. 基于互信息的统计语言模型平滑技术[J]. 中文信息学报, 2005, 19(4): 46-51. (Huang Yongwen, He Zhongshi. A Smoothing Technique for Statistical Language Model Based on Mutual Information[J]. Journal of Chinese Information Processing, 2005, 19(4): 46-51.)
[17] 李国臣. 文本分类中基于对数似然比测试的特征词选择方法[J]. 中文信息学报, 1999, 13(4): 16-21. (Li Guochen. A Log-Likelihood-Ratio-Test-Based Feature Word Selection Approach in Text Categorization[J]. Journal of Chinese Information Processing, 1999, 13(4): 16-21.)