Please wait a minute...
New Technology of Library and Information Service  2014, Vol. 30 Issue (7): 101-106    DOI: 10.11925/infotech.1003-3513.2014.07.14
Current Issue | Archive | Adv Search |
Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow
Zhang Yongjun, Liu Jinling, Ma Jialin
Chinese Information Processing Laboratory, Huaiyin Institute of Technology, Huai'an 223003, China
Download:
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] A topic classification extraction model named SM_ F_ HT is proposed to find multiple topics more effectively in Chinese SMS text message Flow (SM少).[Methods] In this model, SM_ F is divided into SMS text subsets TF-IDF combined with the hierarchical Dirichlet processes of information extraction are used to build multiple probability distributions of the SMS text vector set. Finally topic classification on SM_ F is extracted using Gibbs sampling in conjunction with the probability of the characteristic words which belong to local topic.[Results]Experimental results show that SM_F_HT is superior to CCLDA and CCMix models in perplexity and log like lihood ratio.[Limitations] In fields of SMS text pre processing and keyword extraction, this algorithm still needs further optimization.[Conclusions] The SM_ F_ HT scheme is effective for multiple topics classification extraction of SM_F.

Key wordsShort message text      Message flow      Topic extract      Dirichlet      Gibbs sample     
Received: 05 January 2014      Published: 20 October 2014
:  TP391.1  

Cite this article:

Zhang Yongjun, Liu Jinling, Ma Jialin. Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow. New Technology of Library and Information Service, 2014, 30(7): 101-106.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.1003-3513.2014.07.14     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2014/V30/I7/101

[1] 刘金岭, 倪晓红, 王新功. 手机短信文本信息流的自动文摘生成[J]. 现代图书情报技术, 2013(2): 43-49. (Liu Jinling,Ni Xiaohong, Wang Xingong. Automatic Abstracting Generating Based on Mobile Short Message Text Information Flow[J]. New Technology of Library and Information Service, 2013(2): 43-49.)
[2] Allan J, Papka R, Lavrenko V. On-line New Event Detection and Tracking[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’98). New York: ACM, 1998: 37-45.
[3] 蔡淑琴, 张静, 王旸, 等. 基于中心化的微博热点发现方法[J]. 管理学报, 2012, 9(6): 874-879. (Cai Shuqin, Zhang Jing, Wang Yang, et al. Micro-blogging Hotspot Discovery Method Based on Centralization[J]. Chinese Journal of Management, 2012, 9(6): 874-879.)
[4] 马斌, 洪宇, 陆剑江, 等. 基于线索树双层聚类的微博话题检测[J]. 中文信息学报, 2012, 26 (6): 121-128. (Ma Bin, Hong Yu, Lu Jianjiang, et al. A Thread-based Two-stage Clustering Method of Microblog Topic Detection[J]. Journal of Chinese Information Processing, 2012, 26(6): 121-128.)
[5] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99). New York: ACM, 1999: 50-57.
[6] Zhai C, Velivelli A, Yu B.A Cross-collection Mixture Model for Comparative Text Mining[C]. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04), Seattle, Washington, USA. New York: ACM, 2004: 743-748.
[7] Yin Z, Cao L, Han J, et al. Geographical Topic Discovery and Comparison[C]. In: Proceedings of the 20th International Conference on World Wide Web (WWW’11). New York: ACM, 2011: 247-256.
[8] 赵华, 赵铁军, 张姝, 等. 基于内容分析的话题检测研究[J]. 哈尔滨工业大学学报, 2006, 38(10): 1740-1743. (Zhao Hua, Zhao Tiejun, Zhang Shu, et al. Topic Detection Research Based on Content Analysis[J]. Journal of Harbin Institute of Technology, 2006, 38(10): 1740-1743.)
[9] Paul M J, Girju R. Cross-cultural Analysis of Blogs and Forums with Mixed-collection Topic Model[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1408-1417.
[10] Paul M J, Girju R. Comparative Scientific Research Analysis with a Language-Independent Cross-Collection Model[J]. Procesamiento del Lenguaje Natural, 2010, 45: 153-160.
[11] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng. Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[12] 何建民, 张义. 基于类熵距离测量的热点话题识别方法研究[J]. 情报科学, 2012, 30(8): 1147-1150, 1166. (He Jianmin, Zhang Yi. Research on Identifying Method for the Hot Topics Based on Class Entropy Distance Measurement[J]. Information Science, 2012, 30(8): 1147-1150, 1166.)
[13] 刘金岭. 基于降维的短信文本语义分类及主题提取[J]. 计算机工程与应用, 2010, 46(23): 159-161, 174. (Liu Jinling. Dimensionality Reduction of Short Message Text Classification and Thematic Extraction of Semantic[J]. Computer Engineering and Applications, 2010, 46(23): 159-161, 174.)
[14] 尤芳. Gibbs抽样在正态混合模型中的参数估计[J]. 统计与决策, 2009(15): 150-151. (You Fang. Gibbs Sampling in The Gaussian Mixture Model Parameter Estimation[J]. Statistics and Decision, 2009(15): 150-151.)
[15] 赵爱华, 刘培玉, 郑燕. 基于LDA的新闻话题子话题划分方法[J]. 小型微型计算机系统, 2013, 34(4): 732-737. (Zhao Aihua, Liu Peiyu, Zheng Yan. Subtopic Division in News Topic Based on Latent Dirichlet Allocation[J]. Journal of Chinese Computer Systems, 2013, 34(4): 732-737.)
[16] 黄永文, 何中市. 基于互信息的统计语言模型平滑技术[J]. 中文信息学报, 2005, 19(4): 46-51. (Huang Yongwen, He Zhongshi. A Smoothing Technique for Statistical Language Model Based on Mutual Information[J]. Journal of Chinese Information Processing, 2005, 19(4): 46-51.)
[17] 李国臣. 文本分类中基于对数似然比测试的特征词选择方法[J]. 中文信息学报, 1999, 13(4): 16-21. (Li Guochen. A Log-Likelihood-Ratio-Test-Based Feature Word Selection Approach in Text Categorization[J]. Journal of Chinese Information Processing, 1999, 13(4): 16-21.)

[1] Wang Hongbin,Wang Jianxiong,Zhang Yafei,Yang Heng. Topic Recognition of News Reports with Imbalanced Contents[J]. 数据分析与知识发现, 2021, 5(3): 109-120.
[2] Hongfei Ling,Shiyan Ou. Review of Automatic Labeling for Topic Models[J]. 数据分析与知识发现, 2019, 3(9): 16-26.
[3] Mingzhu Sun,Jing Ma,Lingfei Qian. Extracting Keywords Based on Topic Structure and Word Diagram Iteration[J]. 数据分析与知识发现, 2019, 3(8): 68-76.
[4] Ruihua Qi,Junyi Zhou,Xu Guo,Caihong Liu. Extracting Book Review Topics with Knowledge Base[J]. 数据分析与知识发现, 2019, 3(6): 83-91.
[5] He Weilin,Feng Guohe,Xie Hongling. Analyzing Scientific Literature with Content Similarity - Topics over Time Model[J]. 数据分析与知识发现, 2018, 2(11): 64-72.
[6] Wang Yuefen,Fu Zhu,Chen Bikun. Analyzing Knowledge Structure Research with LDA Model[J]. 现代图书情报技术, 2016, 32(4): 8-19.
[7] Hong Ma, Yongming Cai. A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature[J]. 数据分析与知识发现, 2016, 32(12): 17-26.
[8] Qun Zhang, Hongjun Wang, Lunwen Wang. Classifying Short Texts with Word Embedding and LDA Model[J]. 数据分析与知识发现, 2016, 32(12): 27-35.
[9] Xu Yuemei,Li Yang,Liang Ye,Cai Lianqiao. Analyzing Evolution of News Topics with Manifold Learning[J]. 现代图书情报技术, 2016, 32(10): 59-69.
[10] Li Xiangdong, Ba Zhichao, Huang Li. Allocation and Multi-granularity[J]. 现代图书情报技术, 2015, 31(5): 42-49.
[11] Wu Wankun, Wu Qinglie, Gu Jinjiang. Hot Topic Extraction from E-commerce Microblog Based on EM-LDA Integrated Model[J]. 现代图书情报技术, 2015, 31(11): 33-40.
[12] Li Xiangdong, Liao Xiangpeng, Huang Li. Research and Implementation of Bibliographic Information Classification System in LDA Model[J]. 现代图书情报技术, 2014, 30(5): 18-25.
[13] Liu Jinling, Ni Xiaohong, Wang Xingong. Automatic Abstracting Generating Based on Mobile Short Message Text Information Flow[J]. 现代图书情报技术, 2013, 29(2): 43-49.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn