中文短信文本信息流中多话题的分类抽取

doi:10.11925/infotech.1003-3513.2014.07.14

现代图书情报技术

2014, Vol. 30

Issue (7): 101-106 https://doi.org/10.11925/infotech.1003-3513.2014.07.14

情报分析与研究

本期目录 | 过刊浏览 | 高级检索

中文短信文本信息流中多话题的分类抽取

张永军, 刘金岭, 马甲林

淮阴工学院中文信息处理研究室, 淮安223003

Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow

Zhang Yongjun, Liu Jinling, Ma Jialin

Chinese Information Processing Laboratory, Huaiyin Institute of Technology, Huai'an 223003, China

摘要
参考文献
相关文章
Metrics

全文: PDF (454 KB) HTML
输出: BibTeX | EndNote (RIS)

摘要

[目的]为更有效地在中文短信文本信息流（（SMS Text Message Flow， SM F）中进行多话题的分类提取，提出一种基于SM_ F特点的话题分类抽取方法SM_F_ HTo[方法]将SM_F分割成多个短信文本子集SM_F_i，通过层次的狄利克雷过程信息抽取与TF-IDF相结合，建立短信文本向量集上多个概率分布，采用吉布斯抽样并结合特征同属于临时话题的概率进行SM_F话题分类抽取。[结果]实验结果表明，SM_ F_ HT在困惑度和对数似然比方面优越于模型CCLDA和CCMixo[局限]在短信文本预处理和特征同的抽取方面，还需进一步优化算法和提高数据质量。[结论]提出的SM_F_HT方法对SM_F的多话题分类抽取是有效的。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	刘金岭
	张永军
	马甲林

Abstract：

[Objective] A topic classification extraction model named SM_ F_ HT is proposed to find multiple topics more effectively in Chinese SMS text message Flow (SM少).[Methods] In this model, SM_ F is divided into SMS text subsets TF-IDF combined with the hierarchical Dirichlet processes of information extraction are used to build multiple probability distributions of the SMS text vector set. Finally topic classification on SM_ F is extracted using Gibbs sampling in conjunction with the probability of the characteristic words which belong to local topic.[Results]Experimental results show that SM_F_HT is superior to CCLDA and CCMix models in perplexity and log like lihood ratio.[Limitations] In fields of SMS text pre processing and keyword extraction, this algorithm still needs further optimization.[Conclusions] The SM_ F_ HT scheme is effective for multiple topics classification extraction of SM_F.

Key words： Short message text Message flow Topic extract Dirichlet Gibbs sample

收稿日期: 2014-01-05 出版日期: 2014-10-20

TP391.1

基金资助:

国家级星火计划项目“农村民生建设信息反馈平台建设”（项目编号：2011GA690190）的研究成果之一

通讯作者: 刘金岭E-mail：liujinlingg@126.com E-mail: liujinlingg@126.com

作者简介: 作者贡献声明：张永军：设计研究方案；论文起草；编程井进行实验刘金岭：提出研究思路，论文最终版本修订；马甲林：采集、清洗和分析数据。

引用本文:

张永军, 刘金岭, 马甲林. 中文短信文本信息流中多话题的分类抽取[J]. 现代图书情报技术, 2014, 30(7): 101-106.
Zhang Yongjun, Liu Jinling, Ma Jialin. Classification of Multi Topic Extraction Based on Chinese Short Information Text Message Flow. New Technology of Library and Information Service, 2014, 30(7): 101-106.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2014.07.14 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2014/V30/I7/101

[1] 刘金岭, 倪晓红, 王新功. 手机短信文本信息流的自动文摘生成[J]. 现代图书情报技术, 2013(2): 43-49. (Liu Jinling,Ni Xiaohong, Wang Xingong. Automatic Abstracting Generating Based on Mobile Short Message Text Information Flow[J]. New Technology of Library and Information Service, 2013(2): 43-49.)
[2] Allan J, Papka R, Lavrenko V. On-line New Event Detection and Tracking[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’98). New York: ACM, 1998: 37-45.
[3] 蔡淑琴, 张静, 王旸, 等. 基于中心化的微博热点发现方法[J]. 管理学报, 2012, 9(6): 874-879. (Cai Shuqin, Zhang Jing, Wang Yang, et al. Micro-blogging Hotspot Discovery Method Based on Centralization[J]. Chinese Journal of Management, 2012, 9(6): 874-879.)
[4] 马斌, 洪宇, 陆剑江, 等. 基于线索树双层聚类的微博话题检测[J]. 中文信息学报, 2012, 26 (6): 121-128. (Ma Bin, Hong Yu, Lu Jianjiang, et al. A Thread-based Two-stage Clustering Method of Microblog Topic Detection[J]. Journal of Chinese Information Processing, 2012, 26(6): 121-128.)
[5] Hofmann T. Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99). New York: ACM, 1999: 50-57.
[6] Zhai C, Velivelli A, Yu B.A Cross-collection Mixture Model for Comparative Text Mining[C]. In: Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’04), Seattle, Washington, USA. New York: ACM, 2004: 743-748.
[7] Yin Z, Cao L, Han J, et al. Geographical Topic Discovery and Comparison[C]. In: Proceedings of the 20th International Conference on World Wide Web (WWW’11). New York: ACM, 2011: 247-256.
[8] 赵华, 赵铁军, 张姝, 等. 基于内容分析的话题检测研究[J]. 哈尔滨工业大学学报, 2006, 38(10): 1740-1743. (Zhao Hua, Zhao Tiejun, Zhang Shu, et al. Topic Detection Research Based on Content Analysis[J]. Journal of Harbin Institute of Technology, 2006, 38(10): 1740-1743.)
[9] Paul M J, Girju R. Cross-cultural Analysis of Blogs and Forums with Mixed-collection Topic Model[C]. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing (EMNLP’09). Stroudsburg: Association for Computational Linguistics, 2009: 1408-1417.
[10] Paul M J, Girju R. Comparative Scientific Research Analysis with a Language-Independent Cross-Collection Model[J]. Procesamiento del Lenguaje Natural, 2010, 45: 153-160.
[11] 姚全珠, 宋志理, 彭程. 基于LDA模型的文本分类研究[J].计算机工程与应用, 2011, 47(13): 150-153. (Yao Quanzhu, Song Zhili, Peng Cheng. Research on Text Categorization Based on LDA[J]. Computer Engineering and Applications, 2011, 47(13): 150-153.)
[12] 何建民, 张义. 基于类熵距离测量的热点话题识别方法研究[J]. 情报科学, 2012, 30(8): 1147-1150, 1166. (He Jianmin, Zhang Yi. Research on Identifying Method for the Hot Topics Based on Class Entropy Distance Measurement[J]. Information Science, 2012, 30(8): 1147-1150, 1166.)
[13] 刘金岭. 基于降维的短信文本语义分类及主题提取[J]. 计算机工程与应用, 2010, 46(23): 159-161, 174. (Liu Jinling. Dimensionality Reduction of Short Message Text Classification and Thematic Extraction of Semantic[J]. Computer Engineering and Applications, 2010, 46(23): 159-161, 174.)
[14] 尤芳. Gibbs抽样在正态混合模型中的参数估计[J]. 统计与决策, 2009(15): 150-151. (You Fang. Gibbs Sampling in The Gaussian Mixture Model Parameter Estimation[J]. Statistics and Decision, 2009(15): 150-151.)
[15] 赵爱华, 刘培玉, 郑燕. 基于LDA的新闻话题子话题划分方法[J]. 小型微型计算机系统, 2013, 34(4): 732-737. (Zhao Aihua, Liu Peiyu, Zheng Yan. Subtopic Division in News Topic Based on Latent Dirichlet Allocation[J]. Journal of Chinese Computer Systems, 2013, 34(4): 732-737.)
[16] 黄永文, 何中市. 基于互信息的统计语言模型平滑技术[J]. 中文信息学报, 2005, 19(4): 46-51. (Huang Yongwen, He Zhongshi. A Smoothing Technique for Statistical Language Model Based on Mutual Information[J]. Journal of Chinese Information Processing, 2005, 19(4): 46-51.)
[17] 李国臣. 文本分类中基于对数似然比测试的特征词选择方法[J]. 中文信息学报, 1999, 13(4): 16-21. (Li Guochen. A Log-Likelihood-Ratio-Test-Based Feature Word Selection Approach in Text Categorization[J]. Journal of Chinese Information Processing, 1999, 13(4): 16-21.)

[1]	余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究^*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[2]	刘欢, 张智雄, 王宇飞. BERT模型的主要优化改进方法研究综述 [J]. 数据分析与知识发现, 0, (): 1-.
[3]	叶光辉, 徐彤, 毕崇武, 李心悦. 基于多维度特征与LDA模型的城市旅游画像演化分析 [J]. 数据分析与知识发现, 0, (): 1-.
[4]	刘婧茹, 宋阳, 贾睿, 张翼鹏, 罗勇, 马敬东. 基于BiLSTM-CRF中文临床文本中受保护的健康信息识别 [J]. 数据分析与知识发现, 0, (): 0-.
[5]	石磊,王毅,成颖,魏瑞斌. 自然语言处理中的注意力机制研究综述^*[J]. 数据分析与知识发现, 2020, 4(5): 1-14.
[6]	刘萍,彭小芳. 基于形式概念分析的词汇相似度计算^*[J]. 数据分析与知识发现, 2020, 4(5): 66-74.
[7]	刘书瑞,田继东,陈普春,赖立,宋国杰. 基于文本数据的过滤式与嵌入式样本选择算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 223-230.
[8]	徐建民,张丽青,王苗. 基于贝叶斯网络的静态话题追踪模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[9]	谭荧,张进,夏立新. 社交媒体情境下的情感分析研究综述[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[10]	聂卉,何欢. 引入词向量的隐性特征识别研究*[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[11]	李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 ^*[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[12]	李晓峰,马静,李驰,朱恒民. 基于XGBoost模型的电商商品品名识别算法研究 ^*[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[13]	余传明, 龚雨田, 王峰, 安璐. 基于文本价格融合模型的股票趋势预测^*[J]. 数据分析与知识发现, 2018, 2(12): 33-42.
[14]	曾子明, 杨倩雯. 基于LDA和AdaBoost多特征组合的微博情感分析^*[J]. 数据分析与知识发现, 2018, 2(8): 51-59.
[15]	贾隆嘉, 张邦佐. *高校网络舆情安全中主题分类方法研究^——以新浪微博数据为例**[J]. 数据分析与知识发现, 2018, 2(7): 55-62.

Viewed

Full text

Abstract

Cited

Shared

Discussed