新闻话题识别中LDA最优主题数选取研究

doi:10.11925/infotech.2096-3467.2022-0115

数据分析与知识发现

本期目录 | 过刊浏览 | 高级检索

新闻话题识别中LDA最优主题数选取研究

杨洋,江开忠,原明君,惠岚昕

(上海工程技术大学数理与统计学院上海 201620)

Research on the selection of LDA optimal topic number in news topic identification

Yang Yang,Jang Kaizhong,Yuan Mingjun,Hui Lanxin

(School of Mathematics and Statistics, Shanghai University of Engineering Science, Shanghai 201620, China)

摘要
相关文章
Metrics

全文:
输出: BibTeX | EndNote (RIS)

摘要

[目的] 针对传统LDA模型中主题数目需指定的问题，提出了一种面向新闻话题识别领域的融合语义与时序的自适应主题数目确定方法。

[方法] 本文将语义和时序作为两个视图对新闻数据进行提取，得到对应的特征向量。再利用Co-DPSC算法对两个视图进行协同训练得到包含时序影响的语义特征矩阵，最后对矩阵降维处理后按行进行密度峰值聚类，其结果作为最优主题的个数。

[结果] 实验结果表明考虑语义和时间因素来确定最优主题数其查准率和F值均有所提升，其中查准率提高了35.09%， F值提高了15.39%。

[局限] 对关键词集进行聚类，关键词的获取方法一定程度上影响了聚类的效果和运行的时间。由于新闻数据需要文本和时间要素，对其他类型的数据有一定局限性。

[结论] 实验证明，本文方法将新闻数据的时效性和内容结合起来考量新闻的类别，能在一定程度上提升最优主题数目选取的准确性。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章

关键词 ： font-family:", LDA">LDAfont-family:新宋体, 模型')" href="#">">模型, 新闻话题, 多视图聚类

Abstract：

[Objective] Aiming at the problem that the number of topics needs to be specified in the traditional LDA model, an adaptive topic number determination method for the field of news topic recognition is proposed.

[Methods] This paper extracts the news data by using semantics and time series as two views to obtain the corresponding feature vectors. The Co-DPSC algorithm is used to collaboratively train the two views to obtain a semantic feature matrix containing timing effects, and finally the density peak clustering by row after the matrix dimensionality reduction process is obtained, and the result is used as the optimal number of topics.

[Results] The experimental results show that the precision and F value of the optimal number of topics are improved by considering semantic and temporal factors, among which the precision rate is increased by 35.09%, and the F value is increased by 15.39%.

[Limitations] The keyword set is clustered, and the method of obtaining keywords affects the effect of clustering and the running time to a certain extent. Because news data requires textual and temporal elements, there are limitations to other types of data.

[Conclusions] Experiments show that this method combines the timeliness and content of news data to consider the categories of news, which can improve the accuracy of the optimal number of topics to a certain extent.

Key words： LDA model news topics multi-view clustering

出版日期: 2022-07-11

ZTFLH:

TP393，G250

引用本文:

杨洋, 江开忠, 原明君, 惠岚昕. 新闻话题识别中LDA最优主题数选取研究 [J]. 数据分析与知识发现, 10.11925/infotech.2096-3467.2022-0115.
Yang Yang, Jang Kaizhong, Yuan Mingjun, Hui Lanxin. Research on the selection of LDA optimal topic number in news topic identification . Data Analysis and Knowledge Discovery, 0, (): 1-.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2022-0115 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y0/V/I/1

[1]	杨洋,江开忠,原明君,惠岚昕. 新闻话题识别中LDA最优主题数选取研究*[J]. 数据分析与知识发现, 2022, 6(11): 72-78.

Viewed

Full text

Abstract

Cited

Shared

Discussed