1School of Information Science & Engineering, Yunnan University, Kunming 650500, China 2School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China 3Sichuan Key Laboratory of Software Automatic Generation and Intelligent Service,Chengdu University of Information Technology, Chengdu 610225, China
[Objective] This paper modifies the topic model to improve the quality of extracted news clues. [Methods] We constructed a News-IBTM model based on IBTM (Incremental Biterm Topic Model) with dynamic sliding window, which reduced the extraction scope of binary phrases. Then, we used this model to extract topics and topic-word distributions from news, and inferred the document-topic distributions. Finally, we used the JS (Jensen-Shannon) divergence to measure the difference between document-topic distributions and generate news clues. [Results] We examined our News-IBTM model with news from People’s Daily Online and Weibo. The proposed model outperformed existing ones in perplexity, accuracy and efficiency. [Limitations] The accuracy of News-IBTM algorithm needs to be further improved. [Conclusions] The proposed method could effectively extract quality news topics and clues.
Surendran S, Chithraprasad D, Kaimal M R. A Scalable Geometric Algorithm for Community Detection from Social Networks with Incremental Update[J]. Social Network Analysis & Mining, 2016, 6(1): Article No.90.
[2]
Papadimitriou C H, Raghavan P, Tamaki H, et al. Latent Semantic Indexing: A Probabilistic Analysis[J]. Journal of Computer and System Sciences, 2000,61(2):217-235.
doi: 10.1006/jcss.2000.1711
[3]
Kling C C, Posch L, Bleier A, et al. Topic Model Tutorial: A Basic Introduction on Latent Dirichlet Allocation and Extensions for Web Scientists[C]//Proceedings of the 8th ACM Conference on Web Science. 2016.
[4]
Blei D M, Ng A Y, Jordan M I, et al. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[5]
AlSumait L, Barbará D, Domeniconi C. On-line LDA: Adaptive Topic Models for Mining Text Streams with Applications to Topic Detection and Tracking[C]//Proceedings of the 8th IEEE International Conference on Data Mining. 2008: 3-12.
[6]
Yao L, Zhang Y, Wei B, et al. Incorporating Knowledge Graph Embeddings into Topic Modeling[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3119-3126.
( Tang Huanling, Dou Quansheng, Yu Liping, et al. SLDA-TC: A Novel Text Categorization Approach Based on Supervised Topic Model[J]. Acta Electronica Sinica, 2019,47(6):1300-1308.)
[8]
Yan X, Guo J, Lan Y, et al. A Biterm Topic Model for Short Texts[C]//Proceedings of the 22nd International Conference on World Wide Web. ACM, 2013: 1445-1456.
[9]
Cheng X, Yan X, Lan Y, et al. BTM: Topic Modeling over Short Texts[J]. IEEE Transactions on Knowledge and Data Engineering, 2014,26(12):2928-2941.
doi: 10.1109/TKDE.2014.2313872
( Liang Jiye, Qiao Jie, Cao Fuyuan, et al. A Distributed Representation Model for Short Text Analysis[J]. Journal of Computer Research and Development, 2018,55(8):1631-1640.)
[11]
Pang J, Li X, Xie H, et al. SBTM: Topic Modeling over Short Texts[C]//Proceedings of the DASFAA 2016 Workshop. Springer International Publishing, 2016: 43-56.
[12]
Zhou X, Ouyang J, Li X. Two Time-Efficient Gibbs Sampling Inference Algorithms for Biterm Topic Model[J]. Applied Intelligence, 2018,48(3):730-754.
doi: 10.1007/s10489-017-1004-2
[13]
Li X, Zhang A, Li C, et al. Relational Biterm Topic Model: Short-Text Topic Modeling Using Word Embeddings[J]. Computer Journal, 2019,62(3):359-372.
doi: 10.1093/comjnl/bxy037
[14]
Liu J, Xia C, Li X, et al. A Bert-based Ensemble Model for Chinese News Topic Prediction[C]//Proceedings of the 2nd International Conference on Big Data Engineering. 2020: 18-23.
[15]
Nam H, Seo S, Mailthody V, et al. I-BERT: Inductive Generalization of Transformer to Arbitrary Context Lengths arXiv Preprint, arXiv: 2006.10220.
( Zheng Fei, Wei Dehao, Huang Sheng. Text Classification Method Based on LDA and Deep Learning[J]. Computer Engineering and Design, 2020,41(8):2184-2189.)
[17]
Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810. 04805.
[18]
Fiscus J G, Doddington G R. Topic Detection and Tracking Evaluation Overview[A]//Topic Detection and Tracking: Event-based Information Organization[M]. 2002: 17-31.
[19]
Mei Q, Zhai C X. Discovering Evolutionary Theme Patterns from Text: An Exploration of Temporal Text Mining[C]//Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2005: 198-207.
[20]
Goodfellow I, Bengio Y, Courville A. Deep Learning (Vol. 1) [M]. Cambridge: MIT Press, 2016: 71-73.
[21]
Canini K R, Shi L, Griffiths T L. Online Inference of Topics with Latent Dirichlet Allocation[C]//Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. AISTATS, 2009: 65-72.
( Li Yingying, Ma Shuai, Jiang Haoyi, et al. An Approach for Storytelling by Correlating Events from Social Networks[J]. Journal of Computer Research and Development, 2018,55(9):1972-1986.)
( He Xufeng, Chen Ling, Chen Gencai, et al. A LDA Topic Model Based Collection Selection Method for Distributed Information Retrieval[J]. Journal of Chinese Information Processing, 2017,31(3):125-133.)
[24]
Li C, Wang H, Zhang Z, et al. Topic Modeling for Short Texts with Auxiliary Word Embeddings[C]//Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016: 165-174.
[25]
Huang J, Peng M, Wang H, et al. A Probabilistic Method for Emerging Topic Tracking in Microblog Stream[J]. World Wide Web Journal, 2017,20(2):325-350.
doi: 10.1007/s11280-016-0390-4
( Peng Min, Guan Chenyu, Zhu Jiahui, et al. A Survey on Topic Detection and Tracking in Social Media Text[J]. Journal of Wuhan University (Natural Science Edition), 2016,62(3):197-217.)
( Zhang Yangsen, Duan Yuxiang, Huang Gaijuan, et al. A Survey on Topic Detection and Tracking Methods in Social Media[J]. Journal of Chinese Information Processing, 2019,33(7):1-10,30.)
[28]
Zhang Y, Ma J, Wang Z, et al. Extraction and Tracking of Scientific Topics by LDA[C]//Proceedings of the 9th International Conference on Intelligent Networking and Collaborative Systems. 2017: 536-544.
( Zhou Nan, Du Pan, Jin Xiaolong, et al. ET-TAG: A Tag Generation Model for the Sub-Topics of Public Opinion Events[J]. Chinese Journal of Computers, 2018,41(7):1490-1503.)
( Han Zhongming, Zhang Mengmei, Li Mengqi, et al. Flow Hierarchical Dirichlet Process for Complex Topic Modeling[J]. Chinese Journal of Computers, 2019,42(7):1539-1552.)
[31]
Huang L, Ma J, Chen C. Topic Detection from Microblogs Using T-LDA and Perplexity[C]//Proceedings of the 24th Asia-Pacific Software Engineering Conference Workshops. IEEE, 2017: 71-77.