1School of Economics and Management, Nanjing University of Science and Technology, Nanjing 210094, China 2 Alibaba Research Center for Complex Sciences, Hangzhou Normal University, Hangzhou 311121, China 3 Jiangsu Key Laboratory of Data Engineering and Knowledge Service (Nanjing University), Nanjing 210093, China
[Objective] This study aims to identify microblog post topics, and then automatically extract high quality ones with the help of text clustering techniques. [Methods] We collected food related microblog posts from Sina Weibo as raw data, then applied text clustering and deep learning techniques to detect the target topics. First, we categorized the microblog posts by the four seasons in accordance with their publishing dates. Second, we created a vector space model and used text clustering method to retrieve candidate topics. Finally, we automatically identified the quality topics with deep learning technology. [Results] We automatically identified the high quality topics manually found by researchers, and their topic coverage values were all higher than 0.5. [Limitations] We decided the topic quality based on qualitative data. [Conclusions] The proposed method could extract high quality topics effectively. The retrieved topics reflect the distribution of food related microblog posts in the four seasons.
张晓勇,周清清,章成志. 面向在线社交网络用户生成内容的饮食话题发现研究*[J]. 现代图书情报技术, 2016, 32(10): 70-80.
Zhang Xiaoyong,Zhou Qingqing,Zhang Chengzhi. Identifying Food Topics from User-Generated Contents in Microblogs. New Technology of Library and Information Service, 2016, 32(10): 70-80.
(Yin Fengjing, Xiao Weidong, Ge Bin, et al.Incremental Algorithm for Clustering Texts in Internet-oriented Topic Detection[J]. Application Research of Computers, 2011, 28(1): 54-57.)
(Wang Wei, Xu Xin.Online Public Opinion Hotspot Detection and Analysis Based on Document Clustering[J]. New Technology of Library & Information Service, 2009(3): 74-79.)
(Xu Dongliang.Research of Public Opinion Information Mining on Bulletin Board Systems Based on Cluster Analysis[D]. Harbin: Harbin Institute of Technology, 2010.)
(Zhu Hengmin, Li Qing.Public Opinion Propagation Model with Topic Derivatives in the Micro-blog Network[J]. New Technology of Library & Information Service, 2012(5): 60-64.)
(Hong Yu, Zhang Yu, Liu Ting, et al.Topic Detection and Tracking Review[J]. Journal of Chinese Information Processing, 2007, 21(6): 71-87.)
[7]
Allan J, Carbonell J, Doddington G, et al.Topic Detection and Tracking Pilot Study Final Report[C]. In: Proceedings of the 1998 Broadcast News Transcription and Understanding Workshop. 1998.
(Lu Rong, Xiang Liang, Liu Mingrong, et al.Discovering News Topics from Microblogs Based on Hidden Topics Analysis and Text Clustering[J]. Pattern Recognition & Artificial Intelligence, 2012, 25(3): 382-387.)
(Luo Weihua, Liu Qun, Cheng Xueqi.Development and Analysis of Technology of Topic Detection and Tracking [C]. In: Proceedings of the 7th National Conference on Computational Linguistics. Beijing: Tsinghua University Press, 2003: 560-566. )
[10]
Xu J, Croft W B.Cluster-based Language Models for Distributed Retrieval [C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999.
[11]
Wartena C, Brussee R.Topic Detection by Clustering Keywords [C]. In: Proceedings of the 19th International Conference on Database and Expert Systems Application. IEEE Computer Society, 2008: 54-58.
[12]
Yang Y, Pierce T, Carbonell J.A Study on Retrospective and On-line Event Detection[C]. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998.
[13]
Jia Z Y, Qing H E, Zhang H J, et al.A News Event Detection and Tracking Algorithm Based on Dynamic Evolution Model[J]. Journal of Computer Research & Development, 2004, 41(7): 1273-1280.
(Jia Ziyan, He Qing, Zhang Haijun, et al.A News Event Detection and Tracking Algorithm Based on Dynamic Evolution Model[J]. Journal of Computer Research & Development, 2004, 41(7): 1273-1280.)
(Ma Bin, Hong Yu, Lu Jianjiang, et al.A Thread-based Two-stage Clustering Method of Microblog Topic Detection[J]. Journal of Chinese Information Processing, 2012, 26(6): 121-128.)
[16]
Hofmann T.Probabilistic Latent Semantic Indexing[C]. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1999.
[17]
Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[18]
Griths T L, Steyvers M.A Probabilistic Approach to Semantic Representation [C]. In: Proceedings of the 24th Annual Conference of the Congnitive Science Society. 2002: 381-386.
(He Liang, Li Fang.Topic Discovery and Trend Analysis in Scientific Literature Based on Topic Model[J]. Journal of Chinese Information Processing, 2012, 26(2): 109-115.)
(Wu Yonghui, Wang Xiaolong, Ding Yuxin, et al.Adaptive On-Line Web Topic Detection Method for Web News Recommendation System[J]. Acta Electronica Sinica, 2010, 38(11): 2620-2624.)
(Zhang Chenyi, Sun Jianling, Ding Yiqun.Topic Mining for Microblog Based on MB-LDA Model[J]. Journal of Computer Research & Development, 2011, 48(10): 1795-1802.)
[23]
Civitello L.Cuisine and Culture: A History of Food and People[M]. Wiley, 2011.
[24]
Tregear A.From Stilton to Vimto: Using Food History to Re-think Typical Products in Rural Development[J]. Sociologia Ruralis, 2003, 43(2): 91-107.
[25]
王仁湘. 饮食与中国文化[M]. 北京: 人民出版社, 1993.
[25]
(Wang Renxiang.Diet and Chinese Culture [M]. Beijing: People’s Publishing House, 1993.)
(Lan Yong.On The Reasons and Distrbution of Pungent Flavour in Chinese Food and Drink[J]. Geographical Research, 2001, 16(5): 229-237.)
[33]
Ahn Y Y, Ahnert S E, Bagrow J P, et al. Flavor Network and the Principles of Food Pairing [J/OL]. Scientific Reports, 2011: Article No. 196. .
[34]
Sherman P W, Billing J.Darwinian Gastronomy: Why We Use Spices[J]. Bioscience, 1999, 49(6): 453.
[35]
Zhu Y X, Huang J, Zhang Z K, et al.Geography and Similarity of Regional Cuisines in China[J]. PLoS One, 2013, 8(11): e79161.
[36]
Salton G, Yang C S.On the Specification of Term Values in Automatic Indexing[J]. Journal of Documentation, 1973, 29(4): 351-372.
[37]
Arthur D, Vassilvitskii S.K-means++: The Advantages of Careful Seeding [C]. In: Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. 2007: 1027-1035.
(Peng Nanyun, Wang Houfeng, Ling Chentian.Event Mining in On-line News Based on Hierarchical Clustering [A]. // Advances of Computational Linguistics in China [R]. Beijing: Tsinghua University Press, 2011: 487-492.)
[39]
Hinton G E.Learning Distributed Representations of Concepts [C]. In: Proceedings of the 8th Annual Meeting of the Cognitive Science Society. 1986.
[40]
Tan P N, Steinbach M, Kumar V, et al.Introduction to Data Mining[M]. Pearson, 2010.
[41]
Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space [OL]. ArXiv: 1301. 3781.
[42]
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and Their Compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.