News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means
Tingxin Wen1,Yangzi Li1(),Jingshuang Sun2
1Institute of Systems Engineering, Liaoning Technical University, Huludao 125105, China 2College of Business Administration, Liaoning Technical University, Huludao 125105, China
[Objective] This paper aims to improve the efficiency and accuracy of the hot topic by studying the feature reduction method and clustering algorithm of the news text. [Methods] Based on the traditional TF-IDF formula, the four features are introduced to realize multi factor feature selection, including weighting of symbol, part of speech, position and length. The Ameliorated Fruit fly Optimization Algorithm(AFOA) is constructed from four aspects of coding, fitness function, adaptive step length and population fitness variance. AFOA is used to optimize the K-means initial cluster center, and the optimized K-means is used to find hot topics. Multi factor feature selection is used to identify hot topics, and hot topic ranking is achieved by using TOPSIS. [Results] Relevant experiments show that multi factor feature selection and AFOA/K-means algorithm significantly improve the clustering effect respectively, and verify the overall effectiveness of the proposed method. [Limitations] It is only applicable to Chinese news texts. [Conclusions] The proposed method can provide a new idea for the research of Chinese news hotspots discovery.
温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means. Data Analysis and Knowledge Discovery, 2019, 3(4): 97-106.
Lu P, Liu S, Dong Z, et al.HSPKNN: An Effective and Practical Framework for Hot Topic Detection of Internet News[C]// Proceedings of the 7th International Conference on Computing and Convergence Technology. IEEE, 2013: 888-893.
(Gesang Duoji, Qiao Shaojie, Han Nan, et al.An Internet Public Opinion Hotspot Detection Algorithm Based on Single-Pass[J]. Journal of University of Electronic Science and Technology of China, 2015, 44(4): 599-604.)
(Chen Qiang, Du Pan, Chen Haiqiang, et al.K-Canopy: A Fast Data Segmentation Algorithm for the Topic Detection[J]. Journal of Shandong University: Natural Science, 2016, 51(9): 106-112.)
(Sun Mingxi, Liu Chunqi.Research on Hot Topic Detection Based on DBSCAN Algorithm and Inter Sentence Relationship[J]. Library and Information Service, 2017, 61(12): 113-121.)
(Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130, 194.)
[7]
Patil L H, Atique M.A Novel Approach for Feature Selection Method TF-IDF in Document Clustering[C]// Proceedings of the 3rd IEEE International Advance Computing Conference. IEEE, 2013: 858-862.
(Xin Zhu, Zhou Yajian.Study and Improvement of Mutual Information for Feature Selection in Text Categorization[J]. Journal of Computer Applications, 2013, 33(S2): 116-118, 152.)
(Liu Haifeng, Liu Shousheng, Song Aling.Improved Method of IG Feature Selection Based on Word Frequency Distribution[J]. Computer Engineering and Applications, 2017, 53(4): 113-117, 122.)
(Chang E.Automatic Text Clustering Based on Latent Semantic Index Theory[J]. Library and Information Service, 2012, 56(11): 89-92.)
[12]
Zahedi M, Sorkhi A G.Improving Text Classification Performance Using PCA and Recall-Precision Criteria[J]. Arabian Journal for Science & Engineering, 2013, 38(8): 2095-2102.
[13]
Abdulhussain M I, Gan J Q.An Experimental Investigation on PCA Based on Cosine Similarity and Correlation for Text Feature Dimensionality Reduction[C]// Proceedings of the 7th Computer Science and Electronic Engineering Conference. IEEE, 2015: 1-4.
(Ke Gang.Enhanced Bee Colony Optimal and K-means Based Document Clustering Algorithm[J]. Application Research of Computers, 2016, 33(8): 2298-2302.)
[16]
Zade J, Bamnote G R, Agrawal P K.Text Document Clustering Using K-means Algorithm with Its Analysis and Implementation[J]. Imperial Journal of Interdisciplinary Research, 2017, 3(2): 1528-1531.
(He Tingting, Dai Wenhua, Jiao Cuizhen.Research of Text Clustering Based on Hybrid Parallel Genetic Algorithm[J]. Journal of Chinese Information Processing, 2007, 21(4): 55-60.)
(Wang Yonggui, Lin Lin, Liu Xianguo.Hybrid Text Clustering Algorithm Based on Dual Particle Swarm Optimization and K-means Algorithm[J]. Application Research of Computers, 2014, 31(2): 364-368.)