Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (4): 97-106    DOI: 10.11925/infotech.2096-3467.2018.0757
Current Issue | Archive | Adv Search |
News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means
Tingxin Wen1,Yangzi Li1(),Jingshuang Sun2
1Institute of Systems Engineering, Liaoning Technical University, Huludao 125105, China
2College of Business Administration, Liaoning Technical University, Huludao 125105, China
Download: PDF(674 KB)   HTML ( 2
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to improve the efficiency and accuracy of the hot topic by studying the feature reduction method and clustering algorithm of the news text. [Methods] Based on the traditional TF-IDF formula, the four features are introduced to realize multi factor feature selection, including weighting of symbol, part of speech, position and length. The Ameliorated Fruit fly Optimization Algorithm(AFOA) is constructed from four aspects of coding, fitness function, adaptive step length and population fitness variance. AFOA is used to optimize the K-means initial cluster center, and the optimized K-means is used to find hot topics. Multi factor feature selection is used to identify hot topics, and hot topic ranking is achieved by using TOPSIS. [Results] Relevant experiments show that multi factor feature selection and AFOA/K-means algorithm significantly improve the clustering effect respectively, and verify the overall effectiveness of the proposed method. [Limitations] It is only applicable to Chinese news texts. [Conclusions] The proposed method can provide a new idea for the research of Chinese news hotspots discovery.

Key wordsNetwork News      Hot Topic Discovery      Multi Factor Feature Selection      AFOA/K-means Algorithm      TOPSIS Model     
Received: 15 July 2018      Published: 29 May 2019

Cite this article:

Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means. Data Analysis and Knowledge Discovery, 2019, 3(4): 97-106.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0757     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I4/97

[1] Lu P, Liu S, Dong Z, et al.HSPKNN: An Effective and Practical Framework for Hot Topic Detection of Internet News[C]// Proceedings of the 7th International Conference on Computing and Convergence Technology. IEEE, 2013: 888-893.
[2] 格桑多吉, 乔少杰, 韩楠, 等. 基于Single-Pass的网络舆情热点发现算法[J]. 电子科技大学学报, 2015, 44(4): 599-604.
[2] (Gesang Duoji, Qiao Shaojie, Han Nan, et al.An Internet Public Opinion Hotspot Detection Algorithm Based on Single-Pass[J]. Journal of University of Electronic Science and Technology of China, 2015, 44(4): 599-604.)
[3] 陈强, 杜攀, 陈海强, 等. K-Canopy: 一种面向话题发现的快速数据切分算法[J]. 山东大学学报: 理学版, 2016, 51(9): 106-112.
[3] (Chen Qiang, Du Pan, Chen Haiqiang, et al.K-Canopy: A Fast Data Segmentation Algorithm for the Topic Detection[J]. Journal of Shandong University: Natural Science, 2016, 51(9): 106-112.)
[4] 孙明溪, 刘春琦. 基于DBSCAN算法与句间关系的热点话题发现研究[J]. 图书情报工作, 2017, 61(12): 113-121.
[4] (Sun Mingxi, Liu Chunqi.Research on Hot Topic Detection Based on DBSCAN Algorithm and Inter Sentence Relationship[J]. Library and Information Service, 2017, 61(12): 113-121.)
[5] 奉国和, 郑伟. 文本分类特征降维研究综述[J]. 图书情报工作, 2011, 55(9): 109-113.
[5] (Feng Guohe, Zheng Wei.Review of Feature Dimension Reduction in Text Classification[J]. Library and Information Service, 2011, 55(9): 109-113.)
[6] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130, 194.
[6] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130, 194.)
[7] Patil L H, Atique M.A Novel Approach for Feature Selection Method TF-IDF in Document Clustering[C]// Proceedings of the 3rd IEEE International Advance Computing Conference. IEEE, 2013: 858-862.
[8] 辛竹, 周亚建. 文本分类中互信息特征选择方法的研究与算法改进[J]. 计算机应用, 2013, 33(S2): 116-118, 152.
[8] (Xin Zhu, Zhou Yajian.Study and Improvement of Mutual Information for Feature Selection in Text Categorization[J]. Journal of Computer Applications, 2013, 33(S2): 116-118, 152.)
[9] 刘海峰, 刘守生, 宋阿羚. 基于词频分布信息的优化IG特征选择方法[J]. 计算机工程与应用, 2017, 53(4): 113-117, 122.
[9] (Liu Haifeng, Liu Shousheng, Song Aling.Improved Method of IG Feature Selection Based on Word Frequency Distribution[J]. Computer Engineering and Applications, 2017, 53(4): 113-117, 122.)
[10] 刘美茹. 基于LSI和SVM的文本分类研究[J]. 计算机工程, 2007, 33(15): 217-219.
[10] (Liu Meiru.Research on Text Classification Based on LSI and SVM[J]. Computer Engineering, 2007, 33(15): 217-219.)
[11] 常娥. 基于LSI理论的文本自动聚类研究[J]. 图书情报工作, 2012, 56(11): 89-92.
[11] (Chang E.Automatic Text Clustering Based on Latent Semantic Index Theory[J]. Library and Information Service, 2012, 56(11): 89-92.)
[12] Zahedi M, Sorkhi A G.Improving Text Classification Performance Using PCA and Recall-Precision Criteria[J]. Arabian Journal for Science & Engineering, 2013, 38(8): 2095-2102.
[13] Abdulhussain M I, Gan J Q.An Experimental Investigation on PCA Based on Cosine Similarity and Correlation for Text Feature Dimensionality Reduction[C]// Proceedings of the 7th Computer Science and Electronic Engineering Conference. IEEE, 2015: 1-4.
[14] 蔡岳, 袁津生. 基于改进DBSCAN算法的文本聚类[J]. 计算机工程, 2011, 37(12): 50-52, 55.
[14] (Cai Yue, Yuan Jinsheng.Text Clustering Based on Improved DBSCAN Algorithm[J]. Computer Engineering, 2011, 37(12): 50-52, 55.)
[15] 柯钢. 基于增强蜂群优化与K-means的文本聚类算法[J]. 计算机应用研究, 2016, 33(8): 2298-2302.
[15] (Ke Gang.Enhanced Bee Colony Optimal and K-means Based Document Clustering Algorithm[J]. Application Research of Computers, 2016, 33(8): 2298-2302.)
[16] Zade J, Bamnote G R, Agrawal P K.Text Document Clustering Using K-means Algorithm with Its Analysis and Implementation[J]. Imperial Journal of Interdisciplinary Research, 2017, 3(2): 1528-1531.
[17] 张琳, 牟向伟. 基于Canopy+K-means的中文文本聚类算法[J]. 图书馆论坛, 2018(6): 113-119.
[17] (Zhang Lin, Mou Xiangwei.Chinese Text Clustering Algorithm Based on Canopy+K-means[J]. Library Tribune, 2018(6): 113-119.)
[18] 潘文超. 果蝇最佳化演算法[M]. 中国台北: 沧海书局, 2011: 10-12.
[18] (Pan Wenchao.Fruit Fly Optimization Algorithm[M]. Taipei: The Sea Book Company, 2011: 10-12.)
[19] 何婷婷, 戴文华, 焦翠珍. 基于混合并行遗传算法的文本聚类研究[J]. 中文信息学报, 2007, 21(4): 55-60.
[19] (He Tingting, Dai Wenhua, Jiao Cuizhen.Research of Text Clustering Based on Hybrid Parallel Genetic Algorithm[J]. Journal of Chinese Information Processing, 2007, 21(4): 55-60.)
[20] 王永贵, 林琳, 刘宪国. 结合双粒子群和K-means的混合文本聚类算法[J]. 计算机应用研究, 2014, 31(2): 364-368.
[20] (Wang Yonggui, Lin Lin, Liu Xianguo.Hybrid Text Clustering Algorithm Based on Dual Particle Swarm Optimization and K-means Algorithm[J]. Application Research of Computers, 2014, 31(2): 364-368.)
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn