Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (4): 97-106    DOI: 10.11925/infotech.2096-3467.2018.0757
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于多因素特征选择与AFOA/K-means的新闻热点发现方法*
温廷新1,李洋子1(),孙静霜2
1辽宁工程技术大学系统工程研究所 葫芦岛 125105
2辽宁工程技术大学工商管理学院 葫芦岛 125105
News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means
Tingxin Wen1,Yangzi Li1(),Jingshuang Sun2
1Institute of Systems Engineering, Liaoning Technical University, Huludao 125105, China
2College of Business Administration, Liaoning Technical University, Huludao 125105, China
全文: PDF(674 KB)   HTML ( 3
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】研究新闻文本的特征降维方法及聚类算法, 以期进一步提升热点话题发现效率及准确率。【方法】基于传统TF-IDF特征权重计算方法, 引入符号、词性、位置及长度4个特征加权, 实现多因素特征选择。从编码方式、适应度函数、自适应步长及群体适应度方差这4方面构造改进果蝇优化算法(Ameliorated Fruit Fly Optimization Algorithm, AFOA), 利用AFOA优选K-means初始聚类中心, 实现优化后的K-means热点话题发现。采用多因素特征选择识别热点话题, 利用TOPSIS获得热点话题排名。【结果】相关实验表明, 多因素特征选择及AFOA/K-means算法分别显著提高了聚类效果, 验证了所提方法整体有效性。【局限】仅适用于中文新闻文本。【结论】本文方法能够为中文新闻热点发现方法研究提供一条新思路。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
温廷新
李洋子
孙静霜
关键词 网络新闻热点话题发现多因素特征选择AFOA/K-means算法TOPSIS模型    
Abstract

[Objective] This paper aims to improve the efficiency and accuracy of the hot topic by studying the feature reduction method and clustering algorithm of the news text. [Methods] Based on the traditional TF-IDF formula, the four features are introduced to realize multi factor feature selection, including weighting of symbol, part of speech, position and length. The Ameliorated Fruit fly Optimization Algorithm(AFOA) is constructed from four aspects of coding, fitness function, adaptive step length and population fitness variance. AFOA is used to optimize the K-means initial cluster center, and the optimized K-means is used to find hot topics. Multi factor feature selection is used to identify hot topics, and hot topic ranking is achieved by using TOPSIS. [Results] Relevant experiments show that multi factor feature selection and AFOA/K-means algorithm significantly improve the clustering effect respectively, and verify the overall effectiveness of the proposed method. [Limitations] It is only applicable to Chinese news texts. [Conclusions] The proposed method can provide a new idea for the research of Chinese news hotspots discovery.

Key wordsNetwork News    Hot Topic Discovery    Multi Factor Feature Selection    AFOA/K-means Algorithm    TOPSIS Model
收稿日期: 2018-07-15     
基金资助:*本文系辽宁省社会科学规划基金项目“辽宁新型城镇化评价指标体系研究”(项目编号: L14BTJ004)的研究成果之一
引用本文:   
温廷新,李洋子,孙静霜. 基于多因素特征选择与AFOA/K-means的新闻热点发现方法*[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.0757.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0757
[1] Lu P, Liu S, Dong Z, et al.HSPKNN: An Effective and Practical Framework for Hot Topic Detection of Internet News[C]// Proceedings of the 7th International Conference on Computing and Convergence Technology. IEEE, 2013: 888-893.
[2] 格桑多吉, 乔少杰, 韩楠, 等. 基于Single-Pass的网络舆情热点发现算法[J]. 电子科技大学学报, 2015, 44(4): 599-604.
[2] (Gesang Duoji, Qiao Shaojie, Han Nan, et al.An Internet Public Opinion Hotspot Detection Algorithm Based on Single-Pass[J]. Journal of University of Electronic Science and Technology of China, 2015, 44(4): 599-604.)
[3] 陈强, 杜攀, 陈海强, 等. K-Canopy: 一种面向话题发现的快速数据切分算法[J]. 山东大学学报: 理学版, 2016, 51(9): 106-112.
[3] (Chen Qiang, Du Pan, Chen Haiqiang, et al.K-Canopy: A Fast Data Segmentation Algorithm for the Topic Detection[J]. Journal of Shandong University: Natural Science, 2016, 51(9): 106-112.)
[4] 孙明溪, 刘春琦. 基于DBSCAN算法与句间关系的热点话题发现研究[J]. 图书情报工作, 2017, 61(12): 113-121.
[4] (Sun Mingxi, Liu Chunqi.Research on Hot Topic Detection Based on DBSCAN Algorithm and Inter Sentence Relationship[J]. Library and Information Service, 2017, 61(12): 113-121.)
[5] 奉国和, 郑伟. 文本分类特征降维研究综述[J]. 图书情报工作, 2011, 55(9): 109-113.
[5] (Feng Guohe, Zheng Wei.Review of Feature Dimension Reduction in Text Classification[J]. Library and Information Service, 2011, 55(9): 109-113.)
[6] 裴英博, 刘晓霞. 文本分类中改进型CHI特征选择方法的研究[J]. 计算机工程与应用, 2011, 47(4): 128-130, 194.
[6] (Pei Yingbo, Liu Xiaoxia.Study on Improved CHI for Feature Selection in Chinese Text Categorization[J]. Computer Engineering and Applications, 2011, 47(4): 128-130, 194.)
[7] Patil L H, Atique M.A Novel Approach for Feature Selection Method TF-IDF in Document Clustering[C]// Proceedings of the 3rd IEEE International Advance Computing Conference. IEEE, 2013: 858-862.
[8] 辛竹, 周亚建. 文本分类中互信息特征选择方法的研究与算法改进[J]. 计算机应用, 2013, 33(S2): 116-118, 152.
[8] (Xin Zhu, Zhou Yajian.Study and Improvement of Mutual Information for Feature Selection in Text Categorization[J]. Journal of Computer Applications, 2013, 33(S2): 116-118, 152.)
[9] 刘海峰, 刘守生, 宋阿羚. 基于词频分布信息的优化IG特征选择方法[J]. 计算机工程与应用, 2017, 53(4): 113-117, 122.
[9] (Liu Haifeng, Liu Shousheng, Song Aling.Improved Method of IG Feature Selection Based on Word Frequency Distribution[J]. Computer Engineering and Applications, 2017, 53(4): 113-117, 122.)
[10] 刘美茹. 基于LSI和SVM的文本分类研究[J]. 计算机工程, 2007, 33(15): 217-219.
[10] (Liu Meiru.Research on Text Classification Based on LSI and SVM[J]. Computer Engineering, 2007, 33(15): 217-219.)
[11] 常娥. 基于LSI理论的文本自动聚类研究[J]. 图书情报工作, 2012, 56(11): 89-92.
[11] (Chang E.Automatic Text Clustering Based on Latent Semantic Index Theory[J]. Library and Information Service, 2012, 56(11): 89-92.)
[12] Zahedi M, Sorkhi A G.Improving Text Classification Performance Using PCA and Recall-Precision Criteria[J]. Arabian Journal for Science & Engineering, 2013, 38(8): 2095-2102.
[13] Abdulhussain M I, Gan J Q.An Experimental Investigation on PCA Based on Cosine Similarity and Correlation for Text Feature Dimensionality Reduction[C]// Proceedings of the 7th Computer Science and Electronic Engineering Conference. IEEE, 2015: 1-4.
[14] 蔡岳, 袁津生. 基于改进DBSCAN算法的文本聚类[J]. 计算机工程, 2011, 37(12): 50-52, 55.
[14] (Cai Yue, Yuan Jinsheng.Text Clustering Based on Improved DBSCAN Algorithm[J]. Computer Engineering, 2011, 37(12): 50-52, 55.)
[15] 柯钢. 基于增强蜂群优化与K-means的文本聚类算法[J]. 计算机应用研究, 2016, 33(8): 2298-2302.
[15] (Ke Gang.Enhanced Bee Colony Optimal and K-means Based Document Clustering Algorithm[J]. Application Research of Computers, 2016, 33(8): 2298-2302.)
[16] Zade J, Bamnote G R, Agrawal P K.Text Document Clustering Using K-means Algorithm with Its Analysis and Implementation[J]. Imperial Journal of Interdisciplinary Research, 2017, 3(2): 1528-1531.
[17] 张琳, 牟向伟. 基于Canopy+K-means的中文文本聚类算法[J]. 图书馆论坛, 2018(6): 113-119.
[17] (Zhang Lin, Mou Xiangwei.Chinese Text Clustering Algorithm Based on Canopy+K-means[J]. Library Tribune, 2018(6): 113-119.)
[18] 潘文超. 果蝇最佳化演算法[M]. 中国台北: 沧海书局, 2011: 10-12.
[18] (Pan Wenchao.Fruit Fly Optimization Algorithm[M]. Taipei: The Sea Book Company, 2011: 10-12.)
[19] 何婷婷, 戴文华, 焦翠珍. 基于混合并行遗传算法的文本聚类研究[J]. 中文信息学报, 2007, 21(4): 55-60.
[19] (He Tingting, Dai Wenhua, Jiao Cuizhen.Research of Text Clustering Based on Hybrid Parallel Genetic Algorithm[J]. Journal of Chinese Information Processing, 2007, 21(4): 55-60.)
[20] 王永贵, 林琳, 刘宪国. 结合双粒子群和K-means的混合文本聚类算法[J]. 计算机应用研究, 2014, 31(2): 364-368.
[20] (Wang Yonggui, Lin Lin, Liu Xianguo.Hybrid Text Clustering Algorithm Based on Dual Particle Swarm Optimization and K-means Algorithm[J]. Application Research of Computers, 2014, 31(2): 364-368.)
[1] 邹伟, 刘永学, 李满春, 王加胜, 陈映雪. 网络新闻中黄岩岛争端事件舆情研究——以新浪网“中菲黄岩岛争端”专题为例[J]. 现代图书情报技术, 2014, 30(2): 72-78.
[2] 杨代庆, 王志苹, 王星, 刘敏健, 常迎春. 一种断点续传的多线程新闻组抓取方法及存储结构[J]. 现代图书情报技术, 2011, 27(2): 29-33.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn