Please wait a minute...
Advanced Search
数据分析与知识发现  2018, Vol. 2 Issue (7): 55-62     https://doi.org/10.11925/infotech.2096-3467.2018.0003
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
高校网络舆情安全中主题分类方法研究*——以新浪微博数据为例
贾隆嘉1,2(), 张邦佐3
1东北师范大学数学与统计学院 长春 130024
2东北师范大学发展规划处 长春 130024
3东北师范大学信息科学与技术学院 长春 130024
Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo
Jia Longjia1,2(), Zhang Bangzuo3
1School of Mathematics and Statistics, Northeast Normal University, Changchun 130024, China
2Department of Planning and Development, Northeast Normal University, Changchun 130024, China
3School of Computer Science and Information Technology, Northeast Normal University, Changchun 130024, China
全文: PDF (879 KB)   HTML ( 2
输出: BibTeX | EndNote (RIS)      
摘要 

目的】通过一种特征加权方法解决高校新浪微博主题分类研究所面临的高维性和稀疏性问题。【方法】计算特征属于类别的概率,进一步预测文档属于类别的概率,使得特征由基于词的表示转换为基于类别的表示,最终采用支持向量机对转换后的特征矩阵进行分类。【结果】传统tf, tf×idf以及tf×rf三种方法在结合本文提出的方法后,在微平均F1/宏平均F1方面分别提升:7.2%/7.8%,7.5%/7.9%以及6.4%/5.7%。【局限】仅针对主题分类中特征加权方法进行探索,未对主题分类中其他部分展开研究。【结论】在高校网路舆情主题分类中,该方法可以有效地降低特征矩阵维度,同时提升分类能力与分类效率。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
贾隆嘉
张邦佐
关键词 网络舆情安全主题分类特征加权机器学习    
Abstract

[Objective] This paper introduces a term weighting method to classify topics of Sina Weibo posts by college students, aiming to solve the high dimension and sparsity issues. [Methods] First, we calculated the probability of a term’s falling to specific categories and then predicted the probability of a document’s category. Then, we converted the word-based features to a class-based matrix, which was classified by the support vector machine. [Results] Our new method increased the MicroF1/MacroF1values of the traditional tf, tf×idf and tf×rf methods by 7.2%/7.8%, 7.5%/7.9% and 6.4%/5.7%, respectively. [Limitations] More research is needed to explore topic classification methods other than the term weighting one in this paper. [Conclusions] The proposed method could effectively reduce the dimension of feature matrix and improve the classification efficiency for Internet public opinion studies.

Key wordsInternet Public Opinion Security    Theme Classification    Term Weighting    Machine Learning
收稿日期: 2018-01-02      出版日期: 2018-08-15
ZTFLH:  TP391.1  
基金资助:*本文系国家自然科学基金项目“基于网络结构演化的Folksonomy模式中社群知识组织与知识涌现研究”(项目编号: 71473035)、国家自然科学基金青年科学基金项目“基于贝叶斯图模型的海量短文本数据统计推断”(项目编号: 11501095)和吉林省科技厅重点科技攻关项目“基于异构信息网络融合社会关系的电子商务推荐系统关键技术研究与开发”(项目编号: 20150204040GX)的研究成果之一
引用本文:   
贾隆嘉, 张邦佐. 高校网络舆情安全中主题分类方法研究*——以新浪微博数据为例[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
Jia Longjia,Zhang Bangzuo. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo. Data Analysis and Knowledge Discovery, 2018, 2(7): 55-62.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.0003      或      http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2018/V2/I7/55
  数据集中包含某特征的文档数量分布示例
样本数据的实际情况
属于类别ci 不属于类别ci
分类器
预测结果
属于类别ci TP FP
不属于类别ci FN TN
  类别ci的列联表
类别集合
C={c1,c2,…ci,…c|c|}
样本数据的实际情况
属于类别C 不属于类别C
分类器
预测结果
属于类别C $TP=\sum\limits_{\text{i}=1}^{|C|}{T{{P}_{\text{i}}}}$ $FP=\sum\limits_{\text{i}=1}^{|C|}{F{{P}_{\text{i}}}}$
不属于类别C $FN=\sum\limits_{i=1}^{|C|}{F{{N}_{\text{i}}}}$ $TN=\sum\limits_{i=1}^{|C|}{T{{N}_{\text{i}}}}$
  全局列联表
  采用6个特征加权方法和支持向量机分类器的微平均F1值性能
  采用6个特征加权方法和支持向量机分类器的宏平均F1值性能
[1] 中国互联网络信息中心(CNNIC). 第40次中国互联网络发展状况统计报告[R/OL]. .
[1] (China Internet Network Information Center (CNNIC). The 40th Statistical Report on the Internet Development in China [R/OL].
[2] 廖海涵, 靳嘉林, 王曰芬. 网络舆情事件中微博用户行为特征和关系分析——以新浪微博“雾霾调查: 穹顶之下”为例[J]. 情报资料工作, 2016(3): 12-18.
doi: 10.3969/j.issn.1002-0314.2016.03.002
[2] (Liao Haihan, Jin Jialin, Wang Yuefen.Analysis on the Characteristics and Relationships of Weibo Users’ Behaviors in Internet Public Opinion Incidents —— A Case Study of Sina Weibo Survey on Haze: Under Domes[J]. Information and Documentation Services, 2016(3): 12-18.)
doi: 10.3969/j.issn.1002-0314.2016.03.002
[3] 罗泰晔. 基于Logistic模型的微博舆情热点发展预测研究[J]. 统计与信息论坛, 2017, 32(10): 91-95.
[3] (Luo Taiye.Study on the Prediction of Hotspot Development of Weibo Public Opinion Based on Logistic Model[J]. Statistics and Information Forum, 2017, 32(10): 91-95.)
[4] 王亚民, 胡悦. 基于BTM的微博舆情热点发现[J]. 情报杂志, 2016, 35(11): 119-124, 140.
doi: 10.3969/j.issn.1002-1965.2016.11.022
[4] (Wang Yamin, Hu Yue.Discovery of Public Opinion Hotspot in Weibo Based on BTM[J]. Journal of Intelligence. 2016, 35(11): 119-124, 140.)
doi: 10.3969/j.issn.1002-1965.2016.11.022
[5] 胡悦, 王亚民. 基于模糊神经网络的微博舆情趋势预测方法[J]. 情报科学, 2017, 35(12): 28-33.
[5] (Hu Yue, Wang Yamin.New Forecasting Method of Weibo Public Opinion Based on Fuzzy Neural Network[J]. Information Science, 2017, 35(12): 28-33.)
[6] 张宸, 韩夏. 大数据环境下基于SVM-WNB的网络舆情分类研究[J]. 统计与决策, 2017(14): 45-48.
doi: 10.13546/j.cnki.tjyjc.2017.14.010
[6] (Zhang Chen, Han Xia.Study on Network Public Opinion Classification Based on SVM-WNB in Big Data Environment[J]. Statistics and Decision, 2017(14): 45-48.)
doi: 10.13546/j.cnki.tjyjc.2017.14.010
[7] 马宾, 殷立峰. 一种基于Hadoop平台的并行朴素贝叶斯网络舆情快速分类算法[J]. 现代图书情报技术, 2015(2): 78-84.
[7] (Ma Bin, Yin Lifeng.A Fast Classification Algorithm of Public Opinion Based on Parallel Naive Bayesian Network Based on Hadoop Platform[J]. New Technology of Library and Information Service, 2015(2): 78-84.)
[8] 李纲, 陈璟浩. 突发公共事件网络舆情研究综述[J]. 图书情报知识, 2014(2): 111-119.
doi: 10.13366/j.dik.2014.02.111
[8] (Li Gang, Chen Jinghao.Review of the Research on Internet Public Opinions of Public Emergencies[J].Knowledge of Library and Information Service, 2014(2): 111-119.)
doi: 10.13366/j.dik.2014.02.111
[9] Uysal A K.An Improved Global Feature Selection Scheme for Text Classification[J]. Expert Systems with Applications, 2016, 43: 82-92.
doi: 10.1016/j.eswa.2015.08.050
[10] 李真, 丁晟春, 王楠. 网络舆情观点主题识别研究[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
[10] (Li Zhen, Ding Shengchun, Wang Nan.A Study on Theme Recognition of Internet Public Opinion[J]. Data Analysis and Knowledge Discovery, 2017, 1(8): 18-30.)
[11] 王国华, 冯伟, 王雅蕾. 基于网络舆情分类的舆情应对研究[J]. 情报杂志, 2013, 32(5): 1-4.
doi: 10.3969/j.issn.1002-1965.2013.05.001
[11] (Wang Guohua, Feng Wei, Wang Yalei.Research on Public Opinion Based on Internet Public Opinion Classification[J]. Journal of Intelligence. 2013, 32(5): 1-4.)
doi: 10.3969/j.issn.1002-1965.2013.05.001
[12] Nakov P, Rosenthal S, Kiritchenko S, et al.Developing a Successful SemEval Task in Sentiment Analysis of Twitter and Other Social Media Texts[J]. Language Resources and Evaluation, 2016, 50(1): 35-65.
doi: 10.1007/s10579-015-9328-1
[13] 刘小慧, 李长玲, 冯志刚. 基于改进的TF*IDF方法分析学科研究热点——以情报学为例[J]. 情报科学, 2017, 35(7): 82-87.
[13] (Liu Xiaohui, Li Changling, Feng Zhigang.Analysis of Discipline Research Hotspots Based on Improved TF×IDF Method —— A Case Study of Information Science[J]. Journal of IntelligenceScience, 2017, 35(7): 82-87.)
[14] Tang B, He H, Baggenstoss P M, et al.A Bayesian Classification Approach Using Class-Specific Features for Text Categorization[J]. IEEE Transactions on Knowledge and Data Engineering, 2016, 28(6): 1602-1606.
doi: 10.1109/TKDE.2016.2522427
[15] Lan M, Tan C L, Su J, et al.Supervised and Traditional Term Weighting Methods for Automatic Text Categorization[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2009, 31(4): 721-735.
doi: 10.1109/TPAMI.2008.110 pmid: 19229086
[16] Lan M, Tan C L, Low H B.Proposing a New Term Weighting Scheme for Text Categorization[C]//Proceedings of the 21st National Conference on Artificial Intelligence. 2006.
[17] McCallum A, Nigam K. A Comparison of Event Models for Naive Bayes Text Classification[C]//Proceedings of the 13th National Conference on Artificial Intelligence. 1998, 752: 41-48.
[18] 刘勘, 朱怀萍, 刘秀芹. 基于支持向量机的网络伪舆情识别研究[J]. 现代图书情报技术, 2013(11): 75-80.
[18] (Liu Kan, Zu Huaiping, Liu Xiuqin.Pseudo-publicaire Recognition Based on Support Vector Machine[J]. New Technology of Library and Information Service, 2013(11): 75-80.)
[19] 岑咏华, 王曰芬. 大数据环境下社会舆情分析与决策支持的研究视角和关键问题[J]. 现代图书情报技术, 2016(7-8): 3-11.
[19] (Cen Yonghua, Wang Yuefen.Study Perspective and Key Issues on Analysis and Decision Support of Social Sentiment in Big Data Environment[J]. New Technology of Library and Information Service, 2016(7-8): 3-11.)
[20] Zhang L, Jiang L, Li C, et al.Two Feature Weighting Approaches for Naive Bayes Text Classifiers[J]. Knowledge- Based Systems, 2016, 100: 137-144.
doi: 10.1007/978-3-319-11179-7_70
[21] Zhang J, Chen L, Guo G.Projected-prototype Based Classifier for Text Categorization[J]. Knowledge-Based Systems, 2013, 49: 179-189.
doi: 10.1016/j.knosys.2013.05.013
[22] Lee S, Seo K K.Intelligent Fault Diagnosis Based on a Hybrid Multi-class Support Vector Machines and Case-based Reasoning Approach[J]. Journal of Computational and Theoretical Nanoscience, 2013, 10(8): 1727-1734.
doi: 10.1166/jctn.2013.3116
[23] Chang C C, Lin C J.LIBSVM: A Library for Support Vector Machines[J]. ACM Transactions on Intelligent Systems and Technology (TIST), 2011, 2(3): 27.
[1] 陈东,王建冬,李慧颖,蔡思航,黄倩倩,易成岐,曹攀. 融合机器学习算法和多因素的禽肉交易量预测方法研究 *[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[2] 梁野,李小元,许航,胡伊然. CLOpin:一种面向舆情分析与预警领域的跨语言知识图谱架构*[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[3] 杨恒,王思丽,祝忠明,刘巍,王楠. 基于并行协同过滤算法的领域知识推荐模型研究*[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[4] 王若佳,张璐,王继民. 基于机器学习的在线问诊平台智能分诊研究[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[5] 李纲,周华阳,毛进,陈思菁. 基于机器学习的社交媒体用户分类研究 *[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[6] 胡佳慧,方安,赵琬清,杨晨柳,任慧玲. 面向知识发现的中文电子病历标注方法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[7] 张金柱,胡一鸣. 融合表示学习与机器学习的专利科学引文标题自动抽取研究*[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[8] 刘志强,都云程,施水才. 基于改进的隐马尔科夫模型的网页新闻关键信息抽取*[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[9] 徐红霞,李春旺. 科技文献内容知识点抽取研究综述[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[10] 李静,潘舒笑,李雪岩,贾立静,赵宇卓. 基于多目标量子优化分类器的急诊危重患者关键指标筛选 *[J]. 数据分析与知识发现, 2019, 3(12): 101-112.
[11] 沈洋,庄伟超,吴清华,钱玲飞. 基于区间模糊VIKOR的监犯特征风险评估研究 *[J]. 数据分析与知识发现, 2019, 3(11): 70-78.
[12] 张紫玄,王昊,朱立平,邓三鸿. 中国海关HS编码风险的识别研究*[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[13] 刘丽娜,齐佳音,张镇平,曾丹. 品牌对商品在线销量的影响*——基于海量商品评论的在线声誉和品牌知名度的调节作用研究[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[14] 陆伟, 罗梦奇, 丁恒, 李信. 深度学习图像标注与用户标注比较研究*[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[15] 王丽, 邹丽雪, 刘细文. 基于LDA主题模型的文献关联分析及可视化研究[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn