Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (8): 1-9    DOI: 10.11925/infotech.2096-3467.2018.1207
Current Issue | Archive | Adv Search |
Classifying Social Media Users with Machine Learning
Gang Li,Huayang Zhou,Jin Mao(),Sijing Chen
Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
Download: PDF(1064 KB)   HTML ( 39
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper uses multi-dimensional information of social media users to automatically classify them. [Methods] First, we defined social media users as individual, media, government, and organization. Then, we extracted the following features from user profiles: demographic characteristics, namings, and self-descriptions. Third, we created a user classification models based on machine learning algorithms and evaluated its performance with real Twitter dataset. [Results] Both precision and recall of the proposed model were greater than 83%. The naming, demographic characteristics, and self-description features posed increasing contributions to the classification model. [Limitations] The sample size needs to be expanded, which helps us better analyzed the characteristics of different users. [Conclusions] The proposed method could accurately identify four types of users, which benefits social media user classification research in the future.

Key wordsSVM      User Classification      Machine Learning      Feature Extraction     
Received: 31 October 2018      Published: 29 September 2019
ZTFLH:  TP393 G35  
Corresponding Authors: Jin Mao     E-mail: danveno@163.com

Cite this article:

Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning. Data Analysis and Knowledge Discovery, 2019, 3(8): 1-9.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1207     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I8/1

特征组 特征编号 特征说明 备注
人口统计学
特征
F1 用户粉丝数 取值为0-9
通过为1, 否则
为0
F2 用户关注数
F3 用户被标记数
F4 是否通过认证
命名特征 F5 用户名的命名模式 英文字母搭配
方式
F6 昵称命名模式
F7 昵称和用户名相似度
自我描述
特征
F8-F2395 词汇出现的词频-
逆频率
判断是媒体
类型用户
判断不是媒体
类型用户
实际是媒体类型用户 TP FN
实际不是媒体类型用户 FP TN
算法对比 p值
支持向量机-随机梯度下降 0.098
支持向量机-决策树 0.011**
支持向量机-K近邻 0.032**
支持向量机-朴素贝叶斯 0.000**
支持向量机-人工神经网络 0.000**
随机梯度下降-决策树 0.000**
随机梯度下降-K近邻 0.002**
随机梯度下降-朴素贝叶斯 0.007**
随机梯度下降-人工神经网络 0.018**
[1] Wikipedia. Social Network Service[EB/OL]. [ 2018- 06- 15].
[2] Boyd D M, Ellison N B . Social Network Sites: Definition, History, and Scholarship[J]. Journal of Computer Mediated Communication, 2008,13(1):210-230.
[3] We Are Social. Digital in 2018[EB/OL]. [ 2018- 10- 30].
[4] 贺超波, 汤庸, 麦辉强 , 等. 在线社交网络挖掘综述[J]. 武汉大学学报: 理学版, 2014,60(3):189-200.
[4] ( He Chaobo, Tang Yong, Mai Huiqiang , et al. A Survey on Online Social Network Mining[J]. Journal of Wuhan University: Natural Science Edition, 2014,60(3):189-200.)
[5] 陈家维 . 线上运动社群之社群意识组成要素之研究——以日本职棒社群日促会为例[D]. 朝阳科技大学, 2006.
[5] ( Chen Jiawei . Exploring the Sense of Community for an Online Sport Community: A Case Study of Nippon Professional Baseball Club[D]. Chaoyang University of Science and Technology, 2006.)
[6] Gomez-Rodriguez M, Leskovec J, Krause A . Inferring Network of Diffusion and Influence [C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010: 1019-1028.
[7] 邓三鸿, 刘喜文, 蒋勋 . 基于利益相关者理论的突发事件案例知识库构建研究[J]. 图书与情报, 2015(3):1-8.
[7] ( Deng Sanhong, Liu Xiwen, Jiang Xun . Constructing Cases Knowledge Base of Emergency Based on Stakeholder’s Theory[J]. Library & Information, 2015(3):1-8.)
[8] 穆桃, 陈伟, 陈松健 . 基于多层网络流量分析的用户分类方法[J]. 计算机应用, 2017,37(3):705-710.
[8] ( Mu Tao, Chen Wei, Chen Songjian . User Classification Method Based on Multi-Layer Network Traffic Analysis[J]. Journal of Computer Applications, 2017,37(3):705-710.)
[9] 苏朝晖 . 客户关系管理[M]. 第2版. 北京: 高等教育出版社, 2016: 14-16.
[9] ( Su Zhaohui. Customer Relationship Management[M]. The 2nd Edition. Beijing: Higher Education Press, 2016: 14-16.)
[10] 贺超波, 杨镇雄, 洪少文 , 等. 应用随机游走的社交网络用户分类方法[J]. 计算机科学, 2015,42(2):198-202.
[10] ( He Chaobo, Yang Zhenxiong, Hong Shaowen , et al. User Classification Method in Online Social Network Using Random Walks[J]. Computer Science, 2015,42(2):198-202.)
[11] Wu S, Hofman J, Mason W , et al. Who Says What to Whom on Twitter [C]// Proceedings of the 20th International Conference on World Wide Web. 2011: 705-714.
[12] Rao D, Yarowsky D, Shreevats A , et al. Classifying Latent User Attributes in Twitter [C]// Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents. ACM, 2010: 37-44.
[13] Zubiaga A, Körner C, Strohmaier M . Tags vs Shelves: From Social Tagging to Social Classification [C]// Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia. ACM, 2011: 93-102.
[14] Pennacchiotti M, Popescu A M. A Machine Learning Approach to Twitter User Classification[C]// Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. AAAI Press, 2011: 281-288.
[15] Shafiq M Z, Ilyas M U, Liu A X , et al. Identifying Leaders and Followers in Online Social Networks[J]. IEEE Journal on Selected Areas in Communications, 2013,31(9):618-628.
[16] Xie D, Xu J, Lu T C . Automated Classification of Extremist Twitter Accounts Using Content-Based and Network-Based Features [C]// Proceedings of the 4th International Conference on Big Data. IEEE, 2016: 2545-2549.
[17] Abu-Salih B, Wongthontham P, Chan K Y . Twitter Mining for Ontology-Based Domain Discovery Incorporating Machine Learning[J]. Journal of Knowledge Management, 2018,22(5):949-981.
[18] 赵文兵, 朱庆华, 吴克文 , 等. 微博客用户特性及动机分析——以和讯财经微博为例[J]. 现代图书情报技术, 2011(2):69-75.
[18] ( Zhao Wenbing, Zhu Qinghua, Wu Kewen , et al. Analysis of Micro-blogging User Character and Motivation——Take Micro-blogging of Hexun.com as an Example[J]. New Technology of Library and Information Service, 2011(2):69-75.)
[19] 薛云霞, 李寿山, 阮进 . 微博中个人与非个人用户分类方法研究[J]. 山西大学学报:自然科学版, 2015,38(2):192-198.
[19] ( Xue Yunxia, Li Shoushan, Ruan Jin . Human and Nonhuman User Classification in Micro-blog[J]. Journal of Shanxi University: Natural Science Edition, 2015,38(2):192-198.)
[20] He S, Wang H, Jiang Z H. Identifying User Behavior on Twitter Based on Multi-scale Entropy [C]// Proceedings of the 2014 IEEE International Conference on Security, Pattern Analysis, and Cybernetics. IEEE, 2014: 381-384.
[21] 蒋翠清, 宋凯伦, 丁勇 , 等. 基于用户生成内容的潜在客户识别方法[J]. 数据分析与知识发现, 2018,2(3):1-8.
[21] ( Jiang Cuiqing, Song Kailun, Ding Yong , et al. Identifying Potential Customers Based on User-Generated Contents[J]. Data Analysis and Knowledge Discovery, 2018,2(3):1-8.)
[22] 方洁, 龚立群, 魏疆 . 基于利益相关者理论的微博舆情中的用户分类研究[J]. 情报科学, 2014,32(1):18-22.
[22] ( Fang Jie, Gong Liqun, Wei Jiang . A Study of the User’s Classification of Microblog Public Opinion Based on the Stakeholders Theories[J]. Information Science, 2014,32(1):18-22.)
[23] 李春英, 汤庸, 贺超波 , 等. 在线社交网络用户分析研究综述[J]. 华南师范大学学报:自然科学版, 2016,48(5):107-115.
[23] ( Li Chunying, Tang Yong, He Chaobo , et al. A Survey of Online Social Network Based User Analysis[J]. Journal of South China Normal University:Natural Science Edition, 2016,48(5):107-115.)
[24] 林燕霞, 谢湘生 . 基于社会认同理论的微博群体用户画像[J]. 情报理论与实践, 2018,41(3):142-148.
[24] ( Lin Yanxia, Xie Xiangsheng . User Portrait of Diversified Groups in Micro-blog Based on Social Identity Theory[J]. Information Studies: Theory & Application, 2018,41(3):142-148.)
[25] 蒋翠清, 王齐林, 刘士喜 , 等. 中文社会媒体环境下半监督学习的汽车缺陷识别方法[J]. 中国管理科学, 2014(S1):677-685.
[25] ( Jiang Cuiqing, Wang Qilin, Liu Shixi , et al. Semi-supervised Learning for Automobile Defect Identification in the Context of Chinese Social Media[J]. Chinese Journal of Management Science, 2014(S1):677-685.)
[26] 路永和, 李焰锋 . 改进TF-IDF 算法的文本特征项权值计算方法[J]. 图书情报工作, 2013,57(3):90-95.
doi: 10.7536/j.jssn.0252-3116.2013.03.017
[26] ( Lu Yonghe, Li Yanfeng . Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm[J]. Library and Information Service, 2013,57(3):90-95.)
doi: 10.7536/j.jssn.0252-3116.2013.03.017
[27] 周立欣, 林杰 . 基于NodeRank 算法的产品特征提取研究[J]. 数据分析与知识发现, 2018,2(4):90-98.
[27] ( Zhou Lixin, Lin Jie . Extracting Product Features with NodeRank Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(4):90-98.)
[28] Philips M E . Hurricane Harvey Twitter Dataset[DB/OL]. [2017-11-22].
[29] RANKS NL . Stopwords[DB/OL]. [2018-6-19].
[30] 陈远, 王超群, 胡忠义 , 等. 基于主成分分析和随机森林的恶意网站评估与识别[J]. 数据分析与知识发现, 2018,2(4):71-79.
[30] ( Chen Yuan, Wang Chaoqun, Hu Zhongyi , et al. Identifying Malicious Websites with PCA and Random Forest Methods[J]. Data Analysis and Knowledge Discovery, 2018,2(4):71-79.)
[1] Jiahui Hu,An Fang,Wanqing Zhao,Chenliu Yang,Huiling Ren. Annotating Chinese E-Medical Record for Knowledge Discovery[J]. 数据分析与知识发现, 2019, 3(7): 123-132.
[2] Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[3] Jinzhu Zhang,Yiming Hu. Extracting Titles from Scientific References in Patents with Fusion of Representation Learning and Machine Learning[J]. 数据分析与知识发现, 2019, 3(5): 68-76.
[4] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[5] Zhiqiang Liu,Yuncheng Du,Shuicai Shi. Extraction of Key Information in Web News Based on Improved Hidden Markov Model[J]. 数据分析与知识发现, 2019, 3(3): 120-128.
[6] Hongxia Xu,Chunwang Li. Review of Knowledge Extraction of Scientific Literature[J]. 数据分析与知识发现, 2019, 3(3): 14-24.
[7] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[8] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[9] Lina Liu,Jiayin Qi,Zhenping Zhang,Dan Zeng. Analyzing Impacts of Brand Reputation on Online Sales Based on Massive Commodity Reviews and Brand[J]. 数据分析与知识发现, 2018, 2(9): 10-21.
[10] Longjia Jia,Bangzuo Zhang. Classifying Topics of Internet Public Opinion from College Students: Case Study of Sina Weibo[J]. 数据分析与知识发现, 2018, 2(7): 55-62.
[11] Wei Lu,Mengqi Luo,Heng Ding,Xin Li. Image Annotation Tags by Deep Learning and Real Users: A Comparative Study[J]. 数据分析与知识发现, 2018, 2(5): 1-10.
[12] Lixin Zhou,Jie Lin. Extracting Product Features with NodeRank Algorithm[J]. 数据分析与知识发现, 2018, 2(4): 90-98.
[13] Jun Hou,Kui Liu,Qianmu Li. Classification Recommendation Based on ESSVM[J]. 数据分析与知识发现, 2018, 2(3): 9-21.
[14] Li Wang,Lixue Zou,Xiwen Liu. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[15] Xinyue Fan,Lei Cui. Predicting Antineoplastic Drug Targets Based on Network Properties[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn