Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (8): 1-9    DOI: 10.11925/infotech.2096-3467.2018.1207
Current Issue | Archive | Adv Search |
Classifying Social Media Users with Machine Learning
Gang Li,Huayang Zhou,Jin Mao(),Sijing Chen
Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China
Download: PDF (1064 KB)   HTML ( 45
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper uses multi-dimensional information of social media users to automatically classify them. [Methods] First, we defined social media users as individual, media, government, and organization. Then, we extracted the following features from user profiles: demographic characteristics, namings, and self-descriptions. Third, we created a user classification models based on machine learning algorithms and evaluated its performance with real Twitter dataset. [Results] Both precision and recall of the proposed model were greater than 83%. The naming, demographic characteristics, and self-description features posed increasing contributions to the classification model. [Limitations] The sample size needs to be expanded, which helps us better analyzed the characteristics of different users. [Conclusions] The proposed method could accurately identify four types of users, which benefits social media user classification research in the future.

Key wordsSVM      User Classification      Machine Learning      Feature Extraction     
Received: 31 October 2018      Published: 29 September 2019
ZTFLH:  TP393 G35  
Corresponding Authors: Jin Mao     E-mail: danveno@163.com

Cite this article:

Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning. Data Analysis and Knowledge Discovery, 2019, 3(8): 1-9.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1207     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I8/1

特征组 特征编号 特征说明 备注
人口统计学
特征
F1 用户粉丝数 取值为0-9
通过为1, 否则
为0
F2 用户关注数
F3 用户被标记数
F4 是否通过认证
命名特征 F5 用户名的命名模式 英文字母搭配
方式
F6 昵称命名模式
F7 昵称和用户名相似度
自我描述
特征
F8-F2395 词汇出现的词频-
逆频率
判断是媒体
类型用户
判断不是媒体
类型用户
实际是媒体类型用户 TP FN
实际不是媒体类型用户 FP TN
算法对比 p值
支持向量机-随机梯度下降 0.098
支持向量机-决策树 0.011**
支持向量机-K近邻 0.032**
支持向量机-朴素贝叶斯 0.000**
支持向量机-人工神经网络 0.000**
随机梯度下降-决策树 0.000**
随机梯度下降-K近邻 0.002**
随机梯度下降-朴素贝叶斯 0.007**
随机梯度下降-人工神经网络 0.018**
[1] Wikipedia. Social Network Service[EB/OL]. [ 2018- 06- 15].
[2] Boyd D M, Ellison N B . Social Network Sites: Definition, History, and Scholarship[J]. Journal of Computer Mediated Communication, 2008,13(1):210-230.
[3] We Are Social. Digital in 2018[EB/OL]. [ 2018- 10- 30].
[4] 贺超波, 汤庸, 麦辉强 , 等. 在线社交网络挖掘综述[J]. 武汉大学学报: 理学版, 2014,60(3):189-200.
[4] ( He Chaobo, Tang Yong, Mai Huiqiang , et al. A Survey on Online Social Network Mining[J]. Journal of Wuhan University: Natural Science Edition, 2014,60(3):189-200.)
[5] 陈家维 . 线上运动社群之社群意识组成要素之研究——以日本职棒社群日促会为例[D]. 朝阳科技大学, 2006.
[5] ( Chen Jiawei . Exploring the Sense of Community for an Online Sport Community: A Case Study of Nippon Professional Baseball Club[D]. Chaoyang University of Science and Technology, 2006.)
[6] Gomez-Rodriguez M, Leskovec J, Krause A . Inferring Network of Diffusion and Influence [C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010: 1019-1028.
[7] 邓三鸿, 刘喜文, 蒋勋 . 基于利益相关者理论的突发事件案例知识库构建研究[J]. 图书与情报, 2015(3):1-8.
[7] ( Deng Sanhong, Liu Xiwen, Jiang Xun . Constructing Cases Knowledge Base of Emergency Based on Stakeholder’s Theory[J]. Library & Information, 2015(3):1-8.)
[8] 穆桃, 陈伟, 陈松健 . 基于多层网络流量分析的用户分类方法[J]. 计算机应用, 2017,37(3):705-710.
[8] ( Mu Tao, Chen Wei, Chen Songjian . User Classification Method Based on Multi-Layer Network Traffic Analysis[J]. Journal of Computer Applications, 2017,37(3):705-710.)
[9] 苏朝晖 . 客户关系管理[M]. 第2版. 北京: 高等教育出版社, 2016: 14-16.
[9] ( Su Zhaohui. Customer Relationship Management[M]. The 2nd Edition. Beijing: Higher Education Press, 2016: 14-16.)
[10] 贺超波, 杨镇雄, 洪少文 , 等. 应用随机游走的社交网络用户分类方法[J]. 计算机科学, 2015,42(2):198-202.
[10] ( He Chaobo, Yang Zhenxiong, Hong Shaowen , et al. User Classification Method in Online Social Network Using Random Walks[J]. Computer Science, 2015,42(2):198-202.)
[11] Wu S, Hofman J, Mason W , et al. Who Says What to Whom on Twitter [C]// Proceedings of the 20th International Conference on World Wide Web. 2011: 705-714.
[12] Rao D, Yarowsky D, Shreevats A , et al. Classifying Latent User Attributes in Twitter [C]// Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents. ACM, 2010: 37-44.
[13] Zubiaga A, Körner C, Strohmaier M . Tags vs Shelves: From Social Tagging to Social Classification [C]// Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia. ACM, 2011: 93-102.
[14] Pennacchiotti M, Popescu A M. A Machine Learning Approach to Twitter User Classification[C]// Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. AAAI Press, 2011: 281-288.
[15] Shafiq M Z, Ilyas M U, Liu A X , et al. Identifying Leaders and Followers in Online Social Networks[J]. IEEE Journal on Selected Areas in Communications, 2013,31(9):618-628.
[16] Xie D, Xu J, Lu T C . Automated Classification of Extremist Twitter Accounts Using Content-Based and Network-Based Features [C]// Proceedings of the 4th International Conference on Big Data. IEEE, 2016: 2545-2549.
[17] Abu-Salih B, Wongthontham P, Chan K Y . Twitter Mining for Ontology-Based Domain Discovery Incorporating Machine Learning[J]. Journal of Knowledge Management, 2018,22(5):949-981.
[18] 赵文兵, 朱庆华, 吴克文 , 等. 微博客用户特性及动机分析——以和讯财经微博为例[J]. 现代图书情报技术, 2011(2):69-75.
[18] ( Zhao Wenbing, Zhu Qinghua, Wu Kewen , et al. Analysis of Micro-blogging User Character and Motivation——Take Micro-blogging of Hexun.com as an Example[J]. New Technology of Library and Information Service, 2011(2):69-75.)
[19] 薛云霞, 李寿山, 阮进 . 微博中个人与非个人用户分类方法研究[J]. 山西大学学报:自然科学版, 2015,38(2):192-198.
[19] ( Xue Yunxia, Li Shoushan, Ruan Jin . Human and Nonhuman User Classification in Micro-blog[J]. Journal of Shanxi University: Natural Science Edition, 2015,38(2):192-198.)
[20] He S, Wang H, Jiang Z H. Identifying User Behavior on Twitter Based on Multi-scale Entropy [C]// Proceedings of the 2014 IEEE International Conference on Security, Pattern Analysis, and Cybernetics. IEEE, 2014: 381-384.
[21] 蒋翠清, 宋凯伦, 丁勇 , 等. 基于用户生成内容的潜在客户识别方法[J]. 数据分析与知识发现, 2018,2(3):1-8.
[21] ( Jiang Cuiqing, Song Kailun, Ding Yong , et al. Identifying Potential Customers Based on User-Generated Contents[J]. Data Analysis and Knowledge Discovery, 2018,2(3):1-8.)
[22] 方洁, 龚立群, 魏疆 . 基于利益相关者理论的微博舆情中的用户分类研究[J]. 情报科学, 2014,32(1):18-22.
[22] ( Fang Jie, Gong Liqun, Wei Jiang . A Study of the User’s Classification of Microblog Public Opinion Based on the Stakeholders Theories[J]. Information Science, 2014,32(1):18-22.)
[23] 李春英, 汤庸, 贺超波 , 等. 在线社交网络用户分析研究综述[J]. 华南师范大学学报:自然科学版, 2016,48(5):107-115.
[23] ( Li Chunying, Tang Yong, He Chaobo , et al. A Survey of Online Social Network Based User Analysis[J]. Journal of South China Normal University:Natural Science Edition, 2016,48(5):107-115.)
[24] 林燕霞, 谢湘生 . 基于社会认同理论的微博群体用户画像[J]. 情报理论与实践, 2018,41(3):142-148.
[24] ( Lin Yanxia, Xie Xiangsheng . User Portrait of Diversified Groups in Micro-blog Based on Social Identity Theory[J]. Information Studies: Theory & Application, 2018,41(3):142-148.)
[25] 蒋翠清, 王齐林, 刘士喜 , 等. 中文社会媒体环境下半监督学习的汽车缺陷识别方法[J]. 中国管理科学, 2014(S1):677-685.
[25] ( Jiang Cuiqing, Wang Qilin, Liu Shixi , et al. Semi-supervised Learning for Automobile Defect Identification in the Context of Chinese Social Media[J]. Chinese Journal of Management Science, 2014(S1):677-685.)
[26] 路永和, 李焰锋 . 改进TF-IDF 算法的文本特征项权值计算方法[J]. 图书情报工作, 2013,57(3):90-95.
doi: 10.7536/j.jssn.0252-3116.2013.03.017
[26] ( Lu Yonghe, Li Yanfeng . Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm[J]. Library and Information Service, 2013,57(3):90-95.)
doi: 10.7536/j.jssn.0252-3116.2013.03.017
[27] 周立欣, 林杰 . 基于NodeRank 算法的产品特征提取研究[J]. 数据分析与知识发现, 2018,2(4):90-98.
[27] ( Zhou Lixin, Lin Jie . Extracting Product Features with NodeRank Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(4):90-98.)
[28] Philips M E . Hurricane Harvey Twitter Dataset[DB/OL]. [2017-11-22].
[29] RANKS NL . Stopwords[DB/OL]. [2018-6-19].
[30] 陈远, 王超群, 胡忠义 , 等. 基于主成分分析和随机森林的恶意网站评估与识别[J]. 数据分析与知识发现, 2018,2(4):71-79.
[30] ( Chen Yuan, Wang Chaoqun, Hu Zhongyi , et al. Identifying Malicious Websites with PCA and Random Forest Methods[J]. Data Analysis and Knowledge Discovery, 2018,2(4):71-79.)
[1] Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi. Identifying Pathogens of Foodborne Diseases with Machine Learning[J]. 数据分析与知识发现, 2021, 5(9): 54-62.
[2] Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong. Optimizing Large Hospital Operating Rooms with Data Analytics[J]. 数据分析与知识发现, 2021, 5(9): 115-128.
[3] Che Hongxin,Wang Tong,Wang Wei. Comparing Prediction Models for Prostate Cancer[J]. 数据分析与知识发现, 2021, 5(9): 107-114.
[4] Su Qiang, Hou Xiaoli, Zou Ni. Predicting Surgical Infections Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(8): 65-75.
[5] Cao Rui,Liao Bin,Li Min,Sun Ruina. Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost[J]. 数据分析与知识发现, 2021, 5(6): 51-65.
[6] Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng. Review of Methods and Applications of Text Sentiment Analysis[J]. 数据分析与知识发现, 2021, 5(6): 1-13.
[7] Xiang Zhuoyuan,Liu Zhicong,Wu Yu. Adaptive Recommendation Model Based on User Behaviors[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[8] Shen Wang, Li Shiyu, Liu Jiayu, Li He. Optimizing Quality Evaluation for Answers of Q&A Community[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[9] Zheng Xinman, Dong Yu. Constructing Degree Lexicon for STI Policy Texts[J]. 数据分析与知识发现, 2021, 5(10): 81-93.
[10] Chai Guorong,Wang Bin,Sha Yongzhong. Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China[J]. 数据分析与知识发现, 2021, 5(1): 90-98.
[11] Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan. Forecasting Poultry Turnovers with Machine Learning and Multiple Factors[J]. 数据分析与知识发现, 2020, 4(7): 18-27.
[12] Liang Ye,Li Xiaoyuan,Xu Hang,Hu Yiran. CLOpin: A Cross-Lingual Knowledge Graph Framework for Public Opinion Analysis and Early Warning[J]. 数据分析与知识发现, 2020, 4(6): 1-14.
[13] Yang Heng,Wang Sili,Zhu Zhongming,Liu Wei,Wang Nan. Recommending Domain Knowledge Based on Parallel Collaborative Filtering Algorithm[J]. 数据分析与知识发现, 2020, 4(6): 15-21.
[14] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[15] Cai Jingxuan,Wu Jiang,Wang Chengkun. Predicting Usefulness of Crowd Testing Reports with Deep Learning[J]. 数据分析与知识发现, 2020, 4(11): 102-111.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn