|
|
Classifying Social Media Users with Machine Learning |
Gang Li,Huayang Zhou,Jin Mao(),Sijing Chen |
Center for Studies of Information Resources, Wuhan University, Wuhan 430072, China |
|
|
Abstract [Objective] This paper uses multi-dimensional information of social media users to automatically classify them. [Methods] First, we defined social media users as individual, media, government, and organization. Then, we extracted the following features from user profiles: demographic characteristics, namings, and self-descriptions. Third, we created a user classification models based on machine learning algorithms and evaluated its performance with real Twitter dataset. [Results] Both precision and recall of the proposed model were greater than 83%. The naming, demographic characteristics, and self-description features posed increasing contributions to the classification model. [Limitations] The sample size needs to be expanded, which helps us better analyzed the characteristics of different users. [Conclusions] The proposed method could accurately identify four types of users, which benefits social media user classification research in the future.
|
Received: 31 October 2018
Published: 29 September 2019
|
|
Corresponding Authors:
Jin Mao
E-mail: danveno@163.com
|
[1] |
Wikipedia. Social Network Service[EB/OL]. [ 2018- 06- 15].
|
[2] |
Boyd D M, Ellison N B . Social Network Sites: Definition, History, and Scholarship[J]. Journal of Computer Mediated Communication, 2008,13(1):210-230.
|
[3] |
We Are Social. Digital in 2018[EB/OL]. [ 2018- 10- 30].
|
[4] |
贺超波, 汤庸, 麦辉强 , 等. 在线社交网络挖掘综述[J]. 武汉大学学报: 理学版, 2014,60(3):189-200.
|
[4] |
( He Chaobo, Tang Yong, Mai Huiqiang , et al. A Survey on Online Social Network Mining[J]. Journal of Wuhan University: Natural Science Edition, 2014,60(3):189-200.)
|
[5] |
陈家维 . 线上运动社群之社群意识组成要素之研究——以日本职棒社群日促会为例[D]. 朝阳科技大学, 2006.
|
[5] |
( Chen Jiawei . Exploring the Sense of Community for an Online Sport Community: A Case Study of Nippon Professional Baseball Club[D]. Chaoyang University of Science and Technology, 2006.)
|
[6] |
Gomez-Rodriguez M, Leskovec J, Krause A . Inferring Network of Diffusion and Influence [C]// Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2010: 1019-1028.
|
[7] |
邓三鸿, 刘喜文, 蒋勋 . 基于利益相关者理论的突发事件案例知识库构建研究[J]. 图书与情报, 2015(3):1-8.
|
[7] |
( Deng Sanhong, Liu Xiwen, Jiang Xun . Constructing Cases Knowledge Base of Emergency Based on Stakeholder’s Theory[J]. Library & Information, 2015(3):1-8.)
|
[8] |
穆桃, 陈伟, 陈松健 . 基于多层网络流量分析的用户分类方法[J]. 计算机应用, 2017,37(3):705-710.
|
[8] |
( Mu Tao, Chen Wei, Chen Songjian . User Classification Method Based on Multi-Layer Network Traffic Analysis[J]. Journal of Computer Applications, 2017,37(3):705-710.)
|
[9] |
苏朝晖 . 客户关系管理[M]. 第2版. 北京: 高等教育出版社, 2016: 14-16.
|
[9] |
( Su Zhaohui. Customer Relationship Management[M]. The 2nd Edition. Beijing: Higher Education Press, 2016: 14-16.)
|
[10] |
贺超波, 杨镇雄, 洪少文 , 等. 应用随机游走的社交网络用户分类方法[J]. 计算机科学, 2015,42(2):198-202.
|
[10] |
( He Chaobo, Yang Zhenxiong, Hong Shaowen , et al. User Classification Method in Online Social Network Using Random Walks[J]. Computer Science, 2015,42(2):198-202.)
|
[11] |
Wu S, Hofman J, Mason W , et al. Who Says What to Whom on Twitter [C]// Proceedings of the 20th International Conference on World Wide Web. 2011: 705-714.
|
[12] |
Rao D, Yarowsky D, Shreevats A , et al. Classifying Latent User Attributes in Twitter [C]// Proceedings of the 2nd International Workshop on Search and Mining User-generated Contents. ACM, 2010: 37-44.
|
[13] |
Zubiaga A, Körner C, Strohmaier M . Tags vs Shelves: From Social Tagging to Social Classification [C]// Proceedings of the 22nd ACM Conference on Hypertext and Hypermedia. ACM, 2011: 93-102.
|
[14] |
Pennacchiotti M, Popescu A M. A Machine Learning Approach to Twitter User Classification[C]// Proceedings of the 5th International AAAI Conference on Weblogs and Social Media. AAAI Press, 2011: 281-288.
|
[15] |
Shafiq M Z, Ilyas M U, Liu A X , et al. Identifying Leaders and Followers in Online Social Networks[J]. IEEE Journal on Selected Areas in Communications, 2013,31(9):618-628.
|
[16] |
Xie D, Xu J, Lu T C . Automated Classification of Extremist Twitter Accounts Using Content-Based and Network-Based Features [C]// Proceedings of the 4th International Conference on Big Data. IEEE, 2016: 2545-2549.
|
[17] |
Abu-Salih B, Wongthontham P, Chan K Y . Twitter Mining for Ontology-Based Domain Discovery Incorporating Machine Learning[J]. Journal of Knowledge Management, 2018,22(5):949-981.
|
[18] |
赵文兵, 朱庆华, 吴克文 , 等. 微博客用户特性及动机分析——以和讯财经微博为例[J]. 现代图书情报技术, 2011(2):69-75.
|
[18] |
( Zhao Wenbing, Zhu Qinghua, Wu Kewen , et al. Analysis of Micro-blogging User Character and Motivation——Take Micro-blogging of Hexun.com as an Example[J]. New Technology of Library and Information Service, 2011(2):69-75.)
|
[19] |
薛云霞, 李寿山, 阮进 . 微博中个人与非个人用户分类方法研究[J]. 山西大学学报:自然科学版, 2015,38(2):192-198.
|
[19] |
( Xue Yunxia, Li Shoushan, Ruan Jin . Human and Nonhuman User Classification in Micro-blog[J]. Journal of Shanxi University: Natural Science Edition, 2015,38(2):192-198.)
|
[20] |
He S, Wang H, Jiang Z H. Identifying User Behavior on Twitter Based on Multi-scale Entropy [C]// Proceedings of the 2014 IEEE International Conference on Security, Pattern Analysis, and Cybernetics. IEEE, 2014: 381-384.
|
[21] |
蒋翠清, 宋凯伦, 丁勇 , 等. 基于用户生成内容的潜在客户识别方法[J]. 数据分析与知识发现, 2018,2(3):1-8.
|
[21] |
( Jiang Cuiqing, Song Kailun, Ding Yong , et al. Identifying Potential Customers Based on User-Generated Contents[J]. Data Analysis and Knowledge Discovery, 2018,2(3):1-8.)
|
[22] |
方洁, 龚立群, 魏疆 . 基于利益相关者理论的微博舆情中的用户分类研究[J]. 情报科学, 2014,32(1):18-22.
|
[22] |
( Fang Jie, Gong Liqun, Wei Jiang . A Study of the User’s Classification of Microblog Public Opinion Based on the Stakeholders Theories[J]. Information Science, 2014,32(1):18-22.)
|
[23] |
李春英, 汤庸, 贺超波 , 等. 在线社交网络用户分析研究综述[J]. 华南师范大学学报:自然科学版, 2016,48(5):107-115.
|
[23] |
( Li Chunying, Tang Yong, He Chaobo , et al. A Survey of Online Social Network Based User Analysis[J]. Journal of South China Normal University:Natural Science Edition, 2016,48(5):107-115.)
|
[24] |
林燕霞, 谢湘生 . 基于社会认同理论的微博群体用户画像[J]. 情报理论与实践, 2018,41(3):142-148.
|
[24] |
( Lin Yanxia, Xie Xiangsheng . User Portrait of Diversified Groups in Micro-blog Based on Social Identity Theory[J]. Information Studies: Theory & Application, 2018,41(3):142-148.)
|
[25] |
蒋翠清, 王齐林, 刘士喜 , 等. 中文社会媒体环境下半监督学习的汽车缺陷识别方法[J]. 中国管理科学, 2014(S1):677-685.
|
[25] |
( Jiang Cuiqing, Wang Qilin, Liu Shixi , et al. Semi-supervised Learning for Automobile Defect Identification in the Context of Chinese Social Media[J]. Chinese Journal of Management Science, 2014(S1):677-685.)
|
[26] |
路永和, 李焰锋 . 改进TF-IDF 算法的文本特征项权值计算方法[J]. 图书情报工作, 2013,57(3):90-95.
doi: 10.7536/j.jssn.0252-3116.2013.03.017
|
[26] |
( Lu Yonghe, Li Yanfeng . Improvement of Text Feature Weighting Method Based on TF-IDF Algorithm[J]. Library and Information Service, 2013,57(3):90-95.)
doi: 10.7536/j.jssn.0252-3116.2013.03.017
|
[27] |
周立欣, 林杰 . 基于NodeRank 算法的产品特征提取研究[J]. 数据分析与知识发现, 2018,2(4):90-98.
|
[27] |
( Zhou Lixin, Lin Jie . Extracting Product Features with NodeRank Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(4):90-98.)
|
[28] |
Philips M E . Hurricane Harvey Twitter Dataset[DB/OL]. [2017-11-22].
|
[29] |
RANKS NL . Stopwords[DB/OL]. [2018-6-19].
|
[30] |
陈远, 王超群, 胡忠义 , 等. 基于主成分分析和随机森林的恶意网站评估与识别[J]. 数据分析与知识发现, 2018,2(4):71-79.
|
[30] |
( Chen Yuan, Wang Chaoqun, Hu Zhongyi , et al. Identifying Malicious Websites with PCA and Random Forest Methods[J]. Data Analysis and Knowledge Discovery, 2018,2(4):71-79.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|