Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (12): 32-40    DOI: 10.11925/infotech.2096-3467.2017.0817
Orginal Article Current Issue | Archive | Adv Search |
Research on Text Clustering Based on Requirements of Big Data Jobs
Liu Ruilun, Ye Wenhao, Gao Ruiqing, Tang Mengjia, Wang Dongbo()
College of Information and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF (1378 KB)   HTML ( 4
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study analyzes the requirements of big data related positions, aiming to identify high-quality candidates for the companies. [Methods] We retrieved job postings in the field of big data from major recruitment websites during the first quarter of 2017. Then, we used the TF-IDF, word2vec, and k-means algorithms to cluster the texts semantically, which were optimized with the help of silhouette coefficient. [Results] We obtained very good clustering results, and divided the job requirements into three categories of capability, education background and work experiences. [Limitations] First, the formats of job announcement posted on different websites were not unified, which affected the data cleaning and clustering. Second, the training set for word2vec was small due to insufficient data retrieved from the Web. [Conclusions] We found that the big data related jobs do not require advanced degrees and the companies prefer experienced candidates. Those applicants with no relevant experience will also be considered. The candidates’ professionalism is more important than their computer skills.

Key wordsBig DATA Jobs      Word2Vec      K-means      Silhouette Coefficient     
Received: 15 August 2017      Published: 29 December 2017
ZTFLH:  G351  

Cite this article:

Liu Ruilun,Ye Wenhao,Gao Ruiqing,Tang Mengjia,Wang Dongbo. Research on Text Clustering Based on Requirements of Big Data Jobs. Data Analysis and Knowledge Discovery, 2017, 1(12): 32-40.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2017.0817     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2017/V1/I12/32

编号 类型 例子
1 大数据技术名词 Python、PostgreSQL、数据挖掘、数据分析
2 工作经验 3年、1-3年、5年数据库管理经验、经验不限
3 学历要求 本科、硕士、博士
4 优先条件 编写开源项目经验
Size k 3 4 5 6
2 0.735 0.726 0.622 0.597
25 0.784 0.779 0.701 0.690
50 0.792 0.787 0.712 0.711
100 0.797 0.792 0.722 0.719
250 0.802 0.795 0.727 0.728
序号 关键词 频次 序号 关键词 频次
1 本科及以上 1 529 16 良好的沟通能力 416
2 计算机相关专业 1 434 17 责任心强 371
3 有经验者优先 1 408 18 excel 368
4 数据库 1 131 19 数据仓库 367
5 数据挖掘 874 20 办公软件 359
6 统计学 868 21 团队合作精神 357
7 三年以上 723 22 业务需求 351
8 二年以上 564 23 机器学习 349
9 一年以上 551 24 hadoop 341
10 相关工作经验 538 25 独立完成 340
11 数据库工程师 518 26 对数据敏感 330
12 大数据 466 27 学习能力 324
13 逻辑思维能力 428 28 大专及以上 306
14 沟通能力 422 29 数据处理 296
15 开发经验 417 30 逻辑分析能力 295
类编号 关键词 词频
#1 经验 34
海量数据 20
经验者优先 18
有经验者 7
设计经验 6
#2 良好的沟通能力 128
团队合作精神 116
责任心强 90
沟通能力 59
和团队合作精神 55
#3 专业 21
本科及以上 16
双休 7
本科以上 6
大专及以上 4
[1] 国家信息中心. 《2017中国大数据发展报告》[J]. 新西部(上), 2017(3): 7.
[1] (State Information Center. Report of Big Bata Development in China2017[J]. New West, 2017(3): 7.)
[2] Lukić J.The New Job Positions for Working with Big Data Technologies and Their Placement in Companies Worldwide: Evidence from Empirical Research[J]. Facta Universitatis: Economics and Organization, 2016, 13(3): 301-312.
[3] Kim J Y, Lee C K.An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings[J]. International Journal of Software Engineering and Its Applications, 2016, 10(4): 161-172.
doi: 10.14257/ijseia.2016.10.4.15
[4] 夏火松, 潘筱听. 基于Python挖掘的大数据学术研究与人才需求的关系研究[J]. 信息资源管理学报, 2017, 7(1): 4-12.
doi: 10.13365/j.jirm.2017.01.004
[4] (Xia Huosong, Pan Xiaoting.Research on Relationship Between Big Data’s Academic Research and It’s Talent Demand Based on Python[J]. Journal of Information Resources Management, 2017, 7(1): 4-12.)
doi: 10.13365/j.jirm.2017.01.004
[5] 黄崑, 王凯飞, 王珊珊, 等. 数据类岗位招聘需求调查及对图情学科人才培养的启示[J]. 图书情报知识, 2016(6): 42-53.
doi: 10.13366/j.dik.2016.06.042
[5] (Huang Kun, Wang Kaifei, Wang Shanshan, et al.Survey on the Demand of Data Post Recruitment and Its Enlightenment to the Talent Cultivation of the Library and Information Science[J]. Document, Inofrmation & Knowledge, 2016(6): 42-53.)
doi: 10.13366/j.dik.2016.06.042
[6] De Mauro A, Greco M, Grimaldi M, et al.Beyond Data Scientists: A Review of Big Data Skills and Job Families[C]// Proceedings of the 2016 International Forum on Knowledge Asset Dynamics. 2016: 1844-1857.
[7] Debortoli S, Müller O, Vom Brocke J.Comparing Business Intelligence and Big Data Skills[J]. Business & Information Systems Engineering, 2014, 6(5): 289-300.
doi: 10.1007/s12599-014-0344-2
[8] Steinhaus H.Sur la Division des Corp Materiels en Parties[J]. Bulletin L’Academie Polonaise des Science, 1956, 4: 801-804.
[9] MacQueen J. Some Methods for Classification and Analysis of MultiVariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[10] Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
doi: 10.1023/A:1007612920971
[11] 黄建宇, 周爱武, 肖云, 等. 基于特征空间的文本聚类[J]. 计算机技术与发展, 2017, 27(9): 75-77.
doi: 10.3969/j.issn.1673-629X.2017.09.016
[11] (Huang Jianyu, Zhou Aiwu, Xiao Yun, et al.Text Clustering Based on Feature Space[J]. Computer Technology and Development, 2017, 27(9): 75-77.)
doi: 10.3969/j.issn.1673-629X.2017.09.016
[12] 武森, 冯小东, 杨杰, 等. 基于MapReduce的大规模文本聚类并行化[J]. 北京科技大学学报, 2014, 36(10): 1411-1419.
[12] (Wu Sen, Feng Xiaodong, Yang Jie, et al.Parallel Clustering of Very Large Document Datasets with MapReduce[J]. Journal of University of Science and Technology Beijing, 2014, 36(10): 1411-1419.)
[13] 王东波, 韩普, 沈耕宇, 等. 基于汉英词性组合的短语级平行语料类别知识挖掘研究[J]. 图书情报工作, 2013, 57(11): 106-111.
doi: 10.7536/j.jssn.0252-3116.2013.11.020
[13] (Wang Dongbo, Han Pu, Shen Gengyu, et al.Research of Mining the Category Knowledge Based on Chinese-English Part of Speech Sequence Parallel Corpus in Phrase Level[J]. Library and Information Service, 2013, 57(11): 106-111.)
doi: 10.7536/j.jssn.0252-3116.2013.11.020
[14] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv PrePrint, arXiv:1301.3781, 2013.
[15] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and their Compositionality[C] // Advances in Neural Information Processing Systems 26(NIPS 2013). 2013.
[16] 姜霖, 王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究[J]. 现代图书情报技术, 2016(2): 9-15.
[16] (Jiang Lin, Wang Dongbo.Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model[J]. New Technology of Library and Information Service, 2016(2): 9-15.)
[17] Řehůřek R.Models.Word2Vec - Deep Learning with Word2Vec [EB/OL].[2017-07-26]. .
[18] 张冬梅. 基于轮廓系数的层次聚类算法研究[D]. 秦皇岛: 燕山大学, 2009.
[18] (Zhang Dongmei.Research on Hierarchical Clustering Algorithm Based on Silhouette[D]. Qinhuangdao: Yanshan University, 2009.)
[19] 朱连江, 马炳先, 赵学泉. 基于轮廓系数的聚类有效性分析[J]. 计算机应用, 2010, 30(S2): 139-141.
[19] (Zhu Lianjiang, Ma Bingxian, Zhao Xuequan.Clustering Validity Analysis Based on Silhouette Coefficient[J]. Journal of Computer Applications, 2010, 30(S2): 139-141.)
[20] 江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[20] (Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation [D]. Hangzhou: Zhejiang University, 2015.)
[1] Li Yueyan,Xiong Huixiang,Li Xiaomin. Recommending Doctors Online Based on Combined Conditions[J]. 数据分析与知识发现, 2020, 4(8): 130-142.
[2] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[3] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[4] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[5] Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[6] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[7] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[8] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[9] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[10] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[11] Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[12] Liu Hongwei,Gao Hongming,Chen Li,Zhan Mingjun,Liang Zhouyang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[13] Jia Xiaoting,Wang Mingyang,Cao Yu. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[14] Liu Minghui. Risk Assessment of Civil Aviation Terrorism Based on K-means Clustering[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[15] Gao Yongbing,Yang Guipeng,Zhang Di,Ma Zhanfei. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn