[Objective] This study analyzes the requirements of big data related positions, aiming to identify high-quality candidates for the companies. [Methods] We retrieved job postings in the field of big data from major recruitment websites during the first quarter of 2017. Then, we used the TF-IDF, word2vec, and k-means algorithms to cluster the texts semantically, which were optimized with the help of silhouette coefficient. [Results] We obtained very good clustering results, and divided the job requirements into three categories of capability, education background and work experiences. [Limitations] First, the formats of job announcement posted on different websites were not unified, which affected the data cleaning and clustering. Second, the training set for word2vec was small due to insufficient data retrieved from the Web. [Conclusions] We found that the big data related jobs do not require advanced degrees and the companies prefer experienced candidates. Those applicants with no relevant experience will also be considered. The candidates’ professionalism is more important than their computer skills.
刘睿伦, 叶文豪, 高瑞卿, 唐梦嘉, 王东波. 基于大数据岗位需求的文本聚类研究*[J]. 数据分析与知识发现, 2017, 1(12): 32-40.
Liu Ruilun,Ye Wenhao,Gao Ruiqing,Tang Mengjia,Wang Dongbo. Research on Text Clustering Based on Requirements of Big Data Jobs. Data Analysis and Knowledge Discovery, 2017, 1(12): 32-40.
(State Information Center. Report of Big Bata Development in China2017[J]. New West, 2017(3): 7.)
[2]
Lukić J.The New Job Positions for Working with Big Data Technologies and Their Placement in Companies Worldwide: Evidence from Empirical Research[J]. Facta Universitatis: Economics and Organization, 2016, 13(3): 301-312.
[3]
Kim J Y, Lee C K.An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings[J]. International Journal of Software Engineering and Its Applications, 2016, 10(4): 161-172.
doi: 10.14257/ijseia.2016.10.4.15
(Xia Huosong, Pan Xiaoting.Research on Relationship Between Big Data’s Academic Research and It’s Talent Demand Based on Python[J]. Journal of Information Resources Management, 2017, 7(1): 4-12.)
doi: 10.13365/j.jirm.2017.01.004
(Huang Kun, Wang Kaifei, Wang Shanshan, et al.Survey on the Demand of Data Post Recruitment and Its Enlightenment to the Talent Cultivation of the Library and Information Science[J]. Document, Inofrmation & Knowledge, 2016(6): 42-53.)
doi: 10.13366/j.dik.2016.06.042
[6]
De Mauro A, Greco M, Grimaldi M, et al.Beyond Data Scientists: A Review of Big Data Skills and Job Families[C]// Proceedings of the 2016 International Forum on Knowledge Asset Dynamics. 2016: 1844-1857.
[7]
Debortoli S, Müller O, Vom Brocke J.Comparing Business Intelligence and Big Data Skills[J]. Business & Information Systems Engineering, 2014, 6(5): 289-300.
doi: 10.1007/s12599-014-0344-2
[8]
Steinhaus H.Sur la Division des Corp Materiels en Parties[J]. Bulletin L’Academie Polonaise des Science, 1956, 4: 801-804.
[9]
MacQueen J. Some Methods for Classification and Analysis of MultiVariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[10]
Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
doi: 10.1023/A:1007612920971
(Wu Sen, Feng Xiaodong, Yang Jie, et al.Parallel Clustering of Very Large Document Datasets with MapReduce[J]. Journal of University of Science and Technology Beijing, 2014, 36(10): 1411-1419.)
(Wang Dongbo, Han Pu, Shen Gengyu, et al.Research of Mining the Category Knowledge Based on Chinese-English Part of Speech Sequence Parallel Corpus in Phrase Level[J]. Library and Information Service, 2013, 57(11): 106-111.)
doi: 10.7536/j.jssn.0252-3116.2013.11.020
[14]
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv PrePrint, arXiv:1301.3781, 2013.
[15]
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and their Compositionality[C] // Advances in Neural Information Processing Systems 26(NIPS 2013). 2013.
(Jiang Lin, Wang Dongbo.Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model[J]. New Technology of Library and Information Service, 2016(2): 9-15.)
[17]
Řehůřek R.Models.Word2Vec - Deep Learning with Word2Vec [EB/OL].[2017-07-26]. .
[18]
张冬梅. 基于轮廓系数的层次聚类算法研究[D]. 秦皇岛: 燕山大学, 2009.
[18]
(Zhang Dongmei.Research on Hierarchical Clustering Algorithm Based on Silhouette[D]. Qinhuangdao: Yanshan University, 2009.)
(Zhu Lianjiang, Ma Bingxian, Zhao Xuequan.Clustering Validity Analysis Based on Silhouette Coefficient[J]. Journal of Computer Applications, 2010, 30(S2): 139-141.)
[20]
江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[20]
(Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation [D]. Hangzhou: Zhejiang University, 2015.)