|
|
Research on Text Clustering Based on Requirements of Big Data Jobs |
Liu Ruilun, Ye Wenhao, Gao Ruiqing, Tang Mengjia, Wang Dongbo() |
College of Information and Technology, Nanjing Agricultural University, Nanjing 210095, China |
|
|
Abstract [Objective] This study analyzes the requirements of big data related positions, aiming to identify high-quality candidates for the companies. [Methods] We retrieved job postings in the field of big data from major recruitment websites during the first quarter of 2017. Then, we used the TF-IDF, word2vec, and k-means algorithms to cluster the texts semantically, which were optimized with the help of silhouette coefficient. [Results] We obtained very good clustering results, and divided the job requirements into three categories of capability, education background and work experiences. [Limitations] First, the formats of job announcement posted on different websites were not unified, which affected the data cleaning and clustering. Second, the training set for word2vec was small due to insufficient data retrieved from the Web. [Conclusions] We found that the big data related jobs do not require advanced degrees and the companies prefer experienced candidates. Those applicants with no relevant experience will also be considered. The candidates’ professionalism is more important than their computer skills.
|
Received: 15 August 2017
Published: 29 December 2017
|
|
[1] |
国家信息中心. 《2017中国大数据发展报告》[J]. 新西部(上), 2017(3): 7.
|
[1] |
(State Information Center. Report of Big Bata Development in China2017[J]. New West, 2017(3): 7.)
|
[2] |
Lukić J.The New Job Positions for Working with Big Data Technologies and Their Placement in Companies Worldwide: Evidence from Empirical Research[J]. Facta Universitatis: Economics and Organization, 2016, 13(3): 301-312.
|
[3] |
Kim J Y, Lee C K.An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings[J]. International Journal of Software Engineering and Its Applications, 2016, 10(4): 161-172.
doi: 10.14257/ijseia.2016.10.4.15
|
[4] |
夏火松, 潘筱听. 基于Python挖掘的大数据学术研究与人才需求的关系研究[J]. 信息资源管理学报, 2017, 7(1): 4-12.
doi: 10.13365/j.jirm.2017.01.004
|
[4] |
(Xia Huosong, Pan Xiaoting.Research on Relationship Between Big Data’s Academic Research and It’s Talent Demand Based on Python[J]. Journal of Information Resources Management, 2017, 7(1): 4-12.)
doi: 10.13365/j.jirm.2017.01.004
|
[5] |
黄崑, 王凯飞, 王珊珊, 等. 数据类岗位招聘需求调查及对图情学科人才培养的启示[J]. 图书情报知识, 2016(6): 42-53.
doi: 10.13366/j.dik.2016.06.042
|
[5] |
(Huang Kun, Wang Kaifei, Wang Shanshan, et al.Survey on the Demand of Data Post Recruitment and Its Enlightenment to the Talent Cultivation of the Library and Information Science[J]. Document, Inofrmation & Knowledge, 2016(6): 42-53.)
doi: 10.13366/j.dik.2016.06.042
|
[6] |
De Mauro A, Greco M, Grimaldi M, et al.Beyond Data Scientists: A Review of Big Data Skills and Job Families[C]// Proceedings of the 2016 International Forum on Knowledge Asset Dynamics. 2016: 1844-1857.
|
[7] |
Debortoli S, Müller O, Vom Brocke J.Comparing Business Intelligence and Big Data Skills[J]. Business & Information Systems Engineering, 2014, 6(5): 289-300.
doi: 10.1007/s12599-014-0344-2
|
[8] |
Steinhaus H.Sur la Division des Corp Materiels en Parties[J]. Bulletin L’Academie Polonaise des Science, 1956, 4: 801-804.
|
[9] |
MacQueen J. Some Methods for Classification and Analysis of MultiVariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
|
[10] |
Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
doi: 10.1023/A:1007612920971
|
[11] |
黄建宇, 周爱武, 肖云, 等. 基于特征空间的文本聚类[J]. 计算机技术与发展, 2017, 27(9): 75-77.
doi: 10.3969/j.issn.1673-629X.2017.09.016
|
[11] |
(Huang Jianyu, Zhou Aiwu, Xiao Yun, et al.Text Clustering Based on Feature Space[J]. Computer Technology and Development, 2017, 27(9): 75-77.)
doi: 10.3969/j.issn.1673-629X.2017.09.016
|
[12] |
武森, 冯小东, 杨杰, 等. 基于MapReduce的大规模文本聚类并行化[J]. 北京科技大学学报, 2014, 36(10): 1411-1419.
|
[12] |
(Wu Sen, Feng Xiaodong, Yang Jie, et al.Parallel Clustering of Very Large Document Datasets with MapReduce[J]. Journal of University of Science and Technology Beijing, 2014, 36(10): 1411-1419.)
|
[13] |
王东波, 韩普, 沈耕宇, 等. 基于汉英词性组合的短语级平行语料类别知识挖掘研究[J]. 图书情报工作, 2013, 57(11): 106-111.
doi: 10.7536/j.jssn.0252-3116.2013.11.020
|
[13] |
(Wang Dongbo, Han Pu, Shen Gengyu, et al.Research of Mining the Category Knowledge Based on Chinese-English Part of Speech Sequence Parallel Corpus in Phrase Level[J]. Library and Information Service, 2013, 57(11): 106-111.)
doi: 10.7536/j.jssn.0252-3116.2013.11.020
|
[14] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv PrePrint, arXiv:1301.3781, 2013.
|
[15] |
Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and their Compositionality[C] // Advances in Neural Information Processing Systems 26(NIPS 2013). 2013.
|
[16] |
姜霖, 王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究[J]. 现代图书情报技术, 2016(2): 9-15.
|
[16] |
(Jiang Lin, Wang Dongbo.Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model[J]. New Technology of Library and Information Service, 2016(2): 9-15.)
|
[17] |
Řehůřek R.Models.Word2Vec - Deep Learning with Word2Vec [EB/OL].[2017-07-26]. .
|
[18] |
张冬梅. 基于轮廓系数的层次聚类算法研究[D]. 秦皇岛: 燕山大学, 2009.
|
[18] |
(Zhang Dongmei.Research on Hierarchical Clustering Algorithm Based on Silhouette[D]. Qinhuangdao: Yanshan University, 2009.)
|
[19] |
朱连江, 马炳先, 赵学泉. 基于轮廓系数的聚类有效性分析[J]. 计算机应用, 2010, 30(S2): 139-141.
|
[19] |
(Zhu Lianjiang, Ma Bingxian, Zhao Xuequan.Clustering Validity Analysis Based on Silhouette Coefficient[J]. Journal of Computer Applications, 2010, 30(S2): 139-141.)
|
[20] |
江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
|
[20] |
(Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation [D]. Hangzhou: Zhejiang University, 2015.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|