Please wait a minute...
Data Analysis and Knowledge Discovery  2017, Vol. 1 Issue (12): 32-40    DOI: 10.11925/infotech.2096-3467.2017.0817
Orginal Article Current Issue | Archive | Adv Search |
Research on Text Clustering Based on Requirements of Big Data Jobs
Ruilun Liu,Wenhao Ye,Ruiqing Gao,Mengjia Tang,Dongbo Wang()
College of Information and Technology, Nanjing Agricultural University, Nanjing 210095, China
Download: PDF(1378 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      

[Objective] This study analyzes the requirements of big data related positions, aiming to identify high-quality candidates for the companies. [Methods] We retrieved job postings in the field of big data from major recruitment websites during the first quarter of 2017. Then, we used the TF-IDF, word2vec, and k-means algorithms to cluster the texts semantically, which were optimized with the help of silhouette coefficient. [Results] We obtained very good clustering results, and divided the job requirements into three categories of capability, education background and work experiences. [Limitations] First, the formats of job announcement posted on different websites were not unified, which affected the data cleaning and clustering. Second, the training set for word2vec was small due to insufficient data retrieved from the Web. [Conclusions] We found that the big data related jobs do not require advanced degrees and the companies prefer experienced candidates. Those applicants with no relevant experience will also be considered. The candidates’ professionalism is more important than their computer skills.

Key wordsBig DATA Jobs      Word2Vec      K-means      Silhouette Coefficient     
Received: 15 August 2017      Published: 29 December 2017

Cite this article:

Ruilun Liu,Wenhao Ye,Ruiqing Gao,Mengjia Tang,Dongbo Wang. Research on Text Clustering Based on Requirements of Big Data Jobs. Data Analysis and Knowledge Discovery, 2017, 1(12): 32-40.

URL:     OR

[1] 国家信息中心. 《2017中国大数据发展报告》[J]. 新西部(上), 2017(3): 7.
[1] (State Information Center. Report of Big Bata Development in China2017[J]. New West, 2017(3): 7.)
[2] Luki? J.The New Job Positions for Working with Big Data Technologies and Their Placement in Companies Worldwide: Evidence from Empirical Research[J]. Facta Universitatis: Economics and Organization, 2016, 13(3): 301-312.
[3] Kim J Y, Lee C K.An Empirical Analysis of Requirements for Data Scientists Using Online Job Postings[J]. International Journal of Software Engineering and Its Applications, 2016, 10(4): 161-172.
[4] 夏火松, 潘筱听. 基于Python挖掘的大数据学术研究与人才需求的关系研究[J]. 信息资源管理学报, 2017, 7(1): 4-12.
[4] (Xia Huosong, Pan Xiaoting.Research on Relationship Between Big Data’s Academic Research and It’s Talent Demand Based on Python[J]. Journal of Information Resources Management, 2017, 7(1): 4-12.)
[5] 黄崑, 王凯飞, 王珊珊, 等. 数据类岗位招聘需求调查及对图情学科人才培养的启示[J]. 图书情报知识, 2016(6): 42-53.
[5] (Huang Kun, Wang Kaifei, Wang Shanshan, et al.Survey on the Demand of Data Post Recruitment and Its Enlightenment to the Talent Cultivation of the Library and Information Science[J]. Document, Inofrmation & Knowledge, 2016(6): 42-53.)
[6] De Mauro A, Greco M, Grimaldi M, et al.Beyond Data Scientists: A Review of Big Data Skills and Job Families[C]// Proceedings of the 2016 International Forum on Knowledge Asset Dynamics. 2016: 1844-1857.
[7] Debortoli S, Müller O, Vom Brocke J.Comparing Business Intelligence and Big Data Skills[J]. Business & Information Systems Engineering, 2014, 6(5): 289-300.
[8] Steinhaus H.Sur la Division des Corp Materiels en Parties[J]. Bulletin L’Academie Polonaise des Science, 1956, 4: 801-804.
[9] MacQueen J. Some Methods for Classification and Analysis of MultiVariate Observations[C]// Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. 1967: 281-297.
[10] Dhillon I S, Modha D S.Concept Decompositions for Large Sparse Text Data Using Clustering[J]. Machine Learning, 2001, 42(1-2): 143-175.
[11] 黄建宇, 周爱武, 肖云, 等. 基于特征空间的文本聚类[J]. 计算机技术与发展, 2017, 27(9): 75-77.
[11] (Huang Jianyu, Zhou Aiwu, Xiao Yun, et al.Text Clustering Based on Feature Space[J]. Computer Technology and Development, 2017, 27(9): 75-77.)
[12] 武森, 冯小东, 杨杰, 等. 基于MapReduce的大规模文本聚类并行化[J]. 北京科技大学学报, 2014, 36(10): 1411-1419.
[12] (Wu Sen, Feng Xiaodong, Yang Jie, et al.Parallel Clustering of Very Large Document Datasets with MapReduce[J]. Journal of University of Science and Technology Beijing, 2014, 36(10): 1411-1419.)
[13] 王东波, 韩普, 沈耕宇, 等. 基于汉英词性组合的短语级平行语料类别知识挖掘研究[J]. 图书情报工作, 2013, 57(11): 106-111.
[13] (Wang Dongbo, Han Pu, Shen Gengyu, et al.Research of Mining the Category Knowledge Based on Chinese-English Part of Speech Sequence Parallel Corpus in Phrase Level[J]. Library and Information Service, 2013, 57(11): 106-111.)
[14] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv PrePrint, arXiv:1301.3781, 2013.
[15] Mikolov T, Sutskever I, Chen K, et al.Distributed Representations of Words and Phrases and their Compositionality[C] // Advances in Neural Information Processing Systems 26(NIPS 2013). 2013.
[16] 姜霖, 王东波. 采用连续词袋模型(CBOW)的领域术语自动抽取研究[J]. 现代图书情报技术, 2016(2): 9-15.
[16] (Jiang Lin, Wang Dongbo.Automatic Extraction of Domain Terms Using Continuous Bag-of-Words Model[J]. New Technology of Library and Information Service, 2016(2): 9-15.)
[17] ?eh??ek R.Models.Word2Vec - Deep Learning with Word2Vec [EB/OL].[2017-07-26]. .
[18] 张冬梅. 基于轮廓系数的层次聚类算法研究[D]. 秦皇岛: 燕山大学, 2009.
[18] (Zhang Dongmei.Research on Hierarchical Clustering Algorithm Based on Silhouette[D]. Qinhuangdao: Yanshan University, 2009.)
[19] 朱连江, 马炳先, 赵学泉. 基于轮廓系数的聚类有效性分析[J]. 计算机应用, 2010, 30(S2): 139-141.
[19] (Zhu Lianjiang, Ma Bingxian, Zhao Xuequan.Clustering Validity Analysis Based on Silhouette Coefficient[J]. Journal of Computer Applications, 2010, 30(S2): 139-141.)
[20] 江大鹏. 基于词向量的短文本分类方法研究[D]. 杭州: 浙江大学, 2015.
[20] (Jiang Dapeng.Research on Short Text Classification Based on Word Distributed Representation [D]. Hangzhou: Zhejiang University, 2015.)
[1] Tingxin Wen,Yangzi Li,Jingshuang Sun. News Hotspots Discovery Method Based on Multi Factor Feature Selection and AFOA/K-means[J]. 数据分析与知识发现, 2019, 3(4): 97-106.
[2] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[3] Xinlei Li,Hao Wang,Xiaomin Liu,Sanhong Deng. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
[4] Hongwei Liu,Hongming Gao,Li Chen,Mingjun Zhan,Zhouyang Liang. Identifying User Interests Based on Browsing Behaviors[J]. 数据分析与知识发现, 2018, 2(2): 74-85.
[5] Xiaoting Jia,Mingyang Wang,Yu Cao. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[6] Minghui Liu. Risk Assessment of Civil Aviation Terrorism Based on K-means Clustering[J]. 数据分析与知识发现, 2018, 2(10): 21-26.
[7] Yongbing Gao,Guipeng Yang,Di Zhang,Zhanfei Ma. Detecting Events from Official Weibo Profiles Based on Post Clustering with Burst Words[J]. 数据分析与知识发现, 2017, 1(9): 57-64.
[8] Qin Zhang,Hongmei Guo,Zhixiong Zhang. Extracting Entity Relationship with Word Embedding Representation Features[J]. 数据分析与知识发现, 2017, 1(9): 8-15.
[9] Xueying Wang,Zixuan Zhang,Hao Wang,Sanhong Deng. Evaluating Brands of Agriculture Products: A Literature Review[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[10] Qin Guan, Sanhong Deng, Hao Wang. Chinese Stopwords for Text Clustering: A Comparative Study[J]. 数据分析与知识发现, 2017, 1(3): 72-80.
[11] Xiaofei Fang,Xiaoxi Huang,Rongbo Wang,Zhiqun Chen,Xiaohua Wang. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[12] Tian Xia. Extracting Keywords with Modified TextRank Model[J]. 数据分析与知识发现, 2017, 1(2): 28-34.
[13] Luo Wenxin,Chen Chong,Deng Siyi. Detecting Disease Associations with Word2Vec from Consumer Health Information[J]. 现代图书情报技术, 2016, 32(9): 78-87.
[14] Niu Liang. New Research and Application with Co-topics Network[J]. 现代图书情报技术, 2016, 32(7-8): 137-146.
[15] Ning Jianfei,Liu Jiangzhen. Using Word2vec with TextRank to Extract Keywords[J]. 现代图书情报技术, 2016, 32(6): 20-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938