Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (2/3): 274-288     https://doi.org/10.11925/infotech.2096-3467.2021.0963
  专辑 本期目录 | 过刊浏览 | 高级检索 |
中文招聘文档中专业技能词抽取的跨域迁移学习*
易新河1,杨鹏2,文益民2()
1桂林电子科技大学图书馆 桂林 541004
2桂林电子科技大学计算机与信息安全学院 桂林 541004
Cross-domain Transfer Learning for Recognizing Professional Skills from Chinese Job Postings
Yi Xinhe1,Yang Peng2,Wen Yimin2()
1Library of Guilin University of Electronic Technology, Guilin 541004, China
2School of Computer Science and Information Security, Guilin University of Electronic Technology,Guilin 541004, China
全文: PDF (3408 KB)   HTML ( 20
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 利用在线招聘文档,准确分析用人单位需求,为解决劳动力供需失配提供技术支持。【方法】 提出一种基于跨域迁移学习的专业技能词识别方法(CDTL-PSE)。CDTL-PSE将专业技能词的识别任务当作序列标注任务,首先将SIGHAN语料库分解为三个源域,利用插入在Bi-LSTM层和CRF层之间的域自适应层来有效实现从各个源域到目标域的跨域迁移学习;然后采用参数迁移法训练每个子模型;最后通过多数投票获得标签序列的预测结果。【结果】 在自建在线招聘文档数据集上,相对于基线方法,使用交替训练的具有Bi-LSTM域自适应层的CDTL-PSE的F1值提高0.91%,能减少50%左右的标记样本。【局限】 模型的可解释性有待进一步改善。【结论】 CDTL-PSE能有效实现对技能词的自动抽取,还可有效缓解目标域标注样本的不足。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
易新河
杨鹏
文益民
关键词 专业技能词跨域迁移学习域自适应    
Abstract

[Objective] This paper analyzes the online job postings and identifies the demands of employers accurately, aiming to address the skill gaps between supply and demand in the labor market.[Methods] We proposed a model with cross-domain transfer learning to recognize professional skill words (CDTL-PSE). This task was treated as sequence tagging like named entity recognition or term extraction in CDTL-PSE. It also decomposed the SIGHAN corpus into three source domains. A domain adaptation layer was inserted between the Bi-LSTM and the CRF layers, which helped us transfer learning from each source domain to the target domain. Then, we used parameter transfer approach to train each sub-model. Finally, we obtained the prediction of label sequence by majority vote. [Results] On the self-built online recruitment data set, compared with the baseline method, the proposed model improved the F1 value by 0.91%, and reduced the labeled samples by about 50%. [Limitations] The interpretability of CDTL-PSE needs to be further improved. [Conclusions] CDTL-PSE can automatically extract words on professional skills, and effectively increase the labeled samples in the target domain.

Key wordsProfessional Skill Words    Cross Domain Transfer Learning    Domain Adaptation
收稿日期: 2021-08-31      出版日期: 2022-01-07
ZTFLH:  TP393  
基金资助:*教育部人文社会科学研究专项任务项目(17JDGC022);广西学位与研究生教育改革课题(JGY2017055);广西自然科学基金项目的研究成果之一(2018GXNSFDA138006)
通讯作者: 文益民,ORCID:0000-0001-5017-3987     E-mail: ymwen@guet.edu.cn
引用本文:   
易新河, 杨鹏, 文益民. 中文招聘文档中专业技能词抽取的跨域迁移学习*[J]. 数据分析与知识发现, 2022, 6(2/3): 274-288.
Yi Xinhe, Yang Peng, Wen Yimin. Cross-domain Transfer Learning for Recognizing Professional Skills from Chinese Job Postings. Data Analysis and Knowledge Discovery, 2022, 6(2/3): 274-288.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0963      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I2/3/274
Fig.1  中文在线招聘的任职要求举例(技能词用下划线标出)
Fig.2  Bi-LSTM框架
Fig.3  CDTL-PSE框架
实体类型 实体数目 动词后 形容词/形容词短语后 名词/名词短语后 “和”、“或”、“、”等并列形式 其他位置
技能词 3 930 1 309 764 2 1 512 343
人名 884 136 27 204 73 444
地名 1 951 649 106 57 87 1 052
机构名 984 278 54 36 128 488
Table 1  相应语料库中不同位置的实体出现情况统计
数据集 数据类型 实体类型 技能词数量 语句数量
目标域 训练集 技能词 [5836, 6586] 725
验证集 技能词 [593, 817] 80
测试集 技能词 [1577, 1903] 201
Table 2  目标域数据集的统计信息
数据集 数据
类型
实体类型 实体数量 语句
数量
源域 训练集 人名/地名/机构名 8144/16571/9277 20 864
验证集 人名/地名/机构名 884/1951/984 2 318
测试集 人名/地名/机构名 1864/3658/2185 4 636
Table 3  源域数据集的统计信息
Fig.4  各种对比方法的网络结构
方法 Precision/% Recall/% F1/% 提升/
%
No-Transfer 80.04(3.12) 82.11(2.63) 81.03(2.41) -
Per_Alter 79.78(3.43) 82.59(2.64) 81.15(2.93) +0.12
Loc_Alter 79.76(2.84) 82.00(2.76) 80.84(2.50) -0.19
Org_Alter 80.07(3.33) 82.28(2.26) 81.13(2.50) +0.10
Diff_Alter 79.81(3.22) 81.42(2.58) 80.58(2.57) -0.45
Same_Alter 79.26(3.54) 81.72(3.29) 80.44(3.04) -0.59
Alter_Ensemble 80.67(2.77) 82.38(2.70) 81.51(2.61) +0.48
Per_Tune 80.13(2.37) 82.09(2.42) 81.09(2.18) +0.06
Loc_Tune 79.85(2.30) 81.58(3.02) 80.68(2.19) -0.35
Org_Tune 80.00(2.42) 82.02(2.30) 80.96(2.00) -0.07
Diff_Tune 79.27(3.43) 81.79(1.77) 80.49(2.44) -0.54
Same_Tune 79.38(2.24) 81.73(2.12) 80.52(1.96) -0.51
Tune_Ensemble 80.37(2.23) 82.13(2.55) 81.22(1.97) +0.19
Per_Domain 80.30(2.43) 82.35(2.57) 81.29(2.07) +0.26
Loc_Domain 79.71(2.88) 82.35(2.33) 80.99(2.26) -0.04
Org_Domain 80.25(2.73) 82.26(2.06) 81.22(2.14) +0.19
Diff_Domain 79.71(4.17) 81.76(3.31) 80.69(3.46) -0.34
Same_Domain 79.61(3.04) 81.88(2.36) 80.70(2.32) -0.33
Domain_Ensemble 80.82(3.34) 82.41(3.19) 81.58(2.96) +0.55
Table 4  将不同源任务迁移到目标任务的结果(括号中为标准差)
迁移学习方法 DiffLabel SameLabel Per Loc Org
方法B 80.58●■★ 80.44●■★ 81.15 80.84 81.13
方法C 80.49●■★ 80.52●■★ 81.09 80.68 80.96
方法D 80.69●■★ 80.70●■★ 81.29 80.99 81.22
Table 5  目标数据集上使用不同迁移学习方法的F1值
迁移学习方法 Per Loc Org Ensemble
方法B 81.15● 80.84● 81.13● 81.51
方法C 81.09● 80.68● 80.96● 81.22
方法D 81.29● 80.99● 81.22● 81.58
Table 6  目标数据集上使用不同迁移学习方法与集成学习的F1值
方法 Precision/% Recall/% F1/% 提升/%
No-Transfer 80.04(3.12) 82.11(2.63) 81.03(2.41) -
Alter_Ensemble 80.67(2.77) 82.38(2.70) 81.51(2.61) +0.48
CDTL-PSE:Alter(特征扩充) 80.71(2.73) 82.51(2.45) 81.59(2.41) +0.56
CDTL- PSE:Alter(Bi-LSTM) 81.39(3.61) 82.58(2.29) 81.94(2.49) +0.91
Tune_Ensemble 80.37(2.23) 82.13(2.55) 81.22(1.97) +0.19
CDTL-PSE:Tune(特征扩充) 80.26(2.42) 82.41(2.64) 81.30(2.09) +0.27
CDTL-PSE:Tune(Bi-LSTM) 81.06(2.53) 82.19(1.88) 81.61(1.93) +0.58
Domain_Ensemble 80.82(3.34) 82.41(3.19) 81.58(2.96) +0.55
BERT-Bi-LSTM-CRF 80.03(3.09) 83.84(2.12) 81.86(2.14) +0.83
Table 7  不同迁移学习方法的结果(括号中为标准差)
迁移学习方法 Only Ensemble 特征扩充 Add Bi-LSTM
方法B 81.51●■ 81.59 81.94
方法C 81.22●■ 81.30 81.61
Table 8  同一数据集上不同方法的F1值
Fig.5  不同规模的目标域训练数据下,CDTL-PSE和基线方法的F1值
Fig.6  CDTL-PSE识别机械制造行业前50个技能词
Fig.7  CDTL-PSE识别服装设计行业前50个技能词
[1] 麦可思研究院, 王伯庆, 陈永红. 2019年中国本科生就业报告[M]. 北京: 社会科学文献出版社, 2019.
[1] (MyCOS, Wang Boqing, Chen Yonghong. Chinese 4-Year College Graduates’ Employment Annual Report (2019)[M]. Beijing: Social Sciences Academic Press, 2019.)
[2] Phaphuangwittayakul A, Saranwong S, Panyakaew S N, et al. Analysis of Skill Demand in Thai Labor Market from Online Jobs Recruitments Websites[C]// Proceedings of the 15th International Joint Conference on Computer Science and Software Engineering. 2018: 1-5.
[3] Mauro A, Greco M, Grimaldi M, et al. Human Resources for Big Data Professions: A Systematic Classification of Job Roles and Required Skill Sets[J]. Information Processing & Management, 2018, 54(5):807-817.
doi: 10.1016/j.ipm.2017.05.004
[4] Huang Z, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[OL]. arXiv Preprint, arXiv:1508.01991.
[5] Cho H C, Okazaki N, Miwa M, et al. Named Entity Recognition with Multiple Segment Representations[J]. Information Processing & Management, 2013, 49(4):954-965.
doi: 10.1016/j.ipm.2013.03.002
[6] Ronan C, Jason W, Leon B, et al. Natural Language Processing (almost) from Scratch[J]. The Journal of Machine Learning Research, 2011, 12:2493-2537.
[7] Lample G, Ballesteros M, Subramanian S, et al. Neural Architectures for Named Entity Recognition[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2016: 260-270.
[8] Peng N Y, Dredze M. Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2016: 149-155.
[9] Feng X C, Feng X C, Qin B, et al. Improving Low Resource Named Entity Recognition Using Cross-Lingual Knowledge Transfer[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4071-4077.
[10] Wang S L, Zhang Y, Che W X, et al. Joint Extraction of Entities and Relations Based on a Novel Graph Scheme[C]// Proceedings of the 27th International Joint Conference on Artificial Intelligence. 2018: 4461-4467.
[11] Li Z, Zhou J, Zhao H, et al. Cross-domain Transfer Learning for Dependency Parsing[C]// Proceedings of the 2019 CCF International Conference on Natural Language Processing and Chinese Computing. Switzerland: Springer, 2019: 835-844.
[12] Cao Y X, Hu Z K, Chua T S, et al. Low-Resource Name Tagging Learned with Weakly Labeled Data[C]// Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Association for Computational Linguistics, 2019: 261-270.
[13] Cao P, Chen Y, Liu K, et al. Adversarial Transfer Learning for Chinese Named Entity Recognition with Self-Attention Mechanism[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. ACL, 2018: 182-192.
[14] Peng N Y, Dredze M. Multi-Task Domain Adaptation for Sequence Tagging[C]// Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, 2017: 91-100.
[15] Wang Z H, Qu Y R, Chen L H, et al. Label-Aware Double Transfer Learning for Cross-Specialty Medical Named Entity Recognition[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 2018: 1-15.
[16] Yang Z L, Salakhutdinov R, Cohen W W. Transfer Learning for Sequence Tagging with Hierarchical Recurrent Networks[OL]. arXiv Preprint, arXiv: 1703.06345.
[17] Lin B Y, Lu W. Neural Adaptation Layers for Cross-Domain Named Entity Recognition[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2018: 2012-2022.
[18] Lee J Y, Dernoncourt F, Szolovits P. Transfer Learning for Named-Entity Recognition with Neural Networks[C]// Proceedings of the 11th International Conference on Language Resources and Evaluation. European Language Resources Association, 2018:4470-4473.
[19] Peng N Y, Dredze M. Improving Named Entity Recognition for Chinese Social Media with Word Segmentation Representation Learning[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2016: 149-155.
[20] Dong C, Zhang J, Zong C, et al. Character-Based LSTM-CRF with Radical-Level Features for Chinese Named Entity Recognition[C]// Proceedings of the 2016 International Conference on Computer Processing of Oriental Languages. Berlin, German: Springer, 2016: 239-250.
[21] Kim J, Woodl P C. A Rule-Based Named Entity Recognition System for Speech Input[C]// Proceedings of the 6th International Conference on Spoken Language Processing. Piscataway, NJ, USA: IEEE, 2000:521-524.
[22] Chieu H L, Ng H T. Named Entity Recognition: A Maximum Entropy Approach Using Global Information[C]// Proceedings of the 19th International Conference on Computational Linguistics. Association for Computational Linguistics, 2002: 1-7.
[23] Zhang J, Shen D, Zhou G D, et al. Enhancing HMM-Based Biomedical Named Entity Recognition by Studying Special Phenomena[J]. Journal of Biomedical Informatics, 2004, 37(6):411-422.
pmid: 15542015
[24] Li L, Mao T, Huang D, et al. Hybrid Models for Chinese Named Entity Recognition[C]// Proceedings of the 5th SIGHAN Workshop on Chinese Language Processing. ACL, 2006: 72-78.
[25] Duan H, Zheng Y. A Study on Features of the CRFs-Based Chinese Named Entity Recognition[J]. International Journal of Advanced Intelligence, 2011, 3(2):287-294.
[26] Han A L F, Wong D F, Chao L S. Chinese Named Entity Recognition with Conditional Random Fields in the Light of Chinese Characteristics[C]// Proceedings of the 20th International Conference on Intelligent Information Systems. Berlin, Heidelberg: Springer, 2013: 57-68.
[27] Quimbaya A P, Múnera A S, Rivera R A G, et al. Named Entity Recognition over Electronic Health Records Through a Combined Dictionary-Based Approach[J]. Procedia Computer Science, 2016, 100:55-61.
doi: 10.1016/j.procs.2016.09.123
[28] Zhang S D, Elhadad N. Unsupervised Biomedical Named Entity Recognition: Experiments with Clinical and Biological Texts[J]. Journal of Biomedical Informatics, 2013, 46(6):1088-1098.
doi: 10.1016/j.jbi.2013.08.004
[29] Ma X Z, Hovy E. End-to-End Sequence Labeling via Bi-Directional LSTM-CNNS-CRF[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 1064-1074.
[30] Nadeau D, Sekine S. A Survey of Named Entity Recognition and Classification[J]. Lingvisticæ Investigation, 2007, 30(1):3-26.
[31] Yang Z L, Salakhutdinov R, Cohen W. Multi-Task Cross-Lingual Sequence Tagging from Scratch[OL]. arXiv Preprint, arXiv: 1603.06270.
[32] Xiao M, Guo Y. Domain Adaptation for Sequence Labeling Tasks with a Probabilistic Language Adaptation Model[C]// Proceedings of the 30th International Conference on Machine Learning. German: Springer, 2013:293-301.
[33] Kulkarni V, Mehdad Y, Chevalier T. Domain Adaptation for Named Entity Recognition in Online Media with Word Embeddings[OL]. arXiv Preprint, arXiv:1612.00148.
[34] Che W, Wang M, Manning C D, et al. Named Entity Recognition with Bilingual Constraints[C]// Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics. 2013: 52-62.
[35] Liu Z H, Xiong C Y, Sun M S, et al. Explore Entity Embedding Effectiveness in Entity Retrieval[C]// Proceedings of the 2019 China National Conference on Chinese Computational Linguistics. Switzerland: Springer, 2019: 105-116.
[36] Pan J H, Hu X G, Li P P, et al. Domain Adaptation via Multi-Layer Transfer Learning[J]. Neurocomputing, 2016, 190:10-24.
doi: 10.1016/j.neucom.2015.12.097
[37] Hal Daumé III. Frustratingly Easy Domain Adaptation[C]// Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. 2007: 256-263.
[38] Kingma D P, Ba J. Adam: A Method for Stochastic Optimization[OL]. arXiv Preprint, arXiv: 1412.6980.
[39] iResearch. China Online Recruitment Industry Development Report[R/OL].(2019-07-11). http://report.iresearch.cn/report/201907/3409.shtml.
[40] Xu J J, He H F, Sun X, et al. Cross-Domain and Semisupervised Named Entity Recognition in Chinese Social Media: A Unified Model[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2018, 26(11):2142-2152.
doi: 10.1109/TASLP.2018.2856625
[41] Wu F Z, Liu J X, Wu C H, et al. Neural Chinese Named Entity Recognition via CNN-LSTM-CRF and Joint Training with Word Segmentation[C]//Proceedings of the 2019 World Wide Web Conference. New York: ACM Press, 2019: 3342-3348.
[42] Devlin J, Chang M, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019: 4171-4186.
[1] 王晓庆, 陈东. 区块链资源协同配置系统动力学预测模拟研究——以粤港澳大湾区为例*[J]. 数据分析与知识发现, 2022, 6(2/3): 138-150.
[2] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[3] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[4] 张芳丛, 秦秋莉, 姜勇, 庄润涛. 基于RoBERTa-WWM-BiLSTM-CRF的中文电子病历命名实体识别研究[J]. 数据分析与知识发现, 2022, 6(2/3): 251-262.
[5] 韦婷婷, 江涛, 郑舒玲, 张建桃. 融合LSTM与逻辑回归的中文专利关键词抽取*[J]. 数据分析与知识发现, 2022, 6(2/3): 308-317.
[6] 周云泽, 闵超. 基于LDA模型与共享语义空间的新兴技术识别——以自动驾驶汽车为例*[J]. 数据分析与知识发现, 2022, 6(2/3): 55-66.
[7] 李纲, 张霁, 毛进. 面向突发事件画像的社交媒体图像分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 67-79.
[8] 孙羽, 裘江南. 基于网络分析和文本挖掘的意见领袖影响力研究*[J]. 数据分析与知识发现, 2022, 6(1): 69-79.
[9] 朱冬亮, 文奕, 万子琛. 基于知识图谱的推荐系统研究综述*[J]. 数据分析与知识发现, 2021, 5(12): 1-13.
[10] 朱路, 邓芳, 刘坤, 贺婷婷, 刘媛媛. 基于语义自编码哈希学习的跨模态检索方法*[J]. 数据分析与知识发现, 2021, 5(12): 110-122.
[11] 赵正, 黄倩倩, 童楠楠. 基于K-Means聚类的新冠肺炎疫情期间惠企政策偏离度研究*[J]. 数据分析与知识发现, 2021, 5(12): 148-157.
[12] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[13] 李文娜,张智雄. 基于置信学习的知识库错误检测方法研究*[J]. 数据分析与知识发现, 2021, 5(9): 1-9.
[14] 孙羽, 裘江南. 基于网络分析和文本挖掘的意见领袖影响力研究 [J]. 数据分析与知识发现, 0, (): 1-.
[15] 王勤洁, 秦春秀, 马续补, 刘怀亮, 徐存真. 基于作者偏好和异构信息网络的科技文献推荐方法研究*[J]. 数据分析与知识发现, 2021, 5(8): 54-64.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn