|
|
Review of Keyword Extraction Studies |
Hu Shaohu,Zhang Yingyi,Zhang Chengzhi() |
School of Economics and Management, Nanjing University of Science & Technology, Nanjing 210094, China |
|
|
Abstract [Objective] This paper reviews the methods, features and evaluation procedures of keyword extraction research, aiming to provide reference for future studies. [Coverage] We searched the Web of Science, DBLP, Engineering Index, Google Scholar, CNKI and Wanfang Data with “Keyword Extraction”, “Keyword Generation”,“Keyphrase Extraction”, and “Keyphrase Generation”, etc. A total of 89 representative literature were retrieved. [Methods] First, we analyzed the development of keyword extraction techniques. Then, we summarized related studies from the perspectives of research methods, characteristics and evaluation process. [Results] The keyword extraction methods, which gradually shifted from feature-driven models to data-driven models due to the development of machine learning, also faced problems like data labeling and evaluation criteria. [Limitations] We examined more mainstream methods for keyword extraction. [Conclusions] This paper summarizes the developing trends of keyword extraction methods, as well as the dis-advantages of existing evaluation mechanism.
|
Received: 10 October 2020
Published: 24 November 2020
|
|
Fund:National Natural Science Foundation of China(72074113) |
Corresponding Authors:
Zhang Chengzhi
E-mail: zhangcz@njust.edu.cn
|
[1] |
Turney P D. Learning Algorithms for Keyphrase Extraction[J]. Information Retrieval, 2000,2(4):303-336.
|
[2] |
章成志. 自动标引研究的回顾与展望[J]. 现代图书情报技术, 2007(11):33-39.
|
[2] |
( Zhang Chengzhi. Review and Prospect of Automatic Indexing Research[J]. New Technology of Library and Information Service, 2007(11):33-39.)
|
[3] |
赵京胜, 朱巧明, 周国栋, 等. 自动关键词抽取研究综述[J]. 软件学报, 2017,28(9):2431-2449.
|
[3] |
( Zhao Jingsheng, Zhu Qiaoming, Zhou Guodong, et al. Review of Research in Automatic Keyword Extraction[J]. Journal of Software, 2017,28(9):2431-2449.)
|
[4] |
Liu Z, Huang W, Zheng Y, et al. Automatic Keyphrase Extraction via Topic Decomposition[C]// Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, Massachusetts, USA. Association for Computational Linguistics, 2010: 366-376.
|
[5] |
Hassaïne A, Mecheter S, Jaoua A. Text Categorization Using Hyper Rectangular Keyword Extraction: Application to News Articles Classification[C]// Proceedings of the 15th International Conference on Relational and Algebraic Methods in Computer Science, Braga, Portugal. Springer, 2015,9348:312-325.
|
[6] |
Luhn H P. A Statistical Approach to Mechanized Encoding and Searching of Literary Information[J]. IBM Journal of Research and Development, 1957,1(4):309-317.
|
[7] |
Merrouni Z A, Frikh B, Ouhbi B. Automatic Keyphrase Extraction: A Survey and Trends[J]. Journal of Intelligent Information Systems, 2020,54(2):391-424.
|
[8] |
常耀成, 张宇翔, 万怀宇, 等. 特征驱动的关键词提取算法综述[J]. 软件学报, 2018,29(7):2046-2070.
|
[8] |
( Chang Yaocheng, Zhang Yuxiang, Wan Huaiyu, et al. Features Oriented Survey of State-of-the-Art Keyphrase Extraction Algorithms[J]. Journal of Software, 2018,29(7):2046-2070.)
|
[9] |
Papagiannopoulou E, Tsoumakas G. A Review of Keyphrase Extraction[J]. Wiley Interdisciplinary Reviews Data Mining & Knowledge Discovery, 2020,10(2):e1339.
|
[10] |
Meng R, Zhao S, Han S, et al. Deep Keyphrase Generation[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver,Canada. Association for Computational Linguistics, 2017: 582-592.
|
[11] |
Cohen J D. Highlights: Language- and Domain-Independent Automatic Indexing Terms for Abstracting[J]. Journal of the American Society for Information Science, 1995,46(3):162-174.
|
[12] |
Salton G, Yang C S, Yu C T. A Theory of Term Importance in Automatic Text Analysis[J]. Journal of the American Society for Information Science, 1975,26(1):33-44.
|
[13] |
Matsuo Y, Ishizuka M. Keyword Extracyion from a Single Document Using Word Co-occurrence Statistical Information[J]. International Journal on Artificial Intelligence Tools, 2008,13(1):157-169.
|
[14] |
Barker K, Cornacchia N. Using Noun Phrase Heads to Extract Document Keyphrases[C]// Proceedings of the 13th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence, Quebec, Canada. Springer, 2000:40-52.
|
[15] |
Edmundson H P. New Method in Automatic Abstracting[J]. Journal of the ACM, 1969,16(2):264-285.
|
[16] |
Campos R, Mangaravite V, Pasquali A, et al. YAKE! Collection-independent Automatic Keyword Extractor[C]// Proceedings of the 40th European Conference on IR Research, Grenoble, France. Springer, 2018:806-810.
|
[17] |
Mihalcea R, Tarau P. TextRank: Bringing Order into Text[C]// Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing, Barcelona, Spain. Association for Computational Linguistics, 2004: 404-411.
|
[18] |
Wan X, Xiao J. Single Document Keyphrase Extraction Using Neighborhood Knowledge[C]// Proceedings of the 23rd AAAI Conference on Artificial Intelligence,Illinois, USA. AAAI Press, 2008: 855-860.
|
[19] |
Danesh S, Sumner T, Martin J H. SGRank: Combining Statistical and Graphical Methods to Improve the State of the Art in Unsupervised Keyphrase Extraction[C]// Proceedings of the 4th Joint Conference on Lexical and Computational Semantics, Colorado,USA. 2015: 117-126.
|
[20] |
Florescu C, Caragea C. PositionRank: An Unsupervised Approach to Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics, 2017: 1105-1115.
|
[21] |
Gollapalli S D, Caragea C. Extracting Keyphrases from Research Papers Using Citation Networks[C]// Proceedings of the 28th AAAI Conference on Artificial Intelligence, Quebec, Canada. AAAI Press, 2014: 1629-1635.
|
[22] |
Liu Z, Li P, Zheng Y, et al. Clustering to Find Exemplar Terms for Keyphrase Extraction[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Suntec, Singapore. ACL, 2009: 257-266.
|
[23] |
Bougouin A, Boudin F, Daille B. TopicRank: Graph-Based Topic Ranking for Keyphrase Extraction[C]// Proceedings of the 6th International Joint Conference on Natural Language Processing, Nagoya, Japan. ACL, 2013: 543-551.
|
[24] |
Boudin F. Unsupervised Keyphrase Extraction with Multipartite Graphs[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Louisiana, USA. Association for Computational Linguistics, 2018: 667-672.
|
[25] |
Sterckx L, Demeester T, Deleu J, et al. Topical Word Importance for Fast Keyphrase Extraction[C]// Proceedings of the 24th International Conference on World Wide Web, Florence, Italy. ACM, 2015: 121-122.
|
[26] |
Teneva N, Cheng W. Salience Rank: Efficient Keyphrase Extraction with Topic Modeling[C]// Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Vancouver, Canada. Association for Computational Linguistics, 2017: 530-535.
|
[27] |
Collobert R, Weston J, Bottou L, et al. Natural Language Processing (Almost) from Scratch[J]. Journal of Machine Learning Research, 2011,12:2493-2537.
|
[28] |
Wang R, Liu W, McDonald C. Corpus-independent Generic Keyphrase Extraction Using Word Embedding Vectors[C]. Software Engineering Research Conference, 2014,39:1-8.
|
[29] |
Wang R, Liu W, McDonald C. Using Word Embeddings to Enhance Keyword Identification for Scientific Publications[C]// Proceedings of the 26th Australasian Database Conference, Melbourne, Australia. Springer, 2015: 257-268.
|
[30] |
Mahata D, Kuriakose J, Shah R R, et al. Key2Vec: Automatic Ranked Keyphrase Extraction from Scientific Articles Using Phrase Embeddings[C]// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, USA. Association for Computational Linguistics, 2018: 634-639.
|
[31] |
Shi W, Zheng W, Yu J X, et al. Keyphrase Extraction Using Knowledge Graphs[J]. Data Science and Engineering, 2017,2(4):275-288.
|
[32] |
Yu Y, Ng V. WikiRank: Improving Keyphrase Extraction Based on Background Knowledge[C]// Proceedings of the 11th Edition of the Language Resources and Evaluation Conference, Miyazaki, Japan. European Language Resources Association, 2018: 3723-3727.
|
[33] |
Tomokiyo T, Hurst M. A Language Model Approach to Keyphrase Extraction[C]// Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment, Sapporo, Japan. Association for Computational Linguistics, 2003,18:33-40.
|
[34] |
Frank E, Paynter G W, Witten I H, et al. Domain-Specific Keyphrase Extraction[C]// Proceedings of the 16th International Joint Conference on Artificial Intelligence, Stockholm, Sweden. Morgan Kaufmann, 1999: 668-673.
|
[35] |
Wang J, Peng H. Keyphrases Extraction from Web Document by the Least Squares Support Vector Machine[C]// Proceedings of the 2005 IEEE / WIC / ACM International Conference on Web Intelligence, Compiegne, France. IEEE Computer Society, 2005: 293-296.
|
[36] |
Zhang C, Wang H, Liu Y, et al. Automatic Keyword Extraction from Documents Using Conditional Random Fields[J]. Journal of Computer Information Systems, 2008,4(3):1169-1180.
|
[37] |
Ding Z, Zhang Q, Huang X. Keyphrase Extraction from Online News Using Binary Integer Programming[C]// Proceedings of the 5th International Joint Conference on Natural Language Processing, Chiang Mai, Thailand. Association for Computer Linguistics, 2011: 165-173.
|
[38] |
Haddoud M, Mokhtari A, Lecroq T, et al. Accurate Keyphrase Extraction from Scientific Papers by Mining Linguistic Information[C]// Proceedings of the 1st Workshop on Mining Scientific Papers: Computational Linguistics and Bibliometrics Co-located with 15th International Society of Scientometrics and Informetrics Conference, Istanbul, Turkey. 2015: 12-17.
|
[39] |
Turney P D. Coherent Keyphrase Extraction via Web Mining[C]// Proceedings of the 18th International Joint Conference on Artificial Intelligence, Acapulco, Mexico. Morgan Kaufmann, 2003: 434-442.
|
[40] |
Nguyen T D, Kan M Y. Keyphrase Extraction in Scientific Publications[C]// Proceedings of the 10th International Conference on Asian Digital Libraries, Hanoi, Vietnam. Springer, 2007: 317-326.
|
[41] |
Medelyan O, Frank E, Witten I H. Human-competitive Tagging Using Automatic Keyphrase Extraction[C]// Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Suntec, Singapore. Association for Computational Linguistics, 2009: 1318-1327.
|
[42] |
Haddoud M, Abdeddaïm S. Accurate Keyphrase Extraction by Discriminating Overlapping Phrases[J]. Journal of Information Science, 2014,40(4):488-500.
|
[43] |
Caragea C, Bulgarov F A, Godea A, et al. Citation-Enhanced Keyphrase Extraction from Research Papers: A Supervised Approach[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, 2014: 1435-1446.
|
[44] |
Zhang K, Xu H, Tang J, et al. Keyword Extraction Using Support Vector Machine[C]// Proceedings of the 7th International Conference of Web-Age Information Management, Hong Kong, China. Springer, 2006: 85-96.
|
[45] |
章成志. 基于集成学习的自动标引方法研究[J]. 情报学报, 2010,29(1):3-8.
|
[45] |
( Zhang Chengzhi. Research on Automatic Indexing Method Based on Ensemble Learning[J]. Journal of the China Society for Scientific and Technical Information, 2010,29(1):3-8.)
|
[46] |
Hulth A. Improved Automatic Keyword Extraction Given More Linguistic Knowledge[C]// Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, Sapporo, Japan. Association for Computational Linguistics, 2003: 216-223.
|
[47] |
Ercan G, Cicekli I. Using Lexical Chains for Keyword Extraction[J]. Information Processing & Management, 2007,43(6):1705-1714.
|
[48] |
Sterckx L, Caragea C, Demeester T, et al. Supervised Keyphrase Extraction as Positive Unlabeled Learning[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, USA. Association for Computational Linguistics, 2016: 1924-1929.
|
[49] |
Krapivin M, Autayeu A, Marchese M, et al. Keyphrases Extraction from Scientific Documents: Improving Machine Learning Approaches with Natural Language Processing[C]// Proceedings of the 12th International Conference on Asia-Pacific Digital Libraries. Springer, 2010: 102-111.
|
[50] |
Sarkar K, Nasipuri M, Ghose S. Machine Learning Based Keyphrase Extraction: Comparing Decision Trees, Naïve Bayes, and Artificial Neural Networks[J]. Journal of Information Processing Systems, 2012,8(4):693-712.
|
[51] |
Aquino G O, Lanzarini L C. Keyword Identification in Spanish Documents Using Neural Networks[J]. Journal of Computer Science and Technology, 2015,15(2):55-60.
|
[52] |
Zhang Q, Wang Y, Gong Y, et al. Keyphrase Extraction Using Deep Recurrent Neural Networks on Twitter[C]// Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin,USA. Association for Computational Linguistics, 2016: 836-845.
|
[53] |
Basaldella M, Antolli E, Serra G, et al. Bidirectional Lstm Recurrent Neural Network for Keyphrase Extraction[C]// Proceedings of the 14th Italian Research Conference on Digital Libraries, Udine, Italy. Springer, 2018: 180-187.
|
[54] |
Alzaidy R, Caragea C, Giles C L. Bi-LSTM-CRF Sequence Labeling for Keyphrase Extraction from Scholarly Documents[C]// Proceedings of the 2019 World Wide Web Conference. ACM, 2019: 2551-2557.
|
[55] |
Bhaskar P, Nongmeikapam K, Bandyopadhyay S. Keyphrase Extraction in Scientific Articles: A Supervised Approach[C]// Proceedings of the 24th International Conference on Computational Linguistics, Austin, USA. Indian Institute of Technology Bombay, 2012: 17-24.
|
[56] |
Gollapalli S D, Li X L, Yang P. Incorporating Expert Knowledge into Keyphrase Extraction[C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence, San Francisco, USA. AAAI Press, 2017: 3180-3187.
|
[57] |
Liu Z, Chen X, Zheng Y, et al. Automatic Keyphrase Extraction by Bridging Vocabulary Gap[C]// Proceedings of the 15th Conference on Computational Natural Language Learning, Portland, USA. ACL, 2011: 135-144.
|
[58] |
Koehn P. Statistical Machine Translation[M]. Cambridge,UK: Cambridge University Press, 2010.
|
[59] |
Brown P F, Pietra S D, Pietra V J D, et al. The Mathematics of Statistical Machine Translation: Parameter Estimation[J]. Computational Linguistics, 1993,19(2):263-311.
|
[60] |
Cho K, van Merrienboer B, Gülçehre Ç, et al. Learning Phrase Representations Using RNN Encoder-Decoder for Statistical Machine Translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. ACL, 2014: 1724-1734.
|
[61] |
Chen J, Zhang X, Wu Y, et al. Keyphrase Generation with Correlation Constraints[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Association for Computational Linguistics, 2018: 4057-4066.
|
[62] |
Zhang Y, Xiao W. Keyphrase Generation Based on Deep Seq2Seq Model[J]. IEEE Access, 2018,6:46047-46057.
|
[63] |
Chen W, Gao Y, Zhang J, et al. Title-Guided Encoding for Keyphrase Generation[C]// Proceedings of the 33rd AAAI Conference on Artificial Intelligence, Honolulu,USA. AAAI Press, 2019: 6268-6275.
|
[64] |
Chen W, Chan H P, Li P, et al. Exclusive Hierarchical Decoding for Deep Keyphrase Generation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020: 1095-1105.
|
[65] |
Chen W, Chan H P, Li P, et al. An Integrated Approach for Keyphrase Generation via Exploring the Power of Retrieval and Extraction[C]// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Minneapolis,USA. Association for Computational Linguistics, 2019: 2846-2856.
|
[66] |
Wang Y, Li J, Chan H P, et al. Topic-Aware Neural Keyphrase Generation for Social Media Language[C]// Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, 2019: 2516-2526.
|
[67] |
Chan H P, Chen W, Wang L, et al. Neural Keyphrase Generation via Reinforcement Learning with Adaptive Rewards[C]// Proceedings of the 57th Conference of the Association for Computational Linguistics, Florence, Italy. Association for Computational Linguistics, 2019: 2163-2174.
|
[68] |
Ye H, Wang L. Semi-Supervised Learning for Neural Keyphrase Generation[C]// Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels,Belgium. Association for Computational Linguistics, 2018: 4142-4153.
|
[69] |
Wang Y, Liu Q, Qin C, et al. Exploiting Topic-Based Adversarial Neural Network for Cross-Domain Keyphrase Extraction[C]// Proceedings of the 2018 IEEE International Conference on Data Mining, Sentosa, Singapore. IEEE Computer Society, 2018: 597-606.
|
[70] |
Jones K S. A Statistical Interpretation of Term Specificity and Its Application in Retrieval[J]. Journal of Documentation, 1972,28(1):11-21.
|
[71] |
Salton G, Buckley C. Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
|
[72] |
Zhang W, Feng W, Wang J. Integrating Semantic Relatedness and Words’ Intrinsic Features for Keyword Extraction[C]// Proceedings of the 23rd International Joint Conference on Artificial Intelligence, Beijing, China. IJCAI, 2013: 1115-2231.
|
[73] |
Nguyen T D, Luong M T. WINGNUS: Keyphrase Extraction Utilizing Document Logical Structure[C]// Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala, Sweden. Association for Computer Linguistics, 2010: 166-169.
|
[74] |
Marujo L, Gershman A, Carbonell J G, et al. Supervised Topical Key Phrase Extraction of News Stories Using Crowdsourcing, Light Filtering and Co-reference Normalization[C]// Proceedings of the 8th International Conference on Language Resources and Evaluation, Istanbul,Turkey. European Language Resources Association, 2012: 399-403.
|
[75] |
Boudin F. A Comparison of Centrality Measures for Graph-Based Keyphrase Extraction[C]// Proceedings of the 6th International Joint Conference on Natural Language Processing, Nagoya, Japan. ACL, 2013: 834-838.
|
[76] |
Eichler K, Neumann G. DFKI KeyWE: Ranking Keyphrases Extracted from Scientific Articles[C]// Proceedings of the 5th International Workshop on Semantic Evaluation, Uppsala,Sweden. Association for Computer Linguistics, 2010: 150-153.
|
[77] |
Berend G. Exploiting Extra-textual and Linguistic Information in Keyphrase Extraction[J]. Natural Language Engineering, 2016,22(1):73-95.
|
[78] |
Zhang Y, Zhang C. Using Human Attention to Extract Keyphrase from Microblog Post[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence,Italy. Association for Computational Linguistics, 2019: 5867-5872.
|
[79] |
Zhang Y, Zhang C. Enhancing Keyphrase Extraction from Microblogs Using Human Reading Time[J]. Journal of the Association for Information Science and Technology, 2020.
|
[80] |
Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[C]// Proceedings of the 1st International Conference on Learning Representations, Scottsdale,USA. Association for Computational Linguistics, 2013: 1-12.
|
[81] |
Pennington J, Socher R, Manning C D. Glove: Global Vectors for Word Representation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar. Association for Computational Linguistics, 2014: 1532-1543.
|
[82] |
Zhang Y, Zhang C, Li J. Joint Modeling of Characters, Words, and Conversation Contexts for Microblog Keyphrase Extraction[J]. Journal of the Association for Information Science and Technology, 2020,71(5):553-567.
|
[83] |
Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval[M]. Cambridge,UK: Cambridge University Press, 2008.
|
[84] |
Voorhees E M. The TREC-8 Question Answering Track Report[C]// Proceedings of the 8th Text Retrieval Conference, Gaithersburg,USA. National Institute of Standards and Technology (NIST), 1999: 246-500.
|
[85] |
Liu L, Özsu M T. Encyclopedia of Database Systems[M]. New York,USA: Springer US, 2009.
|
[86] |
Ristad E S, Yianilos P N. Learning String-edit Distance[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1998,20(5):522-532.
|
[87] |
Dagan I, Pereira F C N, Lee L. Similarity-Based Estimation of Word Cooccurrence Probabilities[C]// Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics, Las Cruces, USA. ACL, 1994: 272-278.
|
[88] |
章成志, 周冬敏. 自动标引通用评价模型研究[J]. 情报学报, 2009,28(1):40-47.
|
[88] |
( Zhang Chengzhi, Zhou Dongmin. General Evaluation Model for Automatic Indexing[J]. Journal of the China Society for Scientific and Technical Information, 2009,28(1):40-47.)
|
[89] |
Chen P I, Lin S J. Automatic Keyword Prediction Using Google Similarity Distance[J]. Expert Systems with Applications, 2010,37(3):1928-1938.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|