Identifying Commodity Names Based on XGBoost Model
Xiaofeng Li1,Jing Ma1(),Chi Li2,Hengmin Zhu3
1(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China) 2(Alibaba Zhejiang Rookie Supply Chain Management Co., Ltd. , Hangzhou 311100, China) 3(College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing 210046, China)
[Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.
Varma M, Zisserman A . A Statistical Approach to Texture Classification from Single Images[J]. International Journal of Computer Vision, 2005,62(1-2):61-81.
Isozaki H, Kazawa H. Efficient Support Vector Classifiers for Named Entity Recognition [C]//Proceedings of the 19th International Conference on Computational Linguistics. 2002: 390-396.
Bender O, Och F J, Ney H. Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of CoNLL-2003. 2003,4:148-151.
Klinger R. Automatically Selected Skip Edges in Conditional Random Fields for Named Entity Recognition [C]// Proceedings of Recent Advances in Natural Language Processing. 2011: 580-585.
Marcińczuk M, Janicki M. Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts [C] // Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. 2012: 258-269.
Ritter A, Clark S, Mausam, et al. Named Entity Recognition in Tweets: An Experimental Study [C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 1524-1534.
Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2010: 384-394.
Liu X, Zhang S, Wei F, et al. Recognizing Named Entities in Tweets [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011,1:359-367.
Farmakiotou D, Karkaletsis V, Koutsias J, et al. Rule-based Named Entity Recognition for Greek Financial Texts [C]// Proceedings of the 2000 Workshop on Computational Lexicography and Multimedia Dictionaries. 2000: 1-4.
Bikel D, Miller S, Schwartz R. Nymble: A High-Performance Learning Name-Finder [C]//Proceedings of the 5th Conference on Applied Natural Language Processing. 1997: 194-201.
( Wu Xiaofeng, Zong Chengqing . An Approach to Automatic Summarization by Integrating Latent Dirichlet Allocation in Conditional Random Field[J]. Journal of Chinese Information Processing, 2009,23(6):39-45.)
Cheng J, Lapata M. Neural Summarization by Extracting Sentences and Words [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 484-494.
Nallapati R, Zhai F, Zhou B. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents [C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines [C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 757-766.
Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets [C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 91-100.
Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-Granularity Topics [C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 1776-1781.
Zhou Y, Xu J, Cao J, et al. Hybrid Attention Networks for Chinese Short Text Classification [C]//Proceedings of Neural Information Processing. 2017: 759-769.
Mikolov T. Word2vec Code [CP/OL]. [ 2015- 09- 18]. .