|
|
Identifying Commodity Names Based on XGBoost Model |
Xiaofeng Li1,Jing Ma1(),Chi Li2,Hengmin Zhu3 |
1(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China) 2(Alibaba Zhejiang Rookie Supply Chain Management Co., Ltd. , Hangzhou 311100, China) 3(College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing 210046, China) |
|
|
Abstract [Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.
|
Received: 20 September 2018
Published: 06 September 2019
|
|
Corresponding Authors:
Jing Ma
E-mail: majing5525@126.com
|
[1] |
Varma M, Zisserman A . A Statistical Approach to Texture Classification from Single Images[J]. International Journal of Computer Vision, 2005,62(1-2):61-81.
|
[2] |
Isozaki H, Kazawa H. Efficient Support Vector Classifiers for Named Entity Recognition [C]//Proceedings of the 19th International Conference on Computational Linguistics. 2002: 390-396.
|
[3] |
Bender O, Och F J, Ney H. Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of CoNLL-2003. 2003,4:148-151.
|
[4] |
Klinger R. Automatically Selected Skip Edges in Conditional Random Fields for Named Entity Recognition [C]// Proceedings of Recent Advances in Natural Language Processing. 2011: 580-585.
|
[5] |
Marcińczuk M, Janicki M. Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts [C] // Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. 2012: 258-269.
|
[6] |
Ritter A, Clark S, Mausam, et al. Named Entity Recognition in Tweets: An Experimental Study [C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 1524-1534.
|
[7] |
Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2010: 384-394.
|
[8] |
Liu X, Zhang S, Wei F, et al. Recognizing Named Entities in Tweets [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011,1:359-367.
|
[9] |
Farmakiotou D, Karkaletsis V, Koutsias J, et al. Rule-based Named Entity Recognition for Greek Financial Texts [C]// Proceedings of the 2000 Workshop on Computational Lexicography and Multimedia Dictionaries. 2000: 1-4.
|
[10] |
Bikel D, Miller S, Schwartz R. Nymble: A High-Performance Learning Name-Finder [C]//Proceedings of the 5th Conference on Applied Natural Language Processing. 1997: 194-201.
|
[11] |
程园 , 吾守尔·斯拉木, 买买提依明·哈斯木. 基于综合的句子特征的文本自动摘要[J]. 计算机科学, 2015,42(4):226-229.
|
[11] |
( Cheng Yuan, Wushouer Silamu, Maimaitiyiming Hasimua . Automation Text Summarization Based on Comprehensive Characteristics of Sentence[J]. Computer Science, 2015,42(4):226-229.)
|
[12] |
贾晓婷, 王名扬, 曹宇 . 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究[J]. 数据分析与知识发现, 2018,2(2):86-95.
|
[12] |
( Jia Xiaoting, Wang Mingyang, Cao Yu . Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(2):86-95.)
|
[13] |
Arora R, Ravindran B. Latent Dirichlet Allocation Based Multi-Document Summarization [C]// Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. 2008: 91-97.
|
[14] |
吴晓峰, 宗成庆 . 一种基于LD A的CRF自动文摘方法[J]. 中文信息学报, 2009,23(6):39-45.
|
[14] |
( Wu Xiaofeng, Zong Chengqing . An Approach to Automatic Summarization by Integrating Latent Dirichlet Allocation in Conditional Random Field[J]. Journal of Chinese Information Processing, 2009,23(6):39-45.)
|
[15] |
Cheng J, Lapata M. Neural Summarization by Extracting Sentences and Words [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 484-494.
|
[16] |
Nallapati R, Zhai F, Zhou B. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents [C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
|
[17] |
胡学钢, 杨超群, 张玉红 . 基于自身特征扩展的短文本分类方法[J]. 计算机应用研究, 2017,34(4):1008-1010.
|
[17] |
( Hu Xuegang, Yang Chaoqun, Zhang Yuhong . Short Text Classification Based on Extension with Its Own Features[J]. Application Research of Computers, 2017,34(4):1008-1010.)
|
[18] |
王盛, 樊兴华, 陈现麟 . 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010,30(3):603-606.
doi: 10.7666/d.y1989082
|
[18] |
( Wang Sheng, Fan Xinghua, Chen Xianlin . Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010,30(3):603-606.)
doi: 10.7666/d.y1989082
|
[19] |
范云杰, 刘怀亮 . 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3):47-52.
|
[19] |
( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
|
[20] |
Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines [C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 757-766.
|
[21] |
Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets [C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
|
[22] |
Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
|
[23] |
Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 91-100.
|
[24] |
Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-Granularity Topics [C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 1776-1781.
|
[25] |
Zhou Y, Xu J, Cao J, et al. Hybrid Attention Networks for Chinese Short Text Classification [C]//Proceedings of Neural Information Processing. 2017: 759-769.
|
[26] |
Mikolov T. Word2vec Code [CP/OL]. [ 2015- 09- 18]. .
|
[27] |
周练 . Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015,25(2):145-148.
|
[27] |
( Zhou Lian . Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015,25(2):145-148.)
|
[28] |
Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
|
[29] |
Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|