Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 34-41    DOI: 10.11925/infotech.2096-3467.2018.1048
Current Issue | Archive | Adv Search |
Identifying Commodity Names Based on XGBoost Model
Xiaofeng Li1,Jing Ma1(),Chi Li2,Hengmin Zhu3
1(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
2(Alibaba Zhejiang Rookie Supply Chain Management Co., Ltd. , Hangzhou 311100, China)
3(College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing 210046, China)
Download: PDF(1020 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      

[Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.

Key wordsE-Commerce      Product Description      Product Name Recognition      XGBoost      Feature Extraction     
Received: 20 September 2018      Published: 06 September 2019
:  TP391.1 G35  
Corresponding Authors: Jing Ma     E-mail:

Cite this article:

Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model. Data Analysis and Knowledge Discovery, 2019, 3(7): 34-41.

URL:     OR

参数名称 参数值 参数说明
sentence / 要训练的语料为一个list列表
size 50 训练所得特征向量的维度
window 30 表示当前词与预测词在一个句子中的最大距离
sg 0 sg=0, 采用CBOW模型; sg=1, 采用skip-gram模型
Min_count 2 词频数少于Min_count的单词会被丢弃掉
[1] Varma M, Zisserman A . A Statistical Approach to Texture Classification from Single Images[J]. International Journal of Computer Vision, 2005,62(1-2):61-81.
[2] Isozaki H, Kazawa H. Efficient Support Vector Classifiers for Named Entity Recognition [C]//Proceedings of the 19th International Conference on Computational Linguistics. 2002: 390-396.
[3] Bender O, Och F J, Ney H. Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of CoNLL-2003. 2003,4:148-151.
[4] Klinger R. Automatically Selected Skip Edges in Conditional Random Fields for Named Entity Recognition [C]// Proceedings of Recent Advances in Natural Language Processing. 2011: 580-585.
[5] Marcińczuk M, Janicki M. Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts [C] // Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. 2012: 258-269.
[6] Ritter A, Clark S, Mausam, et al. Named Entity Recognition in Tweets: An Experimental Study [C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 1524-1534.
[7] Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2010: 384-394.
[8] Liu X, Zhang S, Wei F, et al. Recognizing Named Entities in Tweets [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011,1:359-367.
[9] Farmakiotou D, Karkaletsis V, Koutsias J, et al. Rule-based Named Entity Recognition for Greek Financial Texts [C]// Proceedings of the 2000 Workshop on Computational Lexicography and Multimedia Dictionaries. 2000: 1-4.
[10] Bikel D, Miller S, Schwartz R. Nymble: A High-Performance Learning Name-Finder [C]//Proceedings of the 5th Conference on Applied Natural Language Processing. 1997: 194-201.
[11] 程园 , 吾守尔·斯拉木, 买买提依明·哈斯木. 基于综合的句子特征的文本自动摘要[J]. 计算机科学, 2015,42(4):226-229.
[11] ( Cheng Yuan, Wushouer Silamu, Maimaitiyiming Hasimua . Automation Text Summarization Based on Comprehensive Characteristics of Sentence[J]. Computer Science, 2015,42(4):226-229.)
[12] 贾晓婷, 王名扬, 曹宇 . 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究[J]. 数据分析与知识发现, 2018,2(2):86-95.
[12] ( Jia Xiaoting, Wang Mingyang, Cao Yu . Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(2):86-95.)
[13] Arora R, Ravindran B. Latent Dirichlet Allocation Based Multi-Document Summarization [C]// Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. 2008: 91-97.
[14] 吴晓峰, 宗成庆 . 一种基于LD A的CRF自动文摘方法[J]. 中文信息学报, 2009,23(6):39-45.
[14] ( Wu Xiaofeng, Zong Chengqing . An Approach to Automatic Summarization by Integrating Latent Dirichlet Allocation in Conditional Random Field[J]. Journal of Chinese Information Processing, 2009,23(6):39-45.)
[15] Cheng J, Lapata M. Neural Summarization by Extracting Sentences and Words [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 484-494.
[16] Nallapati R, Zhai F, Zhou B. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents [C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
[17] 胡学钢, 杨超群, 张玉红 . 基于自身特征扩展的短文本分类方法[J]. 计算机应用研究, 2017,34(4):1008-1010.
[17] ( Hu Xuegang, Yang Chaoqun, Zhang Yuhong . Short Text Classification Based on Extension with Its Own Features[J]. Application Research of Computers, 2017,34(4):1008-1010.)
[18] 王盛, 樊兴华, 陈现麟 . 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010,30(3):603-606.
doi: 10.7666/d.y1989082
[18] ( Wang Sheng, Fan Xinghua, Chen Xianlin . Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010,30(3):603-606.)
doi: 10.7666/d.y1989082
[19] 范云杰, 刘怀亮 . 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3):47-52.
[19] ( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
[20] Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines [C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 757-766.
[21] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets [C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
[22] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[23] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 91-100.
[24] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-Granularity Topics [C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 1776-1781.
[25] Zhou Y, Xu J, Cao J, et al. Hybrid Attention Networks for Chinese Short Text Classification [C]//Proceedings of Neural Information Processing. 2017: 759-769.
[26] Mikolov T. Word2vec Code [CP/OL]. [ 2015- 09- 18]. .
[27] 周练 . Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015,25(2):145-148.
[27] ( Zhou Lian . Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015,25(2):145-148.)
[28] Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[29] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[1] Guijun Yang,Xue Xu,Fuqiang Zhao. Predicting User Ratings with XGBoost Algorithm[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[2] Chuanming Yu,Yajing Guo,Yutian Gong,Manyu Huang,Hufeng Peng. Evolution and Regional Differences of E-commerce Policies for Rural Poverty Reduction Based on Topic over Time Model[J]. 数据分析与知识发现, 2018, 2(7): 34-45.
[3] Lixin Zhou,Jie Lin. Extracting Product Features with NodeRank Algorithm[J]. 数据分析与知识发现, 2018, 2(4): 90-98.
[4] Xiaoxi Huang,Hanyu Li,Rongbo Wang,Xiaohua Wang,Zhiqun Chen. Recognizing Metaphor with Convolution Neural Network and SVM[J]. 数据分析与知识发现, 2018, 2(10): 77-83.
[5] Weiqing Li,Weijun Wang. Building Product Feature Dictionary with Large-scale Review Data[J]. 数据分析与知识发现, 2018, 2(1): 41-50.
[6] Changbing Li,Chongpeng Pang,Meiping Li. Extracting Product Features with Weight-based Apriori Algorithm[J]. 数据分析与知识发现, 2017, 1(9): 83-89.
[7] Yu Wang,Xiuxiu Li. Evaluating Business Reputation with E-Commerce Comments[J]. 数据分析与知识发现, 2017, 1(8): 59-67.
[8] Fuliang Xue,Junling Liu. Improving Collaborative Filtering Recommendation Based on Trust Relationship Among Users[J]. 数据分析与知识发现, 2017, 1(7): 90-99.
[9] Peng Zhu, Xiaoxiao Zhao, Wei Wu. Factors Influencing Mobile E-commerce Consumers’ Preferences: An Empirical Study[J]. 数据分析与知识发现, 2017, 1(3): 1-9.
[10] Liu Honglian,Zhang Pengyi,Wang Jun. Multi-session Product Information Seeking Behaviors, Motivation, and Influencing Factors[J]. 现代图书情报技术, 2016, 32(4): 1-7.
[11] Du Siqi, Li Honglian, Lv Xueqiang. Research of Chinese Chunk Parsing in Application of the Product Feature Extraction[J]. 现代图书情报技术, 2015, 31(9): 26-30.
[12] Yuan Xingfu, Zhang Pengyi, Wang Jun. “State-Behavior” Modeling and Its Application in Analyzing Product Information Seeking Behavior of E-commerce Websites Users[J]. 现代图书情报技术, 2015, 31(6): 93-100.
[13] Zhang Wenjun, Wang Jun, Xu Shanchuan. The Probing of E-commerce User Need States by Page Cluster Analysis ——An Empirical Study on Women's Clothes from[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[14] Wu Wankun, Wu Qinglie, Gu Jinjiang. Hot Topic Extraction from E-commerce Microblog Based on EM-LDA Integrated Model[J]. 现代图书情报技术, 2015, 31(11): 33-40.
[15] Gao Jinsong, Liang Yanqi, Li Ke, Xiao Lian, Zhou Ximan. E-commerce Credit Information Service Model for Linked Data[J]. 现代图书情报技术, 2014, 30(6): 8-16.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938