Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 34-41    DOI: 10.11925/infotech.2096-3467.2018.1048
Identifying Commodity Names Based on XGBoost Model
Xiaofeng Li1,Jing Ma1(),Chi Li2,Hengmin Zhu3
1(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
2(Alibaba Zhejiang Rookie Supply Chain Management Co., Ltd. , Hangzhou 311100, China)
3(College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing 210046, China)
[Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.

Key wordsE-Commerce      Product Description      Product Name Recognition      XGBoost      Feature Extraction     
Received: 20 September 2018      Published: 06 September 2019
Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model. Data Analysis and Knowledge Discovery, 2019, 3(7): 34-41.

参数名称 参数值 参数说明
sentence / 要训练的语料为一个list列表
size 50 训练所得特征向量的维度
window 30 表示当前词与预测词在一个句子中的最大距离
sg 0 sg=0, 采用CBOW模型; sg=1, 采用skip-gram模型
Min_count 2 词频数少于Min_count的单词会被丢弃掉
