Please wait a minute...
Advanced Search
数据分析与知识发现  2019, Vol. 3 Issue (7): 34-41    DOI: 10.11925/infotech.2096-3467.2018.1048
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于XGBoost模型的电商商品品名识别算法研究 *
李晓峰1,马静1(),李驰2,朱恒民3
1(南京航空航天大学经济与管理学院 南京 211106)
2(阿里巴巴浙江菜鸟供应链管理有限公司 杭州 311100)
3(南京邮电大学经济与管理学院 南京 210046)
Identifying Commodity Names Based on XGBoost Model
Xiaofeng Li1,Jing Ma1(),Chi Li2,Hengmin Zhu3
1(College of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
2(Alibaba Zhejiang Rookie Supply Chain Management Co., Ltd. , Hangzhou 311100, China)
3(College of Economics and Management, Nanjing University of Posts and Telecommunications, Nanjing 210046, China)
全文: PDF(1020 KB)   HTML ( 5
输出: BibTeX | EndNote (RIS)      
摘要 

目的】针对淘宝商品上架自动类目识别需求, 在电子商务领域中提出商品品名识别问题。【方法】通过合作方获取的大量商品交易数据, 构建电商商品描述数据集, 并对数据集人工标注; 使用基于XGBoost模型的有监督机器学习算法, 对电商商品描述短文本进行品名识别研究。【结果】实验结果表明, 该算法对最终20 059条数据集上的816种商品的识别准确率为85%, 召回率为87%。【局限】商品种类不够完善, 语料库中的商品种类和描述数量可进一步丰富。【结论】本研究在电子商务领域中尝试使用机器学习算法解决商品品名识别问题。实验证明本算法是合理的、有效的, 具有实际应用价值。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
李晓峰
马静
李驰
朱恒民
关键词 电子商务商品描述品名识别XGBoost特征抽取    
Abstract

[Objective] This paper tries to automatically identify commodity names from product descriptions, aiming to classifying items sold by Taobao. [Methods] First, we retrieved a large number of transaction records from Taobao. Then, we built an e-commerce commodity description dataset and labeled it manually. Third, we created a supervised machine learning algorithm based on the XGBoost model to extract names from product description. [Results] The precision and recall of the algorithm was 85% and 87% for 816 different items from 20,059 records. [Limitations] Categories of commodities in the test corpus need to be expanded. [Conclusions] Machine learning algorithm is an effective way to identify product names.

Key wordsE-Commerce    Product Description    Product Name Recognition    XGBoost    Feature Extraction
收稿日期: 2018-09-20     
中图分类号:  TP391.1 G35  
基金资助:*本文系国家自然科学基金面上项目“基于演化本体的网络舆情自适应话题跟踪方法研究”(71373123);中央高校基本科研业务费专项: 前瞻性发展策略研究资助项目“基于大数据技术的跨境电商政府管理范式研究”(NW2018004);国家自然科学基金项目“基于主路径网络的舆情传播态势预测与干预研究——以社会化媒体中舆情为对象”的研究成果之一(71874088)
通讯作者: 马静     E-mail: majing5525@126.com
引用本文:   
李晓峰,马静,李驰,朱恒民. 基于XGBoost模型的电商商品品名识别算法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
Xiaofeng Li,Jing Ma,Chi Li,Hengmin Zhu. Identifying Commodity Names Based on XGBoost Model. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2018.1048.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1048
图1  CBOW模型网络结构
图2  Boost树型算法简单示例
图3  预处理后的电商商品描述数据集
参数名称 参数值 参数说明
sentence / 要训练的语料为一个list列表
size 50 训练所得特征向量的维度
window 30 表示当前词与预测词在一个句子中的最大距离
sg 0 sg=0, 采用CBOW模型; sg=1, 采用skip-gram模型
Min_count 2 词频数少于Min_count的单词会被丢弃掉
表1  Word2Vec参数设置表
[1] Varma M, Zisserman A . A Statistical Approach to Texture Classification from Single Images[J]. International Journal of Computer Vision, 2005,62(1-2):61-81.
[2] Isozaki H, Kazawa H. Efficient Support Vector Classifiers for Named Entity Recognition [C]//Proceedings of the 19th International Conference on Computational Linguistics. 2002: 390-396.
[3] Bender O, Och F J, Ney H. Maximum Entropy Models for Named Entity Recognition [C]//Proceedings of CoNLL-2003. 2003,4:148-151.
[4] Klinger R. Automatically Selected Skip Edges in Conditional Random Fields for Named Entity Recognition [C]// Proceedings of Recent Advances in Natural Language Processing. 2011: 580-585.
[5] Marcińczuk M, Janicki M. Optimizing CRF-Based Model for Proper Name Recognition in Polish Texts [C] // Proceedings of International Conference on Intelligent Text Processing and Computational Linguistics. 2012: 258-269.
[6] Ritter A, Clark S, Mausam, et al. Named Entity Recognition in Tweets: An Experimental Study [C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. 2011: 1524-1534.
[7] Turian J, Ratinov L, Bengio Y. Word Representations: A Simple and General Method for Semi-supervised Learning [C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics. 2010: 384-394.
[8] Liu X, Zhang S, Wei F, et al. Recognizing Named Entities in Tweets [C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011,1:359-367.
[9] Farmakiotou D, Karkaletsis V, Koutsias J, et al. Rule-based Named Entity Recognition for Greek Financial Texts [C]// Proceedings of the 2000 Workshop on Computational Lexicography and Multimedia Dictionaries. 2000: 1-4.
[10] Bikel D, Miller S, Schwartz R. Nymble: A High-Performance Learning Name-Finder [C]//Proceedings of the 5th Conference on Applied Natural Language Processing. 1997: 194-201.
[11] 程园 , 吾守尔·斯拉木, 买买提依明·哈斯木. 基于综合的句子特征的文本自动摘要[J]. 计算机科学, 2015,42(4):226-229.
( Cheng Yuan, Wushouer Silamu, Maimaitiyiming Hasimua . Automation Text Summarization Based on Comprehensive Characteristics of Sentence[J]. Computer Science, 2015,42(4):226-229.)
[12] 贾晓婷, 王名扬, 曹宇 . 结合Doc2Vec与改进聚类算法的中文单文档自动摘要方法研究[J]. 数据分析与知识发现, 2018,2(2):86-95.
( Jia Xiaoting, Wang Mingyang, Cao Yu . Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. Data Analysis and Knowledge Discovery, 2018,2(2):86-95.)
[13] Arora R, Ravindran B. Latent Dirichlet Allocation Based Multi-Document Summarization [C]// Proceedings of the 2nd Workshop on Analytics for Noisy Unstructured Text Data. 2008: 91-97.
[14] 吴晓峰, 宗成庆 . 一种基于LD A的CRF自动文摘方法[J]. 中文信息学报, 2009,23(6):39-45.
( Wu Xiaofeng, Zong Chengqing . An Approach to Automatic Summarization by Integrating Latent Dirichlet Allocation in Conditional Random Field[J]. Journal of Chinese Information Processing, 2009,23(6):39-45.)
[15] Cheng J, Lapata M. Neural Summarization by Extracting Sentences and Words [C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016: 484-494.
[16] Nallapati R, Zhai F, Zhou B. SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents [C]// Proceedings of the 31st AAAI Conference on Artificial Intelligence. 2017: 3075-3081.
[17] 胡学钢, 杨超群, 张玉红 . 基于自身特征扩展的短文本分类方法[J]. 计算机应用研究, 2017,34(4):1008-1010.
( Hu Xuegang, Yang Chaoqun, Zhang Yuhong . Short Text Classification Based on Extension with Its Own Features[J]. Application Research of Computers, 2017,34(4):1008-1010.)
[18] 王盛, 樊兴华, 陈现麟 . 利用上下位关系的中文短文本分类[J]. 计算机应用, 2010,30(3):603-606.
doi: 10.7666/d.y1989082
( Wang Sheng, Fan Xinghua, Chen Xianlin . Chinese Short Text Classification Based on Hyponymy Relation[J]. Journal of Computer Applications, 2010,30(3):603-606.)
doi: 10.7666/d.y1989082
[19] 范云杰, 刘怀亮 . 基于维基百科的中文短文本分类研究[J]. 现代图书情报技术, 2012(3):47-52.
( Fan Yunjie, Liu Huailiang . Research on Chinese Short Text Classification Based on Wikipedia[J]. New Technology of Library and Information Service, 2012(3):47-52.)
[20] Bollegala D, Matsuo Y, Ishizuka M. Measuring Semantic Similarity Between Words Using Web Search Engines [C]// Proceedings of the 16th International Conference on World Wide Web. 2007: 757-766.
[21] Sahami M, Heilman T D. A Web-based Kernel Function for Measuring the Similarity of Short Text Snippets [C]// Proceedings of the 15th International Conference on World Wide Web. 2006: 377-386.
[22] Blei D M, Ng A Y, Jordan M I . Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[23] Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large-Scale Data Collections [C]// Proceedings of the 17th International Conference on World Wide Web. 2008: 91-100.
[24] Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-Granularity Topics [C]// Proceedings of the 22nd International Joint Conference on Artificial Intelligence. 2011: 1776-1781.
[25] Zhou Y, Xu J, Cao J, et al. Hybrid Attention Networks for Chinese Short Text Classification [C]//Proceedings of Neural Information Processing. 2017: 759-769.
[26] Mikolov T. Word2vec Code [CP/OL]. [ 2015- 09- 18]. .
[27] 周练 . Word2vec的工作原理及应用探究[J]. 科技情报开发与经济, 2015,25(2):145-148.
( Zhou Lian . Exploration of the Working Principle and Application of Word2vec[J]. Sci-Tech Information Development & Economy, 2015,25(2):145-148.)
[28] Salton G, Buckley C . Term-weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[29] Chen T, Guestrin C. XGBoost: A Scalable Tree Boosting System [C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016: 785-794.
[1] 桂思思,陆伟,张晓娟. 基于查询表达式特征的时态意图识别研究*[J]. 数据分析与知识发现, 2019, 3(3): 66-75.
[2] 杨贵军,徐雪,赵富强. 基于XGBoost算法的用户评分预测模型及应用*[J]. 数据分析与知识发现, 2019, 3(1): 118-126.
[3] 王宇,李秀秀. 基于电子商务评论的商家信誉维度构建*[J]. 数据分析与知识发现, 2017, 1(8): 59-67.
[4] 薛福亮,刘君玲. 基于用户间信任关系改进的协同过滤推荐方法*[J]. 数据分析与知识发现, 2017, 1(7): 90-99.
[5] 朱鹏, 赵笑笑, 伍薇. 移动电子商务消费者决策偏好影响因素实证研究*[J]. 数据分析与知识发现, 2017, 1(3): 1-9.
[6] 张文君, 王军, 徐山川. 电商用户需求状态的聚类分析——以淘宝网女装为例[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[7] 高劲松, 梁艳琪, 李珂, 肖涟, 周习曼. 面向关联数据的电子商务信用信息服务模型研究[J]. 现代图书情报技术, 2014, 30(6): 8-16.
[8] 孙霄凌, 赵宇翔, 朱庆华. 在线商品评论系统功能需求的Kano模型分析——以我国主要购物网站为例[J]. 现代图书情报技术, 2013, (6): 76-84.
[9] 沈洪洲, 宗乾进, 袁勤俭. 应用Google云消息框架C2DM实现商务信息推送服务[J]. 现代图书情报技术, 2012, 28(6): 78-83.
[10] 李慧, 刘东苏. 消除用户主观评价差异的电子商务信誉模型[J]. 现代图书情报技术, 2012, 28(2): 48-52.
[11] 李聪. 电子商务协同过滤可扩展性研究综述[J]. 现代图书情报技术, 2010, 26(11): 37-41.
[12] 李聪. ECRec: 基于协同过滤的电子商务个性化推荐管理*[J]. 现代图书情报技术, 2009, (10): 34-39.
[13] 李纲,安璐. 基于SOM的手机电子商务交易聚类分析*[J]. 现代图书情报技术, 2008, 24(9): 70-77.
[14] 杨陟卓,韩燮. 一种基于特征抽取的文档信息过滤算法研究[J]. 现代图书情报技术, 2008, 24(4): 29-34.
[15] 张少龙,周宁. 基于知识交互的协同电子商务多主体组织模型[J]. 现代图书情报技术, 2008, 24(11): 34-39.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn