Please wait a minute...
Advanced Search
数据分析与知识发现  2022, Vol. 6 Issue (5): 34-43     https://doi.org/10.11925/infotech.2096-3467.2021.0958
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
基于改进文本表示的商品文本分类算法研究*
屠振超,马静()
南京航空航天大学经济与管理学院 南京 211106
Item Categorization Algorithm Based on Improved Text Representation
Tu Zhenchao,Ma Jing()
College of Economics and Management, Nanjing University of Aeronautics and Astronautics,Nanjing 211106, China
全文: PDF (821 KB)   HTML ( 20
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 解决传统文本分类中分类器容易将属于不同标签但拥有许多相似修饰词的商品标题文本错误分类的问题,提高分类器的表现。【方法】 本文设计了文本判别器作为辅助任务,其损失函数为不同标签文本向量的归一化欧氏距离,并结合传统文本分类主任务的交叉熵损失函数,推动文本编码器为不同类别的商品文本生成有足够区分度的文本表示,构建了ITR-BiLSTM-Attention模型。【结果】 对比没有使用文本判别器的BiLSTM-Attention基础模型,ITR-BiLSTM-Attention模型在准确率、精确率、召回率和F1值4个指标上分别提高1.84百分点、2.31百分点、2.88百分点、2.82百分点;对比文本判别器使用余弦相似度损失函数的Cos-BiLSTM-Attention模型,ITR-BiLSTM-Attention模型在4个指标上分别提高0.53百分点、0.54百分点、1.21百分点、1.01百分点。【局限】 未测试不同采样方式对模型的影响,未在更广泛的数据集上进行实验。【结论】 本文设计的文本判别器辅助任务确实能够改进文本编码器生成的文本表示;构建的基于改进文本表示的商品文本分类模型相比于传统的商品文本分类算法具有更好的表现。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
屠振超
马静
关键词 文本分类文本表示多任务学习度量学习商品分类    
Abstract

[Objective] This paper proposes a new model to improve the traditional text classifiers which tend to misclassify commodity titles with different labels and similar modifiers. [Methods] First, we designed the text discriminator as an auxiliary task, which took the normalized Euclidean distance of different label text vectors as the loss function. Then, we utilized the cross-entropy loss function of the traditional text classification to the new text encoder. Finally, we generated text representation with sufficient discrimination for different categories of commodity texts, and constructed the ITR-BiLSTM-Attention model. [Results] Compared with the BiLSTM-Attention model without text discriminator, the proposed model’s accuracy, precision, recall and F1 values improved by 1.84%, 2.31%, 2.88% and 2.82%, respectively. Compared with the Cos-BiLSTM-Attention model, our new model improved accuracy, precision, recall and F1 values by 0.53%, 0.54%, 1.21% and 1.01%, respectively. [Limitations] The impacts of different sampling methods on the model were not tested. We did not conduct experiment on a larger data set. [Conclusions] The text discriminator auxiliary task designed in this paper can improve the text representation generated by the text encoder. The item categorization model based on improved text representation was more effective than the traditional ones.

Key wordsText Classification    Text Representation    Multitasking Learning    Metric Learning    Item Categorization
收稿日期: 2021-07-07      出版日期: 2022-06-21
ZTFLH:  TP391  
基金资助:*国家自然科学基金面上项目的研究成果之一(72174086)
通讯作者: 马静,ORCID: 0000-0001-8472-2581     E-mail: majing5525@126.com
引用本文:   
屠振超, 马静. 基于改进文本表示的商品文本分类算法研究*[J]. 数据分析与知识发现, 2022, 6(5): 34-43.
Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation. Data Analysis and Knowledge Discovery, 2022, 6(5): 34-43.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2021.0958      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2022/V6/I5/34
Fig.1  ITR-BiLSTM-Attention模型框架
商品文本 类别
宝宝 周岁 百天 生日 儿童 生日派对 铝膜 气球 装饰用品 气球
夏天 凉 拖鞋 居家 情侣 地板 防滑 软底 洗澡 浴室 拖鞋 拖鞋
斑点 印花 领结 系带 长袖 衬衫 女 直筒 显瘦 套头 罩衫 罩衫
天然 孔雀石 手链 佛珠 手链 新款 手链
定制 欧美 时尚 百搭 戴 项链 锁骨 链 项链
新品 秋季 新款 时尚 通勤 直身 飘 带领 长袖 女士 白衬衫 衬衫
防 暴雨 帐篷 户外 人 家庭 人 情侣 双人 露营 帐篷 帐篷
新款 女装 民族 风 文艺 绣花 飘逸 大裤 摆 裤裙 阔腿裤 阔腿裤
达人 女款 新品 女 亮片 性感 收腰 衣裤 连体 裤 连体裤
演唱会 黑色 墨镜 太阳眼镜 多边形 金属 镂空 男女 近视 墨镜
Table 1  经过预处理后的部分数据
参数名 参数值
LSTM的hidden_size 64
Batch_size 128
Epoch 20
Optimizer Adam
Dropout 0.25
Learning Rate 0.001
δ 0.35
Table 2  模型参数设置
模型 Acc P R F1
TextCNN 67.85% 61.10% 58.04% 57.98%
TextRCNN 92.61% 91.49% 89.89% 90.09%
BiLSTM-Attention 92.31% 90.59% 89.50% 89.32%
ITR-TextCNN 68.01% 61.69% 58.45% 58.32%
ITR-TextRCNN 92.73% 91.39% 89.87% 89.91%
ITR-BiLSTM-Attention 94.15% 92.90% 92.38% 92.14%
Cos-BiLSTM-Attention 93.62% 92.36% 91.17% 91.13%
Table 3  算法性能对比
[1] 贺波, 马静, 李驰. 基于融合特征的商品文本分类方法研究[J]. 情报理论与实践, 2020, 43(11): 162-168.
[1] ( He Bo, Ma Jing, Li Chi. Research on Commodity Text Classification Based on Fusion Features[J]. Information Studies: Theory & Application, 2020, 43(11): 162-168.)
[2] 李晓峰, 马静, 李驰, 等. 基于XGBoost模型的电商商品品名识别算法研究[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[2] ( Li Xiaofeng, Ma Jing, Li Chi, et al. Ide.pngying Commodity Names Based on XGBoost Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(7): 34-41.)
[3] 万家山, 吴云志. 基于深度学习的文本分类方法研究综述[J]. 天津理工大学学报, 2021, 37(2): 41-47.
[3] ( Wan Jiashan, Wu Yunzhi. Review of Text Classification Research Based on Deep Learning[J]. Journal of Tianjin University of Technology, 2021, 37(2): 41-47.)
[4] Ohashi S, Takayama J, Kajiwara T, et al. Text Classification with Negative Supervision[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 351-357.
[5] Shen D H, Wang G Y, Wang W L, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 440-450.
[6] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 1480-1489.
[7] Qin Q, Hu W P, Liu B. Feature Projection for Improved Text Classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8161-8171.
[8] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
[9] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[10] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[11] Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv: 2005.14165.
[12] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[13] Androutsopoulos I, Koutsias J, Konstantinos V C, et al. An Evaluation of Naive Bayesian Anti-spam Iltering[C]// Proceedings of the 2000 Workshop on Machine Learning in the New Information Age. 2000: 9-17.
[14] Tan S B. An Effective Refinement Strategy for KNN Text Classifier[J]. Expert Systems with Applications, 2006, 30(2): 290-298.
doi: 10.1016/j.eswa.2005.07.019
[15] Forman G. BNS Feature Scaling: An Improved Representation over TF-IDF for SVM Text Classification[C]// Proceeding of the 17th ACM Conference on Information and Knowledge Mining. New York: ACM Press, 2008: 263-270.
[16] Zhang Y F, Yu X L, Cui Z Y, et al. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 334-339.
[17] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1746-1751.
[18] Iyyer M, Manjunatha V, Boyd-Graber J, et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 1681-1691.
[19] Tang D Y, Qin B, Liu T. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 1422-1432.
[20] Dai A M, Le Q V. Semi-supervised Sequence Learning [A]// Advances in Neural Information Processing Systems[M]. 2015, 28: 3079-3087.
[21] Jin P, Zhang Y, Chen X, et al. Bag-of-Embeddings for Text Classification[C]// Proceedings of the 25th International Joint Conference on A.pngicial Intelligence. 2016: 2824-2830.
[22] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017: 427-431.
[23] 孙嘉琪, 王晓晔, 周晓雯. 基于神经网络模型的文本分类研究综述[J]. 天津理工大学学报, 2019, 35(5): 29-33.
[23] ( Sun Jiaqi, Wang Xiaoye, Zhou Xiaowen. Review of Text Classification Research Based on Neural Network Model[J]. Journal of Tianjin University of Technology, 2019, 35(5): 29-33.)
[24] Chung J, Gulcehre C, Cho K, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[OL]. arXiv Preprint, arXiv: 1412.3555.
[25] 李启行, 廖薇, 孟静雯. 基于注意力机制的双通道DAC-RNN文本分类模型[J/OL]. 计算机工程与应用. [2021-08-18]. http://kns.cnki.net/kcms/detail/11.2127.tp.20210420.1354.070.html.
[25] ( Li Qixing, Liao Wei, Meng Jingwen. Dual-channel DAC-RNN Text Categorization Model Based on Attention Mechanism[J/OL]. Computer Engineering and Applications. [2021-08-18]. http://kns.cnki.net/kcms/detail/11.2127.tp.20210420.1354.070.html.)
[26] 明建华, 胡创, 周建政, 等. 针对直播弹幕的TextCNN过滤模型[J]. 计算机工程与应用, 2021, 57(3): 162-167.
[26] ( Ming Jianhua, Hu Chuang, Zhou Jianzheng, et al. TextCNN Based Filtering Model for Barrage in Live Video Broadcast[J]. Computer Engineering and Applications, 2021, 57(3): 162-167.)
[27] Lai S, Xu L, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on A.pngicial Intelligence. 2015: 2267-2273.
[28] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[29] 王微. 融合全局和局部信息的度量学习方法研究[D]. 合肥: 中国科学技术大学, 2014.
[29] ( Wang Wei. Globality and Locality Incorporation in Distance Metric Learning[D]. Hefei: University of Science and Technology of China, 2014.)
[30] Kulis B. Metric Learning: A Survey[J]. Foundations and Trends® in Machine Learning, 2013, 5(4): 287-364.
doi: 10.1561/2200000019
[31] Musgrave K, Belongie S, Lim S N. A Metric Learning Reality Check[C]// Proceedings of the 16th European Conference on Computer Vision. 2020: 681-699.
[32] Hadsell R, Chopra S, LeCun Y. Dimensionality Reduction by Learning an Invariant Mapping[C]// Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2006: 1735-1742.
[33] Schroff F, Kalenichenko D, Philbin J. FaceNet: A Unified Embedding for Face Recognition and Clustering[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2015: 815-823.
[34] Wen Y D, Zhang K P, Li Z F, et al. A Discriminative Feature Learning Approach for Deep Face Recognition[C]// Proceedings of the 14th European Conference on Computer Vision. 2016: 499-515.
[35] Liu W Y, Wen Y D, Yu Z D, et al. Large-Margin Softmax Loss for Convolutional Neural Networks[OL]. arXiv Preprint, arXiv: 1612.02295.
[36] Xuan H, Stylianou A, Liu X T, et al. Hard Negative Examples are Hard, but Useful[C]// Proceedings of the 16th European Conference on Computer Vision. 2020: 126-142.
[37] 姜同强, 万忠赫, 张青川. 基于双向长短期记忆网络和自注意力机制的食品安全裁判文书分类方法[J]. 科学技术与工程, 2019, 19(29): 188-192.
[37] ( Jiang Tongqiang, Wan Zhonghe, Zhang Qingchuan. Text Classification of Food Safety Judgment Document Based on BLSTM and Self-Attention[J]. Science Technology and Engineering, 2019, 19(29): 188-192.)
[38] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[38] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[1] 叶瀚,孙海春,李欣,焦凯楠. 融合注意力机制与句向量压缩的长文本分类模型[J]. 数据分析与知识发现, 2022, 6(6): 84-94.
[2] 陈果, 叶潮. 融合半监督学习与主动学习的细分领域新闻分类研究*[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[3] 肖悦珺, 李红莲, 张乐, 吕学强, 游新冬. 特征融合的中文专利文本分类方法研究*[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[4] 杨林, 黄晓硕, 王嘉阳, 丁玲玲, 李子孝, 李姣. 基于BERT-TextCNN的临床试验疾病亚型识别研究*[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[5] 徐月梅, 樊祖薇, 曹晗. 基于标签嵌入注意力机制的多任务文本分类模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[6] 余传明, 林虹君, 张贞港. 基于多任务深度学习的实体和事件联合抽取模型*[J]. 数据分析与知识发现, 2022, 6(2/3): 117-128.
[7] 佟昕瑀, 赵蕊洁, 路永和. 基于预训练模型的多标签专利分类研究*[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[8] 黄学坚, 刘雨飏, 马廷淮. 基于改进型图神经网络的学术论文分类模型*[J]. 数据分析与知识发现, 2022, 6(10): 93-102.
[9] 谢星雨, 余本功. 基于MFFMB的电商评论文本分类研究*[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[10] 陈杰,马静,李晓峰. 融合预训练模型文本特征的短文本分类方法*[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[11] 周泽聿,王昊,赵梓博,李跃艳,张小琴. 融合关联信息的GCN文本分类模型构建及其应用研究*[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[12] 杨晗迅, 周德群, 马静, 罗永聪. 基于不确定性损失函数和任务层级注意力机制的多任务谣言检测研究*[J]. 数据分析与知识发现, 2021, 5(7): 101-110.
[13] 余本功,朱晓洁,张子薇. 基于多层次特征提取的胶囊网络文本分类研究*[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[14] 周志超. 基于机器学习技术的自动引文分类研究综述*[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
[15] 王艳, 王胡燕, 余本功. 基于多特征融合的中文文本分类研究*[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn