Please wait a minute...
Data Analysis and Knowledge Discovery  2022, Vol. 6 Issue (5): 34-43    DOI: 10.11925/infotech.2096-3467.2021.0958
Current Issue | Archive | Adv Search |
Item Categorization Algorithm Based on Improved Text Representation
Tu Zhenchao,Ma Jing()
College of Economics and Management, Nanjing University of Aeronautics and Astronautics,Nanjing 211106, China
Download: PDF (821 KB)   HTML ( 17
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper proposes a new model to improve the traditional text classifiers which tend to misclassify commodity titles with different labels and similar modifiers. [Methods] First, we designed the text discriminator as an auxiliary task, which took the normalized Euclidean distance of different label text vectors as the loss function. Then, we utilized the cross-entropy loss function of the traditional text classification to the new text encoder. Finally, we generated text representation with sufficient discrimination for different categories of commodity texts, and constructed the ITR-BiLSTM-Attention model. [Results] Compared with the BiLSTM-Attention model without text discriminator, the proposed model’s accuracy, precision, recall and F1 values improved by 1.84%, 2.31%, 2.88% and 2.82%, respectively. Compared with the Cos-BiLSTM-Attention model, our new model improved accuracy, precision, recall and F1 values by 0.53%, 0.54%, 1.21% and 1.01%, respectively. [Limitations] The impacts of different sampling methods on the model were not tested. We did not conduct experiment on a larger data set. [Conclusions] The text discriminator auxiliary task designed in this paper can improve the text representation generated by the text encoder. The item categorization model based on improved text representation was more effective than the traditional ones.

Key wordsText Classification      Text Representation      Multitasking Learning      Metric Learning      Item Categorization     
Received: 07 July 2021      Published: 21 June 2022
ZTFLH:  TP391  
Fund:National Natural Science Foundation of China(72174086)
Corresponding Authors: Ma Jing,ORCID: 0000-0001-8472-2581     E-mail: majing5525@126.com

Cite this article:

Tu Zhenchao, Ma Jing. Item Categorization Algorithm Based on Improved Text Representation. Data Analysis and Knowledge Discovery, 2022, 6(5): 34-43.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2021.0958     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2022/V6/I5/34

Framework of ITR-BiLSTM-Attention
商品文本 类别
宝宝 周岁 百天 生日 儿童 生日派对 铝膜 气球 装饰用品 气球
夏天 凉 拖鞋 居家 情侣 地板 防滑 软底 洗澡 浴室 拖鞋 拖鞋
斑点 印花 领结 系带 长袖 衬衫 女 直筒 显瘦 套头 罩衫 罩衫
天然 孔雀石 手链 佛珠 手链 新款 手链
定制 欧美 时尚 百搭 戴 项链 锁骨 链 项链
新品 秋季 新款 时尚 通勤 直身 飘 带领 长袖 女士 白衬衫 衬衫
防 暴雨 帐篷 户外 人 家庭 人 情侣 双人 露营 帐篷 帐篷
新款 女装 民族 风 文艺 绣花 飘逸 大裤 摆 裤裙 阔腿裤 阔腿裤
达人 女款 新品 女 亮片 性感 收腰 衣裤 连体 裤 连体裤
演唱会 黑色 墨镜 太阳眼镜 多边形 金属 镂空 男女 近视 墨镜
Part of the Data After Preprocessing
参数名 参数值
LSTM的hidden_size 64
Batch_size 128
Epoch 20
Optimizer Adam
Dropout 0.25
Learning Rate 0.001
δ 0.35
Parameter Setting
模型 Acc P R F1
TextCNN 67.85% 61.10% 58.04% 57.98%
TextRCNN 92.61% 91.49% 89.89% 90.09%
BiLSTM-Attention 92.31% 90.59% 89.50% 89.32%
ITR-TextCNN 68.01% 61.69% 58.45% 58.32%
ITR-TextRCNN 92.73% 91.39% 89.87% 89.91%
ITR-BiLSTM-Attention 94.15% 92.90% 92.38% 92.14%
Cos-BiLSTM-Attention 93.62% 92.36% 91.17% 91.13%
Algorithm Performance
[1] 贺波, 马静, 李驰. 基于融合特征的商品文本分类方法研究[J]. 情报理论与实践, 2020, 43(11): 162-168.
[1] ( He Bo, Ma Jing, Li Chi. Research on Commodity Text Classification Based on Fusion Features[J]. Information Studies: Theory & Application, 2020, 43(11): 162-168.)
[2] 李晓峰, 马静, 李驰, 等. 基于XGBoost模型的电商商品品名识别算法研究[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[2] ( Li Xiaofeng, Ma Jing, Li Chi, et al. Ide.pngying Commodity Names Based on XGBoost Model[J]. Data Analysis and Knowledge Discovery, 2019, 3(7): 34-41.)
[3] 万家山, 吴云志. 基于深度学习的文本分类方法研究综述[J]. 天津理工大学学报, 2021, 37(2): 41-47.
[3] ( Wan Jiashan, Wu Yunzhi. Review of Text Classification Research Based on Deep Learning[J]. Journal of Tianjin University of Technology, 2021, 37(2): 41-47.)
[4] Ohashi S, Takayama J, Kajiwara T, et al. Text Classification with Negative Supervision[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 351-357.
[5] Shen D H, Wang G Y, Wang W L, et al. Baseline Needs More Love: On Simple Word-Embedding-Based Models and Associated Pooling Mechanisms[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2018: 440-450.
[6] Yang Z C, Yang D Y, Dyer C, et al. Hierarchical Attention Networks for Document Classification[C]// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2016: 1480-1489.
[7] Qin Q, Hu W P, Liu B. Feature Projection for Improved Text Classification[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020: 8161-8171.
[8] Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
pmid: 9377276
[9] Bengio Y, Ducharme R, Vincent P, et al. A Neural Probabilistic Language Model[J]. Journal of Machine Learning Research, 2003, 3: 1137-1155.
[10] Mikolov T, Chen K, Corrado G, et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301.3781.
[11] Brown T B, Mann B, Ryder N, et al. Language Models are Few-Shot Learners[OL]. arXiv Preprint, arXiv: 2005.14165.
[12] Devlin J, Chang M W, Lee K, et al. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding[OL]. arXiv Preprint, arXiv: 1810.04805.
[13] Androutsopoulos I, Koutsias J, Konstantinos V C, et al. An Evaluation of Naive Bayesian Anti-spam Iltering[C]// Proceedings of the 2000 Workshop on Machine Learning in the New Information Age. 2000: 9-17.
[14] Tan S B. An Effective Refinement Strategy for KNN Text Classifier[J]. Expert Systems with Applications, 2006, 30(2): 290-298.
doi: 10.1016/j.eswa.2005.07.019
[15] Forman G. BNS Feature Scaling: An Improved Representation over TF-IDF for SVM Text Classification[C]// Proceeding of the 17th ACM Conference on Information and Knowledge Mining. New York: ACM Press, 2008: 263-270.
[16] Zhang Y F, Yu X L, Cui Z Y, et al. Every Document Owns Its Structure: Inductive Text Classification via Graph Neural Networks[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2020: 334-339.
[17] Kim Y. Convolutional Neural Networks for Sentence Classification[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2014: 1746-1751.
[18] Iyyer M, Manjunatha V, Boyd-Graber J, et al. Deep Unordered Composition Rivals Syntactic Methods for Text Classification[C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 1681-1691.
[19] Tang D Y, Qin B, Liu T. Document Modeling with Gated Recurrent Neural Network for Sentiment Classification[C]// Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Stroudsburg, PA, USA: Association for Computational Linguistics, 2015: 1422-1432.
[20] Dai A M, Le Q V. Semi-supervised Sequence Learning [A]// Advances in Neural Information Processing Systems[M]. 2015, 28: 3079-3087.
[21] Jin P, Zhang Y, Chen X, et al. Bag-of-Embeddings for Text Classification[C]// Proceedings of the 25th International Joint Conference on A.pngicial Intelligence. 2016: 2824-2830.
[22] Joulin A, Grave E, Bojanowski P, et al. Bag of Tricks for Efficient Text Classification[C]// Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2017: 427-431.
[23] 孙嘉琪, 王晓晔, 周晓雯. 基于神经网络模型的文本分类研究综述[J]. 天津理工大学学报, 2019, 35(5): 29-33.
[23] ( Sun Jiaqi, Wang Xiaoye, Zhou Xiaowen. Review of Text Classification Research Based on Neural Network Model[J]. Journal of Tianjin University of Technology, 2019, 35(5): 29-33.)
[24] Chung J, Gulcehre C, Cho K, et al. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling[OL]. arXiv Preprint, arXiv: 1412.3555.
[25] 李启行, 廖薇, 孟静雯. 基于注意力机制的双通道DAC-RNN文本分类模型[J/OL]. 计算机工程与应用. [2021-08-18]. http://kns.cnki.net/kcms/detail/11.2127.tp.20210420.1354.070.html.
[25] ( Li Qixing, Liao Wei, Meng Jingwen. Dual-channel DAC-RNN Text Categorization Model Based on Attention Mechanism[J/OL]. Computer Engineering and Applications. [2021-08-18]. http://kns.cnki.net/kcms/detail/11.2127.tp.20210420.1354.070.html.)
[26] 明建华, 胡创, 周建政, 等. 针对直播弹幕的TextCNN过滤模型[J]. 计算机工程与应用, 2021, 57(3): 162-167.
[26] ( Ming Jianhua, Hu Chuang, Zhou Jianzheng, et al. TextCNN Based Filtering Model for Barrage in Live Video Broadcast[J]. Computer Engineering and Applications, 2021, 57(3): 162-167.)
[27] Lai S, Xu L, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on A.pngicial Intelligence. 2015: 2267-2273.
[28] Bahdanau D, Cho K, Bengio Y. Neural Machine Translation by Jointly Learning to Align and Translate[OL]. arXiv Preprint, arXiv: 1409.0473.
[29] 王微. 融合全局和局部信息的度量学习方法研究[D]. 合肥: 中国科学技术大学, 2014.
[29] ( Wang Wei. Globality and Locality Incorporation in Distance Metric Learning[D]. Hefei: University of Science and Technology of China, 2014.)
[30] Kulis B. Metric Learning: A Survey[J]. Foundations and Trends® in Machine Learning, 2013, 5(4): 287-364.
doi: 10.1561/2200000019
[31] Musgrave K, Belongie S, Lim S N. A Metric Learning Reality Check[C]// Proceedings of the 16th European Conference on Computer Vision. 2020: 681-699.
[32] Hadsell R, Chopra S, LeCun Y. Dimensionality Reduction by Learning an Invariant Mapping[C]// Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE, 2006: 1735-1742.
[33] Schroff F, Kalenichenko D, Philbin J. FaceNet: A Unified Embedding for Face Recognition and Clustering[C]// Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2015: 815-823.
[34] Wen Y D, Zhang K P, Li Z F, et al. A Discriminative Feature Learning Approach for Deep Face Recognition[C]// Proceedings of the 14th European Conference on Computer Vision. 2016: 499-515.
[35] Liu W Y, Wen Y D, Yu Z D, et al. Large-Margin Softmax Loss for Convolutional Neural Networks[OL]. arXiv Preprint, arXiv: 1612.02295.
[36] Xuan H, Stylianou A, Liu X T, et al. Hard Negative Examples are Hard, but Useful[C]// Proceedings of the 16th European Conference on Computer Vision. 2020: 126-142.
[37] 姜同强, 万忠赫, 张青川. 基于双向长短期记忆网络和自注意力机制的食品安全裁判文书分类方法[J]. 科学技术与工程, 2019, 19(29): 188-192.
[37] ( Jiang Tongqiang, Wan Zhonghe, Zhang Qingchuan. Text Classification of Food Safety Judgment Document Based on BLSTM and Self-Attention[J]. Science Technology and Engineering, 2019, 19(29): 188-192.)
[38] 周志华. 机器学习[M]. 北京: 清华大学出版社, 2016.
[38] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016.)
[1] Chen Guo,Ye Chao. News Classification with Semi-Supervised and Active Learning[J]. 数据分析与知识发现, 2022, 6(4): 28-38.
[2] Xiao Yuejun,Li Honglian,Zhang Le,Lv Xueqiang,You Xindong. Classifying Chinese Patent Texts with Feature Fusion[J]. 数据分析与知识发现, 2022, 6(4): 49-59.
[3] Yang Lin,Huang Xiaoshuo,Wang Jiayang,Ding Lingling,Li Zixiao,Li Jiao. Identifying Subtypes of Clinical Trial Diseases with BERT-TextCNN[J]. 数据分析与知识发现, 2022, 6(4): 69-81.
[4] Xu Yuemei, Fan Zuwei, Cao Han. A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism[J]. 数据分析与知识发现, 2022, 6(2/3): 105-116.
[5] Tong Xinyu, Zhao Ruijie, Lu Yonghe. Multi-label Patent Classification with Pre-training Model[J]. 数据分析与知识发现, 2022, 6(2/3): 129-137.
[6] Xie Xingyu, Yu Bengong. Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model[J]. 数据分析与知识发现, 2022, 6(1): 101-112.
[7] Chen Jie,Ma Jing,Li Xiaofeng. Short-Text Classification Method with Text Features from Pre-trained Models[J]. 数据分析与知识发现, 2021, 5(9): 21-30.
[8] Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin. Construction and Application of GCN Model for Text Classification with Associated Information[J]. 数据分析与知识发现, 2021, 5(9): 31-41.
[9] Yu Bengong,Zhu Xiaojie,Zhang Ziwei. A Capsule Network Model for Text Classification with Multi-level Feature Extraction[J]. 数据分析与知识发现, 2021, 5(6): 93-102.
[10] Zhou Zhichao. Review of Automatic Citation Classification Based on Machine Learning[J]. 数据分析与知识发现, 2021, 5(12): 14-24.
[11] Wang Yan, Wang Huyan, Yu Bengong. Chinese Text Classification with Feature Fusion[J]. 数据分析与知识发现, 2021, 5(10): 1-14.
[12] Huang Lu,Zhou Enguo,Li Daifeng. Text Representation Learning Model Based on Attention Mechanism with Task-specific Information[J]. 数据分析与知识发现, 2020, 4(9): 111-122.
[13] Jiao Qihang,Le Xiaoqiu. Generating Sentences of Contrast Relationship[J]. 数据分析与知识发现, 2020, 4(6): 43-50.
[14] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[15] Xu Yuemei,Liu Yunwen,Cai Lianqiao. Predicitng Retweets of Government Microblogs with Deep-combined Features[J]. 数据分析与知识发现, 2020, 4(2/3): 18-28.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn