Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 89-100    DOI: 10.11925/infotech.2096-3467.2019.0613
Current Issue | Archive | Adv Search |
Reducing Dimensions of Custom Declaration Texts with Word2Vec
Gong Lijuan,Wang Hao(),Zhang Zixuan,Zhu Liping
School of Information Management, Nanjing University, Nanjing 210023, China
Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
Download: PDF (905 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This study tries to reduce the dimension of custom declaration texts, aiming to improve the efficiency of custom platforms.[Methods] We collected the declaration texts from a Chinese custom in four months as the corpus. Then, we evaluated the quality of the word vectors from the microscopic perspectives of word similarity and relevance. We also combined the traditional 0-1 matrix, frequency reduction and information gain with the SVM algorithm. Finally, we compared the results of these methods with the performance of Word2Vec word vector.[Results] Word2Vec word vector is an ideal dimension reduction method for customs declaration texts, and the classification efficiency was the highest when the word vector dimension reached 500, and the accuracy rate was 93.01%.[Limitations] We only studied the five categories with larger data volume.[Conclusions] The proposed method ensures data accuracy and integrity, which significantly reduces feature dimensions.

Key wordsWord2Vec      SVM      Automatic Classification      Feature Reduction     
Received: 05 June 2019      Published: 26 April 2020
ZTFLH:  TP393  
Corresponding Authors: Hao Wang     E-mail: ywhaowang@nju.edu.cn

Cite this article:

Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 89-100.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0613     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I2/3/89

General Framework
字段名称 字段表示 主要内容
商品名称 Goods 通常是商品名称或对商品的直接描述,不允许为空值
商品描述 Description 通常是对商品的具体说明,如尺寸、原材料、成分、用途等,允许为空值
HS编码 HS_id 10位编码,前两位编码为“章”
Data for Experiments
序号 类别编码 商品类型
1 85 电机、电气设备及其零件;录音机及放声机、电视图像、声音的录制和重放设备及其零件、附件
2 84 核反应堆、锅炉、机器、机械器具及其零件
3 39 塑料及其制品
4 90 光学、照相、电影、计量、检验、医疗或外科用仪器及设备、精密仪器及设备;上述物品的零件、附件
5 73 钢铁制品
6 其他 除了以上5类的其他所有商品类别
Category Codes and Corresponding Product Types
序号 类别编码 训练 测试 总计 特征维度 P
1 85 2 447 562 3 009 7 824 92.90%
2 84 2 363 625 2 988
3 39 2 555 453 3 008
4 90 2 540 447 2 987
5 73 2 626 371 2 997
6 其他 2 469 542 3 011
总计 15 000 3 000 18 000
Data and Results in the Experiment
Experimental Result of HS Code 1-2 Bits as Classification Mark
相似词 相似度 描述
杨木 0.913 592 大多与“胶合板”“木制”“POPULUS”“多层”等词共现
0.894 916 大多与“胶合板”“木制”“多层”“覆膜”共现
木制 0.877 032 大多与“胶合板”“木制”共现
桦木 0.820 524 大多与“胶合板”“杨木”“原木”共现
覆膜 0.818 699 大多与“胶合板”共现
白杨木 0.782 977 大多与“胶合板”“多层板”“杨木”共现
Poplar 0.777 427 大多与“杨木”“木制”共现
木托盘 0.775 225 大多与“胶合板”“木制”“杨木”共现
松木 0.774 623 大多与“胶合板”“木制”共现
白杨树 0.761 025 大多与“胶合板”共现
Similar Words to “Plywood”
相似词 相似度 描述
日医 0.972 201 大多与“PVC”“手套”“工业用”“品牌”“型”“其他”共现
MYECO 0.934 574 大多与“PVC”“手套”“品牌”“其他”“无”共现
超轻 0.906 915 大多与“PVC”“手套”“工业用”“品牌”“型”“其他”共现
褐黄 0.894 948 大多与“PVC”“手套”“无”“品牌”“型”共现
假花 0.833 605 大多与“无”“其他”共现
SC55 0.826 973 大多与“PVC”“品牌”共现
安全网 0.824 389 大多与“PVC”“无”“其他”共现
淋浴房 0.823 842 大多与“无”“其他”共现
超薄 0.823 199 大多与“PVC”“手套”“工业用”“无”“品牌”“其他”共现
鞋带 0.820 205 共4条,大多与“PVC”“无”“品牌”“其他”共现
Similar Words to “CJBCO”
相似词 相似度 描述
Touch 0.666 321 大多与“苹果”“iPod”“iOS”共现
苹果汁 0.630 865 大多与“苹果”共现
iPod 0.629 926 大多与“苹果”“iPod”“iOS”共现
Letv 0.623 023 大多与“移动电话”“电话机”“TD”“LTE”“通讯”共现
0.614 639 大多与“苹果”共现
果粒 0.611 279 大多与“饮料”以及其他水果名共现,如“菠萝”“葡萄”
MAX470 0.597 105 大多与“Letv”共现
草莓 0.595 512 无明显词共现
西番莲 0.591 664 大多与“饮料”与一些水果名共现
X522 0.583 301 大多与“Letv”共现
Similar Words to “Apple”
Relationship Example1 Example2
菜刀-厨具
菜刀-厨具
菜刀-厨具
镊子-手术器械
镊子-手术器械
镊子-手术器械
支架-固定装置
支架-固定装置
支架-固定装置
沙发-休息
羽毛球-羽毛球拍
苹果-芒
玻璃杯-餐桌
沙发-架子
洗洁精-厨房
哑铃-健身
椅子-沙发
羽毛球-野营
哑铃-铃片
坐垫-椅
手套-劳保
羽毛球-羽毛球拍
轮胎-子午线
哑铃-健身
坐垫-座椅
沙发-软垫
螺丝-螺钉
Subordinate Relation Between Words
Relationship Example1 Example2
牙膏-高露洁
牙膏-高露洁
牙膏-高露洁
手机-苹果
牙膏-狮王
漱口水-那氏
纸尿裤-花王
纸尿裤-花王
纸尿裤-花王
卫生巾-花王
麦克风-受话器
桌子-客厅
电脑-笔记本电脑
漱口水-口腔
牙膏-狮王
洗衣机-滚筒
牙膏-刷牙
卫生巾-MERRIES
纸尿裤-王牌
坐垫-座椅
床垫-填充物
床垫-垫子
洗衣粉-漱口水
手机-移动电话
牙刷-牙齿
毛巾-盥洗
电脑-计算机
“Commodity-Commodity Brand” Relationship
Relationship Example1 Example2
牙刷-刷牙
牙刷-刷牙
牙刷-刷牙
毛巾-盥洗
毛巾-盥洗
毛巾-盥洗
沙发-休息
手套-劳保
手套-劳保
毛巾-盥洗
纸尿裤-花王
支架-底座
洗洁精-洁厕
牙膏-狮王
灯-照明用
日光灯-吊灯
洗洁精-除菌
烤箱-蒸汽
洗洁精-厨房
漱口水-清洁
沙发-架子
文具-厨具
手套-浸胶
坐垫-椅
毛巾-盥洗
毛巾-健身
灯具-照明
“Commodity-Commodity Use” Relationship
Relationship Example1 Example2
胶合板-杨木
胶合板-杨木
胶合板-杨木
胶合板-杨木
假花-塑料
假花-塑料
T恤衫-针织
T恤衫-针织
手套-乳胶
裤子-马甲
桌子-椅子
杯子-餐具
短袜-无袖
胶合板-Paulownia
胶合板-杨
棉签-硬管
拼板-南洋
坐垫-椅
毛巾-盥洗
假花-KD53624B2
沙发-休息
手套-雨衣
拼板-楹
床单-被套
“Commodity-Commodity Composition / Material” Relationship
Product Classification Results in 1,000 Dimensions after Word Vectorization
Word Vectors of Different Dimensions on Classification Accuracy
序号 类别编码 训练 测试 总计 特征维度 整体准确率
1 85 2 162 498 2 660 506 84.33%
2 84 2 048 573 2 621
3 39 2 356 425 2 781
4 90 2 221 343 2 564
5 73 2 181 407 2 588
6 其他 2 268 504 2 772
总计 13 236 2 750 15 986
Data and Results of TF Experiments
序号 类别编码 训练 测试 总计 特征维度 整体准确率
1 85 2 159 497 2 656 500 84.77%
2 84 2 029 570 2 599
3 39 2 356 425 2 781
4 90 2 224 343 2 567
5 73 2 181 407 2 588
6 其他 2 262 502 2 764
总计 13 211 2 744 15 955
Data and Results of IG Experiments
Effect of Different Dimensionality Reduction Methods on Classification
[1] Zhang S, Zhao S . The Implication of Customs Modernization on Export Competitiveness in China[A]// Impact of Trade Facilitation on Export Competitiveness: A Regional Perspective[M]. United Nations Economic and Social Commission for Asia and the Pacific, 2009,66:121-131.
[2] Laporte B . Risk Management Systems: Using Data Mining in Developing Countries’ Customs Administrations[J]. World Customs Journal, 2011,5(1):17-27.
[3] 胥丽娜 . 对外经贸实务[J].对外经贸实务, 2015(11):70-73.
[3] ( Xu Lina . The Risk of Customs Commodity Classification Errors and Its Prevention[J]. Practice in Foreign Economic Relations and Trade,2015(11):70-73.)
[4] 宗慧民 . 海关商品归类学[M]. 北京: 中国海关出版社, 2009.
[4] ( Zong Huimin. Customs Commodity Classification[M]. Beijing: China Customs Press, 2009.)
[5] 秦阳 . 国际海关研究[M]. 拉萨: 西藏人民出版社, 1996.
[5] ( Qin Yang. International Customs Research[M]. Lhasa: Tibet People’s Publishing House, 1996.)
[6] 世界海关组织. 海关商品归类手册[M]. 王雯译.北京: 中国海关出版社, 2002.
[6] ( World Customs Organization. Customs Commodity Classification Manual[M]. Translated by Wang Wen. Beijing: China Customs Press, 2002.)
[7] 代六玲, 黄河燕, 陈肇雄 . 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004,18(1):27-33.
[7] ( Dai Liuling, Huang Heyan, Chen Zhaoxiong . A Comparative Study on Feature Selection in Chinese Text Categorization[J]. Journal of Chinese Information Processing, 2004,18(1):27-33.)
[8] Al-Amin M, Islam M S, Uzzal S D . Sentiment Analysis of Bengali Comments with Word2Vec and Sentiment Information of Words [C]//Proceedings of the 2017 International Conference on Electrical, Computer and Communication Engineering. IEEE, 2017: 186-190.
[9] 周顺先, 蒋励, 林霜巧 , 等. 基于Word2Vector的文本特征化表示方法[J]. 重庆邮电大学学报:自然科学版, 2018,30(2):272-279.
[9] ( Zhou Shunxian, Jiang Li, Lin Shuangqiao , et al. Characteristic Representation Method of Document Based on Word2Vector[J]. Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 2018,30(2):272-279.)
[10] 杨海 . 现代海关制度建设中的难点及对策研究[D]. 武汉:华中科技大学, 2008.
[10] ( Yang Hai . A Research on Crux and the Counterplan Within Construction of the Modern Customs System[D]. Wuhan: Huazhong University of Science and Technology, 2008.)
[11] Zdanowicz J S . Detecting Money Laundering and Terrorist Financing via Data Mining[J]. Communications of the ACM, 2004,47(5):53-55.
[12] 唐麒麟, 李长生 . 中国海关[J]. 中国海关, 1994(11):44-45.
[12] ( Tang Qilin, Li Changsheng . Introduction to the US Customs “Pre-Import Review System”[J]. China Customs, 1994(11):44-45.)
[13] 操辉 . 中国海关[J].中国海关, 2001(7):60-61.
[13] ( Cao Hui . South Korea Customs Develops Risk Management System Wholeheartedly[J]. China Customs, 2001(7):60-61.)
[14] 张荣忠 . 中国海关[J].中国海关, 2004(8):47-48.
[14] ( Zhang Rongzhong . Great Progress in Indian Customs[J]. China Customs, 2004(8):47-48.)
[15] 薛峰, 胡越, 夏帅 , 等. 基于论文标题和摘要的短文本分类研究[J]. 合肥工业大学学报:自然科学版, 2018,41(10):1343-1349.
[15] ( Xue Feng, Hu Yue, Xia Shuai , et al. Research on Short Text Classification Based on Paper Title and Abstract[J]. Journal of Hefei University of Technology: Natural Science, 2018,41(10):1343-1349.)
[16] 王杨, 许闪闪, 李昌 , 等. 基于支持向量机的中文极短文本分类模型[J/OL]. 计算机应用研究, 2020,37(2). DOI: 10.19734/j.issn.1001-3695.2018.06.0514.
[16] ( Wang Yang, Xu Shanshan, Li Chang , et al. Classification Model Based on Support Vector Machine for Chinese Extremely Short Text[J/OL]. Application Research of Computers, 2020,37(2). DOI: 10.19734/j.issn.1001-3695.2018.06.0514.)
[17] 吴艾薇, 雷景生 . 面向电力客户投诉信息的短文本分类算法的改进技术[J]. 上海电力学院学报, 2017,33(6):597-600.
[17] ( Wu Aiwei, Lei Jingsheng . An Improved Technique for Short-text Classification Algorithm for Power Customer Complaint Information[J]. Journal of Shanghai University of Electric Power, 2017,33(6):597-600.)
[18] 谢斌红, 马非, 潘理虎 , 等. 煤矿安全隐患信息自动分类方法[J]. 工矿自动化, 2018,44(10):10-14.
[18] ( Xie Binhong, Ma Fei, Pan Lihu , et al. Automatic Classification Method of Coal Mine Safety Hidden Danger Information[J]. Industry and Automation, 2018,44(10):10-14.)
[19] 王煜涵, 张春云, 赵宝林 , 等. 卷积神经网络下的Twitter文本情感分析[J]. 数据采集与处理, 2018,33(5):921-927.
[19] ( Wang Yuhan, Zhang Chunyun, Zhao Baolin , et al. Sentiment Analysis of Twitter Data Based on CNN[J]. Journal of Data Acquisition and Processing, 2018,33(5):921-927.)
[20] 谢宗彦, 黎巎, 周纯洁 . 基于Word2Vec的酒店评论情感分类研究[J]. 北京联合大学学报:自然科学版, 2018,32(4):34-39.
[20] ( Xie Zongyan, Li Nao, Zhou Chunjie . Research on Emotional Classification of Hotel Comments Based on Word2Vec[J]. Journal of Beijing Union University: Natural Science, 2018,32(4):34-39.)
[21] 谢日敏, 陈杰, 游贵荣 , 等. 基于Word2Vec的中文图书分类研究[J]. 云南民族大学学报:自然科学版, 2018,27(4):335-339.
[21] ( Xie Rimin, Chen Jie, You Guirong , et al. A Word2Vec-Based Study of the Classification of Chinese Books[J]. Journal of Yunnan Nationalities University: Natural Science Edition, 2018,27(4):335-339.)
[22] Huang E H, Socher R, Manning C D , et al. Improving Word Representations via Global Context and Multiple Word Prototypes [C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012: 873-882.
[23] Tian F, Dai H, Bian J , et al. A Probabilistic Model for Learning Multi-Prototype Word Embeddings [C]// Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers. 2014: 151-160.
[24] 白淑霞, 鲍玉来, 张晖 . 基于词向量包的自动文摘方法[J]. 现代情报, 2017,37(2):8-13.
[24] ( Bai Shuxia, Bao Yulai, Zhang Hui . Automatic Summarization Based on Bag of Word Vector[J]. Journal of Modern Information, 2017,37(2):8-13.)
[25] Zhang K, Xu H, Tang J , et al. Keyword Extraction Using Support Vector Machine [C]// Proceedings of the 7th International Conference on Web-Age Information Management. Springer, 2006: 85-96.
[26] LIBSVM[CP/OL]. [2016-12-22].https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
[27] 张紫玄, 王昊, 朱立平 , 等. 中国海关HS编码风险的识别研究[J]. 数据分析与知识发现, 2019,3(1):72-84.
[27] ( Zhang Zixuan, Wang Hao, Zhu Liping , et al. Identifying Risks of HS Codes by China Customs[J]. Data Analysis and Knowledge Discovery, 2019,3(1):72-84.)
[28] Hinton G E . Learning Distributed Representations of Concepts [C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1989: 1-12.
[29] Bengio Y, Ducharme R, Vincent P , et al. Neural Probabilistic Language Models[A]// Innovations in Machine Learning: Theory and Applications[M]. Springer, 2006: 137-186.
[30] Mathew J, Radhakrishnan D . An FIR Digital Filter Using One-Hot Coded Residue Representation [C]// Proceedings of the 10th European Signal Processing Conference. IEEE, 2000: 1885-1888.
[31] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint,arXiv:1301.3781.
[32] Zheng X Q, Chen H Y, Xu T Y . Deep Learning for Chinese Word Segmentation and POS Tagging [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 647-657.
[33] Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013,2:3111-3119.
[34] Word2Vec 0.9.2[CP/OL]. [2017-09-19]. https://pypi.org/project/Word2Vec/.
[35] 郑开雨, 竹翠 . 计算机与现代化[J].计算机与现代化, 2018(6):1-6.
[35] ( Zheng Kaiyu, Zhu Cui . Context Semantic-based Naive Bayesian Algorithm for Text Classification[J].Computer and Modernization,2018(6):1-6.)
[36] 白秋产, 金春霞, 章慧 , 等. 词共现文本主题聚类算法[J]. 计算机工程与科学, 2013,35(7):164-168.
[36] ( Bai Qiuchan, Jin Chunxia, Zhang Hui , et al. Topic-Text Clustering Algorithm Based on Word Co-Occurrence[J]. Computer Engineering & Science, 2013,35(7):164-168.)
[37] 田久乐, 赵蔚 . 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报:信息科学版, 2010,28(6):602-608.
[37] ( Tian Jiule, Zhao Wei . Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010,28(6):602-608.)
[38] 陈二静, 姜恩波 . 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017,1(6):1-11.
[38] ( Chen Erjing, Jiang Enbo . Review of Studies on Text Similarity Measures[J]. Data Analysis and Knowledge Discovery, 2017,1(6):1-11.)
[1] Shen Wang, Li Shiyu, Liu Jiayu, Li He. Optimizing Quality Evaluation for Answers of Q&A Community[J]. 数据分析与知识发现, 2021, 5(2): 83-93.
[2] Li Yueyan,Xiong Huixiang,Li Xiaomin. Recommending Doctors Online Based on Combined Conditions[J]. 数据分析与知识发现, 2020, 4(8): 130-142.
[3] Tang Xiaobo,Gao Hexuan. Classification of Health Questions Based on Vector Extension of Keywords[J]. 数据分析与知识发现, 2020, 4(7): 66-75.
[4] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[5] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[6] Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR[J]. 数据分析与知识发现, 2020, 4(4): 109-118.
[7] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[8] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[9] Li Jiao,Huang Yongwen,Luo Tingting,Zhao Ruixue,Xian Guojian. Automatic Classification Method Based on Multi-factor Algorithm[J]. 数据分析与知识发现, 2020, 4(11): 43-51.
[10] Bengong Yu,Yumeng Cao,Yangnan Chen,Ying Yang. Classification of Short Texts Based on nLD-SVM-RF Model[J]. 数据分析与知识发现, 2020, 4(1): 111-120.
[11] Gang Li,Huayang Zhou,Jin Mao,Sijing Chen. Classifying Social Media Users with Machine Learning[J]. 数据分析与知识发现, 2019, 3(8): 1-9.
[12] Bengong Yu,Yangnan Chen,Ying Yang. Classifying Short Text Complaints with nBD-SVM Model[J]. 数据分析与知识发现, 2019, 3(5): 77-85.
[13] Cuiqing Jiang,Yibo Guo,Yao Liu. Constructing a Domain Sentiment Lexicon Based on Chinese Social Media Text[J]. 数据分析与知识发现, 2019, 3(2): 98-107.
[14] Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng. Identifying Risks of HS Codes by China Customs[J]. 数据分析与知识发现, 2019, 3(1): 72-84.
[15] Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong. Comparing Text Vector Generators for Weibo Short Text Classification[J]. 数据分析与知识发现, 2018, 2(8): 41-50.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn