Reducing Dimensions of Custom Declaration Texts with Word2Vec
Gong Lijuan,Wang Hao(),Zhang Zixuan,Zhu Liping
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China
[Objective] This study tries to reduce the dimension of custom declaration texts, aiming to improve the efficiency of custom platforms.[Methods] We collected the declaration texts from a Chinese custom in four months as the corpus. Then, we evaluated the quality of the word vectors from the microscopic perspectives of word similarity and relevance. We also combined the traditional 0-1 matrix, frequency reduction and information gain with the SVM algorithm. Finally, we compared the results of these methods with the performance of Word2Vec word vector.[Results] Word2Vec word vector is an ideal dimension reduction method for customs declaration texts, and the classification efficiency was the highest when the word vector dimension reached 500, and the accuracy rate was 93.01%.[Limitations] We only studied the five categories with larger data volume.[Conclusions] The proposed method ensures data accuracy and integrity, which significantly reduces feature dimensions.
Zhang S, Zhao S . The Implication of Customs Modernization on Export Competitiveness in China[A]// Impact of Trade Facilitation on Export Competitiveness: A Regional Perspective[M]. United Nations Economic and Social Commission for Asia and the Pacific, 2009,66:121-131.
[2]
Laporte B . Risk Management Systems: Using Data Mining in Developing Countries’ Customs Administrations[J]. World Customs Journal, 2011,5(1):17-27.
[3]
胥丽娜 . 对外经贸实务[J].对外经贸实务, 2015(11):70-73.
[3]
( Xu Lina . The Risk of Customs Commodity Classification Errors and Its Prevention[J]. Practice in Foreign Economic Relations and Trade,2015(11):70-73.)
( Dai Liuling, Huang Heyan, Chen Zhaoxiong . A Comparative Study on Feature Selection in Chinese Text Categorization[J]. Journal of Chinese Information Processing, 2004,18(1):27-33.)
[8]
Al-Amin M, Islam M S, Uzzal S D . Sentiment Analysis of Bengali Comments with Word2Vec and Sentiment Information of Words [C]//Proceedings of the 2017 International Conference on Electrical, Computer and Communication Engineering. IEEE, 2017: 186-190.
( Zhou Shunxian, Jiang Li, Lin Shuangqiao , et al. Characteristic Representation Method of Document Based on Word2Vector[J]. Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 2018,30(2):272-279.)
[10]
杨海 . 现代海关制度建设中的难点及对策研究[D]. 武汉:华中科技大学, 2008.
[10]
( Yang Hai . A Research on Crux and the Counterplan Within Construction of the Modern Customs System[D]. Wuhan: Huazhong University of Science and Technology, 2008.)
[11]
Zdanowicz J S . Detecting Money Laundering and Terrorist Financing via Data Mining[J]. Communications of the ACM, 2004,47(5):53-55.
[12]
唐麒麟, 李长生 . 中国海关[J]. 中国海关, 1994(11):44-45.
[12]
( Tang Qilin, Li Changsheng . Introduction to the US Customs “Pre-Import Review System”[J]. China Customs, 1994(11):44-45.)
[13]
操辉 . 中国海关[J].中国海关, 2001(7):60-61.
[13]
( Cao Hui . South Korea Customs Develops Risk Management System Wholeheartedly[J]. China Customs, 2001(7):60-61.)
[14]
张荣忠 . 中国海关[J].中国海关, 2004(8):47-48.
[14]
( Zhang Rongzhong . Great Progress in Indian Customs[J]. China Customs, 2004(8):47-48.)
( Xue Feng, Hu Yue, Xia Shuai , et al. Research on Short Text Classification Based on Paper Title and Abstract[J]. Journal of Hefei University of Technology: Natural Science, 2018,41(10):1343-1349.)
( Wang Yang, Xu Shanshan, Li Chang , et al. Classification Model Based on Support Vector Machine for Chinese Extremely Short Text[J/OL]. Application Research of Computers, 2020,37(2). DOI: 10.19734/j.issn.1001-3695.2018.06.0514.)
( Wu Aiwei, Lei Jingsheng . An Improved Technique for Short-text Classification Algorithm for Power Customer Complaint Information[J]. Journal of Shanghai University of Electric Power, 2017,33(6):597-600.)
( Xie Binhong, Ma Fei, Pan Lihu , et al. Automatic Classification Method of Coal Mine Safety Hidden Danger Information[J]. Industry and Automation, 2018,44(10):10-14.)
( Wang Yuhan, Zhang Chunyun, Zhao Baolin , et al. Sentiment Analysis of Twitter Data Based on CNN[J]. Journal of Data Acquisition and Processing, 2018,33(5):921-927.)
( Xie Zongyan, Li Nao, Zhou Chunjie . Research on Emotional Classification of Hotel Comments Based on Word2Vec[J]. Journal of Beijing Union University: Natural Science, 2018,32(4):34-39.)
( Xie Rimin, Chen Jie, You Guirong , et al. A Word2Vec-Based Study of the Classification of Chinese Books[J]. Journal of Yunnan Nationalities University: Natural Science Edition, 2018,27(4):335-339.)
[22]
Huang E H, Socher R, Manning C D , et al. Improving Word Representations via Global Context and Multiple Word Prototypes [C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012: 873-882.
[23]
Tian F, Dai H, Bian J , et al. A Probabilistic Model for Learning Multi-Prototype Word Embeddings [C]// Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers. 2014: 151-160.
( Bai Shuxia, Bao Yulai, Zhang Hui . Automatic Summarization Based on Bag of Word Vector[J]. Journal of Modern Information, 2017,37(2):8-13.)
[25]
Zhang K, Xu H, Tang J , et al. Keyword Extraction Using Support Vector Machine [C]// Proceedings of the 7th International Conference on Web-Age Information Management. Springer, 2006: 85-96.
( Zhang Zixuan, Wang Hao, Zhu Liping , et al. Identifying Risks of HS Codes by China Customs[J]. Data Analysis and Knowledge Discovery, 2019,3(1):72-84.)
[28]
Hinton G E . Learning Distributed Representations of Concepts [C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1989: 1-12.
[29]
Bengio Y, Ducharme R, Vincent P , et al. Neural Probabilistic Language Models[A]// Innovations in Machine Learning: Theory and Applications[M]. Springer, 2006: 137-186.
[30]
Mathew J, Radhakrishnan D . An FIR Digital Filter Using One-Hot Coded Residue Representation [C]// Proceedings of the 10th European Signal Processing Conference. IEEE, 2000: 1885-1888.
[31]
Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint,arXiv:1301.3781.
[32]
Zheng X Q, Chen H Y, Xu T Y . Deep Learning for Chinese Word Segmentation and POS Tagging [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 647-657.
[33]
Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013,2:3111-3119.
( Bai Qiuchan, Jin Chunxia, Zhang Hui , et al. Topic-Text Clustering Algorithm Based on Word Co-Occurrence[J]. Computer Engineering & Science, 2013,35(7):164-168.)
( Tian Jiule, Zhao Wei . Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010,28(6):602-608.)