|
|
Reducing Dimensions of Custom Declaration Texts with Word2Vec |
Gong Lijuan,Wang Hao( ),Zhang Zixuan,Zhu Liping |
School of Information Management, Nanjing University, Nanjing 210023, China Jiangsu Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China |
|
|
Abstract [Objective] This study tries to reduce the dimension of custom declaration texts, aiming to improve the efficiency of custom platforms.[Methods] We collected the declaration texts from a Chinese custom in four months as the corpus. Then, we evaluated the quality of the word vectors from the microscopic perspectives of word similarity and relevance. We also combined the traditional 0-1 matrix, frequency reduction and information gain with the SVM algorithm. Finally, we compared the results of these methods with the performance of Word2Vec word vector.[Results] Word2Vec word vector is an ideal dimension reduction method for customs declaration texts, and the classification efficiency was the highest when the word vector dimension reached 500, and the accuracy rate was 93.01%.[Limitations] We only studied the five categories with larger data volume.[Conclusions] The proposed method ensures data accuracy and integrity, which significantly reduces feature dimensions.
|
Received: 05 June 2019
Published: 26 April 2020
|
|
Corresponding Authors:
Hao Wang
E-mail: ywhaowang@nju.edu.cn
|
[1] |
Zhang S, Zhao S . The Implication of Customs Modernization on Export Competitiveness in China[A]// Impact of Trade Facilitation on Export Competitiveness: A Regional Perspective[M]. United Nations Economic and Social Commission for Asia and the Pacific, 2009,66:121-131.
|
[2] |
Laporte B . Risk Management Systems: Using Data Mining in Developing Countries’ Customs Administrations[J]. World Customs Journal, 2011,5(1):17-27.
|
[3] |
胥丽娜 . 对外经贸实务[J].对外经贸实务, 2015(11):70-73.
|
[3] |
( Xu Lina . The Risk of Customs Commodity Classification Errors and Its Prevention[J]. Practice in Foreign Economic Relations and Trade,2015(11):70-73.)
|
[4] |
宗慧民 . 海关商品归类学[M]. 北京: 中国海关出版社, 2009.
|
[4] |
( Zong Huimin. Customs Commodity Classification[M]. Beijing: China Customs Press, 2009.)
|
[5] |
秦阳 . 国际海关研究[M]. 拉萨: 西藏人民出版社, 1996.
|
[5] |
( Qin Yang. International Customs Research[M]. Lhasa: Tibet People’s Publishing House, 1996.)
|
[6] |
世界海关组织. 海关商品归类手册[M]. 王雯译.北京: 中国海关出版社, 2002.
|
[6] |
( World Customs Organization. Customs Commodity Classification Manual[M]. Translated by Wang Wen. Beijing: China Customs Press, 2002.)
|
[7] |
代六玲, 黄河燕, 陈肇雄 . 中文文本分类中特征抽取方法的比较研究[J]. 中文信息学报, 2004,18(1):27-33.
|
[7] |
( Dai Liuling, Huang Heyan, Chen Zhaoxiong . A Comparative Study on Feature Selection in Chinese Text Categorization[J]. Journal of Chinese Information Processing, 2004,18(1):27-33.)
|
[8] |
Al-Amin M, Islam M S, Uzzal S D . Sentiment Analysis of Bengali Comments with Word2Vec and Sentiment Information of Words [C]//Proceedings of the 2017 International Conference on Electrical, Computer and Communication Engineering. IEEE, 2017: 186-190.
|
[9] |
周顺先, 蒋励, 林霜巧 , 等. 基于Word2Vector的文本特征化表示方法[J]. 重庆邮电大学学报:自然科学版, 2018,30(2):272-279.
|
[9] |
( Zhou Shunxian, Jiang Li, Lin Shuangqiao , et al. Characteristic Representation Method of Document Based on Word2Vector[J]. Journal of Chongqing University of Posts and Telecommunications: Natural Science Edition, 2018,30(2):272-279.)
|
[10] |
杨海 . 现代海关制度建设中的难点及对策研究[D]. 武汉:华中科技大学, 2008.
|
[10] |
( Yang Hai . A Research on Crux and the Counterplan Within Construction of the Modern Customs System[D]. Wuhan: Huazhong University of Science and Technology, 2008.)
|
[11] |
Zdanowicz J S . Detecting Money Laundering and Terrorist Financing via Data Mining[J]. Communications of the ACM, 2004,47(5):53-55.
|
[12] |
唐麒麟, 李长生 . 中国海关[J]. 中国海关, 1994(11):44-45.
|
[12] |
( Tang Qilin, Li Changsheng . Introduction to the US Customs “Pre-Import Review System”[J]. China Customs, 1994(11):44-45.)
|
[13] |
操辉 . 中国海关[J].中国海关, 2001(7):60-61.
|
[13] |
( Cao Hui . South Korea Customs Develops Risk Management System Wholeheartedly[J]. China Customs, 2001(7):60-61.)
|
[14] |
张荣忠 . 中国海关[J].中国海关, 2004(8):47-48.
|
[14] |
( Zhang Rongzhong . Great Progress in Indian Customs[J]. China Customs, 2004(8):47-48.)
|
[15] |
薛峰, 胡越, 夏帅 , 等. 基于论文标题和摘要的短文本分类研究[J]. 合肥工业大学学报:自然科学版, 2018,41(10):1343-1349.
|
[15] |
( Xue Feng, Hu Yue, Xia Shuai , et al. Research on Short Text Classification Based on Paper Title and Abstract[J]. Journal of Hefei University of Technology: Natural Science, 2018,41(10):1343-1349.)
|
[16] |
王杨, 许闪闪, 李昌 , 等. 基于支持向量机的中文极短文本分类模型[J/OL]. 计算机应用研究, 2020,37(2). DOI: 10.19734/j.issn.1001-3695.2018.06.0514.
|
[16] |
( Wang Yang, Xu Shanshan, Li Chang , et al. Classification Model Based on Support Vector Machine for Chinese Extremely Short Text[J/OL]. Application Research of Computers, 2020,37(2). DOI: 10.19734/j.issn.1001-3695.2018.06.0514.)
|
[17] |
吴艾薇, 雷景生 . 面向电力客户投诉信息的短文本分类算法的改进技术[J]. 上海电力学院学报, 2017,33(6):597-600.
|
[17] |
( Wu Aiwei, Lei Jingsheng . An Improved Technique for Short-text Classification Algorithm for Power Customer Complaint Information[J]. Journal of Shanghai University of Electric Power, 2017,33(6):597-600.)
|
[18] |
谢斌红, 马非, 潘理虎 , 等. 煤矿安全隐患信息自动分类方法[J]. 工矿自动化, 2018,44(10):10-14.
|
[18] |
( Xie Binhong, Ma Fei, Pan Lihu , et al. Automatic Classification Method of Coal Mine Safety Hidden Danger Information[J]. Industry and Automation, 2018,44(10):10-14.)
|
[19] |
王煜涵, 张春云, 赵宝林 , 等. 卷积神经网络下的Twitter文本情感分析[J]. 数据采集与处理, 2018,33(5):921-927.
|
[19] |
( Wang Yuhan, Zhang Chunyun, Zhao Baolin , et al. Sentiment Analysis of Twitter Data Based on CNN[J]. Journal of Data Acquisition and Processing, 2018,33(5):921-927.)
|
[20] |
谢宗彦, 黎巎, 周纯洁 . 基于Word2Vec的酒店评论情感分类研究[J]. 北京联合大学学报:自然科学版, 2018,32(4):34-39.
|
[20] |
( Xie Zongyan, Li Nao, Zhou Chunjie . Research on Emotional Classification of Hotel Comments Based on Word2Vec[J]. Journal of Beijing Union University: Natural Science, 2018,32(4):34-39.)
|
[21] |
谢日敏, 陈杰, 游贵荣 , 等. 基于Word2Vec的中文图书分类研究[J]. 云南民族大学学报:自然科学版, 2018,27(4):335-339.
|
[21] |
( Xie Rimin, Chen Jie, You Guirong , et al. A Word2Vec-Based Study of the Classification of Chinese Books[J]. Journal of Yunnan Nationalities University: Natural Science Edition, 2018,27(4):335-339.)
|
[22] |
Huang E H, Socher R, Manning C D , et al. Improving Word Representations via Global Context and Multiple Word Prototypes [C]//Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 2012: 873-882.
|
[23] |
Tian F, Dai H, Bian J , et al. A Probabilistic Model for Learning Multi-Prototype Word Embeddings [C]// Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers. 2014: 151-160.
|
[24] |
白淑霞, 鲍玉来, 张晖 . 基于词向量包的自动文摘方法[J]. 现代情报, 2017,37(2):8-13.
|
[24] |
( Bai Shuxia, Bao Yulai, Zhang Hui . Automatic Summarization Based on Bag of Word Vector[J]. Journal of Modern Information, 2017,37(2):8-13.)
|
[25] |
Zhang K, Xu H, Tang J , et al. Keyword Extraction Using Support Vector Machine [C]// Proceedings of the 7th International Conference on Web-Age Information Management. Springer, 2006: 85-96.
|
[26] |
LIBSVM[CP/OL]. [2016-12-22].https://www.csie.ntu.edu.tw/~cjlin/libsvm/.
|
[27] |
张紫玄, 王昊, 朱立平 , 等. 中国海关HS编码风险的识别研究[J]. 数据分析与知识发现, 2019,3(1):72-84.
|
[27] |
( Zhang Zixuan, Wang Hao, Zhu Liping , et al. Identifying Risks of HS Codes by China Customs[J]. Data Analysis and Knowledge Discovery, 2019,3(1):72-84.)
|
[28] |
Hinton G E . Learning Distributed Representations of Concepts [C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society. 1989: 1-12.
|
[29] |
Bengio Y, Ducharme R, Vincent P , et al. Neural Probabilistic Language Models[A]// Innovations in Machine Learning: Theory and Applications[M]. Springer, 2006: 137-186.
|
[30] |
Mathew J, Radhakrishnan D . An FIR Digital Filter Using One-Hot Coded Residue Representation [C]// Proceedings of the 10th European Signal Processing Conference. IEEE, 2000: 1885-1888.
|
[31] |
Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint,arXiv:1301.3781.
|
[32] |
Zheng X Q, Chen H Y, Xu T Y . Deep Learning for Chinese Word Segmentation and POS Tagging [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 647-657.
|
[33] |
Mikolov T, Sutskever I, Chen K , et al. Distributed Representations of Words and Phrases and Their Compositionality [C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013,2:3111-3119.
|
[34] |
Word2Vec 0.9.2[CP/OL]. [2017-09-19]. https://pypi.org/project/Word2Vec/.
|
[35] |
郑开雨, 竹翠 . 计算机与现代化[J].计算机与现代化, 2018(6):1-6.
|
[35] |
( Zheng Kaiyu, Zhu Cui . Context Semantic-based Naive Bayesian Algorithm for Text Classification[J].Computer and Modernization,2018(6):1-6.)
|
[36] |
白秋产, 金春霞, 章慧 , 等. 词共现文本主题聚类算法[J]. 计算机工程与科学, 2013,35(7):164-168.
|
[36] |
( Bai Qiuchan, Jin Chunxia, Zhang Hui , et al. Topic-Text Clustering Algorithm Based on Word Co-Occurrence[J]. Computer Engineering & Science, 2013,35(7):164-168.)
|
[37] |
田久乐, 赵蔚 . 基于同义词词林的词语相似度计算方法[J]. 吉林大学学报:信息科学版, 2010,28(6):602-608.
|
[37] |
( Tian Jiule, Zhao Wei . Words Similarity Algorithm Based on Tongyici Cilin in Semantic Web Adaptive Learning System[J]. Journal of Jilin University: Information Science Edition, 2010,28(6):602-608.)
|
[38] |
陈二静, 姜恩波 . 文本相似度计算方法研究综述[J]. 数据分析与知识发现, 2017,1(6):1-11.
|
[38] |
( Chen Erjing, Jiang Enbo . Review of Studies on Text Similarity Measures[J]. Data Analysis and Knowledge Discovery, 2017,1(6):1-11.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|