1Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China 2China National Institute of Standardization, Beijing 100191, China
[Objective] This paper builds a dictionary for defective products, aiming to helps users better understand the latest developments of specific domains. [Methods] First, we extracted domain-related phrases from the corpus using word frequency features. Then, we reduced manual labeling work with the help of the TF-IDF algorithm. Finally, we proposed a Convolutional Neural Network (CNN) model using semantic and position information to generate the domain dictionary. [Results] Compared with the statistical learning method, our model improved the accuracy, recall and F1 values by 6%~9%. [Limitations] More research is needed to examine our method in other fields. [Conclusions] The proposed CNN-based method could effectively construct a dictionary for defective products.
Lee S, Shishibori M. Passage Segmentation Based on Topic Matter[J]. International Journal of Computer Processing of Oriental Languages, 2002,15(3):305-339.
( Chen Wenliang, Zhu Jingbo, Zhu Muhua, et al. Text Representation Using Domain Dictionary[J]. Journal of Computer Research and Development, 2005,42(12):2155-2160.)
[3]
Hu B T, Lu Z D, Li H, et al. Convolutional Neural Network Architectures for Matching Natural Language Sentences[C]// Proceedings of the 27th International Conference on Neural Information Processing Systems. 2014: 2042-2050.
[4]
Hochreiter S, Schmidhuber J. Long Short-Term Memory[J]. Neural Computation, 1997,9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735
pmid: 9377276
[5]
Mashechkin I V, Petrovskiy M I, Popov D S, et al. Applying Text Mining Methods for Data Loss Prevention[J]. Programming and Computing Software, 2015,41(1):23-30.
doi: 10.1134/S0361768815010041
( Zhang Tao, Ma Haiqun. Clustering Policy Texts Based on LDA Topic Model[J]. Data Analysis and Knowledge Discovery, 2018,2(9):59-65.)
[7]
Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3(4-5):993-1022.
[8]
He T K, Hao R, Qi H, et al. Mining Feature-Opinion from Reviews Based on Dependency Parsing[J]. International Journal of Software Engineering and Knowledge Engineering, 2017,26(9-10):1581-1591.
doi: 10.1142/S0218194016710029
[9]
El-Kishky A, Song Y L, Wang C, et al. Scalable Topical Phrase Mining from Text Corpora[OL]. arXiv Preprint, arXiv: 1406. 6312.
[10]
Liu J, Shang J, Wang C, et al. Mining Quality Phrases from Massive Text Corpora[C]// Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. 2015: 1729-1744.
( Sun Xia, Zheng Qinghua, Wang Zhaojing, et al. Method of Special Domain Lexicon Construction Based on Raw Materials[J]. Mini-Micro Systems, 2005,26(6):1088-1092.)
( Yin Wenke, Zhu Ming, Chen Tianhao. Domain Thesaurus Construction Based on Wiki Hyperlink Structure Graph Clustering[J]. Journal of Chinese Computer Systems, 2014,35(6):1286-1292.)
( Li Weiqing, Wang Weijun. Building Product Feature Dictionary with Large-Scale Review Data[J]. Data Analysis and Knowledge Discovery, 2018,2(1):41-50.)
[14]
Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[15]
Wu J G, Li Y. Research on Construction of Semantic Dictionary in the Football Field[C]// Proceedings of the 2017 IEEE International Conference on Software Engineering Research, Management and Applications (SERA). 2017: 303-306.
[16]
Ju M Z, Duan H L, Li H M. A CRF-Based Method for Automatic Construction of Chinese Symptom Lexicon[C]// Proceedings of the 2015 International Conference on Information Technology in Medicine and Education (ITME). 2015: 5-8.
[17]
Church K W, Gale W A, Hanks P, et al. Using Statistics in Lexical Analysis[M]// Lexical Acquisition. Lawrence Erlbaum, 1991: 115-164.
[18]
Li G Y, Wang H F. Improved Automatic Keyword Extraction Based on TextRank Using Domain Knowledge[C]// Proceedings of the 2014 CCF International Conference on Natural Language Processing and Chinese Computing. 2014: 403-413.
[19]
Chowdhury G G. Introduction to Modern Information Retrieval[M]. Facet Publishing, 2010.
( Yin Cong, Zhang Liyi. Recommendation Algorithm for Post-Context Filtering Based on TF-IDF: Case Study of Catering O2O[J]. Data Analysis and Knowledge Discovery, 2018,2(11):28-36.)
[21]
Santos C N, Xiang B, Zhou B W. Classifying Relations by Ranking with Convolutional Neural Networks[OL]. arXiv Preprint, arXiv: 1504. 06580.
[22]
Zeng D J, Liu K, Lai S W, et al. Relation Classification via Convolutional Deep Neural Network[C]// Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers. 2014: 2335-2344.
[23]
Collobert R, Weston J, Bottou L, et al. Natural Language Processing (almost) from Scratch[J]. Journal of Machine Learning Research, 2011,12:2493-2537.
[24]
Lai S, Xu L, Liu K, et al. Recurrent Convolutional Neural Networks for Text Classification[C]// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015.
[25]
Van Rijsbergen C. J. Information Retrieval[M]. Butterworths, 1975.
[26]
Suykens J A K, Vandewalle J. Least Squares Support Vector Machine Classifiers[J]. Neural Processing Letters, 1999,9(3):293-300.
doi: 10.1023/A:1018628609742