[Objective] This paper aims to reduce the amount of textual training data and shorten the training time of our models.[Methods] We proposed a new filtering algorithm for sample selection based on covariance estimator and then applied the data forgettable property to the embedded algorithm.[Results] In the training of model for Chinese reading comprehension, the two proposed algorithms reduced more than 50% training time. Compared with the Term Frequency-Inverse Document Frequency algorithm, our new algorithms increased the recall rate and F-score evaluation index by 0.018 and 0.012, 0.017 and 0.029, respectively.[Limitations] More training data is needed to improve the accuracy evaluation index of the model.[Conclusions] Our algorithms reduce model’s training time and improve the evaluation index. They are also suitable for large-scale data set paralleloperations.
( Liu Yi, Cao Jianjun, Diao Xingchun , et al. Survey on Stability of Feature Selection[J]. Journal of Software, 2018,29(9):2559-2579.)
[2]
Franc V, Hlaváč V . Greedy Algorithm for a Training Set Reduction in the Kernel Methods [C]// Proceedings of the 10th International Conference on Computer Analysis of Images and Patterns, Groningen, The Netherlands. Berlin Heidelberg:Springer-Verlag, 2003: 426-433.
[3]
Lü Y, Huang J, Liu Q . Improving Statistical Machine Translation Performance by Training Data Selection and Optimization [C]// Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),Prague, Czech Republic. 2007: 343-350.
[4]
Yasuda K, Zhang R, Yamamoto H , et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model [C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing,Hyderabad, India. 2008: 343-350.
[5]
Axelrod A, He X, Gao J . Domain Adaptation via Pseudo In-domain Data Selection [C]// Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, United Kingdom. 2011: 355-362.
[6]
Chang H S, Learned-Miller E , McCallum A. Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. 2017: 1002-1012.
[7]
Katharopoulos A, Fleuret F . Not All Samples are Created Equal: Deep Learning with Importance Sampling[OL]. arXiv Preprint, arXiv:1803.00942.
[8]
Toneva M, Sordoni A, Combes R T , et al. An Empirical Study of Example Forgetting During Deep Neural Network Learning[OL]. arXiv Preprint, arXiv: 1812.05159.
[9]
Yang M C, Duan N, Zhou M , et al. Joint Relational Embeddings for Knowledge-Based Question Answering [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. 2014: 645-650.
[10]
Wang Y, Berant J, Liang P . Building a Semantic Parser Overnight [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing,China. 2015: 1332-1342.
[11]
Pasupat P, Liang P . Compositional Semantic Parsing on Semi-Structured Tables[OL]. arXiv Preprint, arXiv: 1508.00305.
[12]
Kwiatkowski T, Choi E, Artzi Y , et al. Scaling Semantic Parsers with On-the-fly Ontology Matching [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 1545-1556.
( Wang Yan, Zuo Chun, Zeng Lian . Analysis and Practice of Semantic Model in Tourism Auto-Answering System[J]. Computer Systems & Applications, 2017,26(2):18-24.)
[14]
Bao J, Duan N, Zhou M , et al. Knowledge-based Question Answering as Machine Translation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, USA. 2014: 967-976.
[15]
Yih S W, Chang M, He X , et al. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base [C]// Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 1321-1331.
[16]
Yao X . Lean Question Answering over Freebase from Scratch [C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Denver, Colorado, USA. 2015: 66-70.
( Qiu Yu, Cheng Li, Daniyal Alghazzawi . Semantic Search on Non-Factoid Questions for Domain-Specific Question Answering Systems[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2019,55(1):55-64.)
[18]
Dong L, Wei F, Zhou M , et al. Question Answering over Freebase with Multi-Column Convolutional Neural Networks [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 260-269.
( Zhang Yue, Yang Muyun, Zheng Dequan , et al. Derive MAP-Like Metrics from Content Overlap for IR in QA System[J]. Intelligent Computer and Applications, 2019,9(2):262-268.)
( Anderson T W. . An Introdution to Multivariate Statistical Analysis[M]. Translated by Zhang Runchu, Cheng Yi. The 3rd Edition. Beijing: The People’s Posts and Telecommunications Press, 2010: 53-56.)
( Wang Yong, Hu Baogang . A Study on Integrated Evaluating Kernel Classification Performance Using Statistical Methods[J]. Chinese Journal of Computers, 2008,31(6):942-952.)