|
|
New Sample Selection Algorithm with Textual Data |
Liu Shurui,Tian Jidong(),Chen Puchun,Lai Li,Song Guojie |
School of Science, Southwest Petroleum University, Chengdu 610500, China |
|
|
Abstract [Objective] This paper aims to reduce the amount of textual training data and shorten the training time of our models.[Methods] We proposed a new filtering algorithm for sample selection based on covariance estimator and then applied the data forgettable property to the embedded algorithm.[Results] In the training of model for Chinese reading comprehension, the two proposed algorithms reduced more than 50% training time. Compared with the Term Frequency-Inverse Document Frequency algorithm, our new algorithms increased the recall rate and F-score evaluation index by 0.018 and 0.012, 0.017 and 0.029, respectively.[Limitations] More training data is needed to improve the accuracy evaluation index of the model.[Conclusions] Our algorithms reduce model’s training time and improve the evaluation index. They are also suitable for large-scale data set paralleloperations.
|
Received: 20 June 2019
Published: 26 April 2020
|
|
Corresponding Authors:
Jidong Tian
E-mail: 746602272@qq.com
|
[1] |
刘艺, 曹建军, 刁兴春 , 等. 特征选择稳定性研究综述[J]. 软件学报, 2018,29(9):2559-2579.
|
[1] |
( Liu Yi, Cao Jianjun, Diao Xingchun , et al. Survey on Stability of Feature Selection[J]. Journal of Software, 2018,29(9):2559-2579.)
|
[2] |
Franc V, Hlaváč V . Greedy Algorithm for a Training Set Reduction in the Kernel Methods [C]// Proceedings of the 10th International Conference on Computer Analysis of Images and Patterns, Groningen, The Netherlands. Berlin Heidelberg:Springer-Verlag, 2003: 426-433.
|
[3] |
Lü Y, Huang J, Liu Q . Improving Statistical Machine Translation Performance by Training Data Selection and Optimization [C]// Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),Prague, Czech Republic. 2007: 343-350.
|
[4] |
Yasuda K, Zhang R, Yamamoto H , et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model [C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing,Hyderabad, India. 2008: 343-350.
|
[5] |
Axelrod A, He X, Gao J . Domain Adaptation via Pseudo In-domain Data Selection [C]// Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, United Kingdom. 2011: 355-362.
|
[6] |
Chang H S, Learned-Miller E , McCallum A. Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. 2017: 1002-1012.
|
[7] |
Katharopoulos A, Fleuret F . Not All Samples are Created Equal: Deep Learning with Importance Sampling[OL]. arXiv Preprint, arXiv:1803.00942.
|
[8] |
Toneva M, Sordoni A, Combes R T , et al. An Empirical Study of Example Forgetting During Deep Neural Network Learning[OL]. arXiv Preprint, arXiv: 1812.05159.
|
[9] |
Yang M C, Duan N, Zhou M , et al. Joint Relational Embeddings for Knowledge-Based Question Answering [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. 2014: 645-650.
|
[10] |
Wang Y, Berant J, Liang P . Building a Semantic Parser Overnight [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing,China. 2015: 1332-1342.
|
[11] |
Pasupat P, Liang P . Compositional Semantic Parsing on Semi-Structured Tables[OL]. arXiv Preprint, arXiv: 1508.00305.
|
[12] |
Kwiatkowski T, Choi E, Artzi Y , et al. Scaling Semantic Parsers with On-the-fly Ontology Matching [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 1545-1556.
|
[13] |
王彦, 左春, 曾炼 . 旅游自动应答语义模型分析与实践[J]. 计算机系统应用, 2017,26(2):18-24.
|
[13] |
( Wang Yan, Zuo Chun, Zeng Lian . Analysis and Practice of Semantic Model in Tourism Auto-Answering System[J]. Computer Systems & Applications, 2017,26(2):18-24.)
|
[14] |
Bao J, Duan N, Zhou M , et al. Knowledge-based Question Answering as Machine Translation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, USA. 2014: 967-976.
|
[15] |
Yih S W, Chang M, He X , et al. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base [C]// Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 1321-1331.
|
[16] |
Yao X . Lean Question Answering over Freebase from Scratch [C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Denver, Colorado, USA. 2015: 66-70.
|
[17] |
仇瑜, 程力 , Daniyal Alghazzawi . 特定领域问答系统中基于语义检索的非事实型问题研究[J]. 北京大学学报:自然科学版, 2019,55(1):55-64.
|
[17] |
( Qiu Yu, Cheng Li, Daniyal Alghazzawi . Semantic Search on Non-Factoid Questions for Domain-Specific Question Answering Systems[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2019,55(1):55-64.)
|
[18] |
Dong L, Wei F, Zhou M , et al. Question Answering over Freebase with Multi-Column Convolutional Neural Networks [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 260-269.
|
[19] |
张越, 杨沐昀, 郑德权 , 等. 面向问答系统的信息检索自动评价方法[J]. 智能计算机与应用, 2019,9(2):262-268.
|
[19] |
( Zhang Yue, Yang Muyun, Zheng Dequan , et al. Derive MAP-Like Metrics from Content Overlap for IR in QA System[J]. Intelligent Computer and Applications, 2019,9(2):262-268.)
|
[20] |
周志华 . 机器学习[M]. 北京: 清华大学出版社, 2016: 247-252.
|
[20] |
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016: 247-252.)
|
[21] |
安德森 . 多元特定领域问答系统中基于语义检索的非事实型问题研究分析导论[M]. 张润楚, 程轶译.第3版.北京:人民邮电出版社, 2010: 53-56.
|
[21] |
( Anderson T W. . An Introdution to Multivariate Statistical Analysis[M]. Translated by Zhang Runchu, Cheng Yi. The 3rd Edition. Beijing: The People’s Posts and Telecommunications Press, 2010: 53-56.)
|
[22] |
王泳, 胡包钢 . 应用统计方法综合评估核函数分类能力的研究[J]. 计算机学报, 2008,31(6):942-952.
|
[22] |
( Wang Yong, Hu Baogang . A Study on Integrated Evaluating Kernel Classification Performance Using Statistical Methods[J]. Chinese Journal of Computers, 2008,31(6):942-952.)
|
|
Viewed |
|
|
|
Full text
|
|
|
|
|
Abstract
|
|
|
|
|
Cited |
|
|
|
|
|
Shared |
|
|
|
|
|
Discussed |
|
|
|
|