Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (2/3): 223-230    DOI: 10.11925/infotech.2096-3467.2019.0719
Current Issue | Archive | Adv Search |
New Sample Selection Algorithm with Textual Data
Liu Shurui,Tian Jidong(),Chen Puchun,Lai Li,Song Guojie
School of Science, Southwest Petroleum University, Chengdu 610500, China
Download: PDF (815 KB)   HTML ( 1
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to reduce the amount of textual training data and shorten the training time of our models.[Methods] We proposed a new filtering algorithm for sample selection based on covariance estimator and then applied the data forgettable property to the embedded algorithm.[Results] In the training of model for Chinese reading comprehension, the two proposed algorithms reduced more than 50% training time. Compared with the Term Frequency-Inverse Document Frequency algorithm, our new algorithms increased the recall rate and F-score evaluation index by 0.018 and 0.012, 0.017 and 0.029, respectively.[Limitations] More training data is needed to improve the accuracy evaluation index of the model.[Conclusions] Our algorithms reduce model’s training time and improve the evaluation index. They are also suitable for large-scale data set paralleloperations.

Key wordsSample Selection      Covariance Estimator      Forgetable Algorithm     
Received: 20 June 2019      Published: 26 April 2020
ZTFLH:  TP391.1  
Corresponding Authors: Jidong Tian     E-mail: 746602272@qq.com

Cite this article:

Liu Shurui,Tian Jidong,Chen Puchun,Lai Li,Song Guojie. New Sample Selection Algorithm with Textual Data. Data Analysis and Knowledge Discovery, 2020, 4(2/3): 223-230.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0719     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I2/3/223

Example of Covariance Estimator Sample Selection Algorithm
数据来源 总样本数量(个) 正样本数量(个) 正样本数量/样本数量(%)
泰迪杯C题训练数据 477 019 127 328 26.69%
NLPCC训练数据 181 882 9 198 5.06%
百度 DuReader_v2.0 80 485 80 485 100.00%
以上三个数据集总计 739 386 217 011 29.35%
随机数 1 数据集 175 042 51 507 29.43%
随机数 2 数据集 174 537 51 507 29.51%
随机数 3 数据集 174 955 52 003 29.72%
Description of the Dataset
参数 取值
遗忘算法预训练epoch 10
训练模型epoch 30
Batchsize/协方差估计批量大小 128
优化算法 小批量随机梯度下降法
Dropout 1.0
学习率 0.1×0.5k(k为下降次数,取1,2,3)
句子最大词汇数 100
显卡配置 GTX 1080 四卡 E5 8核64GB
Hyperparameters and Experimental Environment
The Amounts of Forgot Events and Learned Events in Different Datasets
算法 阈值 Accuracy Recall F1
TF-IDF 0.4 0.793 0.725 0.615
0.5 0.795 0.740 0.628
0.6 0.802 0.740 0.638
0.7 0.802 0.700 0.571
小批量协方差估计 0.4 0.796 0.733 0.623
0.5 0.793 0.743 0.638
0.6 0.805 0.735 0.607
0.7 0.803 0.700 0.572
Influence of Different Thresholds on Evaluation Indexes of TF-IDF and the Small Batchsize Covariance Estimator Sample Selection Algorithm
Time Comparison Between the Sample Selection Algorithm and Model Training
算法 数据集 Accuracy Recall F1
不使用任何算法 随机数 1 0.802 0.750 0.653
随机数 2 0.815 0.745 0.644
随机数 3 0.819 0.755 0.660
随机扰动 随机数 1 0.786 0.731 0.623
随机数 2 0.805 0.714 0.597
随机数 3 0.804 0.711 0.587
TF-IDF 随机数 1 0.795 0.740 0.628
随机数 2 0.804 0.707 0.585
随机数 3 0.805 0.706 0.585
遗忘算法 随机数 1 0.797 0.740 0.637
随机数 2 0.805 0.714 0.579
随机数 3 0.800 0.736 0.631
小批量协方差估计 随机数 1 0.793 0.743 0.638
随机数 2 0.794 0.729 0.623
随机数 3 0.798 0.737 0.623
Evaluation Indexes of Different Sample Selection Algorithms
F1 Scores of Different Sample Selection Algorithms
[1] 刘艺, 曹建军, 刁兴春 , 等. 特征选择稳定性研究综述[J]. 软件学报, 2018,29(9):2559-2579.
[1] ( Liu Yi, Cao Jianjun, Diao Xingchun , et al. Survey on Stability of Feature Selection[J]. Journal of Software, 2018,29(9):2559-2579.)
[2] Franc V, Hlaváč V . Greedy Algorithm for a Training Set Reduction in the Kernel Methods [C]// Proceedings of the 10th International Conference on Computer Analysis of Images and Patterns, Groningen, The Netherlands. Berlin Heidelberg:Springer-Verlag, 2003: 426-433.
[3] Lü Y, Huang J, Liu Q . Improving Statistical Machine Translation Performance by Training Data Selection and Optimization [C]// Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),Prague, Czech Republic. 2007: 343-350.
[4] Yasuda K, Zhang R, Yamamoto H , et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model [C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing,Hyderabad, India. 2008: 343-350.
[5] Axelrod A, He X, Gao J . Domain Adaptation via Pseudo In-domain Data Selection [C]// Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, United Kingdom. 2011: 355-362.
[6] Chang H S, Learned-Miller E , McCallum A. Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. 2017: 1002-1012.
[7] Katharopoulos A, Fleuret F . Not All Samples are Created Equal: Deep Learning with Importance Sampling[OL]. arXiv Preprint, arXiv:1803.00942.
[8] Toneva M, Sordoni A, Combes R T , et al. An Empirical Study of Example Forgetting During Deep Neural Network Learning[OL]. arXiv Preprint, arXiv: 1812.05159.
[9] Yang M C, Duan N, Zhou M , et al. Joint Relational Embeddings for Knowledge-Based Question Answering [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. 2014: 645-650.
[10] Wang Y, Berant J, Liang P . Building a Semantic Parser Overnight [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing,China. 2015: 1332-1342.
[11] Pasupat P, Liang P . Compositional Semantic Parsing on Semi-Structured Tables[OL]. arXiv Preprint, arXiv: 1508.00305.
[12] Kwiatkowski T, Choi E, Artzi Y , et al. Scaling Semantic Parsers with On-the-fly Ontology Matching [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 1545-1556.
[13] 王彦, 左春, 曾炼 . 旅游自动应答语义模型分析与实践[J]. 计算机系统应用, 2017,26(2):18-24.
[13] ( Wang Yan, Zuo Chun, Zeng Lian . Analysis and Practice of Semantic Model in Tourism Auto-Answering System[J]. Computer Systems & Applications, 2017,26(2):18-24.)
[14] Bao J, Duan N, Zhou M , et al. Knowledge-based Question Answering as Machine Translation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, USA. 2014: 967-976.
[15] Yih S W, Chang M, He X , et al. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base [C]// Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 1321-1331.
[16] Yao X . Lean Question Answering over Freebase from Scratch [C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Denver, Colorado, USA. 2015: 66-70.
[17] 仇瑜, 程力 , Daniyal Alghazzawi . 特定领域问答系统中基于语义检索的非事实型问题研究[J]. 北京大学学报:自然科学版, 2019,55(1):55-64.
[17] ( Qiu Yu, Cheng Li, Daniyal Alghazzawi . Semantic Search on Non-Factoid Questions for Domain-Specific Question Answering Systems[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2019,55(1):55-64.)
[18] Dong L, Wei F, Zhou M , et al. Question Answering over Freebase with Multi-Column Convolutional Neural Networks [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 260-269.
[19] 张越, 杨沐昀, 郑德权 , 等. 面向问答系统的信息检索自动评价方法[J]. 智能计算机与应用, 2019,9(2):262-268.
[19] ( Zhang Yue, Yang Muyun, Zheng Dequan , et al. Derive MAP-Like Metrics from Content Overlap for IR in QA System[J]. Intelligent Computer and Applications, 2019,9(2):262-268.)
[20] 周志华 . 机器学习[M]. 北京: 清华大学出版社, 2016: 247-252.
[20] ( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016: 247-252.)
[21] 安德森 . 多元特定领域问答系统中基于语义检索的非事实型问题研究分析导论[M]. 张润楚, 程轶译.第3版.北京:人民邮电出版社, 2010: 53-56.
[21] ( Anderson T W. . An Introdution to Multivariate Statistical Analysis[M]. Translated by Zhang Runchu, Cheng Yi. The 3rd Edition. Beijing: The People’s Posts and Telecommunications Press, 2010: 53-56.)
[22] 王泳, 胡包钢 . 应用统计方法综合评估核函数分类能力的研究[J]. 计算机学报, 2008,31(6):942-952.
[22] ( Wang Yong, Hu Baogang . A Study on Integrated Evaluating Kernel Classification Performance Using Statistical Methods[J]. Chinese Journal of Computers, 2008,31(6):942-952.)
[1] Han Huang,Hongyu Wang,Xiaoguang Wang. Automatic Recognizing Legal Terminologies with Active Learning and Conditional Random Field Model[J]. 数据分析与知识发现, 2019, 3(6): 66-74.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn