Please wait a minute...
Advanced Search
数据分析与知识发现  2020, Vol. 4 Issue (2/3): 223-230    DOI: 10.11925/infotech.2096-3467.2019.0719
  专辑 本期目录 | 过刊浏览 | 高级检索 |
基于文本数据的过滤式与嵌入式样本选择算法*
刘书瑞,田继东(),陈普春,赖立,宋国杰
西南石油大学理学院 成都 610500
New Sample Selection Algorithm with Textual Data
Liu Shurui,Tian Jidong(),Chen Puchun,Lai Li,Song Guojie
School of Science, Southwest Petroleum University, Chengdu 610500, China
全文: PDF(815 KB)   HTML ( 0
输出: BibTeX | EndNote (RIS)      
摘要 

【目的】 减少文本数据的训练数据量,缩短模型训练时间。【方法】 基于协方差估计,提出一种新的过滤式样本选择算法,并将数据的遗忘性研究成果应用到嵌入式样本选择算法中。【结果】 在中文阅读理解模型训练中,本文提出的算法至少可以减少模型训练时间50%。与经典的词频-逆文档频次算法相比,本文小批量协方差估计算法与遗忘算法在召回率、F评价指标上分别提升0.018、0.012与0.017、0.029。【局限】 训练数据减少,对模型的准确率评价指标有一定影响。【结论】 本文算法能减少模型的训练时间,提高评价指标,由于计算只与批次有关,故适用于大规模数据集的并行运算。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘书瑞
田继东
陈普春
赖立
宋国杰
关键词 样本选择协方差估计遗忘算法    
Abstract

[Objective] This paper aims to reduce the amount of textual training data and shorten the training time of our models.[Methods] We proposed a new filtering algorithm for sample selection based on covariance estimator and then applied the data forgettable property to the embedded algorithm.[Results] In the training of model for Chinese reading comprehension, the two proposed algorithms reduced more than 50% training time. Compared with the Term Frequency-Inverse Document Frequency algorithm, our new algorithms increased the recall rate and F-score evaluation index by 0.018 and 0.012, 0.017 and 0.029, respectively.[Limitations] More training data is needed to improve the accuracy evaluation index of the model.[Conclusions] Our algorithms reduce model’s training time and improve the evaluation index. They are also suitable for large-scale data set paralleloperations.

Key wordsSample Selection    Covariance Estimator    Forgetable Algorithm
收稿日期: 2019-06-20     
中图分类号:  TP391.1  
基金资助:*本文系国家自然基金项目“分数阶粘性地震波场的高精度有限差分算法研究”的研究成果之一(41674141)
通讯作者: 田继东     E-mail: 746602272@qq.com
引用本文:   
刘书瑞,田继东,陈普春,赖立,宋国杰. 基于文本数据的过滤式与嵌入式样本选择算法*[J]. 数据分析与知识发现, 2020, 4(2/3): 223-230.
Liu Shurui,Tian Jidong,Chen Puchun,Lai Li,Song Guojie. New Sample Selection Algorithm with Textual Data. Data Analysis and Knowledge Discovery, DOI:10.11925/infotech.2096-3467.2019.0719.
链接本文:  
http://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2019.0719
图1  协方差估计样本选择算法应用示例
数据来源 总样本数量(个) 正样本数量(个) 正样本数量/样本数量(%)
泰迪杯C题训练数据 477 019 127 328 26.69%
NLPCC训练数据 181 882 9 198 5.06%
百度 DuReader_v2.0 80 485 80 485 100.00%
以上三个数据集总计 739 386 217 011 29.35%
随机数 1 数据集 175 042 51 507 29.43%
随机数 2 数据集 174 537 51 507 29.51%
随机数 3 数据集 174 955 52 003 29.72%
表1  本文数据集的描述
参数 取值
遗忘算法预训练epoch 10
训练模型epoch 30
Batchsize/协方差估计批量大小 128
优化算法 小批量随机梯度下降法
Dropout 1.0
学习率 0.1×0.5k(k为下降次数,取1,2,3)
句子最大词汇数 100
显卡配置 GTX 1080 四卡 E5 8核64GB
表2  超参数及实验环境配置数值
图2  不同数据集遗忘事件量与学习事件量对比
算法 阈值 Accuracy Recall F1
TF-IDF 0.4 0.793 0.725 0.615
0.5 0.795 0.740 0.628
0.6 0.802 0.740 0.638
0.7 0.802 0.700 0.571
小批量协方差估计 0.4 0.796 0.733 0.623
0.5 0.793 0.743 0.638
0.6 0.805 0.735 0.607
0.7 0.803 0.700 0.572
表3  不同阈值对TD-IDF、小批量协方差估计样本选择算法评价指标的影响
图3  样本选择算法使用时间与模型训练时间对比
算法 数据集 Accuracy Recall F1
不使用任何算法 随机数 1 0.802 0.750 0.653
随机数 2 0.815 0.745 0.644
随机数 3 0.819 0.755 0.660
随机扰动 随机数 1 0.786 0.731 0.623
随机数 2 0.805 0.714 0.597
随机数 3 0.804 0.711 0.587
TF-IDF 随机数 1 0.795 0.740 0.628
随机数 2 0.804 0.707 0.585
随机数 3 0.805 0.706 0.585
遗忘算法 随机数 1 0.797 0.740 0.637
随机数 2 0.805 0.714 0.579
随机数 3 0.800 0.736 0.631
小批量协方差估计 随机数 1 0.793 0.743 0.638
随机数 2 0.794 0.729 0.623
随机数 3 0.798 0.737 0.623
表4  不同样本选择算法的评价指标对比
图4  不同样本选择算法的F1分数对比
[1] 刘艺, 曹建军, 刁兴春 , 等. 特征选择稳定性研究综述[J]. 软件学报, 2018,29(9):2559-2579.
( Liu Yi, Cao Jianjun, Diao Xingchun , et al. Survey on Stability of Feature Selection[J]. Journal of Software, 2018,29(9):2559-2579.)
[2] Franc V, Hlaváč V . Greedy Algorithm for a Training Set Reduction in the Kernel Methods [C]// Proceedings of the 10th International Conference on Computer Analysis of Images and Patterns, Groningen, The Netherlands. Berlin Heidelberg:Springer-Verlag, 2003: 426-433.
[3] Lü Y, Huang J, Liu Q . Improving Statistical Machine Translation Performance by Training Data Selection and Optimization [C]// Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL),Prague, Czech Republic. 2007: 343-350.
[4] Yasuda K, Zhang R, Yamamoto H , et al. Method of Selecting Training Data to Build a Compact and Efficient Translation Model [C]// Proceedings of the 3rd International Joint Conference on Natural Language Processing,Hyderabad, India. 2008: 343-350.
[5] Axelrod A, He X, Gao J . Domain Adaptation via Pseudo In-domain Data Selection [C]// Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Edinburgh, United Kingdom. 2011: 355-362.
[6] Chang H S, Learned-Miller E , McCallum A. Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA. 2017: 1002-1012.
[7] Katharopoulos A, Fleuret F . Not All Samples are Created Equal: Deep Learning with Importance Sampling[OL]. arXiv Preprint, arXiv:1803.00942.
[8] Toneva M, Sordoni A, Combes R T , et al. An Empirical Study of Example Forgetting During Deep Neural Network Learning[OL]. arXiv Preprint, arXiv: 1812.05159.
[9] Yang M C, Duan N, Zhou M , et al. Joint Relational Embeddings for Knowledge-Based Question Answering [C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar. 2014: 645-650.
[10] Wang Y, Berant J, Liang P . Building a Semantic Parser Overnight [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing,China. 2015: 1332-1342.
[11] Pasupat P, Liang P . Compositional Semantic Parsing on Semi-Structured Tables[OL]. arXiv Preprint, arXiv: 1508.00305.
[12] Kwiatkowski T, Choi E, Artzi Y , et al. Scaling Semantic Parsers with On-the-fly Ontology Matching [C]// Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA. 2013: 1545-1556.
[13] 王彦, 左春, 曾炼 . 旅游自动应答语义模型分析与实践[J]. 计算机系统应用, 2017,26(2):18-24.
( Wang Yan, Zuo Chun, Zeng Lian . Analysis and Practice of Semantic Model in Tourism Auto-Answering System[J]. Computer Systems & Applications, 2017,26(2):18-24.)
[14] Bao J, Duan N, Zhou M , et al. Knowledge-based Question Answering as Machine Translation [C]// Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, Maryland, USA. 2014: 967-976.
[15] Yih S W, Chang M, He X , et al. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base [C]// Proceedings of the Joint Conference of the 53rd Annual Meeting of the ACL and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 1321-1331.
[16] Yao X . Lean Question Answering over Freebase from Scratch [C]// Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, Denver, Colorado, USA. 2015: 66-70.
[17] 仇瑜, 程力 , Daniyal Alghazzawi . 特定领域问答系统中基于语义检索的非事实型问题研究[J]. 北京大学学报:自然科学版, 2019,55(1):55-64.
( Qiu Yu, Cheng Li, Daniyal Alghazzawi . Semantic Search on Non-Factoid Questions for Domain-Specific Question Answering Systems[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2019,55(1):55-64.)
[18] Dong L, Wei F, Zhou M , et al. Question Answering over Freebase with Multi-Column Convolutional Neural Networks [C]// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing, Beijing, China. 2015: 260-269.
[19] 张越, 杨沐昀, 郑德权 , 等. 面向问答系统的信息检索自动评价方法[J]. 智能计算机与应用, 2019,9(2):262-268.
( Zhang Yue, Yang Muyun, Zheng Dequan , et al. Derive MAP-Like Metrics from Content Overlap for IR in QA System[J]. Intelligent Computer and Applications, 2019,9(2):262-268.)
[20] 周志华 . 机器学习[M]. 北京: 清华大学出版社, 2016: 247-252.
( Zhou Zhihua. Machine Learning[M]. Beijing: Tsinghua University Press, 2016: 247-252.)
[21] 安德森 . 多元特定领域问答系统中基于语义检索的非事实型问题研究分析导论[M]. 张润楚, 程轶译.第3版.北京:人民邮电出版社, 2010: 53-56.
( Anderson T W. . An Introdution to Multivariate Statistical Analysis[M]. Translated by Zhang Runchu, Cheng Yi. The 3rd Edition. Beijing: The People’s Posts and Telecommunications Press, 2010: 53-56.)
[22] 王泳, 胡包钢 . 应用统计方法综合评估核函数分类能力的研究[J]. 计算机学报, 2008,31(6):942-952.
( Wang Yong, Hu Baogang . A Study on Integrated Evaluating Kernel Classification Performance Using Statistical Methods[J]. Chinese Journal of Computers, 2008,31(6):942-952.)
[1] 徐建民,张丽青,王苗. 基于贝叶斯网络的静态话题追踪模型*[J]. 数据分析与知识发现, 2020, 4(2/3): 200-206.
[2] 谭荧,张进,夏立新. 社交媒体情境下的情感分析研究综述[J]. 数据分析与知识发现, 2020, 4(1): 1-11.
[3] 聂卉,何欢. 引入词向量的隐性特征识别研究*[J]. 数据分析与知识发现, 2020, 4(1): 99-110.
[4] 李博诚,张云秋,杨铠西. 面向微博商品评论的情感标签抽取研究 *[J]. 数据分析与知识发现, 2019, 3(9): 115-123.
[5] 李晓峰,马静,李驰,朱恒民. 基于XGBoost模型的电商商品品名识别算法研究 *[J]. 数据分析与知识发现, 2019, 3(7): 34-41.
[6] 许德山, 李辉, 张运良. 文献关键词链接标引方法研究[J]. 现代图书情报技术, 2015, 31(9): 31-37.
[7] 陈诗琴, 李文江. WebSocket在图书馆移动信息服务中的应用[J]. 现代图书情报技术, 2015, 31(9): 90-96.
[8] 胡菊香, 吕学强, 刘克会. 利用类别引导词的投诉文本分类[J]. 现代图书情报技术, 2015, 31(7-8): 97-103.
[9] 段宇锋, 朱雯晶, 陈巧, 刘伟, 刘凤红. 条件随机场与领域本体元素集相结合的未登录词识别研究[J]. 现代图书情报技术, 2015, 31(4): 41-49.
[10] 李军锋, 吕学强, 周绍钧. 带权复杂图模型的专利关键词标引研究[J]. 现代图书情报技术, 2015, 31(3): 26-32.
[11] 马宾, 殷立峰. 一种基于Hadoop平台的并行朴素贝叶斯网络舆情快速分类算法[J]. 现代图书情报技术, 2015, 31(2): 78-84.
[12] 侯婷, 吕学强, 李卓. 专利术语抽取的层次过滤方法[J]. 现代图书情报技术, 2015, 31(1): 24-30.
[13] 唐守利, 徐宝祥. 基于本体的云服务语义检索系统研究[J]. 现代图书情报技术, 2014, 30(12): 27-35.
[14] 唐晓波, 肖璐. 基于依存句法网络的文本特征提取研究[J]. 现代图书情报技术, 2014, 30(11): 31-37.
[15] 石翠, 王杨, 杨彬, 姚晔. 面向中文专利文献的单层并列结构识别[J]. 现代图书情报技术, 2014, 30(10): 76-83.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn