基于多元特征的文本阅读难度自动分级研究 <sup>*</sup>

doi:10.11925/infotech.2096-3467.2018.1089

数据分析与知识发现

2019, Vol. 3

Issue (7): 103-112 https://doi.org/10.11925/infotech.2096-3467.2018.1089

应用论文

本期目录 | 过刊浏览 | 高级检索

基于多元特征的文本阅读难度自动分级研究 ^*

程勇¹(

),徐德宽¹,吕学强²

1(鲁东大学文学院烟台 264025)
2(北京信息科技大学计算机学院北京 100192)

Automatically Grading Text Difficulty with Multiple Features

Yong Cheng¹(

),Dekuan Xu¹,Xueqiang Lv²

1(School of Chinese Language and Literature, Ludong University, Yantai 264025, China)
2(School of Computer Science, Beijing University of Information Technology, Beijing 100192, China)

摘要
图/表
参考文献
相关文章
Metrics

全文: PDF (4218 KB) HTML ( 17 )
输出: BibTeX | EndNote (RIS)

摘要

【目的】实现文本阅读难度自动分级。【方法】采用基于多元特征的机器学习方法实现对文本难度的分析和判别, 包括词频特征、结构特征、主题特征、深度特征等。这些特征从不同角度对文本的内容进行描述。在此基础上, 对这些多元特征进行融合, 并在多种分类器上进行文本阅读难度自动分级实验。【结果】利用本文提出的方法在面向中学英语考试的阅读理解文本上进行实验, 最终在测试集上的正确率达到0.88, 性能相较传统的阅读分级方法有较大提升。【局限】由于人工标注的高成本, 目前的阅读难度数据集在数量、规模、难度标注程度上都有相应的限制。这在一定程度上影响了本文方法的应用。【结论】本文提出的多元特征提升了机器对阅读文本的分析和理解能力, 使机器能够在理解文本内容的基础上对文本的阅读难度进行自动评级。

	服务

	把本文推荐给朋友
	加入引用管理器
	E-mail Alert
	RSS
	作者相关文章
	程勇
	徐德宽
	吕学强

关键词 ：多元特征, 阅读难度, 自动分级

Abstract：

[Objective] This paper aims to automatically grade reading difficulty of textual documents. [Methods] We used machine learning method based on multiple features of the texts to decide their difficulty levels automatically. The features, which include word-frequency, structures, topics, and depth, describe the textual contents from different perspectives. [Results] We evaluated our method with the reading comprehension texts for high-school English exams, and achieved an accuracy of 0.88. Our result is better than those of the traditional difficulty classification methods. [Limitations] Due to the high cost of manual annotation, the existing datasets cannot be used to improve our method. [Conclusions] The proposed method increased the effectiveness of machine leanring based data analysis.

Key words： Multiple Features Reading Difficulty Automatic Grading

收稿日期: 2018-09-30 出版日期: 2019-09-06

ZTFLH:

G353

基金资助:*本文系国家自然科学基金面上项目“中文专利侵权自动检测研究”(61671070);教育部人文社会科学研究一般项目“基于多元特征融合的中小学汉语文本阅读难度自动分级研究”的研究成果之一(19YJCZH016)

通讯作者: 程勇 E-mail: chengokyong@126.com

引用本文:

程勇,徐德宽,吕学强. 基于多元特征的文本阅读难度自动分级研究 ^*[J]. 数据分析与知识发现, 2019, 3(7): 103-112.
Yong Cheng,Dekuan Xu,Xueqiang Lv. Automatically Grading Text Difficulty with Multiple Features. Data Analysis and Knowledge Discovery, 2019, 3(7): 103-112.

链接本文:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.2096-3467.2018.1089 或 https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2019/V3/I7/103

不同阅读难度的文本

基于易读性公式的结构特征

用于提取深度特征的神经网络架构

多元特征融合方法

不同级别下的Top30特征词

筛选不同结构特征后的性能比较

初中和高中文本的主题分布与相应主题词

不同超参数对网络性能的影响

单类型特征在多分类器下的比较结果

多元特征融合实验结果

与现有方法的比较结果

文本分级识别结果

[1]	郭利敏 . 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[1]	( Guo Limin . Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library and Information, 2017(6):96-103.)
[2]	李慧宗, 胡学钢, 杨恒宇 , 等. 基于LDA的社会化标签综合聚类方法[J]. 情报学报, 2015,34(2):146-155.
[2]	( Li Huizong, Hu Xuegang, Yang Hengyu , et al. A Comprehensive Clustering Method for Socialized Label Based on LDA[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(2):146-155.)
[3]	徐彤阳, 尹凯 . 大数据背景下微博语义检索[J]. 情报杂志, 2017,36(12):173-179.
[3]	( Xu Tongyang, Yin Kai . Semantic Retrieval of Microblogging in the Background of Large Data[J]. Journal of Intelligence, 2017,36(12):173-179.)
[4]	Bear D, Dole J, Echevarria J , et al. Treasures, A Reading/ Language Arts Program[M]. McGraw-Hill Education, 2009.
[5]	Lester M, Neal S, Royster J , et al. Glencoe Writer’s Choice: Grammar and Composition[M]. McGraw-Hill Education, 2001.
[6]	李欣 . 美国中小学生阅读分级研究[D]. 上海: 华东师范大学, 2016.
[6]	( Li Xin . Research on the American Leveled Reading of K-12 Students[D]. Shanghai: East China Normal University, 2016.)
[7]	Kincaid J P, Braby R, Mears J E . Electronic Authoring and Delivery of Technical Information[J]. Journal of Instructional Development, 1988,11(2):8-13.
[8]	Dale E, Chall J S . A Formula for Predicting Readability[J]. Journal of Educational Research Bulletin, 1948,27(2):37-54.
[9]	McLaughlin G H . SMOG Grading: A New Readability Formula[J]. Journal of Reading, 1969,12(8):639-646.
[10]	Graesser A C , McNamara D S, Louwerse M M, et al. Coh-Metrix: Analysis of Text on Cohesion and Language[J]. Journal of Behavior Research Methods, Instruments, & Computers, 2004,36(2):193-202.
[11]	张宁志 . 汉语教材语料难度的定量分析[J]. 世界汉语教学, 2000(3):83-88.
[11]	( Zhang Ningzhi . Quantitative Analysis of Corpora Difficulty in Chinese Textbooks[J]. Chinese Teaching in the World, 2000(3):83-88.)
[12]	郭望皓 . 对外汉语文本易读性公式研究[D]. 上海: 上海交通大学, 2009.
[12]	( Guo Wanghao . Research on Readability Formula of Chinese Text for Foreign Students[D]. Shanghai: Shanghai Jiao Tong University, 2009.)
[13]	左虹, 朱勇 . 中级欧美留学生汉语文本可读性公式研究[J]. 世界汉语教学, 2014,28(2):263-276.
[13]	( Zuo Hong, Zhu Yong . Research on Chinese Readability Formula of Texts for Intermediate Level European and American Students[J]. Chinese Teaching in the World, 2014,28(2):263-276.)
[14]	Salton G, Buckley C . Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[15]	Hofmann T . Unsupervised Learning by Probabilistic Latent Semantic Analysis[J]. Machine Learning, 2001,42(1-2):177-196.
[16]	Blei D M, Ng A Y, Jordan M I , et al. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[17]	Walker S H, Duncan D B . Estimation of the Probability of an Event as a Function of Several Independent Variables[J]. Biometrika, 1967,54(1-2):167-179.
[18]	Cortes C, Vapnik V . Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[19]	Ho T K. Random Decision Forests [C]// Proceedings of the 3rd International Conference on Document Analysis and Recognition. IEEE, 1995: 278-282.
[20]	Liu P, Qiu X, Huang X , et al. Recurrent Neural Network for Text Classification with Multi-Task Learning[C]// Proceedings of the 25th International Joint Conferences on Artificial Intelligence. AAAI Press, 2016: 2873-2879.
[21]	Kim Y. Convolutional Neural Networks for Sentence Classification [C]// Proceedings of the 2014 International Conference on Empirical Methods on Natural Language Processing. ACL, 2014: 1746-1751.
[22]	Senter R J, Smith E A . Automated Readability Index[J]. Journal of Competitor New York, 1967,1:1-14.
[23]	Gunning R . The Fog Index After Twenty Years[J]. Journal of Business Communication, 1969,6(2):3-13.
[24]	Coleman M, Liau T L . A Computer Readability Formula Designed for Machine Scoring[J]. Journal of Applied Psychology, 1975,60(2):283-284.
[25]	Graves A, Schmidhuber J . Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5):602-610.
[26]	Lai G, Xie Q, Liu H, et al. Race: Large-Scale Reading Comprehension Dataset from Examinations [C]// Proceedings of the 2017 International Conference on Empirical Methods on Natural Language Processing. ACL, 2017: 785-794.
[27]	TensorFlow[CP]. [2018-08-24]..
[28]	蒋晶晶 . CEPT阅读文本易读度分析及词汇检测工具的开发[D]. 长沙: 湖南大学, 2009.
[28]	( Jiang Jingjing . Readability Analysis on CEPT Reading Texts and the Development of Lexical Checker[D]. Changsha: Hunan University, 2009.)
[29]	陈炎龙, 张志明 . 基于向量空间模型的英文文本难度判定[J]. 电脑知识与技术, 2010,6(12):2994-2996.
[29]	( Chen Yanlong, Zhang Zhiming . The English Text Difficulty Measurement Based Vector Space Model[J]. Computer Knowledge and Technology, 2010,6(12):2994-2996.)
[30]	Maaten L, Hinton G . Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008,9(11):2579-2605.

[1]	吴胜男, 蒲虹君, 田若楠, 梁雯琪, 于琦. *网络结构对链路预测算法的影响研究——基于元分析视角**[J]. 数据分析与知识发现, 2021, 5(11): 102-113.
[2]	纪有书, 王东波, 黄水清. *基于词对齐的古汉语同义词自动抽取研究^——以前四史典籍为例**[J]. 数据分析与知识发现, 2021, 5(11): 135-144.
[3]	王楠,李海荣,谭舒孺. 基于改进SMOTE算法与集成学习的舆情反转预测研究^*[J]. 数据分析与知识发现, 2021, 5(4): 37-48.
[4]	向卓元,刘志聪,吴玉. 基于用户行为自适应推荐模型研究 ^*[J]. 数据分析与知识发现, 2021, 5(4): 103-114.
[5]	张琪,江川,纪有书,冯敏萱,李斌,许超,刘浏. 面向多领域先秦典籍的分词词性一体化自动标注模型构建^*[J]. 数据分析与知识发现, 2021, 5(3): 2-11.
[6]	李纲, 管为栋, 马亚雪, 毛进. 学术论文的社交媒体可见性预测研究*[J]. 数据分析与知识发现, 2020, 4(8): 63-74.
[7]	夏天. 面向中文学术文本的单文档关键短语抽取 ^*[J]. 数据分析与知识发现, 2020, 4(7): 76-86.
[8]	李纲, 管为栋, 马亚雪, 毛进. 学术论文的社交媒体可见性预测研究 [J]. 数据分析与知识发现, 0, (): 1-.
[9]	吕华揆,洪亮,马费成. 金融股权知识图谱构建与应用^*[J]. 数据分析与知识发现, 2020, 4(5): 27-37.
[10]	王欣瑞,何跃. 社交媒体用户交互行为与股票市场的关联分析研究: 基于新浪财经博客的实证[J]. 数据分析与知识发现, 2019, 3(11): 108-119.
[11]	范馨月, 崔雷. 基于网络属性的抗肿瘤药物靶点预测方法及其应用^*[J]. 数据分析与知识发现, 2018, 2(12): 98-108.
[12]	程勇, 徐德宽, 吕学强. 基于层级交互网络的文本阅读理解与问答方法研究^*[J]. 数据分析与知识发现, 2018, 2(12): 23-32.
[13]	刘竹辰, 陈浩, 于艳华, 李劼. 词位置分布加权TextRank的关键词提取^*[J]. 数据分析与知识发现, 2018, 2(9): 74-79.
[14]	王秀芳, 盛姝, 路燕. 一种基于话题聚类及情感强度的微博舆情分析模型^*[J]. 数据分析与知识发现, 2018, 2(6): 37-47.
[15]	陈远, 王超群, 胡忠义, 吴江. 基于主成分分析和随机森林的恶意网站评估与识别^*[J]. 数据分析与知识发现, 2018, 2(4): 71-80.

Viewed

Full text

Abstract

Cited

Shared

Discussed