Please wait a minute...
Data Analysis and Knowledge Discovery  2019, Vol. 3 Issue (7): 103-112    DOI: 10.11925/infotech.2096-3467.2018.1089
Current Issue | Archive | Adv Search |
Automatically Grading Text Difficulty with Multiple Features
Yong Cheng1(),Dekuan Xu1,Xueqiang Lv2
1(School of Chinese Language and Literature, Ludong University, Yantai 264025, China)
2(School of Computer Science, Beijing University of Information Technology, Beijing 100192, China)
Download: PDF(4218 KB)   HTML ( 11
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper aims to automatically grade reading difficulty of textual documents. [Methods] We used machine learning method based on multiple features of the texts to decide their difficulty levels automatically. The features, which include word-frequency, structures, topics, and depth, describe the textual contents from different perspectives. [Results] We evaluated our method with the reading comprehension texts for high-school English exams, and achieved an accuracy of 0.88. Our result is better than those of the traditional difficulty classification methods. [Limitations] Due to the high cost of manual annotation, the existing datasets cannot be used to improve our method. [Conclusions] The proposed method increased the effectiveness of machine leanring based data analysis.

Key wordsMultiple Features      Reading Difficulty      Automatic Grading     
Received: 30 September 2018      Published: 06 September 2019
:  G353  
Corresponding Authors: Yong Cheng     E-mail: chengokyong@126.com

Cite this article:

Yong Cheng,Dekuan Xu,Xueqiang Lv. Automatically Grading Text Difficulty with Multiple Features. Data Analysis and Knowledge Discovery, 2019, 3(7): 103-112.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.1089     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2019/V3/I7/103

序号 名称 解释/公式 序号 名称 解释/公式
1 avg_sentence_len[7-8, 22-23] 句子平均长度 9 fun_smog_index[9] $1.043\times \sqrt{polysyllables\times \frac{30}{sentences}}+3.129$
2 avg_syllables[7] 单词的平均音节数 10 fun_kincaid[7] $0.39\times \left( \frac{words}{sentences} \right)+11.8\times \left( \frac{syllables}{words} \right)-15.59$
3 avg_letters[22] 单词的平均字母数 11 fun_readability[22] $4.71\times \left( \frac{letters}{words} \right)+0.5\times \left( \frac{words}{sentences} \right)-21.4$
4 total_polysyllables[9] 多音节的总数 12 fun_coleman_liau[24] $5.88\times \left( \frac{letters}{words} \right)+29.6\times \left( \frac{sentences}{words} \right)-15.8$
5 total_syllables[7] 音节的总数 13 fun_dale_chall[8] $15.8\times \left( \frac{difficult\text{ }words}{words} \right)+0.0496\times \left( \frac{words}{sentences} \right)$
6 total_words[22,23] 总词数 14 fun_flesch[7] $206.8-1.02\times \left( \frac{words}{sentences} \right)-84.6\times \left( \frac{syllables}{words} \right)$
7 total_sentences[9] 句子总数 15 fun_gunning_fog[24] $0.4\times \left( \frac{words}{sentences} \right)+100\times \left( \frac{difficult\text{ }words}{words} \right)$
8 total_difficult_words[23,24] 生僻词总数
级别 特征词
初中 school likes happy day nice friends teacher morning eat chinese boy english mother play father lot china afternoon beautiful playing girl homework green friend lunch class tv football breakfast sports
高中 life people time university women study author researchers education college social health experience age person business human public company job american language national brain government body technology family scientists
序号 筛选特征 分级准确率 序号 筛选特征 分级准确率
13 -fun_dale_chall 0.821 4 -total_polysyllables 0.837
12 -fun_coleman_liau 0.829 14 -fun_flesch 0.838
8 -total_difficult_words 0.830 15 -fun_gunning_fog 0.838
3 -avg_letters 0.834 5 -total_syllables 0.839
11 -fun_readability 0.836 2 -avg_syllables 0.839
9 -fun_smog_index 0.837 10 -fun_kincaid 0.840
7 -total_sentences 0.837 6 -total_words 0.841
1 -avg_sentence_len 0.837 0.841
状态向量维度 卷积网络窗口数
维度 分级正确率 数目 分级正确率
32 0.867 2 0.870
64 0.880 3 0.869
128 0.869 4 0.871
192 0.871 5 0.880
256 0.873 6 0.878
特征数目 分级前融合 分级后融合
开发集 测试集 开发集 测试集
单元特征 F 0.850 0.833 0.850 0.833
S 0.843 0.845 0.843 0.845
T 0.815 0.816 0.815 0.816
M 0.880 0.870 0.880 0.870
二元特征
融合
F&S 0.881 0.874 0.875 0.860
F&T 0.866 0.843 0.846 0.837
F&M 0.886 0.877 0.886 0.878
S&T 0.871 0.856 0.859 0.848
S&M 0.883 0.878 0.886 0.879
T&M 0.882 0.876 0.883 0.873
三元特征
融合
F&S&T 0.884 0.877 0.874 0.846
F&S&M 0.887 0.878 0.881 0.875
F&T&M 0.885 0.878 0.878 0.873
S&T&M 0.884 0.877 0.884 0.875
四元特征
融合
F&S&T&M 0.888 0.880 0.888 0.871
对比方法 正确率(校验集) 正确率(测试集)
Random Guess 0.494 0.491
FKGL 0.706 0.709
VM_KNN 0.799 0.780
CNN_SC 0.852 0.863
Our Model 0.888 0.880
[1] 郭利敏 . 基于卷积神经网络的文献自动分类研究[J]. 图书与情报, 2017(6):96-103.
[1] ( Guo Limin . Study of Automatic Classification of Literature Based on Convolution Neural Network[J]. Library and Information, 2017(6):96-103.)
[2] 李慧宗, 胡学钢, 杨恒宇 , 等. 基于LDA的社会化标签综合聚类方法[J]. 情报学报, 2015,34(2):146-155.
[2] ( Li Huizong, Hu Xuegang, Yang Hengyu , et al. A Comprehensive Clustering Method for Socialized Label Based on LDA[J]. Journal of the China Society for Scientific and Technical Information, 2015,34(2):146-155.)
[3] 徐彤阳, 尹凯 . 大数据背景下微博语义检索[J]. 情报杂志, 2017,36(12):173-179.
[3] ( Xu Tongyang, Yin Kai . Semantic Retrieval of Microblogging in the Background of Large Data[J]. Journal of Intelligence, 2017,36(12):173-179.)
[4] Bear D, Dole J, Echevarria J , et al. Treasures, A Reading/ Language Arts Program[M]. McGraw-Hill Education, 2009.
[5] Lester M, Neal S, Royster J , et al. Glencoe Writer’s Choice: Grammar and Composition[M]. McGraw-Hill Education, 2001.
[6] 李欣 . 美国中小学生阅读分级研究[D]. 上海: 华东师范大学, 2016.
[6] ( Li Xin . Research on the American Leveled Reading of K-12 Students[D]. Shanghai: East China Normal University, 2016.)
[7] Kincaid J P, Braby R, Mears J E . Electronic Authoring and Delivery of Technical Information[J]. Journal of Instructional Development, 1988,11(2):8-13.
[8] Dale E, Chall J S . A Formula for Predicting Readability[J]. Journal of Educational Research Bulletin, 1948,27(2):37-54.
[9] McLaughlin G H . SMOG Grading: A New Readability Formula[J]. Journal of Reading, 1969,12(8):639-646.
[10] Graesser A C , McNamara D S, Louwerse M M, et al. Coh-Metrix: Analysis of Text on Cohesion and Language[J]. Journal of Behavior Research Methods, Instruments, & Computers, 2004,36(2):193-202.
[11] 张宁志 . 汉语教材语料难度的定量分析[J]. 世界汉语教学, 2000(3):83-88.
[11] ( Zhang Ningzhi . Quantitative Analysis of Corpora Difficulty in Chinese Textbooks[J]. Chinese Teaching in the World, 2000(3):83-88.)
[12] 郭望皓 . 对外汉语文本易读性公式研究[D]. 上海: 上海交通大学, 2009.
[12] ( Guo Wanghao . Research on Readability Formula of Chinese Text for Foreign Students[D]. Shanghai: Shanghai Jiao Tong University, 2009.)
[13] 左虹, 朱勇 . 中级欧美留学生汉语文本可读性公式研究[J]. 世界汉语教学, 2014,28(2):263-276.
[13] ( Zuo Hong, Zhu Yong . Research on Chinese Readability Formula of Texts for Intermediate Level European and American Students[J]. Chinese Teaching in the World, 2014,28(2):263-276.)
[14] Salton G, Buckley C . Term-Weighting Approaches in Automatic Text Retrieval[J]. Information Processing & Management, 1988,24(5):513-523.
[15] Hofmann T . Unsupervised Learning by Probabilistic Latent Semantic Analysis[J]. Machine Learning, 2001,42(1-2):177-196.
[16] Blei D M, Ng A Y, Jordan M I , et al. Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003,3:993-1022.
[17] Walker S H, Duncan D B . Estimation of the Probability of an Event as a Function of Several Independent Variables[J]. Biometrika, 1967,54(1-2):167-179.
[18] Cortes C, Vapnik V . Support-Vector Networks[J]. Machine Learning, 1995,20(3):273-297.
[19] Ho T K. Random Decision Forests [C]// Proceedings of the 3rd International Conference on Document Analysis and Recognition. IEEE, 1995: 278-282.
[20] Liu P, Qiu X, Huang X , et al. Recurrent Neural Network for Text Classification with Multi-Task Learning[C]// Proceedings of the 25th International Joint Conferences on Artificial Intelligence. AAAI Press, 2016: 2873-2879.
[21] Kim Y. Convolutional Neural Networks for Sentence Classification [C]// Proceedings of the 2014 International Conference on Empirical Methods on Natural Language Processing. ACL, 2014: 1746-1751.
[22] Senter R J, Smith E A . Automated Readability Index[J]. Journal of Competitor New York, 1967,1:1-14.
[23] Gunning R . The Fog Index After Twenty Years[J]. Journal of Business Communication, 1969,6(2):3-13.
[24] Coleman M, Liau T L . A Computer Readability Formula Designed for Machine Scoring[J]. Journal of Applied Psychology, 1975,60(2):283-284.
[25] Graves A, Schmidhuber J . Framewise Phoneme Classification with Bidirectional LSTM and Other Neural Network Architectures[J]. Neural Networks, 2005,18(5):602-610.
[26] Lai G, Xie Q, Liu H, et al. Race: Large-Scale Reading Comprehension Dataset from Examinations [C]// Proceedings of the 2017 International Conference on Empirical Methods on Natural Language Processing. ACL, 2017: 785-794.
[27] TensorFlow[CP]. [2018-08-24]..
[28] 蒋晶晶 . CEPT阅读文本易读度分析及词汇检测工具的开发[D]. 长沙: 湖南大学, 2009.
[28] ( Jiang Jingjing . Readability Analysis on CEPT Reading Texts and the Development of Lexical Checker[D]. Changsha: Hunan University, 2009.)
[29] 陈炎龙, 张志明 . 基于向量空间模型的英文文本难度判定[J]. 电脑知识与技术, 2010,6(12):2994-2996.
[29] ( Chen Yanlong, Zhang Zhiming . The English Text Difficulty Measurement Based Vector Space Model[J]. Computer Knowledge and Technology, 2010,6(12):2994-2996.)
[30] Maaten L, Hinton G . Visualizing Data Using t-SNE[J]. Journal of Machine Learning Research, 2008,9(11):2579-2605.
[1] Sun Yi'nan, Ku Liping, Song Xiufang, Liu Jingjing, Jiang Xian. The Policy Research and Analysis of Subject Data Repository ——Cases Study of Life Sciences[J]. 现代图书情报技术, 2015, 31(12): 13-20.
[2] Ren Ni, Zhou Jiannong. The Discovery and Evaluation of Research Team Under the Mode of Weighted Co-Author Network[J]. 现代图书情报技术, 2015, 31(9): 68-75.
[3] Fu Honghu, Zhang Zhixiong, Liu Jianhua, Qian Li, Wang Ying. Construction of STKOS Term Publishing and Sharing Service Platform[J]. 现代图书情报技术, 2015, 31(9): 76-81.
[4] Wu Ni, Zhao Pengwei, Qin Chunxiu. Microblog Hotspot Detection Based on Semantic Analysis and Similarity Strength[J]. 现代图书情报技术, 2015, 31(5): 57-64.
[5] Liu Danjun, Fu Honghu, Wen Yi, Hu Zhengyin, Yang Ning, Xiang Bin, Qian Li, Liu Chunjiang. Study on STKOS Version Management[J]. 现代图书情报技术, 2015, 31(4): 79-86.
[6] Tang Xiangbin, Lu Wei, Zhang Xiaojuan, Huang Shihao. Feature Analysis and Automatic Identification of Query Specificity[J]. 现代图书情报技术, 2015, 31(2): 15-23.
[7] Gu Wei, Li Chaofan, Wang Hongjun, Xiao Shibin, Shi Shuicai. Acquisition of Synonym from Patent Query Logs[J]. 现代图书情报技术, 2015, 31(2): 24-30.
[8] Liu Huailiang, Du Kun, Qin Chunxiu. Research on Chinese Text Categorization Based on Semantic Similarity of HowNet[J]. 现代图书情报技术, 2015, 31(2): 39-45.
[9] Zhang Yun, Hua Weina, Yuan Shunbo, Su Baoduo. Research on the Themes Dynamic Evolutions of the Patent Analysis Papers from WoS Database[J]. 现代图书情报技术, 2015, 31(1): 17-23.
[10] Hu Zhengyin, Fang Shu, Wen Yi, Zhang Xian, Liang Tian. Study on Automatic Classification of Patents Oriented to TRIZ[J]. 现代图书情报技术, 2015, 31(1): 66-74.
[11] Yang Ruyi, Liu Dongsu. A Research on Visualization Algorithm of Hierarchy Information Based on Folksonomy[J]. 现代图书情报技术, 2014, 30(7): 71-76.
[12] Li Gang, Ye Guanghui. Research on Credibility Evaluation Mechanism of Experts Retrieval Under User's Control[J]. 现代图书情报技术, 2014, 30(7): 107-113.
[13] Hu Zhengyin, Fang Shu. Review on Text-based Patent Technology Mining[J]. 现代图书情报技术, 2014, 30(6): 62-70.
[14] Li Gang, Ye Guanghui. Research on Information Fusion for Multiple-sensor Expert Features[J]. 现代图书情报技术, 2014, 30(4): 27-33.
[15] Zhang Xiaojuan, Tang Xiangbin. Query Recommendation Based on User Task[J]. 现代图书情报技术, 2014, 30(4): 34-40.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn