Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (9): 80-87    DOI: 10.11925/infotech.2096-3467.2018.0204
Current Issue | Archive | Adv Search |
Generating HSK Writing Essays with LDA Model
Xu Yanhua1, Miao Yujie2, Miao Lin2, Lv Xueqiang2()
1School of Chinese Language and Literature, Ludong University, Yantai 264025, China
2Beijing Key Laboratory of Internet Culture and Digital Dissemination Research, Beijing Information Science and Technology University, Beijing 100101, China
Download: PDF (739 KB)   HTML ( 8
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] This paper tries to automatically generate writing samples for the Chinese Proficiency Test (HSK), aiming to help the Chinese teachers and learners prepare for the test. [Methods] First, we used the “HSK Dynamic Corpus” as the basic corpus, and trained it with the LDA model. Then, we adopted the cross-entropy strategy to select sentences containing required keywords. Finally, we manually scored the generated texts with the evaluating criteria. [Results] The generated essays contained all needed keywords and were relevant to the topics of the writing tasks. [Limitations] Some training corpus were modified HSK essays, written by non-Chinese speaker. [Conclusions] The proposed method could generate passages of good quality with the required keywords effectively.

Key wordsNatural Language Generation      LDA Model      Artificial Evaluation     
Received: 26 February 2018      Published: 25 October 2018
ZTFLH:  分类号: TP391 G35  

Cite this article:

Xu Yanhua,Miao Yujie,Miao Lin,Lv Xueqiang. Generating HSK Writing Essays with LDA Model. Data Analysis and Knowledge Discovery, 2018, 2(9): 80-87.

URL:

https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2018.0204     OR     https://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2018/V2/I9/80

题目关键词 分数
招聘、工作、发展、英语、毕业 2.21
回国、帮助、不得不、遗憾、祝福 3.42
开车、喝酒、要是、从来、后悔 4.19
无论、努力、获得、坚持、放弃 4.01
奖金、建筑、围绕、完美、摄影 2.68
年轻、运动、设施、使、精彩 3.72
进步、提高、即使…也…、发展 2.43
护照、找来了、来不及、祝福 3.93
大自然、减少、文明、污染、健康 4.45
季度、早晚、人员、应聘、信心 4.17
演出、顺利、以前、精彩、错过 4.64
档次 标准 分值域 具体要求
空白分 完全空白 0分 1、空白为0分;
2、一处语法错误扣除0.5分, 每两个错别字扣除0.1分, 低于1分则不扣除, 字数较少扣除1分, 少一个关键词扣除0.5分, 酌情给分。
低档分 未全部使用5个词语, 内容不连贯, 有语法错误, 有较多错别字。 1-3分
中档分 内容连贯且合逻辑, 有语法错误; 内容连贯且合逻辑, 有少量错别字; 内容连贯且合逻辑, 篇幅不够。 3-4分
高档分 5个词语全部使用, 无错别字, 无语法错误, 内容丰富, 连贯且合逻辑。 4-5分
[1] Reiter E, Dale R.Building Natural Language Generation Systems[M]. Cambridge University Press, 2000.
[2] 李春红. 基于汉语国际推广战略的新汉语水平考试效度研究——以新HSK五级写作测试为个案[J]. 大学教育, 2013(12): 111-113.
[2] (Li Chunhong.Research on Validity of the New Hanyu Shuiping Kaoshi Based on Chinese International Promotion Strategy——Take the New HSK Level 5 Writing Test as a Case[J].University Education, 2013(12): 111-113.)
[3] Klein S.Control of Style with a Generative Grammar[J]. Language, 1965, 41(4): 619-631.
doi: 10.2307/411529
[4] Manyika J, Chui M, Brown B, et al. Big Data: The Next Frontier for Innovation, Competition, and Productivity. Mckinsey.com[R/OL]. [2012-12-13]. .
[5] 曹存根, 岳小莉, 眭跃飞. PNAI: 一种新型的叙事与动画智能实验平台[J]. 信息技术快报, 2006, 4(5): 1-4.
[5] (Cao Cungen, Yue Xiaoli, Sui Yuefei.PNAI: A New Narrative and Animation Intelligent Experiment Platform[J]. Information Technology Letter, 2006, 4(5): 1-4.)
[6] He J, Zhou M, Jiang L.Generating Chinese Classical Poems with Statistical Machine Translation Models[C]// Proceedings of the 26th AAAI Conference on Artificial Intelligence. AAAI Press, 2012: 1650-1656.
[7] Zhang J, Yao J G, Wan X.Towards Constructing Sports News from Live Text Commentary[C]// Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics. 2016:1361-1371.
[8] 写稿机器人“小南”上岗[EB/OL]. [2017-06-01]. .
[8] (Writing Robot “Xiaonan” [EB/OL]. [2017-06-01].
[9] Reviewer-Horacek H.Review of “Building Natural Language Generation Systems” by Ehud Reiter and Robert Dale. Cambridge University Press 2000[J].Computational Linguistics, 1996, 27(2): 298-300.
[10] 贾佩山. 自然语言生成技术及其应用实例[J]. 电脑与信息技术, 1997, 5(2): 7-9.
[10] (Jia Peishan.Natural Language Generation Technology and Its Application Examples[J]. Computer and Information Technology, 1997, 5(2): 7-9.)
[11] 王纤. 自然语言生成系统的实现技术分析[J]. 微型电脑应用, 1997(4): 51-54.
[11] (Wang Xian.On the Implementation Techniques for Natural Language Generation Systems[J]. Microcomputer Applications, 1997(4): 51-54.)
[12] 张建华, 陈家骏. 自然语言生成综述[J]. 计算机应用研究, 2006, 23(8): 1-3.
[12] (Zhang Jianhua, Chen Jiajun.Summarization of Natural Language Generation[J]. Research on Computer Applications, 2006, 23(8): 1-3.)
[13] 詹卫东. 自然语言的自动分析与生成简介[J]. 术语标准化与信息技术, 2010(4): 33-42.
[13] (Zhan Weidong.A Brief Introduction to Natural Language Understanding and Generation[J]. Terminology Standardization & Information Technology, 2010(4): 33-42.)
[14] 汪卫明, 陈世鸿, 王世同,等. 基于语义模板的医学问答自动生成[J]. 武汉大学学报:理学版, 2009, 55(2): 233-238.
[14] (Wang Weiming, Chen Shihong, Wang Shitong, et al.Automatic Generation of Medical Question Answer Pairs Based on Semantic Templates[J]. Journal of Wuhan University: Science Edition, 2009, 55(2): 233-238.)
[15] 吴焕萍, 吕终亮, 张华平,等. 气象落区文本自动生成研究[J]. 计算机工程与应用, 2014, 50(13): 247-251.
[15] (Wu Huanping, Lü Zhongliang, Zhang Huaping, et al.Text Generation on Weather Falling Area Description[J]. Computer Engineering and Applications, 2014, 50(13): 247-251.)
[16] 孙剑, 周深根, 徐豪华. 基于模板的作战仿真数据自动生成军事报文方法研究[C]// 见第18届中国系统仿真技术及其应用学术年会论文集. 2012.
[16] (Sun Jian, Zhou Shengen, Xu Haohua.Research on Automatic Generation for Military Message from Simulation Data Based on Template[C]// Proceedings of the 14th Chinese Conference on System Simulation Technology & Application. 2012.
[17] Lopez A.Statistical Machine Translation[J]. ACM Computing Surveys, 2008, 40(3): 1-49.
[18] Jiang L, Zhou M, He J.Generating Chinese Couplets and Quatrain Using a Statistical Approach[C]// Proceedings of the 23rd Pacific Asia Conference on Language, Information and Computation. 2009: 377-384.
[19] Jiang L, Zhou M.Generating Chinese Couplets Using a Statistical MT Approach[C]// Proceedings of the 22nd International Conference on Computational Linguistics. 2008.
[20] Soumya S, Kumar G S, Naseem R, et al.Automatic Text Summarization[M]. MIT Press, 2011.
[21] Sauper C, Barzilay R.Automatically Generating Wikipedia Articles: A Structure-Aware Approach[C]// Proceedings of the 4th International Joint Conference on Natural Language. 2009: 208-216.
[22] Generating Chinese Classical Poems with RNN Encoder-Decoder[EB/OL]. [2017-10-07]. .
[23] Wang Q, Luo T, Wang D, et al.Chinese Song Iambics Generation with Neural Attention-based Model [C]//Proceedings of International Joint Coherence on Artificial Intelligence. New York: AAAI Press, 2016: 2943-2949.
[24] Blei D M, Ng A Y, Jordan M I.Latent Dirichlet Allocation[J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
[25] Haidar M A, O'Shaughnessy D. LDA-based LM Adaptation Using Latent Semantic Marginals and Minimum Discriminant Information[C]// Proceedings of the 20th European Signal Processing Conference. 2012: 2040-2044.
[26] Griffiths T L, Steyvers M.Finding Scientific Topics[J]. Proceedings of the National Academy of Sciences of the United States of America, 2004, 101(S1): 5228-5235.
doi: 10.1073/pnas.0307752101
[27] 张宝林. “HSK动态作文语料库”简介[J]. 国外汉语教学动态, 2003(4): 37-38.
[27] (Zhang Baolin.“HSK Dynamic Composition Corpus” Introduction[J]. Foreign Chinese Teaching Dynamics, 2003(4): 37-38.)
[28] Baez J C, Fritz T.A Bayesian Characterization of Relative Entropy[J]. Theory & Applications of Categories, 2014, 29(16): 422-456.
[1] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[2] Liu Yuwen,Wang Kai. Finding Geographic Locations of Popular Online Topics[J]. 数据分析与知识发现, 2020, 4(2/3): 173-181.
[3] Ye Guanghui,Xu Tong,Bi Chongwu,Li Xinyue. Analyzing Evolution of City Tourism Portraits with Multi-Dimensional Features and LDA Model[J]. 数据分析与知识发现, 2020, 4(11): 121-130.
[4] Yunfei Shao,Dongsu Liu. Classifying Short-texts with Class Feature Extension[J]. 数据分析与知识发现, 2019, 3(9): 60-67.
[5] Wang Li,Zou Lixue,Liu Xiwen. Visualizing Document Correlation Based on LDA Model[J]. 数据分析与知识发现, 2018, 2(3): 98-106.
[6] Wang Jingqi,Li Rui,Wu Huayi. The Evolution of Online Public Opinion Based on Spatial Autocorrelation[J]. 数据分析与知识发现, 2018, 2(2): 64-73.
[7] Li Zhen,Ding Shengchun,Wang Nan. Identifying Topics of Online Public Opinion[J]. 数据分析与知识发现, 2017, 1(8): 18-30.
[8] Fang Xiaofei,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Wang Xiaohua. Identifying Hot Topics from Mobile Complaint Texts[J]. 数据分析与知识发现, 2017, 1(2): 19-27.
[9] Zhang Lei,Ma Jing,Li Dandan,Shen Yang. Hypernetwork Model for Semantic Social Network and Automatic Identification of Key Nodes[J]. 现代图书情报技术, 2016, 32(3): 8-17.
[10] Ruyi Yang,Dongsu Liu,Hui Li. An Improved Topic Model Integrating Extra-Features[J]. 现代图书情报技术, 2016, 32(1): 48-54.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn