Please wait a minute...
New Technology of Library and Information Service  2015, Vol. 31 Issue (4): 50-57    DOI: 10.11925/infotech.1003-3513.2015.04.07
Current Issue | Archive | Adv Search |
Words and N-gram Models Analysis for “A Dream of Red Mansions”
Xiao Tianjiu, Liu Ying
Department of Chinese Language and Literature, Tsinghua University, Beijing 100084, Chinad
Export: BibTeX | EndNote (RIS)      

[Objective] Research on the relationship between the first 80 chapters and the last 40 chapters of “A Dream of Red Mansions”. [Methods] Combined quantitative with qualitative method, compare the first 40 chapters, the middle 40 chapters and last 40 chapters with each other to calculate the ratios of the unique words of every part. Clustering is conducted respectively by utilizing the function words, N-gram model of words and part-of-speech, all content words and the word length, compute the similarities among the first 40 chapters, the middle 40 chapters and last 40 chapters according to high-frequency words. [Results] There are differences between the first 80 chapters and the last 40 chapters. There are less long words in the first 80 chapters and it is more readable and coherent than the last 40 chapters. The first 80 chapters pay more attention to description of details, while the last 40 chapters focus more on the description of actions and scenes. [Limitations] Only consider words and N-gram models, semantic and pragmatic features are not utilized. [Conclusions] The author of the first 80 chapters and the author of the last 40 chapters are not the same according to these features.

Key wordsStylistic analysis      Hierarchical clustering      K-means clustering      N-gram     
Received: 20 August 2014      Published: 21 May 2015
:  P315.69  

Cite this article:

Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”. New Technology of Library and Information Service, 2015, 31(4): 50-57.

URL:     OR

[1] 陈大康. 从数理语言学看后四十回的作者——与陈炳藻先生商榷[J]. 红楼梦学刊, 1987(1): 293-318. (Chen Dakang. Identification of the Authorship of the Last 40 Chapters of “A Dream of Red Mansions” from the Aspect of Mathematical Linguistic: Discuss with Chen Bingzao [J]. Studies on “A Dream of Red Mansions”, 1987(1): 293-318.)
[2] 张运良, 朱礼军, 乔晓东, 等. 基于句类特征的作者写作风格分类研究[J].计算机工程与应用, 2009, 45(22): 129-131. (Zhang Yunliang, Zhu Lijun, Qiao Xiaodong, et al. Research on Text Authorship Categorization Based on Sentences Category Features [J]. Computer Engineering and Applications, 2009, 45(22): 129-131.)
[3] 韦博成. 《红楼梦》前80回与后40回某些文风差异的统计分析(两个独立二项总体等价性检验的一个应用)[J]. 应用概率统计, 2009, 25(4): 441-448. (Wei Bocheng. Statistical Analysis on the Differences of Writing Style Between First 80 Chapters and Last 40 Chapters in “Dream of Red Chamber”: An Application of Equivalent Test on Two Independent Binomial Populations [J]. Chinese Journal of Applied Probability and Statistics, 2009, 25(4): 441-448.)
[4] 施建军. 基于支持向量机技术的《红楼梦》作者研究[J]. 红楼梦学刊, 2011(5): 35-52. (Shi Jianjun. The Authorship Research on A Dream of Red Mansions Based on Support Vector Machine [J]. Studies on “A Dream of Red Mansions”, 2011(5): 35-52.)
[5] Li H, Liu Y. Language Models and Classification Analysis for Dream of the Red Chamber [C]. In: Proceedings of the 2nd International Conference on Cloud Computing and Intelligence Systems, Hangzhou, China. IEEE, 2012: 1459-1464.
[6] 刘颖, 肖天久. 《红楼梦》计量风格学研究[J]. 红楼梦学刊, 2014(4): 260-281. (Liu Ying, Xiao Tianjiu. Studies on Quantitative Styles of A Dream of Red Mansions [J]. Studies on “A Dream of Red Mansions”, 2014(4): 260-281.)
[7] Zheng R, Li J, Chen H, et al. A Framework for Authorship Identification of Online Messages: Writing-style Features and Classification Techniques [J]. Journal of the American Society for Information Science and Technology, 2006, 57(3): 378-393.
[8] Grieve J. Quantitative Authorship Attribution: An Evaluation of Techniques [J]. Literary and Linguistic Computing, 2007, 22(3): 251-270.
[9] Argamon S, Whitelaw C, Chase P J, et al. Stylistic Text Classification Using Functional Lexical Features [J]. Journal of the American Society for Information Science and Technology, 2007, 58(6): 802-822.
[10] Peng F, Schuurmans D, Wang S, et al. Language Independent Authorship Attribution Using Character Level Language Models [C]. In: Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. 2003: 267-274.
[11] Gamon M. Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features [C]. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland. 2004.
[12] 王少康, 董科军, 阎保平. 基于语句节奏特征的作者身份识别研究[J]. 计算机工程, 2011, 37(9): 4-5, 8. (Wang Shaokang, Dong Kejun, Yan Baoping. Research on Authorship Identification Based on Sentence Rhythm Feature [J]. Computer Engineering, 2011, 37(9): 4-5, 8.)
[13] 李惠, 刘颖. 基于语言模型和特征分类的抄袭判定[J]. 计算机工程, 2013, 39(5): 230-234. (Li Hui, Liu Ying. Plagiarism Judgment Based on Language Model and Feature Classification [J]. Computer Engineering, 2013, 39(5): 230-234.)
[14] 曹雪芹, 高鹗. 红楼梦[M]. 北京: 人民文学出版社, 2000. (Cao Xueqin, Gao E. A Dream of Red Mansions [M]. Beijing: People's Literature Publishing House, 2000.)
[15] ICTCLAS [CP/OL]. [2014-07-28].
[16] Han J, Kamber M, Pei J. 数据挖掘: 概念与技术[M]. 第3版. 范明, 孟小峰译. 北京: 机械工业出版社, 2012. (Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. The 3rd Edition. Translated by Fan Ming, Meng Xiaofeng. Beijing: China Machine Press, 2012.)
[17] Manning C D, Raghavan P, Schütze H. 信息检索导论[M]. 王斌译. 北京: 人民邮电出版社, 2010. (Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval [M]. Translated by Wang Bin. Beijing: Posts & Telecom Press, 2010.)

[1] Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun. Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector[J]. 数据分析与知识发现, 2020, 4(10): 70-79.
[2] Junzhi Jia,Zhuangzhuang Ye. Clustering Wikidata’s Organizational Entities with Latent Semantic Index[J]. 数据分析与知识发现, 2019, 3(10): 56-65.
[3] Jia Xiaoting,Wang Mingyang,Cao Yu. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm[J]. 数据分析与知识发现, 2018, 2(2): 86-95.
[4] Wang Xueying,Zhang Zixuan,Wang Hao,Deng Sanhong. Evaluating Brands of Agriculture Products: A Literature Review[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[5] Ding Shengchun,Gong Silan,Li Hongmei. A New Method to Detect Bursty Events from Micro-blog Posts Based on Bursty Topic Words and Agglomerative Hierarchical Clustering Algorithm[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[6] Duan Jianyong,. Auto-Correction Search Model Based on Statistics and Characteristics[J]. 现代图书情报技术, 2016, 32(2): 34-42.
[7] Ren Yuwei, Lv Xueqiang, Li Zhuo, Xu Liping. Named Entity Recognition from Search Log[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[8] Zhang Wenjun, Wang Jun, Xu Shanchuan. The Probing of E-commerce User Need States by Page Cluster Analysis ——An Empirical Study on Women's Clothes from[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[9] Wang Hao, Li Sishu, Deng Sanhong. Study on Text Language Recognition Based on N-Gram[J]. 现代图书情报技术, 2013, (4): 54-61.
[10] Zhao Pengwei, Ma Lin, Qin Chunxiu. Formation of Interest-based Peer-to-Peer Community[J]. 现代图书情报技术, 2013, 29(10): 53-58.
[11] Sun Haixia, Li Junlian, Wu Yingjie, Wu Suhui. The Study on Out-of-vocabulary Identification of Chinese Biomedical Field Based on Hybrid Method[J]. 现代图书情报技术, 2013, 29(1): 15-21.
[12] Xiao Ming, Li Wenchao, Xia Qiuju. Mapping the Themes of Information Retrieval Based on Prefuse and Hierarchical Clustering[J]. 现代图书情报技术, 2012, 28(4): 35-40.
[13] Duan Yufeng, Ju Fei. Research on Chinese New Word Recognition in Specialized Field Based on N-Gram[J]. 现代图书情报技术, 2012, 28(2): 41-47.
[14] Wang Dongbo, Han Pu, Shen Si, Wei Xiangqing. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. 现代图书情报技术, 2012, (11): 40-46.
[15] Wu Suhui, Cheng Ying, Zheng Yanning, Pan Yuntao. N-gram Based on Cluster Label Extracting Algorithm for English Paper[J]. 现代图书情报技术, 2011, 27(7/8): 68-75.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938