Please wait a minute...
Advanced Search
现代图书情报技术  2015, Vol. 31 Issue (4): 50-57     https://doi.org/10.11925/infotech.1003-3513.2015.04.07
  研究论文 本期目录 | 过刊浏览 | 高级检索 |
《红楼梦》词和N元文法分析
肖天久, 刘颖
清华大学中国语言文学系 北京 100084
Words and N-gram Models Analysis for “A Dream of Red Mansions”
Xiao Tianjiu, Liu Ying
Department of Chinese Language and Literature, Tsinghua University, Beijing 100084, Chinad
全文: PDF (872 KB)   HTML  
输出: BibTeX | EndNote (RIS)      
摘要 

[目的]研究《红楼梦》前八十回与后四十回的关系, 从而判定《红楼梦》是否为一人所写。[方法]定量统计和定性分析相结合, 比较前、中、后四十回的独有词; 利用虚词、词及词类的N元文法模型、实词以及词长进行聚类; 计算三个部分的相似度。[结果]证明前八十回与后四十回有差异。前八十回用词连贯性较高, 更重视细节描写, 长词较少, 可读性更强; 后四十回更重视动作和场景化描写, 长词较多, 可读性稍弱。[局限]仅限于词和N元文法, 未能进一步考察语义、语篇等方面的特征。[结论]从词、词类、短语串和词类串等方面分析, 前八十回与后四十回很可能并非一人所作。

服务
把本文推荐给朋友
加入引用管理器
E-mail Alert
RSS
作者相关文章
刘颖
肖天久
关键词 风格分析层次聚类K-means聚类N元文法    
Abstract

[Objective] Research on the relationship between the first 80 chapters and the last 40 chapters of “A Dream of Red Mansions”. [Methods] Combined quantitative with qualitative method, compare the first 40 chapters, the middle 40 chapters and last 40 chapters with each other to calculate the ratios of the unique words of every part. Clustering is conducted respectively by utilizing the function words, N-gram model of words and part-of-speech, all content words and the word length, compute the similarities among the first 40 chapters, the middle 40 chapters and last 40 chapters according to high-frequency words. [Results] There are differences between the first 80 chapters and the last 40 chapters. There are less long words in the first 80 chapters and it is more readable and coherent than the last 40 chapters. The first 80 chapters pay more attention to description of details, while the last 40 chapters focus more on the description of actions and scenes. [Limitations] Only consider words and N-gram models, semantic and pragmatic features are not utilized. [Conclusions] The author of the first 80 chapters and the author of the last 40 chapters are not the same according to these features.

Key wordsStylistic analysis    Hierarchical clustering    K-means clustering    N-gram
收稿日期: 2014-08-20      出版日期: 2015-05-21
:  P315.69  
基金资助:

本文系国家自然科学基金项目“基于语用信息的交互行为与语言特征的建模研究”(项目编号:61171114)和教育部自主科研项目“基于大规模语料库的社会语用信息网的构建”(项目编号:20111081010)的研究成果之一。

通讯作者: 肖天久,ORCID:0000-0002-5342-243X,E-mail:xtj1990@126.com     E-mail: xtj1990@126.com
作者简介: 作者贡献声明: 肖天久:进行实验,分析数据,起草论文;刘颖:提出研究思路,设计研究方案,论文最终版本修订。
引用本文:   
肖天久, 刘颖. 《红楼梦》词和N元文法分析[J]. 现代图书情报技术, 2015, 31(4): 50-57.
Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”. New Technology of Library and Information Service, 2015, 31(4): 50-57.
链接本文:  
https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/10.11925/infotech.1003-3513.2015.04.07      或      https://manu44.magtech.com.cn/Jwk_infotech_wk3/CN/Y2015/V31/I4/50

[1] 陈大康. 从数理语言学看后四十回的作者——与陈炳藻先生商榷[J]. 红楼梦学刊, 1987(1): 293-318. (Chen Dakang. Identification of the Authorship of the Last 40 Chapters of “A Dream of Red Mansions” from the Aspect of Mathematical Linguistic: Discuss with Chen Bingzao [J]. Studies on “A Dream of Red Mansions”, 1987(1): 293-318.)
[2] 张运良, 朱礼军, 乔晓东, 等. 基于句类特征的作者写作风格分类研究[J].计算机工程与应用, 2009, 45(22): 129-131. (Zhang Yunliang, Zhu Lijun, Qiao Xiaodong, et al. Research on Text Authorship Categorization Based on Sentences Category Features [J]. Computer Engineering and Applications, 2009, 45(22): 129-131.)
[3] 韦博成. 《红楼梦》前80回与后40回某些文风差异的统计分析(两个独立二项总体等价性检验的一个应用)[J]. 应用概率统计, 2009, 25(4): 441-448. (Wei Bocheng. Statistical Analysis on the Differences of Writing Style Between First 80 Chapters and Last 40 Chapters in “Dream of Red Chamber”: An Application of Equivalent Test on Two Independent Binomial Populations [J]. Chinese Journal of Applied Probability and Statistics, 2009, 25(4): 441-448.)
[4] 施建军. 基于支持向量机技术的《红楼梦》作者研究[J]. 红楼梦学刊, 2011(5): 35-52. (Shi Jianjun. The Authorship Research on A Dream of Red Mansions Based on Support Vector Machine [J]. Studies on “A Dream of Red Mansions”, 2011(5): 35-52.)
[5] Li H, Liu Y. Language Models and Classification Analysis for Dream of the Red Chamber [C]. In: Proceedings of the 2nd International Conference on Cloud Computing and Intelligence Systems, Hangzhou, China. IEEE, 2012: 1459-1464.
[6] 刘颖, 肖天久. 《红楼梦》计量风格学研究[J]. 红楼梦学刊, 2014(4): 260-281. (Liu Ying, Xiao Tianjiu. Studies on Quantitative Styles of A Dream of Red Mansions [J]. Studies on “A Dream of Red Mansions”, 2014(4): 260-281.)
[7] Zheng R, Li J, Chen H, et al. A Framework for Authorship Identification of Online Messages: Writing-style Features and Classification Techniques [J]. Journal of the American Society for Information Science and Technology, 2006, 57(3): 378-393.
[8] Grieve J. Quantitative Authorship Attribution: An Evaluation of Techniques [J]. Literary and Linguistic Computing, 2007, 22(3): 251-270.
[9] Argamon S, Whitelaw C, Chase P J, et al. Stylistic Text Classification Using Functional Lexical Features [J]. Journal of the American Society for Information Science and Technology, 2007, 58(6): 802-822.
[10] Peng F, Schuurmans D, Wang S, et al. Language Independent Authorship Attribution Using Character Level Language Models [C]. In: Proceedings of the 10th Conference on European Chapter of the Association for Computational Linguistics. 2003: 267-274.
[11] Gamon M. Linguistic Correlates of Style: Authorship Classification with Deep Linguistic Analysis Features [C]. In: Proceedings of the 20th International Conference on Computational Linguistics, Geneva, Switzerland. 2004.
[12] 王少康, 董科军, 阎保平. 基于语句节奏特征的作者身份识别研究[J]. 计算机工程, 2011, 37(9): 4-5, 8. (Wang Shaokang, Dong Kejun, Yan Baoping. Research on Authorship Identification Based on Sentence Rhythm Feature [J]. Computer Engineering, 2011, 37(9): 4-5, 8.)
[13] 李惠, 刘颖. 基于语言模型和特征分类的抄袭判定[J]. 计算机工程, 2013, 39(5): 230-234. (Li Hui, Liu Ying. Plagiarism Judgment Based on Language Model and Feature Classification [J]. Computer Engineering, 2013, 39(5): 230-234.)
[14] 曹雪芹, 高鹗. 红楼梦[M]. 北京: 人民文学出版社, 2000. (Cao Xueqin, Gao E. A Dream of Red Mansions [M]. Beijing: People's Literature Publishing House, 2000.)
[15] ICTCLAS [CP/OL]. [2014-07-28]. http://ictclas.nlpir.org/.
[16] Han J, Kamber M, Pei J. 数据挖掘: 概念与技术[M]. 第3版. 范明, 孟小峰译. 北京: 机械工业出版社, 2012. (Han J, Kamber M, Pei J. Data Mining: Concepts and Techniques [M]. The 3rd Edition. Translated by Fan Ming, Meng Xiaofeng. Beijing: China Machine Press, 2012.)
[17] Manning C D, Raghavan P, Schütze H. 信息检索导论[M]. 王斌译. 北京: 人民邮电出版社, 2010. (Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval [M]. Translated by Wang Bin. Beijing: Posts & Telecom Press, 2010.)

[1] 毕崇武,叶光辉,李明倩,曾杰妍. 基于标签语义挖掘的城市画像感知研究 *[J]. 数据分析与知识发现, 2019, 3(12): 41-51.
[2] 贾君枝,叶壮壮. 基于潜在语义索引的Wikidata机构实体聚类研究 *[J]. 数据分析与知识发现, 2019, 3(10): 56-65.
[3] 王雪颖, 张紫玄, 王昊, 邓三鸿. 中国农产品品牌评价研究的内容解析*[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[4] 丁晟春,龚思兰,李红梅. 基于突发主题词和凝聚式层次聚类的微博突发事件检测研究*[J]. 现代图书情报技术, 2016, 32(7-8): 12-20.
[5] 任育伟, 吕学强, 李卓, 徐丽萍. 搜索日志中命名实体识别[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[6] 张文君, 王军, 徐山川. 电商用户需求状态的聚类分析——以淘宝网女装为例[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[7] 赵捧未, 马琳, 秦春秀. P2P用户兴趣社区形成研究[J]. 现代图书情报技术, 2013, 29(10): 53-58.
[8] 肖明, 栗文超, 夏秋菊. 基于Prefuse和层次聚类的信息检索主题知识图谱研究[J]. 现代图书情报技术, 2012, 28(4): 35-40.
[9] 边鹏, 赵妍, 苏玉召. 一种改进的K-means算法最佳聚类数确定方法[J]. 现代图书情报技术, 2011, 27(9): 34-40.
[10] 章顺瑞, 游宏梁. 基于层次聚类算法的中文人名消歧[J]. 现代图书情报技术, 2010, 26(11): 64-68.
[11] 曹高辉,焦玉英,成全. 基于凝聚式层次聚类算法的标签聚类研究*[J]. 现代图书情报技术, 2008, 24(4): 23-28.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
版权所有 © 2015 《数据分析与知识发现》编辑部
地址:北京市海淀区中关村北四环西路33号 邮编:100190
电话/传真:(010)82626611-6626,82624938
E-mail:jishu@mail.las.ac.cn