Please wait a minute...
Data Analysis and Knowledge Discovery  2020, Vol. 4 Issue (4): 109-118    DOI: 10.11925/infotech.2096-3467.2019.0533
Current Issue | Archive | Adv Search |
Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR
Tao Xing1(),Zhang Xiangxian1,Guo Shunli2,Zhang Liman1
1 School of Management, Jilin University, Changchun 130022, China
2 School of Communication, Qufu Normal University, Qufu 276826, China
Download: PDF(792 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      
Abstract  

[Objective] Aiming at the knowledge aggregation problem of user-generated content (UGC) in the current academic Q&A community, an improved automatic summarization method was proposed to provide efficient and accurate knowledge aggregation services for scientific research users in the community. [Methods] The proposed method called W2V-MMR was combine the idea of the Maximal Marginal Relevance (MMR) with the Word2Vec model. Firstly, information quality of abstract sentences was optimized through Word2Vec in the process of score and similarity calculation. Then the Maximal Marginal Relevance (MMR) was introduced to extract the abstract of UGC in the academic Q&A community. [Results] The information quality scores obtained by the proposed method in the four groups of experimental data are 1.422 8, 1.447 6, 1.5921 and 3.416 8, which were all higher than the MMR and TextRank in the comparative experiment. [Limitations] The effect of the number of abstract sentences on the results is not considered, and the quality of abstract under different number of abstract sentences is not compared. [Conclusions] The proposed method provides useful reference for knowledge aggregation service of academic Q&A community.

Key wordsAcademic      Q&A      Community      Automatic      Summarization      Word2Vec      MMR     
Received: 20 May 2019      Published: 01 June 2020
ZTFLH:  N99  
Corresponding Authors: Tao Xing     E-mail: 459978415@qq.com

Cite this article:

Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman. Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR. Data Analysis and Knowledge Discovery, 2020, 4(4): 109-118.

URL:

http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/10.11925/infotech.2096-3467.2019.0533     OR     http://manu44.magtech.com.cn/Jwk_infotech_wk3/EN/Y2020/V4/I4/109

Model of CBOW and Sikp-gram
样本 MMR TextRank W2V-MMR
信息熵 相似度 信息质量 信息熵 相似度 信息质量 信息熵 相似度 信息质量
A1 1.683 5 0.722 6 0.961 0 1.747 5 0.882 0 0.865 5 2.141 5 0.718 7 1.422 8
A2 1.556 3 0.844 9 0.711 3 1.835 3 0.808 8 1.026 6 2.207 2 0.759 6 1.447 6
A3 1.125 9 0.461 7 0.664 1 2.265 2 0.911 6 1.353 6 2.303 8 0.711 7 1.592 1
A4 2.953 7 0.778 5 2.175 2 4.097 2 0.890 2 3.207 0 4.275 7 0.858 9 3.416 8
Quantitative Index Comparison
算法 生成摘要
MMR 墨菲定律发生了!根据“墨菲定律”,一、任何事都没有表面看起来那么简单;二、所有的事都会比你预计的时间长;三、会出错的事总会出错;四,如果你担心某种情况发生,那么它就更有可能发生。
TextRank 这就是墨菲定律,担心的事总会发生。如果让我尝试用最简单的一句话去解释墨菲定律的话,我宁愿选择更侧重心理学角度,解释如下:“墨菲定律——担心的事总会发生。
W2V-MMR “墨菲定律”与“彼得原理”“帕金森定律”并称为二十世纪西方文化三大发现,关于墨菲定律的主要内容概括如下:一、任何事都没有表面看起来那么简单;二、所有的事都会比你预计的时间长;三、会出错的事总会出错;四、如果你担心某种情况发生,那么它就更有可能发生。你预感你12岁儿子倒水会失败,因为你知道儿子的心智,体力,性格,经验,责任心,任何一个环节都有可能掉链子,而导致倒开水失败,而这种担心往往会成真!
Summarization of A1 by Three Methods
算法 生成摘要
MMR 所谓的“斜木桶原理”,“长板理论”其实是个文字游戏,“就木桶而论木桶”了。好像也叫短板效应,一个木桶,由高低不一样的木板组成,人们往往认为最高的木板就是关键,但是却恰恰相反,最短的木板,取决了木桶能装多少水。
TextRank “木桶原理”也称“短板理论”,说的是:“一个水桶无论有多高,它盛水的高度取决于其中最低的那块木板。不过,最近看到一个新木桶理论,它说,你要是把木桶倾斜一下,这时候决定盛水量的就不是短板,而是长板了。
W2V-MMR 很多人提到“木桶原理”,都不知不觉结合自身个人来理解,或自省或顿悟或励志……个人感觉“木桶原理”可能主要针对“群体概念”吧,比如在管理一个群体的过程中,会发现某个短板确实导致了这个群体的劣势,并且从历史中有大量经验统计上说就是因为某个短板导致了整体的失败,比如一支军队,就因为某个兵某个将的失误,前功尽弃,甚至全军毁灭。比如说:木桶有底,木板与木板之间不漏水,我们要的是水,不是油,不是饭,木桶能盛的水是越多越好,我们不用考虑是不是要移动木桶,能盛起来摆个pose就行了……等等。当然,这样的抽象保证了木桶原理有很强的可移植性,随便找个符合的事实想办法用逻辑将接口对接,就又是一个例证。
Summarization of A2 by Three Methods
算法 生成摘要
MMR 这个「引力波」是如何产生的?时空为何会弯曲,爱因斯坦在广义相对论中告诉我们物体的质量会让时空弯曲,一旦物体发生运动或质量发生变换,时空的弯曲程度就会变化,形成了涟漪一般的波,就会向外传播。
TextRank 时空扭曲也只不过是形变,在质变的范围,能量的冲击胀大或是缩小了空间。引力波在物理学中是指时空弯曲中的涟漪,通过波的形式向外进行辐射源传播,引力波可以以引力辐射的形式来传输能量,而早在1916年爱因斯坦就预言了引力波的存在,而今才真正的被人类所发现,可以想象,爱因斯坦的推论让我们研究了100年才得以证实。我们可以用引力波观测用电磁波没法看到的黑洞,还有占了宇宙质量95.1%的暗物质和暗能量等等。引力是一种可以跨越不同维度的力,通过研究引力波,或许有朝一日我们可以实现时间旅行噢。
W2V-MMR 引力波在物理学中是指时空弯曲中的涟漪,通过波的形式向外进行辐射源传播,引力波可以以引力辐射的形式来传输能量,而早在1916年爱因斯坦就预言了引力波的存在,而今才真正的被人类所发现,可以想象,爱因斯坦的推论让我们研究了100年才得以证实。我们测量引力波的设备是一个巨大的“L”形,如下图的两个绿色的臂,两臂的尽头,我们可以理解为一个镜子,激光发射然后反射回来,就可以测出两臂的长度在产生时空的涟漪中,大家请仔细看上面两个图中方块的变化,时空不停被压缩,变成又高又瘦(如果不明显的话请看下图第一个黑色长方形),然后被拉伸,变成又矮又胖(如果不明显的话请看下图第二个黑色长方形)也就是说引力波经过时,它会挤压或者拉伸探测器的双臂,一个方向拉伸,一个方向压缩,我们用“光速”这把尺子(也就是通过光发射出去,再反射回来的时间)可以非常精确地测量出在哪个方向增长了,在哪个方向上压缩了。
Summarization of A3 by Three Methods
算法 摘要
MMR 我是企业的专利管理人员,也写点案子:我觉得首先要根据技术交底资料检索最相关的现有技术,重新判断一下新颖性和创造性……说明书的撰写要特别注意具体实施例的数量和质量问题,以能充分支持权利要求。若专利授权前景较大,专利代理人要提出明确的申请方案、保护的范围和内容,在征得申请人同意的条件下开始准备正式的申请工作。
TextRank 其实个人感觉电路的案子很容易,简单的说就是模块化……尽快写的详细一些,尤其是实施例要多一些,这样在修改时便可以将说明书中的一些内容添加到权利要求书中。我是企业的专利管理人员,也写点案子:我觉得首先要根据技术交底资料检索最相关的现有技术,重新判断一下新颖性和创造性……说明书的撰写要特别注意具体实施例的数量和质量问题,以能充分支持权利要求。
W2V-MMR 本人从事专利工作到现在有一年半,撰写专利申请大约120件,本人认为做好下面几点比较重要:1.与发明人交流,充分理解发明创造;……5.在撰写说明书时介绍某一零部件按一定逻辑(如从上到下、从左到右、从内到外等)可使全文看起来有条而不乱。我是2006年才从事专利工作的,一开始简直对专利一无所知,……一、在独立权利要求中,我多写了一个句号,比如:“一种某某制剂,其特征在于是由以下重量比组成”;二、在递交实审请求与费用减缓请求书时,在第一栏申请人填写处没有填写,只是在第六栏中申请人签字。
Summarization of A4 by Three Methods
[1] 李宇佳 . 学术新媒体信息服务模式与服务质量评价研究[D]. 长春:吉林大学, 2017.
[1] ( Li Yujia . Research on Information Service Mode and Service Quality Evaluation of Academic New Media[D]. Changchun:Jilin University, 2017.)
[2] 王宝勋 . 面向网络社区问答对的语义挖掘研究[D]. 哈尔滨:哈尔滨工业大学, 2013.
[2] ( Wang Baoxun . Research on the Semantic Mining of Question-Answer Pairs in Web Communities[D]. Harbin:Harbin Institute of Technology, 2013.)
[3] Rehurek R, Sojka P. Software Framework for Topic Modelling with Large Corpora[C]// Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. 2010.
[4] Carbonell J, Goldstein J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries[C]// Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in the Information Retrieval. 1998: 335-336.
[5] 贯君, 毕强, 赵夷平 . 基于关联数据的知识聚合与发现研究进展[J]. 情报资料工作, 2015(3):15-21.
[5] ( Guan Jun, Bi Qiang, Zhao Yiping . Linked Data-based Knowledge Aggregation and Discovery Research Progress[J]. Information and Documentation Services, 2015(3):15-21.)
[6] 刘秉权, 徐振, 刘峰 , 等. 面向问答社区的答案摘要方法研究综述[J]. 中文信息学报, 2016,30(1):1-7,15.
[6] ( Liu Bingquan, Xu Zhen, Liu Feng , et al. A Survey of Answer Summarization on Community Question Answering[J]. Journal of Chinese Information Processing, 2016,30(1):1-7,15.)
[7] 侯丽微, 胡珀, 曹雯琳 . 主题关键词信息融合的中文生成式自动摘要研究[J]. 自动化学报, 2019,45(3):530-539.
[7] ( Hou Liwei, Hu Po, Cao Wenlin . Automatic Chinese Abstractive Summarization with Topical Keywords Fusion[J]. Acta Automatica Sinica, 2019,45(3):530-539.)
[8] 王连喜 . 自动摘要研究中的若干问题[J]. 图书情报工作, 2014,58(20):13-22.
[8] ( Wang Lianxi . Issues in Automatic Summarization Research[J]. Library and Information Service, 2014,58(20):13-22.)
[9] 罗文娟, 马慧芳, 何清 , 等. 权衡熵和相关度的自动摘要技术研究[J]. 中文信息学报, 2011,25(5):9-16.
[9] ( Luo Wenjuan, Ma Huifang, He Qing , et al. Leveraging Entropy and Relevance for Document Summarization[J]. Journal of Chinese Information Processing, 2011,25(5):9-16.)
[10] 荀静, 杨玉珍 . 基于TextRank的文本情感摘要提取方法[J]. 计算机应用与软件, 2018,35(10):80-84.
[10] ( Xun Jing, Yang Yuzhen . Text Emotion Summarization Extraction Based on TextRank[J]. Computer Applications and Software, 2018,35(10):80-84.)
[11] Li A, Jiang T, Wang Q, et al. The Mixture of TextRank and LexRank Techniques of Single Document Automatic Summarization Research in Tibetan[C]// Proceedings of the 8th International Conference on Intelligent Human-Machine Systems & Cybernetics. IEEE, 2016.
[12] Yasunaga M, Zhang R, Meelu K, et al. Graph-Based Neural Multi-Document Summarization[C]// Proceedings of the 31st Conference on Computational Natural Language Learning. 2017: 452-462.
[13] 王帅, 赵翔, 李博 , 等. TP-AS:一种面向长文本的两阶段自动摘要方法[J]. 中文信息学报, 2018,32(6):71-79.
[13] ( Wang Shuai, Zhao Xiang, Li Bo , et al. TP-AS: A Two-phase Approach to Long Text Automatic Summarization[J]. Journal of Chinese Information Processing, 2018,32(6):71-79.)
[14] Bhargava R, Sharma Y, Sharma G . ATSSI: Abstractive Text Summarization Using Sentiment Infusion[J]. Procedia Computer Science, 2016,89:404-411.
doi: 10.1016/j.procs.2016.06.088
[15] Akhtar N . Hierarchical Summarization of News Tweets with Twitter-LDA[A]// Ali R, Beg M. Application of Soft Computing for the Web[M]. 2017: 83-98.
[16] Zhang R, Li W, Gao D , et al. Automatic Twitter Topic Summarization with Speech Acts[J]. IEEE Transactions on Audio, Speech, and Language Processing, 2013,21(3):649-658.
doi: 10.1109/TASL.2012.2229984
[17] Madhawa P K K, Atukorale A S. A Robust Algorithm for Determining the Newsworthiness of Microblogs[C]// Proceedings of the 15th International Conference on Advances in ICT for Emerging Regions. IEEE, 2015: 135-139.
[18] 苏放, 王晓宇, 张治 . 基于注意力机制的评论摘要生成[J]. 北京邮电大学学报, 2018,41(3):7-13.
[18] ( Su Fang, Wang Xiaoyu, Zhang Zhi . Review Summarization Generation Based on Attention Mechanism[J]. Journal of Beijing University of Posts and Telecommunications, 2018,41(3):7-13.)
[19] Chan W, Zhou X, Wang W, et al. Community Answer Summarization for Multi-Sentence Question with Group L1 Regularization[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012: 582-591.
[20] Tomasoni M, Huang M. Metadata-Aware Measures for Answer Summarization in Community Question Answering[C]// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010: 760-769.
[21] Song H, Ren Z, Liang S, et al. Summarizing Answers in Non-Factoid Community Question-Answering[C]// Proceedings of the 10th ACM International Conference on Web Search and Data Mining. 2017: 405-414.
[22] Omari A, Carmel D, Rokhlenko O, et al. Novelty Based Ranking of Human Answers for Community Questions[C]// Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016: 215-224.
[23] Mikolov T, Sutskever I, Chen K, et al. Distributed Representations of Words and Phrases and Their Compositionality[C]// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013: 3111-3119.
[24] Mikolov T, Chen K, Corrado G , et al. Efficient Estimation of Word Representations in Vector Space[OL]. arXiv Preprint, arXiv: 1301. 3781.
[25] 王仁武, 陈川宝, 孟现茹 . 基于词向量扩展的学术资源语义检索技术[J]. 图书情报工作, 2018,62(19):111-119.
[25] ( Wang Renwu, Chen Chuanbao, Meng Xianru . Semantic Retrieval Technology of Academic Resources Based on Word Embedding Extension[J]. Library and Information Service, 2018,62(19):111-119.)
[26] Carbonell J. The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries[C]// Proc of International ACM SIGIR Conference on Research & Development in the Information Retrieval. 1998.
[27] Shannon C E . Mathematical Theory of Communication[J]. Bell System Technical Journal, 1948,27(4):379-423.
doi: 10.1002/bltj.1948.27.issue-3
[28] 应文豪, 肖欣延, 李素建 , 等. 一种利用语义相似度改进问答摘要的方法[J]. 北京大学学报:自然科学版, 2017,53(2):197-203.
[28] ( Ying Wenhao, Xiao Xinyan, Li Sujian , et al. Improving Query-Focused Summarization with CNN-Based Similarity[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2017,53(2):197-203.)
[29] 刘娜, 路莹, 唐晓君 , 等. 基于LDA重要主题的多文档自动摘要算法[J]. 计算机科学与探索, 2015,9(2):242-248.
doi: 10.3778/j.issn.1673-9418.1407006
[29] ( Liu Na, Lu Ying, Tang Xiaojun , et al. Multi-Document Summarization Algorithm Based on Significance Topic of LDA[J]. Journal of Frontiers of Computer Science and Technology, 2015,9(2):242-248.)
doi: 10.3778/j.issn.1673-9418.1407006
[30] 苏剑林 . 【不可思议的Word2Vec】 2.训练好的模型[EB/OL]. [2017-04-03]. https://kexue.fm/archives/4304.
[30] ( Su Jianlin. (Incredible Word2Vec[EB/OL]. [2017-04-03]. https://kexue.fm/archives/4304.))
[1] Jiang Lin,Zhang Qilin. Research on Academic Evaluation Based on Fine-Grain Citation Sentimental Quantification[J]. 数据分析与知识发现, 2020, 4(6): 129-138.
[2] Yu Fengchang,Lu Wei. Constructing Data Set for Location Annotations of Academic Literature Figures and Tables[J]. 数据分析与知识发现, 2020, 4(6): 35-42.
[3] Ye Jiaxin,Xiong Huixiang,Tong Zhaoli,Meng Qiuqing. Collaborative Tagging for Doctors in Online Medical Community[J]. 数据分析与知识发现, 2020, 4(6): 118-128.
[4] Yue Lixin,Liu Ziqiang,Hu Zhengyin. Evolution Analysis of Hot Topics with Trend-Prediction[J]. 数据分析与知识发现, 2020, 4(6): 22-34.
[5] Wang Sidi,Hu Guangwei,Yang Siyu,Shi Yun. Automatic Transferring Government Website E-Mails Based on Text Classification[J]. 数据分析与知识发现, 2020, 4(6): 51-59.
[6] Cai Yongming,Liu Lu,Wang Kewei. Identifying Key Users and Topics from Online Learning Community[J]. 数据分析与知识发现, 2020, 4(6): 69-79.
[7] Xiong Xin,Wang Hao,Zhang Haichao,Zhang Baolong. Impacts of Chinese Term Granularity on Measuring Term Discriminative Capacity[J]. 数据分析与知识发现, 2020, 4(2/3): 143-152.
[8] Ye Jiaxin,Xiong Huixiang,Jiang Wuxuan. A Physician Recommendation Algorithm Integrating Inquiries and Decisions of Patients[J]. 数据分析与知识发现, 2020, 4(2/3): 153-164.
[9] Xue Fuliang,Liu Lifang. Fine-Grained Sentiment Analysis with CRF and ATAE-LSTM[J]. 数据分析与知识发现, 2020, 4(2/3): 207-213.
[10] Gong Lijuan,Wang Hao,Zhang Zixuan,Zhu Liping. Reducing Dimensions of Custom Declaration Texts with Word2Vec[J]. 数据分析与知识发现, 2020, 4(2/3): 89-100.
[11] Ruojia Wang,Lu Zhang,Jimin Wang. Automatic Triage of Online Doctor Services Based on Machine Learning[J]. 数据分析与知识发现, 2019, 3(9): 88-97.
[12] Chuang Hong,He Li,Lihui Peng,Yiming Xu. Evaluating Information Services of Online Health Q&A Platform[J]. 数据分析与知识发现, 2019, 3(8): 41-52.
[13] Yong Cheng,Dekuan Xu,Xueqiang Lv. Automatically Grading Text Difficulty with Multiple Features[J]. 数据分析与知识发现, 2019, 3(7): 103-112.
[14] Liqing Qiu,Wei Jia,Xin Fan. Influence Maximization Algorithm Based on Overlapping Community[J]. 数据分析与知识发现, 2019, 3(7): 94-102.
[15] Ming Yi,Tingting Zhang. Ranking Answer Quality of Popular Q&A Community[J]. 数据分析与知识发现, 2019, 3(6): 12-20.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn