Please wait a minute...
Data Analysis and Knowledge Discovery  2018, Vol. 2 Issue (2): 86-95    DOI: 10.11925/infotech.2096-3467.2017.0626
Current Issue | Archive | Adv Search |
Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm
Xiaoting Jia1,Mingyang Wang1(),Yu Cao2
1 (College of Information and Computer Engineering, Northeast Forestry University, Harbin 150040, China)
2(Tongfang Knowledge Network, Beijing 100192, China)
Download: PDF(741 KB)   HTML ( 5
Export: BibTeX | EndNote (RIS)      

[Objective] This paper aims to improve the performance of automatic abstracting with the help of “Doc2vec” model and improved K-means clustering algorithm. [Methods] First, we introduced the Doc2Vec model, which could examine the document contextual information, to extract the semantics, grammar and word sequences of Chinese document sentences. Then, we transformed these sentences to vectors of fixed dimensions. Third, we identified clustering centers for the improved K-means algorithm, and then processed the sentence vectors. Finally, the sentences with larger information entropy in one cluster, as well as higher similarity with other sentences in the cluster, were extracted. [Results] Compared with the PLSA method, the precision, recall, and F value of the proposed model increased by 9.57%, 7.62% and 10.30% respectively. [Limitations] We could not use the sentences extracted from the documents to generate high quality abstracts. [Conclusions] The proposed method could improve the performance of automatic abstracting of Chinese documents.

Key wordsAutomatic Abstracting      Doc2Vec      K-means Clustering      Information Entropy     
Received: 29 June 2017      Published: 07 March 2018

Cite this article:

Xiaoting Jia,Mingyang Wang,Yu Cao. Automatic Abstracting of Chinese Document with Doc2Vec and Improved Clustering Algorithm. Data Analysis and Knowledge Discovery, 2018, 2(2): 86-95.

URL:     OR

[1] 程园, 吾守尔·斯拉木, 买买提依明·哈斯木. 基于综合的句子特征的文本自动摘要[J]. 计算机科学, 2015, 42(4): 226-229.
[1] (Cheng Yuan, Wushouer Silamu, Maimaitiyiming Hasimua.Automation Text Summarization Based on Comprehensive Characteristics of Sentence[J]. Computer Science, 2015, 42(4): 226-229.)
[2] 余珊珊, 苏锦钿, 李鹏飞. 基于改进的TextRank的自动摘要提取方法[J]. 计算机科学, 2016, 43(6): 240-247.
[2] (Yu Shanshan, Su Jindian, Li Pengfei.Improved TextRank-based Method for Automatic Summarization[J]. Computer Science, 2016, 43(6): 240-247.)
[3] 刘星含, 霍华. 基于互信息的文本自动摘要[J]. 合肥工业大学学报: 自然科学版, 2014, 37(10): 1198-1203.
[3] (Liu Xinghan, Huo Hua.Automatic Summarization for Text Based on Mutual Information[J]. Journal of Hefei University of Technology: Natural Science, 2014, 37(10): 1198-1203.)
[4] 高永兵, 王宇, 马占飞. 基于CR-PageRank算法的个人事件自动摘要研究[J]. 计算机工程, 2016, 42(11): 64-69.
[4] (Gao Yongbing, Wang Yu, Ma Zhanfei.Research on Automatic Summarization of Personal Events Based on CR-PageRank Algorithm[J]. Computer Engineering, 2016, 42(11): 64-69.)
[5] 米文丽, 孙曰昕. 利用概率主题模型的微博热点话题发现方法[J]. 计算机系统应用, 2014, 23(8): 163-167.
[5] (Mi Wenli, Sun Yuexin.Microblog Hot Topics Discovery Method Based on Probabilistic Topic Model[J]. Computer Systems & Applications, 2014, 23(8): 163-167.
[6] 李文鹏, 赵俊峰, 谢冰. 基于LDA的软件代码主题摘要自动生成方法[J]. 计算机科学, 2017, 44(4): 35-38.
[6] (Li Wenpeng, Zhao Junfeng, Xie Bing.Summary Extraction Method for Code Topic Based on LDA[J]. Computer Science, 2017, 44(4): 35-38.)
[7] Mikolov T, Chen K, Corrado G, et al.Efficient Estimation of Word Representations in Vector Space [OL]. arXiv: 1301.3781.
[8] Le Q, Mikolov T.Distributed Representations of Sentences and Documents[C]//Proceedings of International Conference on Machine Learning. 2014.
[9] 张群, 王红军, 王伦文. 词向量与LDA相融合的短文本分类方法[J]. 现代图书情报技术, 2016(12): 27-35.
[9] (Zhang Qun, Wang Hongjun, Wang Lunwen.Classifying Short Texts with Word Embedding and LDA Model[J]. New Technology of Library and Information Service, 2016(12): 27-35.)
[10] 林江豪, 周咏梅, 阳爱民, 等. 结合词向量和聚类算法的新闻评论话题演进分析[J]. 计算机工程与科学, 2016, 38(11): 2368-2374.
[10] (Lin Jianghao, Zhou Yongmei, Yang Aimin, et al.Analysis on Topic Evolution of News Comments by Combining Word Vector and Clustering Algorithm[J]. Computer Engineering and Science, 2016, 38(11): 2368-2374.)
[11] Dai X, Bikdash M, Meyer B.From Social Media to Public Health Surveillance: Word Embedding Based Clustering Method for Twitter Classification[C]// Proceedings of SoutheastCon. IEEE, 2017.
[12] 杨宇婷, 王名扬, 田宪允, 等. 基于文档分布式表达的新浪微博情感分类研究[J]. 情报杂志, 2016, 35(2): 151-156.
[12] (Yang Yuting, Wang Mingyang, Tian Xianyun, et al.Sina Microblog Sentiment Classification Based on Distributed Representation of Documents[J]. Journal of Intelligence, 2016, 35(2): 151-156.)
[13] 黄仁, 张卫. 基于Word2Vec的互联网商品评论情感倾向研究[J]. 计算机科学, 2016, 43(S1): 387-389.
[13] (Huang Ren, Zhang Wei.Study on Sentiment Analyzing of Internet Commodities Review Based on Word2Vec[J]. Computer Science, 2016, 43(S1): 387-389.)
[14] Cholakov K, Kordoni V.Using Word Embeddings for Improving Statistical Machine Translation of Phrasal Verbs[C]// Proceedings of the Workshop on Multiword Expressions. 2016: 56-60.
[15] Wei H, Zhang H, Gao G.Representing Word Image Using Visual Word Embeddings and RNN for Keyword Spotting on Historical Document Images[C]// Proceedings of IEEE International Conference on Multimedia and Expo. IEEE Computer Society, 2017: 1368-1373.
[16] 于洁. Skip-Gram模型融合词向量投影的微博新词发现[J]. 计算机系统应用, 2016, 25(7): 130-136.
[16] (Yu Jie.Microblog New Word Recognition Combining Skip-Gram Model and Word Vector Projection[J]. Computer Systems & Applications, 2016, 25(7): 130-136.)
[17] Rui W, Liu J, Jia Y.Unsupervised Feature Selection for Text Classification via Word Embedding[C]// Proceedings of IEEE International Conference on Big Data Analysis. IEEE, 2016: 1-5.
[18] 刘广聪, 黄婷婷, 陈海南. 改进的二分K均值聚类算法[J]. 计算机应用与软件, 2015, 32(2): 261-263.
[18] (Liu Guangcong, Huang Tingting, Chen Hainan.Improved Bisecting K-means Clustering Algorithm[J]. Computer Application and Software, 2015, 32(2): 261-263.)
[19] 翟东海, 鱼江, 高飞, 等. 最大距离法选取初始簇中心的K-means文本聚类算法的研究[J]. 计算机应用研究, 2014, 31(3): 713-715.
[19] (Zhai Donghai, Yu Jiang, Gao Fei, et al.K-means Text Clustering Algorithm Based on Initial Cluster Centers Selection According to Maximum Distance[J]. Application Research of Computers, 2014, 31(3): 713-715.)
[20] 左进, 陈泽茂. 基于改进K均值聚类的异常检测算法[J]. 计算机科学, 2016, 43(8): 258-261.
[20] (Zuo Jin, Chen Zemao.Anomaly Detection Algorithm Based on Improved K-means Clustering[J]. Computer Science, 2016, 43(8): 258-261.)
[21] 张银明, 黄廷磊, 林科, 等. 一种改进的k均值文本聚类算法[J]. 桂林电子科技大学学报, 2016, 36(4): 311-314.
[21] (Zhang Yinming, Huang Tinglei, Lin Ke, et al.An Improved K-means Algorithm for Text Clustering[J]. Journal of Guilin University of Electronic Technology, 2016, 36(4): 311-314.)
[1] Zhongyi Wang,Heming Zhang,Jing Huang,Chunya Li. Studying Knowledge Dissemination of Online Q&A Community with Social Network Analysis[J]. 数据分析与知识发现, 2018, 2(11): 80-94.
[2] Xueying Wang,Zixuan Zhang,Hao Wang,Sanhong Deng. Evaluating Brands of Agriculture Products: A Literature Review[J]. 数据分析与知识发现, 2017, 1(7): 13-21.
[3] Ren Yuwei, Lv Xueqiang, Li Zhuo, Xu Liping. Named Entity Recognition from Search Log[J]. 现代图书情报技术, 2015, 31(6): 49-56.
[4] Xiao Tianjiu, Liu Ying. Words and N-gram Models Analysis for “A Dream of Red Mansions”[J]. 现代图书情报技术, 2015, 31(4): 50-57.
[5] Zhang Wenjun, Wang Jun, Xu Shanchuan. The Probing of E-commerce User Need States by Page Cluster Analysis ——An Empirical Study on Women's Clothes from[J]. 现代图书情报技术, 2015, 31(3): 67-74.
[6] He Yue, Song Lingxi, Qi Liyun. Spillover Effect of Internet Word of Mouth in Negative Events——Take the “Deadly Yuantong Express” Event for an Example[J]. 现代图书情报技术, 2015, 31(10): 58-64.
[7] Tang Xiaobo, Hu Hua. Research of Ontology Concept Extraction Based on Chinese UGC Sources[J]. 现代图书情报技术, 2014, 30(5): 41-49.
[8] Wang Dongbo, Han Pu, Shen Si, Wei Xiangqing. Research of Mining the Category Knowledge Based on English-Chinese Humanities and Social Sciences Parallel Corpus in Phrase Level[J]. 现代图书情报技术, 2012, (11): 40-46.
[9] Shen Weijie. The Explore of the Automatic Abstracting Based on Text Structure[J]. 现代图书情报技术, 2002, 18(3): 23-27.
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938