Data Analysis and Knowledge Discovery

Select

A Survey of Topic Evolution on Social Media

Liu Qian, Li Chenliang

Data Analysis and Knowledge Discovery. 2020, 4(8): 1-14. https://doi.org/10.11925/infotech.2096-3467.2020.0454

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes and summarizes recent researches about topic evolution on social media, and mainly introduces the relevant analysis techniques. [Coverage] Relevant literatures were collected in DBLP, Semantic Scholar and CNKI with the use of keywords "Social" and "Topic Evolution". Finally, a total of 83 representative literatures were cited. [Methods] According to the research objects and the methods of topic extraction, the topic evolution techniques are analyzed. [Results] The techniques are divided into two categories and six subcategories, and the prediction of the topic’s trend is analyzed. [Limitations] We didn’t discuss the detailed comparative analysis of the way these techniques introduce time. [Conclusions] This paper analyzed and summarized the techniques of topic evolution on social media, and found the challenges and future directions of this research.

Select

Author Name Disambiguation Techniques for Academic Literature: A Review

Shen Zhe, Wang Yi, Yao Yifan, Cheng Ying

Data Analysis and Knowledge Discovery. 2020, 4(8): 15-27. https://doi.org/10.11925/infotech.2096-3467.2020.0384

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

Abstract: [Objective] This paper reviews research on author name disambiguation techniques for the academic literature, aiming to provide references for future studies. [Coverage] A total of 51 papers published between January 1, 2016 to March 28 , 2020 were retrieved from the Web of Science, Google Scholar, CNKI and Wanfang Database. [Methods] First, we explored findings from these papers based on the process of author name disambiguation. Then, we summarized techniques like feature extraction, feature representation, model training and prediction. Finally, we discussed common issues facing these research multi-dimensionally. [Results] Graph-based and probabilistic methods, as well as hybrid feature representation models improved the calculation of complicated network features. We need to optimize machine-learning models' efficiency and generalization ability to finish tasks with large databases and incremental disambiguation. Most research did not address issues like unbalanced training data, missing feature data, and authors using different names. [Limitations] Due to the differences in empirical data, we did not carry out quantitative comparison among different methods. [Conclusions] Our study proposed multi-source data fusion, user intervention, and pre-trained models to improve author name disambiguation.

Select

A Comparative Study of Word Representation Models Based on Deep Learning

Yu Chuanming, Wang Manyi, Lin Hongjun, Zhu Xingyu, Huang Tingting, An Lu

Data Analysis and Knowledge Discovery. 2020, 4(8): 28-40. https://doi.org/10.11925/infotech.2096-3467.2019.1222

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study systematically explores the principles of traditional deep representation models and the latest pre-training ones, aiming to examine their performance in text mining tasks. [Methods] We compared these models’ data mining results from the model side and the experimental side. All tests were conducted with six datasets of CR, MR, MPQA, Subj, SST-2 and TREC. [Results] The XLNet model achieved the best average F1 value (0.918 6), which was higher than ELMo (0.809 0), BERT (0.898 3), Word2Vec (0.769 2), GloVe (0.757 6) and FastText (0.750 6). [Limitations] Our research focused on classification tasks of text mining, which did not compare the performance of vocabulary representation methods in machine translation, Q&A and other tasks. [Conclusions] The traditional deep representation learning models and the latest pre-training ones yield different results in text mining tasks.

Select

Classification of Chinese Medical Literature with BERT Model

Zhao Yang, Zhang Zhixiong, Liu Huan, Ding Liangping

Data Analysis and Knowledge Discovery. 2020, 4(8): 41-49. https://doi.org/10.11925/infotech.2096-3467.2019.1238

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the classification results of Chinese medical literature based on the BERT-Base-Chinese model and the BERT Chinese medical pre-training model (BERT-Re-Pretraining-Med-Chi), aiming to analyze their differences. [Methods] We built a medical text pre-training corpus with 340,000 abstracts of Chinese medical literature. Then, we constructed training samples, with 16,000 and 32,000 abstracts, and established test sample with another 3,200 abstracts. Finally, we compareed the performance of the two models, using the SVM method as a benchmark. [Results] The two BERT models yielded better results than the SVM one, and their average F1-scores are about 5% higher than the SVM model. The F1-score of the BERT-Re-Pretraining-Med-Chi model reaches 0.8390 and 0.8607, which is the best among the three. [Limitations] This study only examined research papers from 16 medical and health categories in the Chinese Library Classification, and the remaining four categories were not included in the classification system due to the small amount of data. [Conclusions] The BERT-Re-Pretraining-Med-Chi model improves the performance of medical literature classification, while the BERT-based deep learning method yields better results with large-scale training set.

Select

Question Classification Based on Bidirectional GRU with Hierarchical Attention and Multi-channel Convolution

Yu Bengong, Zhu Mengdi

Data Analysis and Knowledge Discovery. 2020, 4(8): 50-62. https://doi.org/10.11925/infotech.2096-3467.2019.1292

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a method to extract multi-level features from the question texts, aiming to better understand their semantics and address the issues facing text classification. [Methods] First, we constructed multi-channel attention feature matrices based on the multi-feature attention mechanism at the word level. It enriched the semantic representation of the texts and fully utilized the interrogative words, properties and position features from the questions. Then, we convolved the new matrices to obtain phrase-level feature representation. Third, we rearranged the vector representation and fed data to the bidirectional GRU（Gated Recurrent Unit） to access forward and backward semantic features respectively. Finally, we applied the latent topic attention to strengthen the topic information in the bidirectional contextual features, and generated the final text vector for the classification results. [Results] The accuracy rates of proposed model with three Chinese question datasets were 93.89%, 94.47% and 94.23% respectively, which were 5.82% and 4.50% higher than those of the LSTM and CNN. [Limitations] We only examined our new model with three Chinese question corpus. [Conclusions] The proposed model fully understands the semantic features of question texts, and improves the performance of question classification.

Select

Predicting Social Media Visibility of Scholarly Articles

Li Gang, Guan Weidong, Ma Yaxue, Mao Jin

Data Analysis and Knowledge Discovery. 2020, 4(8): 63-74. https://doi.org/10.11925/infotech.2096-3467.2020.0124

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to predict visibility of research papers on Twitter with their multidimensional features, aiming to find important factors affecting social media visibility. [Methods] First, we decided each paper’s social media visibility by its total mentions on Twitter, and extracted features from paper contents, authorship and publishing journals. Then, we constructed a binary classification model to predict each paper’s Twitter visibility. Finally, we examined our model with papers on diabetes to evaluate the performance of different algorithms and the importance of all features. [Results] LightGBM had the best performance with an accuracy of 0.70. Features from contents, authorship and publishing journals all influenced an article’s visibility on social media, while a journal’s annual average impact factor was the most important one. [Limitations] We only examined visiblity of diabete related papers on Twitter. [Conclusions] Ensemble learning algorithm is an effective method to predict social media visibility of scholarly articles, while features of the publishing journals are the key factors.

Select

Expanding Scholar Labels with Research Similarity and Co-authorship Network

Sheng Jiaqi, Xu Xin

Data Analysis and Knowledge Discovery. 2020, 4(8): 75-85. https://doi.org/10.11925/infotech.2096-3467.2020.0002

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to add more academic labels for researchers from scholarly abstracts, aiming to predict their future research interests. [Methods] First, we extracted the basic labels from abstracts with the TF-IDF method. Then, we identified researchers sharing similar academic interests and co-authoriship. Finally, we expanded the basic labels with those from similar scholars and team members. [Results] Compared with existing methods, the proposed one increased recall rate of predicting by 8.33% on average. [Limitations] Our sample size was small, and we only examined scholarly articles in one language. [Conclusions] The proposed method could predict scholars’ future research interests.

Select

Automatic Recognition of Produce Entities from Local Chronicles with Deep Learning

Xu Chenfei, Ye Haiying, Bao Ping

Data Analysis and Knowledge Discovery. 2020, 4(8): 86-97. https://doi.org/10.11925/infotech.2096-3467.2020.0032

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to automatically identify the produce aliases, related human figures, places of origin and cited books from ancient local chronicles, aiming to establish a knowledge base for traditional products. [Methods] Firstly, we chose Local Chronicle of Yunnan: Produce as the basic corpus and preprocessed its texts to carry out corpus tagging. Then, we adopted four deep learning models (Bi-RNN, Bi-LSTM, Bi-LSTM-CRF and BERT) to identify the needed entities. Finally, we compared outputs of these models. [Results] The P-value and F-value of the Bi-LSTM model were 5.54% and 3.51% higher than those of the Bi-LSTM-CRF model. The R-value of the BERT model reached 83.36%, which was the best among all models. The Bi-LSTM-CRF model yielded the best results with the entity recognition of cited books (F-value=89.71%), and the BERT model had the best performance on character entities with a F-value of 87.90%. [Limitations] Due to the linguistic characteristics of ancient local chronicles and the domain knowledge required for identifying related entities, there may be errors in tagging. [Conclusions] Deep learning could help us identify needed entities from ancient local chronicles effectively.

Select

Extracting Emotion-Cause Pairs Based on Emotional Dilation Gated CNN

Dai Jianhua, Deng Yubin

Data Analysis and Knowledge Discovery. 2020, 4(8): 98-106. https://doi.org/10.11925/infotech.2096-3467.2019.1243

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an Emotional Dilation Gated CNN (EDGCNN) model, aiming to extract emotion-cause pairs for sentiment analysis. [Methods] First, we used the emotional discriminant model to identify sentiment sentences. Then, we input coding for these sentences to the EDGCNN model and located corresponding reasons. Finally, we tagged keywords of reasons generated from the experimental dataset. [Results] The new model’s recall and F1 values reached 63.52% and 60.45% respectively on the training dataset, which were better or very similiar to the existing ones The proposed model also extracted emotion-cause pairs at finergranularity level. [Limitations] The experimental corpus size was small. [Conclusions] The proposed model can extract emotion-cause pairs effectively.

Select

Analyzing & Clustering Enterprise Microblog Users with Supernetwork

Xi Yunjiang, Du Diedie, Liao Xiao, Zhang Xuehong

Data Analysis and Knowledge Discovery. 2020, 4(8): 107-118. https://doi.org/10.11925/infotech.2096-3467.2020.0091

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an integrated modeling method to process multi-dimensional user interest data, aiming to examine the spectral clustering method for analyzing user interests. [Methods] First, we retrieved Weibo (Microblog) data of "Three Squirrels" and used supernetwork model to integrate the modeling of contents and user interaction data. Then, we constructed an interactive interest index and grouped the users with spectral clustering algorithm. Finally, we evaluated the clustering results with the Silhouette Coefficient and Davies-Bouldin methods. [Results] We found that the clustering DB value reached 0.57 (k was set at 15), which was evenly distributed. [Limitations] More research is needed to further explore user characteristic data and the impacts of different data dimensions on user interests. [Conclusions] This study proposes maintenance and marketing suggestions for enterprise Weibo profiles, which will help them identify user interests and improve marketing effectiveness.

Select

The Determinants of Continuance Intention to Pay: Empirical Research from Online Knowledge Payment Users

Wei Wu, Xie Xingzheng

Data Analysis and Knowledge Discovery. 2020, 4(8): 119-129. https://doi.org/10.11925/infotech.2096-3467.2020.0271

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The current study aims to investigate the relationship among the characteristics of online knowledge payment products, individual needs, and continuance intention to pay, which offers the guideline to the industry. [Methods] Based on the Elaboration Likelihood Model and Uses and Gratifications Theory, the conceptual model of continuance intention to pay is conducted. Both structural equation model (SEM) and fuzzy-set qualitative comparative analysis (fsQCA) are used to analyze the collected data. [Results] According to the results of SEM, the argument quality has positive effect on individual needs, which can further affect users’ continuance intention to pay. The fsQCA findings reveal that three causal recipes of motivations predicting high continuance intention to pay. [Limitations] Most of the samples are audio knowledge content users, which reflects that the sample representativeness is limited. Also, the conceptual model ignores the moderators, namely, usage scenarios. [Conclusions] The current online knowledge payment products do not fully meet the individual needs of knowledge payment users. The knowledge content and the individual needs are the key factors of enhancing their continuance intention to pay.

Select

Recommending Doctors Online Based on Combined Conditions

Li Yueyan,Xiong Huixiang,Li Xiaomin

Data Analysis and Knowledge Discovery. 2020, 4(8): 130-142. https://doi.org/10.11925/infotech.2096-3467.2019.1038

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper integrates multiple recommendation strategies to discover high-quality doctor services, aiming to improve the recommendation results from medical consultation websites. [Methods] We built a doctor recommendation model based on combined conditions, which included three models for similar patients, medical fields and doctor performance. Then, we used a linear weighted hybrid strategy to merge these results to create a final list. We retrieved data from "Good Doctor Online" to evaluate the proposed model. [Results] Up to 86% of the doctors seen by the patients were identified by our new model. [Limitations] The choice of users might be affected by random factors and the weight setting of each strategy needs to be improved. [Conclusions] The proposed model could effectively recommend high-quality doctors for patients.

Please choose a citation manager

Content to export

25 August 2020, Volume 4 Issue 8

模态框（Modal）标题

Please choose a citation manager

Content to export

25 August 2020, Volume 4 Issue 8