Data Analysis and Knowledge Discovery

Select

Research on Deep Learning Based Topic Representation of Hot Events

Yu Chuanming,Yuan Sai,Zhu Xingyu,Lin Hongjun,Zhang Puliang,An Lu

Data Analysis and Knowledge Discovery. 2020, 4(4): 1-14. https://doi.org/10.11925/infotech.2096-3467.2019.0511

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to explore how to learn topic representation for hot events, and investigate the performances of various topic representation models on tasks such as topic classification and topic relevance modeling. [Methods] Based on the LDA2Vec method, we proposed W-LDA2Vec, a topic representation learning model. We predicted the context vectors of the central words after joint training of the initial document and word vectors. Finally, we obtained a word representation of topic information and a topic representation of context information. [Results] In hot events topical classification task, our model achieved the highest F1 value of 0.893, which is 0.314, 0.057, 0.022 and 0.013 higher than those of the four baseline models LDA, Word2Vec, TEWV and Doc2Vec, respectively. For task of hot events topic relevance modeling, with the number of topics as 10, our model achieved a higher correlation score of 0.462 5, which is 0.067 8 higher than that of the LDA model. [Limitations] The experimental corpus is limited to Chinese and English.[Conclusions] By embedding topic information to word and document representation, our model can effectively improve the performance of topical classification and relevance modeling.

Select

Analyzing Public Sentiments from the Perspective of City Profiles

Ye Guanghui,Zeng Jieyan,Hu Jinglan,Bi Chongwu

Data Analysis and Knowledge Discovery. 2020, 4(4): 15-26. https://doi.org/10.11925/infotech.2096-3467.2019.0500

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs an evolution model for social sentiment analysis from the perspective of city profiles, aiming to grasp city dynamics, guide public opinions, as well as identify and predict potential issues. [Methods] We firstly used the LDA2Vec algorithm to extract city themes from each time window. Then, we applied a dictionary-based sentiment analysis method to fine-grain the emotion categories of city themes, and calculated their emotional intensities. Finally, we tracked city events arising changes of public sentiments with the TF-IDF algorithm, and built the ARMA model to predict social sentiment trends. [Results] Our model’s accuracy rate for predicting emotional intensity of “like” reached 97%, while those of the “dislike” scores were up to 90%. [Limitations] We did not include unexpected events as an influencing factor to the proposed model. [Conclusions] Our method could effectively identify city events and predict emotional changes of public opinions.

Select

Recommending Microblogs Based on Emotion-Weighted Association Rules

Li Tiejun,Yan Duanwu,Yang Xiongfei

Data Analysis and Knowledge Discovery. 2020, 4(4): 27-33. https://doi.org/10.11925/infotech.2096-3467.2019.0765

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study recommends microblogs based on readers’ browsing behaviors, aiming to improve users’ experience with the Weibo services. [Methods] Firstly, we used association rules to analyze users’ behaviors on Sina Weibo and retrieved all frequent 1-item sets for comments. Then, we calculated the emotional intensity of comments, and identified micro-blog posts with emotional intensity higher than the threshold. Finally, we generated a new frequent 1-item set to establish stronger association rules for the final list. [Results] Compared with the benchmark recommendation algorithms, the accuracy, recall and F values of the proposed algorithm were all improved by 10%. [Limitations] The parameters in our experiment were relatively simple, which might not yield the best results. [Conclusions] The proposed method based on emotion-weighted association rules can effectively recommend microblogs.

Select

Recommending Online Medical Experts with Labeled-LDA Model

Pan Youneng,Ni Xiuli

Data Analysis and Knowledge Discovery. 2020, 4(4): 34-43. https://doi.org/10.11925/infotech.2096-3467.2019.0815

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to modify the existing recommendation model for online medical experts, aiming to more effectively address health-related inquiries. [Methods] First, we identified the latent topics of online health questions with the help of Labeled-LDA model. Then, we defined the doctors’ specialties and better match them with questions. Finally, we evaluated the new model with data from http://www.39.net. [Results] The precision, recall and response adoption rates of the proposed method were 40.4%, 44.0% and 22.9%, which were much higher than those of the existing ones. [Limitations] Our method did not include factors like doctors’ responding time and their resumes. This method could not identify expertise of newly joined doctors who answered few questions. [Conclusions] The proposed model could effectively recommend physicians for patients asking questions online.

Select

Computer-Assisted ICD-11 Coding Method Based on Chinese Semantic Analysis

Zhang Runtong,Chen Donghua,Zhao Hongmei,Zhu Xiaomin

Data Analysis and Knowledge Discovery. 2020, 4(4): 44-55. https://doi.org/10.11925/infotech.2096-3467.2019.0530

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes a computer-assisted coding method based on the 11th Revision of International Classification of Diseases (ICD-11) and Chinese semantic analysis, aiming to improve the efficiency of medical coding. [Methods] First, we constructed a new model for the entities and relations in ICD-11 based on traditional graphic models. Then, we used an improved measurement for semantic similarity to estimate the confidence of ICD-11 candidate codes. Finally, the proposed model generated candidate ICD codes. [Results] We examined our model with a coded hospital dataset, and found the proposed method outperformed existing ones. Our method achieved a success rate of 42% in assisted mode and 73% in precise mode. [Limitations] The Chinese version of ICD-11 does not allow us to leverage more Chinese semantics information to improve coding precision. [Conclusions] The proposed method improves the efficiency of coders and quality of medical records. It also promotes the development of Chinese medical informatics.

Select

Identifying Authorship with Novelty Detection Method

Guo Xu,Qi Ruihua

Data Analysis and Knowledge Discovery. 2020, 4(4): 56-62. https://doi.org/10.11925/infotech.2096-3467.2019.0343

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a novelty detection method to identify authorship. [Methods] We built an algorithm combining one-class SVM or multivariate Gaussian algorithm with multi-layer stylistic feature model. Then, we proposed a threshold selection method based on tolerance t. [Results] When the total number of sample characters was greater than 500, the accuracy, recall and F1 values were more than 0.9. Once the number of sample characters reached 2000, the accuracy, recall and F1 values were 0.978, 0.984 and 0.979. [Limitations] The model’s performance with short texts needs to be improved. [Conclusions] The proposed method could effectively address the novelty detection issue facing long text for authorship identification.

Select

Mining User Reviews with PreLM-FT Fine-Grain Sentiment Analysis

Shen Zhuo,Li Yan

Data Analysis and Knowledge Discovery. 2020, 4(4): 63-71. https://doi.org/10.11925/infotech.2096-3467.2019.0146

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper identifies user preferences based on their reviews of the catering providers, aiming to find and improve the un-satisfactory products or services. [Methods] Firstly, we retrieved user reviews on catering industry from the DianPing website to pre-train unsupervised corpus. Then, we fine-tuned the pre-training language model with a small amount of label data. Finally, we quantified the sentiment scores of attributes from user reviews and combined the KANO model to analyze their preferences for products or services. [Results] We successfully identified user preferences with their reviews. [Limitations] The KANO model might yield some inaccurate overall preference analysis. [Conclusions] The proposed method could effectively reveal user preferences with the help of reviews and some label data.

Select

Predicting Community Numbers with Network Bayesian Information Criterion

Li Wenzheng,Gu Yijun,Yan Hongli

Data Analysis and Knowledge Discovery. 2020, 4(4): 72-82. https://doi.org/10.11925/infotech.2096-3467.2019.0561

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an algorithm to predict the number of communities, aiming to improve the issues facing community detection algorithms. [Methods] First, we modified the Bayesian information criterion with characteristics of overlapping and non-overlapping community detection algorithms. Then, we constructed the Network Bayesian Information Criterion Algorithm to predict the number of communities. [Results] The accuracy and stability of the proposed algorithm were better than those of the Silhouette and Modularity algorithms. The accuracy of the former was 18% higher than those of the latter at least. [Limitations] Our new algorithm only includes the network structures. [Conclusions] The proposed algorithm based on Bayesian information criterion could effectively predict the number of network communities.

Select

Classifying Non-life Insurance Customers Based on Improved SOM and RFM Models

Yan Chun,Liu Lu

Data Analysis and Knowledge Discovery. 2020, 4(4): 83-90. https://doi.org/10.11925/infotech.2096-3467.2019.0715

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses neural network algorithms to classify non-life insurance customers, aiming to realize precision marketing. [Methods] We modified the RFM model from the macro and micro perspectives, and then introduced the index of claim amounts to establish the RFMC model. Then, we dynamically set the training speed and weight vector of the SOM neural network model. Finally, we improved the convergence speed of the proposed model and finished customer classification. [Results] We examined our model with information on non-life insurance customers. The proposed model was stable and its self-organization speed increased by 21.6%. [Limitations] All of the non-life insurance customers were from the same insurance company. [Conclusions] This paper divides non-life insurance customers into seven categories, and proposes different strategies for each type, which effectively improve the marketing decisions.

Select

Identifying Chinese / English Metaphors with Word Embedding and Recurrent Neural Network

Su Chuandong,Huang Xiaoxi,Wang Rongbo,Chen Zhiqun,Mao Junyu,Zhu Jiaying,Pan Yuhao

Data Analysis and Knowledge Discovery. 2020, 4(4): 91-99. https://doi.org/10.11925/infotech.2096-3467.2019.0828

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a method to recognize Chinese and English metaphors with word vector combination and recurrent neural network (RNN), aiming to identify the ubiquitous metaphors from natural languages. [Methods] First, we mapped texts to the word vectors as inputs of the neural network with the help of word-embedding combination algorithm. Then, we used the RNN as encoder, and took the attention mechanism and the pooling technique as feature extractor. Finally, we utilized Softmax to calculate the probability of the text was a metaphor. [Results] The accuracy and F1 of the proposed method with English texts improved by 11.8% and 6.3%, compared with traditional method based on vanilla word embedding. For Chinese tasks, the accuracy and F1 of the proposed method also improved by 8.9% and 7.8%. [Limitations] Due to the long-distance dependence issue, our method could not effectively recognize metaphors in long texts with complex sentences. [Conclusions] The proposed model signifcantly improves the neural network’s ability to recognize metaphors.

Select

Identifying Noun Metaphors with Transformer and BERT

Zhang Dongyu,Cui Zijuan,Li Yingxia,Zhang Wei,Lin Hongfei

Data Analysis and Knowledge Discovery. 2020, 4(4): 100-108. https://doi.org/10.11925/infotech.2096-3467.2019.0896

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to address the issues facing semantic information and relationship representation, aiming to improve the recognition of noun metaphors. [Methods] First, we used the BERT model to replace the word vector, and added position relationship among words for the semantic representation. Then, we utilized the Transformer model to extract features. Finally, we identified the noun metaphors with the help of used neural network classifier. [Results] The proposed model got the highest scores in accuracy (0.900 0), precision (0.896 4), recall (0.885 8), and F1(0.891 0). It covered multiple key points to improve the classification results of noun metaphors. [Limitations] The proposed method could not process the Chinese ancient idioms, as well as rare or dummy vocabularies. [Conclusions] The proposed model could more effectively identify Noun Metaphors than the existing models based on artificial features and deep learnings.

Select

Automatic Summarization of User-Generated Content in Academic Q&A Community Based on Word2Vec and MMR

Tao Xing,Zhang Xiangxian,Guo Shunli,Zhang Liman

Data Analysis and Knowledge Discovery. 2020, 4(4): 109-118. https://doi.org/10.11925/infotech.2096-3467.2019.0533

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Aiming at the knowledge aggregation problem of user-generated content (UGC) in the current academic Q&A community, an improved automatic summarization method was proposed to provide efficient and accurate knowledge aggregation services for scientific research users in the community. [Methods] The proposed method called W2V-MMR was combine the idea of the Maximal Marginal Relevance (MMR) with the Word2Vec model. Firstly, information quality of abstract sentences was optimized through Word2Vec in the process of score and similarity calculation. Then the Maximal Marginal Relevance (MMR) was introduced to extract the abstract of UGC in the academic Q&A community. [Results] The information quality scores obtained by the proposed method in the four groups of experimental data are 1.422 8, 1.447 6, 1.5921 and 3.416 8, which were all higher than the MMR and TextRank in the comparative experiment. [Limitations] The effect of the number of abstract sentences on the results is not considered, and the quality of abstract under different number of abstract sentences is not compared. [Conclusions] The proposed method provides useful reference for knowledge aggregation service of academic Q&A community.

Select

Synchronous Clustering Algorithm for Social Networks Based on Improved Vicsek Model

Yang Xu,Qian Xiaodong

Data Analysis and Knowledge Discovery. 2020, 4(4): 119-128. https://doi.org/10.11925/infotech.2096-3467.2019.0674

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The paper designs an algorithm based on the improved Vicsek model, aiming to study the synchronous evolution process and cluster structure of social networks. [Methods] First, we introduced a rate self-regulation rule to adjust the individual evolution rate of the original Vicsek model. Then, we used individual importance to control the direction of individual evolution of the Vicsek model. [Results] We examined our new algorithm with datasets of financial networks. The F1-Score for clustering results was higher than the Sync algorithm and clustering algorithm based on the original Vicsek model. [Limitations] The clustering time was very complex with large datasets. [Conclusions] The proposed algorithm could effectively describe the evolution and synchronization of complex social networks, and then accurately discover their cluster structures.

Please choose a citation manager

Content to export

25 April 2020, Volume 4 Issue 4

模态框（Modal）标题

Please choose a citation manager

Content to export

25 April 2020, Volume 4 Issue 4