Data Analysis and Knowledge Discovery

Select

Forecasting Developments of Core Topics in Science and Technology with Trend Analysis

Cui Ji, Zhang Jinpeng, Bao Zhou, Ding Shengchun

Data Analysis and Knowledge Discovery. 2022, 6(9): 1-13. https://doi.org/10.11925/infotech.2096-3467.2021.1451

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The study creates a predictive model based on trending topics and analyzes the related literature, aiming to forecast the developments of core topics. [Methods] First, we analyzed the characteristics of research topics from scientific and technological literature. Then, we extracted the core topics of strategic coordinate identification. Finally, we used the ARIMA model and exponential smoothing method to predict the topics’ trending degrees. [Results] The mean absolute error and mean root mean square error of the exponential smoothing method were both smaller than those of the ARIMA model. [Limitations] The selection of initial parameters for the model, the distribution of coefficients and the number of published papers will affect the prediction performance. [Conclusions] The two proposed models could yield better prediction results for growing and emerging topics.

Select

Analyzing Characteristics of ESI Discipline Distribution in China, U.S. and U.K. with Sub-Disciplines and Text Contents

Zhang Wanshu, Yao Haitao, Wang Xuefeng

Data Analysis and Knowledge Discovery. 2022, 6(9): 14-26. https://doi.org/10.11925/infotech.2096-3467.2021.1439

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the highly cited papers from ESI, aiming to identify the characteristics of their discipline distributions in China, the United States and the United Kingdom. [Methods] First, we merged the sub-disciplines and text contents based on the general framework of biodiversity. Then, we constructed three indicators of discipline variety, discipline balance and discipline disparity. Finally, we analyzed the changing of indicators over a five-year-period. [Results] There is a gap between China and the United States or the United Kingdom in the diversity of Social Sciences and Biomedical Sciences, in the balance of Engineering, Mathematics, as well as Environment & Ecology, and in the disparity of Computer Sciences, Geosciences, Botanic and Animal Sciences. However, some indicators showed an upward trend. [Limitations] More research is needed to examine the threshold of discipline coverages, as well as the contribution differences due to the order of authors’ nationalities. [Conclusions] Our study finds the differences between China, the United States or the United Kingdom in the distribution of research disciplines, which benefits discipline evaluation and future developments.

Select

Poet’s Emotional Trajectory in Time and Space: Case Study of Li Bai for Digital Humanities

Gao Jinsong, Zhang Qiang, Li Shuaike, Sun Yanling, Zhou Shubin

Data Analysis and Knowledge Discovery. 2022, 6(9): 27-39. https://doi.org/10.11925/infotech.2096-3467.2021.1413

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the poets’ changes of time-space trajectory and emotional dimension, aiming to provide a new research perspective for the humanities field. [Context] We improves the visualization of the current research on digital humanities and the accessibility of the results, with the help of ontology and GIS technology. [Methods] We constructed a poet ontology model for Li Bai, a famous poet in China’s Tang Dynasty, and created the knowledge model for his related concepts and relationships. Then, we used the GIS technology to present the changes of Li Bai’s temporal and spatial emotional trajectory, which helped us explore the tacit knowledge. [Results] Li Bai’s life trajectories spanned more than half of today’s China, with the most frequent trajectories near today’s Nanjing. Dangtu was Li Bai’s “sorrowful and joyful” place, while Nanjing was Li Bai’s “sorrowful” place. Li Bai was more “joyful” than “sorrowful” in his youth, while he became more “sorrowful”than “joyful” in middle age. Li Bai was “sorrowful and joyful” in his later years. [Conclusions] This paper provides practical guidelines for studying poet’s emotional trajectories in time and space, which benefits the humanities research.

Select

Detecting Multimodal Sarcasm Based on SC-Attention Mechanism

Chen Yuanyuan, Ma Jing

Data Analysis and Knowledge Discovery. 2022, 6(9): 40-51. https://doi.org/10.11925/infotech.2096-3467.2021.1362

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs an SC-Attention fusion mechanism,aiming to improve the low prediction accuracy and difficult fusion of multimodal features in the existing detection models for multimodal sarcasm. [Methods] First, we used the CLIP and RoBERTa models to extract features from pictures, picture attributes, and texts. Then, we combined the SC-Attention mechanism with SENet’s attention mechanism to establish the Co-Attention mechanism and fuse multi-modal features. Third, we re-allocated attention feature weights by the original modals. Finally, we input features to the full connection layers to detect sarcasm. [Results] The accuracy and F1 of the proposed model reached 93.71% and 91.68%, which were 10.27 and 11.5 percentage point higher than the existing ones. [Limitations] We need to examine our model with more data sets. [Conclusions] The proposed model reduces information redundancy and feature loss, which effectively improves the accuracy of multimodal sarcasm detection.

Select

News Recommendation with Latent Topic Distribution and Long and Short-Term User Representations

Tang Jiao, Zhang Lisheng, Sang Chunyan

Data Analysis and Knowledge Discovery. 2022, 6(9): 52-64. https://doi.org/10.11925/infotech.2096-3467.2021.1376

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a news recommendation model based on contents and additional information on users’ current preferences, aiming to improve the performance of the existing ones. [Methods] We estblished a news representation model integrating the titles, abstracts, full-texts, as well as explicit and potential topics. We also built a user representation model utilizing the long and short-term user interests as well as their current concerns and preferences. [Results] We examined the proposed model with two large-scale news recommendation datasets. It reached 69.51% on AUC, 34.09% on MRR, 37.25% on nDCG@5, and 43.01% on nDCG@10 with the first dataset. For the second one, we had 66.05% on AUC, 30.93% on MRR, 34.30% on nDCG@5, and 40.46% on nDCG@10, which were all higher than the seven baseline models. [Limitations] More research is needed to study users with few historical behaviors. [Conclusions] The proposed model could create vectors for news contents and user representations using advanced natural language processing techniques. It also effectively improves the performance of news recommendation models.

Select

Classifying Reasons of Hotel Reviews with Domain ERNIE and BiLSTM Model

Zhang Zhipeng, Mao Yusheng, Zhang Liyi

Data Analysis and Knowledge Discovery. 2022, 6(9): 65-76. https://doi.org/10.11925/infotech.2096-3467.2021.1303

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a classification model to identify reasons of hotel reviews from online booking platforms. [Methods] Firstly, we constructed a pretraining corpus with millions of online reviews and manually annotated the ORSC dataset for the proposed model. Then, we extracted the text features of ORSC dataset by adding the constructed corpus to ERNIE model. Finally, we used the BiLSTM model to merge all features and identify reviews with reasons. [Results] On ORSC datasets, the DERNIE model’s accuracy was 91.33% while the F1 value was 91.20%. After adding BiLSTM features, the accuracy increased to 94.57% and the F1 value became 94.62%. [Limitations] The pre-trained language models require large amount of data from the additional corpus, which might affect the computing speed and efficiency. [Conclusions] Our new model can effectively identify reason sentences from online reviews.

Select

CNN-SM: Identifying Words on Defective Products with Sememe and Multi-features

You Xindong, Yuan Menglong, Zhang Le, Lv Xueqiang

Data Analysis and Knowledge Discovery. 2022, 6(9): 77-85. https://doi.org/10.11925/infotech.2096-3467.2021.1369

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a CNN model based on the sememe and multi-features, aiming to improve the recognition accuracy of words on defected consumer products. [Methods] First, we created the model’s input with a distributed word vector fused with sememe. Then, we added part-of-speech features and randomly embedded word position vectors to the input. Finally, we removed the max pooling and increased the information contained in the depth vector output by the convolution kernel, which provided sufficient information for word classification. [Results] Compared with the CNN model only adding word position vectors, the proposed method improved the precision, recall and F1 values by 0.021, 0.002 and 0.012, respectively. [Limitations] We need to improve the polarity recognition of the same expression in different scenarios. [Conclusions] The sememe, part-of-speech, and the removal of pooling layer could improve the performance of model for domain word recognition.

Select

Extracting Entities for Enterprise Risks Based on Stroke ELMo and IDCNN-CRF Model

Yang Meifang, Yang Bo

Data Analysis and Knowledge Discovery. 2022, 6(9): 86-99. https://doi.org/10.11925/infotech.2096-3467.2021.1308

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new model to learn the text characteristics and contextual semantic relevance, aiming to extract entities for the enterprise risks more effectively. [Methods] Our entity extraction model is based on stroke ELMo embedded in the IDCNN-CRF. First, we used the bidirectional language model to pre-train the large-scale unstructured data for enterprise risks and obtained the stroke ELMo vector as the input feature. Then, we sent it to the IDCNN network for training, and utilized the CRF to process the output layer of IDCNN. Finally, we got the optimal entity sequence labeling for the enterprise risks. [Results] The F value of this proposed model is 91.9%, which is 2.0% higher than the performance of BiLSTM-CRF deep neural network models. The running speed of our model is 2.36 times faster than the BiLSTM-CRF. [Limitations] More research is needed to exmine this model in more fields. [Conclusions] The proposed model provides reference for constructing entity corpus of enterprise risks.

Select

Entity Recognition and Labeling for Medical Literature Based on Neural Network

Zhao Ruijie, Tong Xinyu, Liu Xiaohua, Lu Yonghe

Data Analysis and Knowledge Discovery. 2022, 6(9): 100-112. https://doi.org/10.11925/infotech.2096-3467.2021.1414

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new entity recognition model, aiming to find new knowledge effectively and improve the utilization of medical papers. [Methods] We constructed a pharmaceutical entity recognition model based on Attention-BiLSTM-CRF and examined it on the public datasets of GENIA Term Annotation Task and BioCreative II Gene Mention Tagging. We also used the model to annotate abstracts of biomedical scientific papers. [Results] The F1 values of our model on the two data sets were 81.57% and 84.23%, while the accuracy rates were 92.51% and 97.85%. These results are better than those of the benchmark ones. Moreover, our model has more advantages in processing the extremely unbalanced data. [Limitations] The volume of data and application of entity labeling experiments are relatively homogeneous. [Conclusions] The proposed model improves the effectiveness of entity recognition and mining of new medical knowledge.

Select

Drug Recommendation Based on Graph Neural Network with Patient Signs and Medication Data

Cheng Quan, She Dexin

Data Analysis and Knowledge Discovery. 2022, 6(9): 113-124. https://doi.org/10.11925/infotech.2096-3467.2021.1452

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new drug recommendation algorithm based on the graph neural network integrating patient signs and medication history, aiming to improve the illness diagnosis and treatments. [Methods] First, we constructed a transitive relationship model for abnormal signs and drugs based on the Graph Neural Network(GNN). Then, we designed a precise drug recommendation plan with sign perception and built a heterogeneous graph for the “sign-patient-drug” relationship. Third, our model learned the node representation with sign perception using the R-GCN encoder. Finally, we designed a sign-aware interaction decoder, which integrated the abnormal signs to recommend drugs accurately. [Results] We examined the proposed model with diagnosis and treatment records of three types of diseases from the MIMIC-Ⅲ dataset. Compared with the SVD, NeuMF and NGCF models, the proposed method’s Recall@20 value increased by 5.76, 5.33 and 0.91 percentage point, respectively. Meanwhile, it increased the NDCG@20 value by 5.03, 4.25 and 2.67 percentage point. [Limitations] Our method did not include the dynamic changes of patients’ drug use due to the developments of diseases. [Conclusions] The proposed drug recommendation method is effective and feasible. This model could perceive the impacts of patient signs on medication, which lays foundations for precise drug recommendation algorithm integrating multi-dimensional information.

Select

Analyzing Medical Semantic Association with Complex Network

Zhang Junliang, Fang Xuemei, Zhang Fan, Liu Xiwen, Zhu Peng

Data Analysis and Knowledge Discovery. 2022, 6(9): 125-137. https://doi.org/10.11925/infotech.2096-3467.2021.1178

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to study medical semantic association with the help of complex network. [Methods] First, we constructed a medical semantic association network using the medical semantic concepts as nodes and semantic associations as edges. Then, we analyzed the network characteristics and semantic community. Finally, we created vectors for the semantic concepts and conducted semantic clustering analysis with the neural network. [Results] We retrieved relevant literature on “coronavirus” from MEDLINE of PubMed and built a semantic association network with 43 nodes and 877 edges. Then, we visualized the network characteristics, semantic community and semantic clusters. [Limitations] The experimental data size needs to be expanded. [Conclusions] The proposed network effectively describes the semantic association among medical concepts and benefits medical knowledge discovery services.

Select

Reader Preference Analysis and Book Recommendation Model with Attention Mechanism of Catalogs

Wang Dailin, Liu Lina, Liu Meiling, Liu Yaqiu

Data Analysis and Knowledge Discovery. 2022, 6(9): 138-152. https://doi.org/10.11925/infotech.2096-3467.2021.1317

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new reader preference analysis method as well as a personalized book recommendation model (IABiLSTM), aiming to improve the accuracy of the existing algorithms. [Methods] First, we extracted the semantic features of books according to their titles and catalog contents. We used the BiLSTM network to capture the long-distance dependency of the texts and word order context information. We also utilized the Two-layer Self-Attention mechanism to enhance the deeper semantic expression of book catalog features. Then, we analyzed readers’ historical browsing behaviors, which were quantified with interest function. Third, we combined the semantic features of books with readers’ interests to generate their preference vector. Fourth, we calculated the similarity between the vectors of candidate books’ semantic features and readers’ preferences, and predicted the scores for personalized book recommendation. [Results] We examined our model on Douban Reading and Amazon datasets, and set the N value as 50. The MSE,Precision and Recall reached 1.1%, 89.1%, and 85.2%, on the Douban data, while they were 1.2%, 75.2%, and 72.8% with the Amazon data. These performance were better than those of the comparison model. [Limitations] More research is needed to examine our model with other datasets. [Conclusions] The proposed model improves the accuracy of book recommendation, and benefits common NLP tasks.

Please choose a citation manager

Content to export

25 September 2022, Volume 6 Issue 9

模态框（Modal）标题

Please choose a citation manager

Content to export

25 September 2022, Volume 6 Issue 9