Data Analysis and Knowledge Discovery

Select

Extracting Relationship Among Military Domains with BERT and Relation Position Features

Ma Jiangwei, Lv Xueqiang, You Xindong, Xiao Gang, Han Junmei

Data Analysis and Knowledge Discovery. 2021, 5(8): 1-12. https://doi.org/10.11925/infotech.2096-3467.2021.0181

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This article addresses the difficulties of relationship extraction due to overlapping entity relationship in military texts. [Methods] We used the BERT model as the encoder for the input texts, and used the hierarchical reinforcement learning approach to decode relationship and their corresponding entities. Then, we merged the relational position features in the entity decoding process to construct a relationship extraction model for military domains. [Results] The F1 value reached 82.2% on the military weapon and equipment dataset, which was about 8% higher than other methods. Using the publicly available NYT10 and NYT10-sub datasets, the F1 values reached 71.8% and 69.0%, which was about 7% and 9% higher than other methods. [Limitations] The new method’s extraction performance is better on manually annotated datasets. More research is needed to improve it performance on remotely supervised datasets with much noise. [Conclusions] The HBP method could effectively extract relationship among the military domains, and has some generalization potentiality.

Select

Disambiguating Author Names with Embedding Heterogeneous Information and Attentive RNN Clustering Parameters

Wang Ruolin, Niu Zhendong, Lin Qika, Zhu Yifan, Qiu Ping, Lu Hao, Liu Donglei

Data Analysis and Knowledge Discovery. 2021, 5(8): 13-24. https://doi.org/10.11925/infotech.2096-3467.2021.0253

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a name disambiguation method for scientific literature, aiming to distinguish scholars with the same name. The existing solutions utilizes document feature extraction or relationship between documents and co-authors, which loses higher-order attributes. [Methods] First, we established a unified feature extraction framework of Paper Embedding Network (PaperEmbNet), which combined content and relationship to build an academic heterogeneous information network for each author. Then, we designed a Clustering Parameters Method (AR4CPM) based on the Attentive Recurrent Neural Network to estimate the clustering number directly. Finally, we used the Hierarchical agglomerative clustering algorithm (HAC) to disambiguate author names with the predicted number as the preset parameter. [Results] We examined the proposed model with the AMiner-AND dataset and found the macro-F1 score was up to 4.75% higher than the suboptimal model, and the average training time was 5-10 minutes shorter than the existing baselines. [Limitations] We need to evaluate the performance of the proposed method with multilingual environment. [Conclusions] The proposed approach could effectively conduct the name disambiguation tasks.

Select

Extracting Citation Contents with Coreference Resolution

Tan Ying, Tang Yifei

Data Analysis and Knowledge Discovery. 2021, 5(8): 25-33. https://doi.org/10.11925/infotech.2096-3467.2021.0226

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to accurately extract scientific citations and their context data, which significantly improves the results of citation analysis. [Methods] We divided the citation extraction task into citation sentence extraction, citation context identification, and citation metadata. Then, we proposed a coreference resolution-based method to identify and extract scientific citation context. [Results] We examined our method with the Chinese sequential coding periodicals and extracted the citation sentences and references correctly. The F1 value for identifying the citation context was between 0.780 and 0.849. [Limitations] Due to the limits of Chinese scientific citation corpus and the small scale of experimental data, the proposed method might not work effectively in other fields. [Conclusions] Our study optimizes the steps of citation content analysis and enlarges data scope. It provides support for researchers of citation content analysis.

Select

Extracting PDF Tables Based on Word Vectors

Zhang Jiandong, Chen Shiji, Xu Xiaoting, Zuo Wenge

Data Analysis and Knowledge Discovery. 2021, 5(8): 34-44. https://doi.org/10.11925/infotech.2096-3467.2021.0164

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F₁was 98.07%, and the table content division result F₁ value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.

Select

Continual Learning for One-to-many Entity Relationship Generation with Small Samples

Jiang Yaren, Le Xiaoqiu

Data Analysis and Knowledge Discovery. 2021, 5(8): 45-53. https://doi.org/10.11925/infotech.2096-3467.2020.1302

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to recognize the one-to-many entity relationship instances (such as inclusion relation,coordination relation) from sentences using small amount of samples, aiming to realize continual learning with new data. [Methods] First, we generated the one-to-many inclusion and coordination entities from sentences using LaserTagger. Then, with the help of position embedding and weighted loss,our model captured more features with limited data. Finally, the model achieved continual learning by model compression and expansion. [Results] Our approach’s SARI was 1% better than those of the baseline models in all tests. The model compression and expansion can effectively retain the learned knowledge on previous data and the SARI was about 16.92% higher than the performance of baseline models. [Limitations] More research is needed to examine the proposed method with more complex data sets. [Conclusions] Our study could effectively identify entity relationship with small amout of training data from different categories.

Select

Recommending Scientific Literature Based on Author Preference and Heterogeneous Information Network

Wang Qinjie, Qin Chunxiu, Ma Xubu, Liu Huailiang, Xu Cunzhen

Data Analysis and Knowledge Discovery. 2021, 5(8): 54-64. https://doi.org/10.11925/infotech.2096-3467.2021.0102

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study uses heterogeneous information network and author preference to improve the performance of scientific literature recommendation. [Methods] We proposed a new method using various semantic information. Firstly, we weighted the meta path in the heterogeneous information network of the scientific literature with the help of the author preference. Secondly, we used the DPRel algorithm to calculate the correlation between the author and the literature. Finally, we constructed the weighted author-literature matrix, and retrieved the recommendation list based on the descending order of the correlation. [Results] We examined our model with data sets from the Web of Science. Compared with the methods of single meta path, the average successful recommendation rate of the new algorithm was 6%, 8% and 6% higher in three datasets. The improvement rate of successful recommendation was 14.8%, 27.6% and 13.0%, respectively. [Limitations] In data preprocessing stage, the keywords were unified manually, which is unrealistic for massive data sets. [Conclusions] The proposed method could effectively improve the quality of scientific literature recommendation.

Select

Predicting Surgical Infections Based on Machine Learning

Su Qiang, Hou Xiaoli, Zou Ni

Data Analysis and Knowledge Discovery. 2021, 5(8): 65-75. https://doi.org/10.11925/infotech.2096-3467.2021.0188

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a prediction model for post-operative infection based on a combined machine learning algorithm, aiming to effectively reduce surgical site infection risks. [Methods] First, we used SMOTE, ADASYN, and random oversampling to reduce the imbalance of the original data. Then, we combined five commonly used predictive models: Lasso, SVM, GBDT, ANN and RF to create a hybrid prediction method. Finally, we used the improved artificial bee colony algorithm to optimize the weight of multiple combinations. [Results] The G-mean and F1 values of the ABC combination strategy method reached 0.791 2 and 0.669 3 respectively, which were 15.15% and 23.62% higher than the existing ones. [Limitations] The sample size used in the study needs to be expanded. [Conclusions] The proposed model can effectively predict post-operative infections.

Select

Predicting Drug ADMET Properties Based on Graph Attention Network

Gu Yaowen, Zhang Bowen, Zheng Si, Yang Fengchun, Li Jiao

Data Analysis and Knowledge Discovery. 2021, 5(8): 76-85. https://doi.org/10.11925/infotech.2096-3467.2021.0233

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study builds a prediction model for drugs’ ADMET properties (Absorption, Distribution, Metabolism, Excretion, Toxicity), aiming to evaluate drugs in virtual screening. [Methods] We constructed a drug ADMET prediction based on the Graph Attention Network (GAN). Then, we used the drug ADMET properties from open access databases and scientific publications to create their molecular graphs and structures. Finally, we compared the GAN-based model with three machine learning models and two graph neural network models. [Results] We collected 9 datasets with 149 457 ADMET records. The proposed prediction model had an average accuracy of 0.825 and an average F1-Score of 0.672 with the 9 datasets, which were 6.4% and 26.0% higher than those of the baseline models. [Limitations] The data cleansing process needs to be refined, while the prediction performance can be further improved with a pre-training architecture. [Conclusions] The proposed model could effectively predict a drug’s ADMET, which could help virtual drug screening and computer-aided drug developments.

Select

Predicting Survival Rates for Gastric Cancer Based on Ensemble Learning

Xu Liangchen, Guo Chonghui

Data Analysis and Knowledge Discovery. 2021, 5(8): 86-99. https://doi.org/10.11925/infotech.2096-3467.2021.0045

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a model to predict the 5-year survival rates for gastric cancer based on the SEER database, aiming to provide support for the prognosis of gastric cancer, as well as analyze factors affecting the patients’ 5-year survival rates. [Methods] With the help of ensemble learning algorithm, especially the idea of EasyEnsemble, we handled data imbalance issue by combining data layer and model layer. Then, we integrated multiple GradientBoosting classifiers with Bagging, and built a prediction model using unbalanced gastric cancer survival data. Finally, we identified factors affecting the 5-year survival of gastric cancer using the SHAP value. [Results] Our new model’s prediction accuracy reached 0.808, with an AUC of 0.883. The prediction accuracy for subcategory survival patients was 0.835. Compared with the traditional models, our method yielded better prediction rates. We also found the regional nodes positive, summary stage/grade, and age had higher SHAP values. [Limitations] The related prognostic factors from the SEER database were limited, which influenced our model’s performance. [Conclusions] The new model could effectively predict survival rates for gastric cancer, and identify factors influencing the 5-year survival probability of the patients.

Select

Predicting Online Music Playbacks and Influencing Factors

Liu Yuanchen, Wang Hao, Gao Yaqi

Data Analysis and Knowledge Discovery. 2021, 5(8): 100-112. https://doi.org/10.11925/infotech.2096-3467.2020.1013

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper predicts the amount of music playbacks and explores the influencing factors, aiming to help online music platforms evaluate the quality of music lists. [Methods] First, we used a web-crawler to retrieve the numerical and text features of music playlists from the Netease cloud. Then, we pre-trained the texts with Word2Vec and BERT. Third, we established RF, XGBoost and DNN models to predict the amount of playbacks. [Results] We found the prediction accuracy of DNN was higher than those of RF and XGBoost. The numbers of initial playbacks, comments, favorites and forwarding of music list had the most significant impacts on the amount of the music list playbacks. However, the text features reduce the prediction accuracy. [Limitations] The Netease cloud music updated everyday, therefore, we only examined the playback data collected 12 hours following the updates. [Conclusions] This study could help online music websites preliminarily judge the popularity of their music lists.

Select

Automatic Scoring for Subjective Questions in Maritime Competency Assessment

Han Hui, Liu Xiuwen

Data Analysis and Knowledge Discovery. 2021, 5(8): 113-121. https://doi.org/10.11925/infotech.2096-3467.2020.1193

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper builds an automatic scoring system for subjective questions in the maritime competency assessment, aiming to reduce the heavy workload and human factors of subjective question scoring. [Methods] Firstly, we used the weighted TextRank algorithm of dependency syntax analysis to extract keywords. Then, we integrated sentence vectors, core words, syntactic components, and dependent structures to judge the similarity between student answers and the standard ones. Third, we constructed a set of special negative words for maritime affairs to judge the semantic opposition between the student’s answer and the standard answer. Finally, we gave each answer an objective score. [Results] We examined our method with multiple sets of different subjective questions, and found the average score difference between the automatic score and the manual scoring was 0.21, with a deviation rate of 4.20%. [Limitations] More research is needed to improve the processing of long and complex sentences. [Conclusions] The proposed algorithm could effectively evaluate subjective questions in the maritime competency assessment.

Select

Designing New Evaluation Model for Talents

Xu Zengxulin, Xie Jing, Yu Qianqian

Data Analysis and Knowledge Discovery. 2021, 5(8): 122-131. https://doi.org/10.11925/infotech.2096-3467.2020.1122

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a talent evaluation model with multi-dimensional indicators, as well as diversified standards and subjects. [Methods] We designed quantitative indicators from the perspectives of academic contribution and research potential based on scholarly achievements, research projects, peer cooperation, and practical applications. [Results] The proposed model could combine indicators and adjust their weights. We also designed a data-driven procedures for the multi-agent participated model. [Limitations] This research is still in the theoretical development stage and requires more experiment with large-scale data. [Conclusions] Our model provides multi-dimensional portraits and evaluation methods for talents, which improves the evaluation mechanism and creates an academic ecosystem for innovation.

Select

Extracting Knowledge Elements of Sci-Tech Literature Based on Artificial and Machine Features

Chai Qingfeng, Shi Linyan, Mei Shan, Xiong Haitao, He Huixin

Data Analysis and Knowledge Discovery. 2021, 5(8): 132-144. https://doi.org/10.11925/infotech.2096-3467.2020.1221

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper merged the artificial and machine features of scientific and technological literature with the help of deep learning method, aiming to improve the efficiency of knowledge element extraction. [Methods] We constructed 26 artificial features based on the characteristics of these literature, which mainly included texts, sentences and words. Then, we combinted these features with Word2Vec, one-hot and other machine features using LSTM, CNN and BERT models and extracted knowledge elements. [Results] The accuracy of feature vertical merging for knowledge element extraction reached 0.91, which was 6 percentage points higher than the performance of most traditional methods. [Limitations] The deep learning model needs to be optimized to process larger amount of data. [Conclusions] The proposed method could effectively improve the results of knowledge element extraction.

Please choose a citation manager

Content to export

25 August 2021, Volume 5 Issue 8

模态框（Modal）标题

Please choose a citation manager

Content to export

25 August 2021, Volume 5 Issue 8