[Objective] This article addresses the difficulties of relationship extraction due to overlapping entity relationship in military texts. [Methods] We used the BERT model as the encoder for the input texts, and used the hierarchical reinforcement learning approach to decode relationship and their corresponding entities. Then, we merged the relational position features in the entity decoding process to construct a relationship extraction model for military domains. [Results] The F1 value reached 82.2% on the military weapon and equipment dataset, which was about 8% higher than other methods. Using the publicly available NYT10 and NYT10-sub datasets, the F1 values reached 71.8% and 69.0%, which was about 7% and 9% higher than other methods. [Limitations] The new method’s extraction performance is better on manually annotated datasets. More research is needed to improve it performance on remotely supervised datasets with much noise. [Conclusions] The HBP method could effectively extract relationship among the military domains, and has some generalization potentiality.
[Objective] This paper proposes a name disambiguation method for scientific literature, aiming to distinguish scholars with the same name. The existing solutions utilizes document feature extraction or relationship between documents and co-authors, which loses higher-order attributes. [Methods] First, we established a unified feature extraction framework of Paper Embedding Network (PaperEmbNet), which combined content and relationship to build an academic heterogeneous information network for each author. Then, we designed a Clustering Parameters Method (AR4CPM) based on the Attentive Recurrent Neural Network to estimate the clustering number directly. Finally, we used the Hierarchical agglomerative clustering algorithm (HAC) to disambiguate author names with the predicted number as the preset parameter. [Results] We examined the proposed model with the AMiner-AND dataset and found the macro-F1 score was up to 4.75% higher than the suboptimal model, and the average training time was 5-10 minutes shorter than the existing baselines. [Limitations] We need to evaluate the performance of the proposed method with multilingual environment. [Conclusions] The proposed approach could effectively conduct the name disambiguation tasks.
[Objective] This paper aims to accurately extract scientific citations and their context data, which significantly improves the results of citation analysis. [Methods] We divided the citation extraction task into citation sentence extraction, citation context identification, and citation metadata. Then, we proposed a coreference resolution-based method to identify and extract scientific citation context. [Results] We examined our method with the Chinese sequential coding periodicals and extracted the citation sentences and references correctly. The F1 value for identifying the citation context was between 0.780 and 0.849. [Limitations] Due to the limits of Chinese scientific citation corpus and the small scale of experimental data, the proposed method might not work effectively in other fields. [Conclusions] Our study optimizes the steps of citation content analysis and enlarges data scope. It provides support for researchers of citation content analysis.
[Objective] This paper tries to reduce the manual annotations in extracting table with complicated header from PDF documents. [Methods] First, we identified table cells structure based on the line segment and represented the cell contents with word vectors. Then, we calculated the word vector similarity of the table content in each line. Finally, we separeted the table headers and contents. [Results] We examined our method on the self-built PDF table data set. The value of the table information extraction result F1 was 98.07%, and the table content division result F1 value exceeded 99%. They are close to the deep learning text classification model requiring large amount of annotated corpus. [Limitations] Our method can only extract relational tables, and cannot be applied to scanned PDF documents. [Conclusions] The proposed method can automatically extract PDF tables with complicated heades.
[Objective] This paper tries to recognize the one-to-many entity relationship instances (such as inclusion relation,coordination relation) from sentences using small amount of samples, aiming to realize continual learning with new data. [Methods] First, we generated the one-to-many inclusion and coordination entities from sentences using LaserTagger. Then, with the help of position embedding and weighted loss,our model captured more features with limited data. Finally, the model achieved continual learning by model compression and expansion. [Results] Our approach’s SARI was 1% better than those of the baseline models in all tests. The model compression and expansion can effectively retain the learned knowledge on previous data and the SARI was about 16.92% higher than the performance of baseline models. [Limitations] More research is needed to examine the proposed method with more complex data sets. [Conclusions] Our study could effectively identify entity relationship with small amout of training data from different categories.
[Objective] This study uses heterogeneous information network and author preference to improve the performance of scientific literature recommendation. [Methods] We proposed a new method using various semantic information. Firstly, we weighted the meta path in the heterogeneous information network of the scientific literature with the help of the author preference. Secondly, we used the DPRel algorithm to calculate the correlation between the author and the literature. Finally, we constructed the weighted author-literature matrix, and retrieved the recommendation list based on the descending order of the correlation. [Results] We examined our model with data sets from the Web of Science. Compared with the methods of single meta path, the average successful recommendation rate of the new algorithm was 6%, 8% and 6% higher in three datasets. The improvement rate of successful recommendation was 14.8%, 27.6% and 13.0%, respectively. [Limitations] In data preprocessing stage, the keywords were unified manually, which is unrealistic for massive data sets. [Conclusions] The proposed method could effectively improve the quality of scientific literature recommendation.
[Objective] This paper proposes a prediction model for post-operative infection based on a combined machine learning algorithm, aiming to effectively reduce surgical site infection risks. [Methods] First, we used SMOTE, ADASYN, and random oversampling to reduce the imbalance of the original data. Then, we combined five commonly used predictive models: Lasso, SVM, GBDT, ANN and RF to create a hybrid prediction method. Finally, we used the improved artificial bee colony algorithm to optimize the weight of multiple combinations. [Results] The G-mean and F1 values of the ABC combination strategy method reached 0.791 2 and 0.669 3 respectively, which were 15.15% and 23.62% higher than the existing ones. [Limitations] The sample size used in the study needs to be expanded. [Conclusions] The proposed model can effectively predict post-operative infections.
[Objective] This study builds a prediction model for drugs’ ADMET properties (Absorption, Distribution, Metabolism, Excretion, Toxicity), aiming to evaluate drugs in virtual screening. [Methods] We constructed a drug ADMET prediction based on the Graph Attention Network (GAN). Then, we used the drug ADMET properties from open access databases and scientific publications to create their molecular graphs and structures. Finally, we compared the GAN-based model with three machine learning models and two graph neural network models. [Results] We collected 9 datasets with 149 457 ADMET records. The proposed prediction model had an average accuracy of 0.825 and an average F1-Score of 0.672 with the 9 datasets, which were 6.4% and 26.0% higher than those of the baseline models. [Limitations] The data cleansing process needs to be refined, while the prediction performance can be further improved with a pre-training architecture. [Conclusions] The proposed model could effectively predict a drug’s ADMET, which could help virtual drug screening and computer-aided drug developments.
[Objective] This paper constructs a model to predict the 5-year survival rates for gastric cancer based on the SEER database, aiming to provide support for the prognosis of gastric cancer, as well as analyze factors affecting the patients’ 5-year survival rates. [Methods] With the help of ensemble learning algorithm, especially the idea of EasyEnsemble, we handled data imbalance issue by combining data layer and model layer. Then, we integrated multiple GradientBoosting classifiers with Bagging, and built a prediction model using unbalanced gastric cancer survival data. Finally, we identified factors affecting the 5-year survival of gastric cancer using the SHAP value. [Results] Our new model’s prediction accuracy reached 0.808, with an AUC of 0.883. The prediction accuracy for subcategory survival patients was 0.835. Compared with the traditional models, our method yielded better prediction rates. We also found the regional nodes positive, summary stage/grade, and age had higher SHAP values. [Limitations] The related prognostic factors from the SEER database were limited, which influenced our model’s performance. [Conclusions] The new model could effectively predict survival rates for gastric cancer, and identify factors influencing the 5-year survival probability of the patients.
[Objective] This paper predicts the amount of music playbacks and explores the influencing factors, aiming to help online music platforms evaluate the quality of music lists. [Methods] First, we used a web-crawler to retrieve the numerical and text features of music playlists from the Netease cloud. Then, we pre-trained the texts with Word2Vec and BERT. Third, we established RF, XGBoost and DNN models to predict the amount of playbacks. [Results] We found the prediction accuracy of DNN was higher than those of RF and XGBoost. The numbers of initial playbacks, comments, favorites and forwarding of music list had the most significant impacts on the amount of the music list playbacks. However, the text features reduce the prediction accuracy. [Limitations] The Netease cloud music updated everyday, therefore, we only examined the playback data collected 12 hours following the updates. [Conclusions] This study could help online music websites preliminarily judge the popularity of their music lists.
[Objective] This paper builds an automatic scoring system for subjective questions in the maritime competency assessment, aiming to reduce the heavy workload and human factors of subjective question scoring. [Methods] Firstly, we used the weighted TextRank algorithm of dependency syntax analysis to extract keywords. Then, we integrated sentence vectors, core words, syntactic components, and dependent structures to judge the similarity between student answers and the standard ones. Third, we constructed a set of special negative words for maritime affairs to judge the semantic opposition between the student’s answer and the standard answer. Finally, we gave each answer an objective score. [Results] We examined our method with multiple sets of different subjective questions, and found the average score difference between the automatic score and the manual scoring was 0.21, with a deviation rate of 4.20%. [Limitations] More research is needed to improve the processing of long and complex sentences. [Conclusions] The proposed algorithm could effectively evaluate subjective questions in the maritime competency assessment.
[Objective] This paper proposes a talent evaluation model with multi-dimensional indicators, as well as diversified standards and subjects. [Methods] We designed quantitative indicators from the perspectives of academic contribution and research potential based on scholarly achievements, research projects, peer cooperation, and practical applications. [Results] The proposed model could combine indicators and adjust their weights. We also designed a data-driven procedures for the multi-agent participated model. [Limitations] This research is still in the theoretical development stage and requires more experiment with large-scale data. [Conclusions] Our model provides multi-dimensional portraits and evaluation methods for talents, which improves the evaluation mechanism and creates an academic ecosystem for innovation.
[Objective] This paper merged the artificial and machine features of scientific and technological literature with the help of deep learning method, aiming to improve the efficiency of knowledge element extraction. [Methods] We constructed 26 artificial features based on the characteristics of these literature, which mainly included texts, sentences and words. Then, we combinted these features with Word2Vec, one-hot and other machine features using LSTM, CNN and BERT models and extracted knowledge elements. [Results] The accuracy of feature vertical merging for knowledge element extraction reached 0.91, which was 6 percentage points higher than the performance of most traditional methods. [Limitations] The deep learning model needs to be optimized to process larger amount of data. [Conclusions] The proposed method could effectively improve the results of knowledge element extraction.