[Objective] This paper reviews construction methods for domain-specific event graphs, aiming to facilitate future research.[Coverage] We searched “Event Graph”, “Event extraction” and “Event relation” with Web of Science and Google Scholar, then retrieved a total of 61 representative literature.[Methods] We summarized the definition, construction process and extraction methods with literature review. Then, we discussed the rule-based, feature learning based, and neural network-based extraction techniques. Finally, we analyzed their feature selection procedures, model architecture and experiment results.[Results] Refer to the general knowledge graph construction methods, we proposed a process model that include trigger argument and relation recognition. We briefly described on construction standard in structure, domain, event form, inference ability and temporal relations. In practice, we found that Ontology reuse is necessary, and neural network is the best choice.[Limitations] We did not use the same dataset to evaluate all methods.[Conclusions] We proposed knowledge-boosted methods, transfer learning and cognitive models for future studies.
[Objective] This article reviewed the theory, research progress and potential applications on measuring uncertainty of medical knowledge from scientific publications.[Coverage] We searched PubMed, Web of Science, Microsoft Academic, CNKI, and Wanfang Data for English and Chinese publications with 1) keywords “uncertain* AND knowledge AND *medical” in title, and 2) the cited reference “Representing Scientific Knowledge: The Role of Uncertainty”.[Methods] First, we categorized these literature into computational linguistics and informetrics studies. Then, we summarized their research design, data analytics and conclusions.[Results] The thoughts of paradigm shift and the Bayesian causal networks were the foundation for measuring uncertainty of medical knowledge. Latest developments included: identifying uncertain cues from biomedical literature; extracting structured knowledge from unstructured biomedical texts; and measuring the uncertainty level of scientific text which resulted Subject-Predicate-Object (SPO) triples.[Limitations] Our discussion focused on the Data-Information-Knowledge-Wisdom driven research, such as information science, knowledge engineering and artificial intelligence.[Conclusions] The uncertainty of scientific knowledge and its evolution over time indirectly reflect the strength of competing knowledge claims, the contribution to fill up knowledge gap, as well as the probability of certainty for a given knowledge claim. It will promote the developments of informetrics and knowmetrics, as well as their applications in emerging fields, such as detecting reserch fronts, evaluating academic contributions and improving the efficacy of computable knowledge driven decision support.
[Objective] This paper proposes a modified translation model (TransTopic) to predict research cooperation, aiming to promote exchanges among researchers and maximize efficiency.[Methods] We used TransTopic to uniformly map the nodes and edges of the scientific research cooperation network to low-dimensional vectors. First, we used the LDA model to extract the topic distribution features of stem cells papers. Then, we turned topic features to edge vectors with the deep autoencoder and obtained node vectors based on the translation mechanism. Finally, we predicted the scientific cooperation through the semantic calculation between the vectors.[Results] TransTopic’s AUC (95.21%) and MeanRank (17.48) indicators for link prediction are better than those of the existing models, and its topic prediction accuracy rate reached 86.52%.[Limitations] The proposed method only considered a one-step translation path, and did not fully utilized information like author’s institution, research interests, and publication levels.[Conclusions] The proposed method based on translation model could effectively predict research cooperation in the field of stem cells.
[Objective] This paper develops a neural network model to improve the online questioning and answering services.[Methods] First, we retrieved and constructed our experimental dataset from Yahoo Answers and Yahoo! L6 platform. Then, we proposed a neural network model (CNMNN) based on semantic matching matrix,variable-size convolutional layer, and multiple layer perceptron. Finally, we compared the results our model with the MQ2QC、IBLM、DRMM and MatchPyramid methods. [Results] The proposed model was 45.0%, 38.7%, 33.4%, 34.8% and 52.9% higher than the best results on relevance metrics of nDCG@5, nDCG@10, nDCG@20, MRR and MAP. It also gained 31.5%, 23.6%, 25.5%, 38.1%, 36.9% and 30.7% improvements on diversity metrics of α-nDCG@5, α-nDCG@10, α-nDCG@20 and ERR-IA@5, ERR-IA@10 and ERR-IA@20.[Limitations] We did not include new method to further diversify the results.[Conclusions] The new CNMNN model can effectively calculate the semantic relevance between queries and natural language questions at phrase level. It also avoids the issue of feature signal compression due to hierarchical convolution operation.
[Objective] This paper constructs a topic graph for Weibo users, aiming to identify the characteristics of user groups and opinion leaders. It also tries to guide online public opinion and reduce the surveillance costs.[Methods] First, we built a processing model for topic graph of Weibo users based on LDA. Then, we determined the optimal number and distribution of users’ topics with the index of perplexity. Third, we used JS divergence to measure the similarity of user topics, and constructed the topic graph. Finally, we took “Egypt air disaster” data to examine the proposed method.[Results] The topic graph generated by LDA clustered the user topics and identified the opinion leaders.[Limitations] More research is needed to determine the optimal number of LDA topics.[Conclusions] The proposed method could help us identify the characteristics of different topic groups and their opinion leaders.
[Objective] This paper proposes a new semi-supervised method for text classification, aiming to efficiently process texts with only small amount of annotations.[Methods] The proposed DW-TCI based method used double-channel feature extraction to obtain two sets of feature input vectors of the base classifier group. Then, we introduced the semi-supervised classification method with divergence and the idea of integrated learning. Finally, we trained the non-supervised sample with our model, and obtained the classification result of the predicted text with the equivalent weighted voting method.[Results] We examined our method with two different data sets having 20% labeled samples. The classification accuracy reached 92.32% and 87.01%, which were at least 5.54% and 5.65% higher than those of similar methods.[Limitations] The sample data set needs to be expanded.[Conclusions] The proposed method could reduce the labeling workloads of training samples and provide effective support for better text classification results.
[Objective] This paper proposes a model to detect the topics of trending news stories, aiming to improve user experience of news reading.[Methods] We modified the TF-IDF method with the weighting of balanced paragraphs (WTF-IDF). We also improved the K-means clustering model with sub-topic vectors in hierarchical clustering. Finally, we extracted high frequency words from titles with the new model.[Results] The F1 value of our model was 5.4% higher than the TF-IDF method (with three extracted keywords). The hierarchical clustering accuracy based on WTF-IDF and sub-topic vector was 3.1% higher than the single-layer K-means clustering.[Limitations] Our model does not include phrases extraction method and the hierarchical clustering method is complex.[Conclusions] The proposed method could effectively detect topics of trending news reports.
[Objective] This paper proposed a Social Network Image Privacy classifier based on transfer learning to provide reasonable hints for users to avoid accidentally uploading private information.[Methods] A new standard image dataset was created by gathering and annotating images from the Weibo platform. The deep transfer learning and fine-tuning of various image pre-training models were applied to classify whether the Weibo images contain privacy information or not automatically.[Results] With the same amount of data, the accuracy of transfer learning is improved by at least 30 percent compared to non-transfer learning approaches. Most ResNet deep neural network architectures can achieve more than 88% accuracy with transfer learning. Among them, ResNet50 has the highest recall rate (94.31%), accuracy (90.80%) and F1 value (91.11%), and the shortest testing time (148s). It has been selected out after comprehensive measurements of the above metrics and recommended as the most suitable model structure for current scenario requirements.[Limitations] The amount of labeled data in this study is relatively small, which may not be able to cover all the types of private information.[Conclusions] This study validates the feasibility and efficiency of deep transfer learning in the field of classification of private Weibo images. The result can be applied to various types of social media platforms to warn users about the risk of privacy leaking. The annotated image dataset can be used in others’ further researches as both a foundation and a comparison.
[Objective] This paper designs a cross-language recommendation model for patents based on text semantic representation, aiming to reduce the number of bilingual dictionaries and large-scale corpus, as well as improve the ability of domain adaptation.[Methods] First, we designed a word vector mapping method with unsupervised cross-language algorithm. Then, we mapped Chinese and English word vectors to the unified semantic vector space with linear transformation, which constructed the semantic mapping relationship between Chinese and English words. Third, we created semantic representation of patent texts based on cross-language word vector with smooth inverse frequency (SIF) reweighting method. It realized the semantic representation of Chinese-English patent texts in the same vector space. Finally, we calculated the semantic similarity between patent texts and recommend the cross-language patents.[Results] We examined the proposed method with patents on “wireless communication” and the recommendation accuracy rate of the top 1 and the top 5 reached 55.63% and 77.82%, which were 0.66% and 1.45% higher than those of the weak supervised based cross-language recommendation. They were also 4.29% and 3.90% better than the machine translation based ones.[Limitations] We only examined the proposed method with Chinese and English patents from one specific field.[Conclusions] This proposed method could recommend Chinese and English patents effectively, which help future research in cross-language patent recommendations.
[Objective] The study uses patent knowledge graph to calculate similarities between patent terms, aiming to detect infringement cases from patent texts.[Methods] We calculated term similarities based on the knowledge graph of new energy vehicle patent. Other factors included: the concept hierarchy of terms, the distance between terms in the knowledge graph, the semantic similarity of terms, as well as the attributes of terms.[Results] The accuracy and recall rates of patent term classification were more than 80%, which were significantly higher than those of the traditional methods.[Limitations] Manual construction of concept hierarchy tree and annotation of term classification might yield errors.[Conclusions] It is feasible to compute similarities between patent terms based on the knowledge graph, which provides good reference for future research.
[Objective] This paper proposes a classification model for few-shot texts, aiming to address the issues of data scarcity and low generalization performance.[Methods] First, we divided the text classification tasks into multiple subtasks based on episode training mechanism in meta-learning. Then, we proposed a Bi-directional Temporal Convolutional Network (Bi-TCN) to capture the long-term contextual information of the text in each subtask. Third, we developed a Bi-directional Long-term Attention Network (BLAN) to capture more discriminative features based on Bi-TCN and multi-head attention mechanism. Finally, we used the Neural Tensor Network to measure the correlation between query samples and support set of each subtask to finish few-shot text classification.[Results] We examined our model with the ARSC dataset. The classification accuracy of this model reached 86.80% in few-shot learning setting, which was 3.68% and 1.17% better than those of the ROBUSTTC-FSL and Induction-Network-Routing models.[Limitations] The performance of BLAN on long text is not satisfactory. [Conclusions] BLAN overcomes the issue of data scarcity and captures comprehensive text features, which effectively improves the performance of few-shot text classification.
[Objective] This paper proposes an automated scheme to remove personal information from clinical records based on the BiLSTM-CRF model, aiming to protect patient privacy and identify protected health information (PHI) from unstructured files.[Methods] We collected experimental data from the discharge summaries of a health information platform. According to the 18 PHI regulations specified by HIPAA, we determined 7 PHI categories and 15 PHI types. We used the BiLSTM-CRF model to effectively identify protected health information from unstructured clinical records.[Results] The accuracy rate, recall rate and F value of all entity category recognition were 98.66%, 99.36%, and 99.01% respectively, and the wrong labels were summarized and analyzed.[Limitations] The corpus characteristics need to be improved, and the clinical text quality after automatic recognition of PHI was not evaluated.[Conclusions] The BiLSTM-CRF model could automatically recognize named entities without feature engineering, which promotes the sharing and utilization of clinical information.
[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.