Data Analysis and Knowledge Discovery

Select

Domain-Specific Event Graph Construction Methods：A Review

Wang Yi,Shen Zhe,Yao Yifan,Cheng Ying

Data Analysis and Knowledge Discovery. 2020, 4(10): 1-13. https://doi.org/10.11925/infotech.2096-3467.2020.0383

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper reviews construction methods for domain-specific event graphs, aiming to facilitate future research.[Coverage] We searched “Event Graph”, “Event extraction” and “Event relation” with Web of Science and Google Scholar, then retrieved a total of 61 representative literature.[Methods] We summarized the definition, construction process and extraction methods with literature review. Then, we discussed the rule-based, feature learning based, and neural network-based extraction techniques. Finally, we analyzed their feature selection procedures, model architecture and experiment results.[Results] Refer to the general knowledge graph construction methods, we proposed a process model that include trigger argument and relation recognition. We briefly described on construction standard in structure, domain, event form, inference ability and temporal relations. In practice, we found that Ontology reuse is necessary, and neural network is the best choice.[Limitations] We did not use the same dataset to evaluate all methods.[Conclusions] We proposed knowledge-boosted methods, transfer learning and cognitive models for future studies.

Select

Measuring Uncertainty of Medical Knowledge: A Literature Review

Du Jian

Data Analysis and Knowledge Discovery. 2020, 4(10): 14-27. https://doi.org/10.11925/infotech.2096-3467.2020.0222

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This article reviewed the theory, research progress and potential applications on measuring uncertainty of medical knowledge from scientific publications.[Coverage] We searched PubMed, Web of Science, Microsoft Academic, CNKI, and Wanfang Data for English and Chinese publications with 1) keywords “uncertain* AND knowledge AND *medical” in title, and 2) the cited reference “Representing Scientific Knowledge: The Role of Uncertainty”.[Methods] First, we categorized these literature into computational linguistics and informetrics studies. Then, we summarized their research design, data analytics and conclusions.[Results] The thoughts of paradigm shift and the Bayesian causal networks were the foundation for measuring uncertainty of medical knowledge. Latest developments included: identifying uncertain cues from biomedical literature; extracting structured knowledge from unstructured biomedical texts; and measuring the uncertainty level of scientific text which resulted Subject-Predicate-Object (SPO) triples.[Limitations] Our discussion focused on the Data-Information-Knowledge-Wisdom driven research, such as information science, knowledge engineering and artificial intelligence.[Conclusions] The uncertainty of scientific knowledge and its evolution over time indirectly reflect the strength of competing knowledge claims, the contribution to fill up knowledge gap, as well as the probability of certainty for a given knowledge claim. It will promote the developments of informetrics and knowmetrics, as well as their applications in emerging fields, such as detecting reserch fronts, evaluating academic contributions and improving the efficacy of computable knowledge driven decision support.

Select

Predicting Research Collaboration Based on Translation Model

Chen Wenjie

Data Analysis and Knowledge Discovery. 2020, 4(10): 28-36. https://doi.org/10.11925/infotech.2096-3467.2020.0062

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a modified translation model (TransTopic) to predict research cooperation, aiming to promote exchanges among researchers and maximize efficiency.[Methods] We used TransTopic to uniformly map the nodes and edges of the scientific research cooperation network to low-dimensional vectors. First, we used the LDA model to extract the topic distribution features of stem cells papers. Then, we turned topic features to edge vectors with the deep autoencoder and obtained node vectors based on the translation mechanism. Finally, we predicted the scientific cooperation through the semantic calculation between the vectors.[Results] TransTopic’s AUC (95.21%) and MeanRank (17.48) indicators for link prediction are better than those of the existing models, and its topic prediction accuracy rate reached 86.52%.[Limitations] The proposed method only considered a one-step translation path, and did not fully utilized information like author’s institution, research interests, and publication levels.[Conclusions] The proposed method based on translation model could effectively predict research cooperation in the field of stem cells.

Select

Improving Online Q&A Service with Deep Learning

Ding Heng,Li Yingxuan

Data Analysis and Knowledge Discovery. 2020, 4(10): 37-46. https://doi.org/10.11925/infotech.2096-3467.2019.1301

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develops a neural network model to improve the online questioning and answering services.[Methods] First, we retrieved and constructed our experimental dataset from Yahoo Answers and Yahoo! L6 platform. Then, we proposed a neural network model (CNMNN) based on semantic matching matrix,variable-size convolutional layer, and multiple layer perceptron. Finally, we compared the results our model with the MQ2QC、IBLM、DRMM and MatchPyramid methods. [Results] The proposed model was 45.0%, 38.7%, 33.4%, 34.8% and 52.9% higher than the best results on relevance metrics of nDCG@5, nDCG@10, nDCG@20, MRR and MAP. It also gained 31.5%, 23.6%, 25.5%, 38.1%, 36.9% and 30.7% improvements on diversity metrics of α-nDCG@5, α-nDCG@10, α-nDCG@20 and ERR-IA@5, ERR-IA@10 and ERR-IA@20.[Limitations] We did not include new method to further diversify the results.[Conclusions] The new CNMNN model can effectively calculate the semantic relevance between queries and natural language questions at phrase level. It also avoids the issue of feature signal compression due to hierarchical convolution operation.

Select

Constructing Topic Graph for Weibo Users Based on LDA: Case Study of “Egypt Air Disaster”

Wang Xiwei,Zhang Liu,Huang Bo,Wei Ya’nan

Data Analysis and Knowledge Discovery. 2020, 4(10): 47-57. https://doi.org/10.11925/infotech.2096-3467.2020.0127

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a topic graph for Weibo users, aiming to identify the characteristics of user groups and opinion leaders. It also tries to guide online public opinion and reduce the surveillance costs.[Methods] First, we built a processing model for topic graph of Weibo users based on LDA. Then, we determined the optimal number and distribution of users’ topics with the index of perplexity. Third, we used JS divergence to measure the similarity of user topics, and constructed the topic graph. Finally, we took “Egypt air disaster” data to examine the proposed method.[Results] The topic graph generated by LDA clustered the user topics and identified the opinion leaders.[Limitations] More research is needed to determine the optimal number of LDA topics.[Conclusions] The proposed method could help us identify the characteristics of different topic groups and their opinion leaders.

Select

Semi-Supervised Method for Text Classification Based on DW-TCI

Yu Bengong,Ji Haomin

Data Analysis and Knowledge Discovery. 2020, 4(10): 58-69. https://doi.org/10.11925/infotech.2096-3467.2020.0219

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new semi-supervised method for text classification, aiming to efficiently process texts with only small amount of annotations.[Methods] The proposed DW-TCI based method used double-channel feature extraction to obtain two sets of feature input vectors of the base classifier group. Then, we introduced the semi-supervised classification method with divergence and the idea of integrated learning. Finally, we trained the non-supervised sample with our model, and obtained the classification result of the predicted text with the equivalent weighted voting method.[Results] We examined our method with two different data sets having 20% labeled samples. The classification accuracy reached 92.32% and 87.01%, which were at least 5.54% and 5.65% higher than those of similar methods.[Limitations] The sample data set needs to be expanded.[Conclusions] The proposed method could reduce the labeling workloads of training samples and provide effective support for better text classification results.

Select

Detecting News Topics Based on Equalized Paragraph and Sub-topic Vector

Wei Jiaze,Dong Cheng,He Yanqing,Liu Zhihui,Peng Keyun

Data Analysis and Knowledge Discovery. 2020, 4(10): 70-79. https://doi.org/10.11925/infotech.2096-3467.2020.0361

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a model to detect the topics of trending news stories, aiming to improve user experience of news reading.[Methods] We modified the TF-IDF method with the weighting of balanced paragraphs (WTF-IDF). We also improved the K-means clustering model with sub-topic vectors in hierarchical clustering. Finally, we extracted high frequency words from titles with the new model.[Results] The F1 value of our model was 5.4% higher than the TF-IDF method (with three extracted keywords). The hierarchical clustering accuracy based on WTF-IDF and sub-topic vector was 3.1% higher than the single-layer K-means clustering.[Limitations] Our model does not include phrases extraction method and the hierarchical clustering method is complex.[Conclusions] The proposed method could effectively detect topics of trending news reports.

Select

Microblog Image Privacy Classification with Deep Transfer Learning

Wang Shuyi,Liu Sai,Ma Zheng

Data Analysis and Knowledge Discovery. 2020, 4(10): 80-92. https://doi.org/10.11925/infotech.2096-3467.2020.0046

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposed a Social Network Image Privacy classifier based on transfer learning to provide reasonable hints for users to avoid accidentally uploading private information.[Methods] A new standard image dataset was created by gathering and annotating images from the Weibo platform. The deep transfer learning and fine-tuning of various image pre-training models were applied to classify whether the Weibo images contain privacy information or not automatically.[Results] With the same amount of data, the accuracy of transfer learning is improved by at least 30 percent compared to non-transfer learning approaches. Most ResNet deep neural network architectures can achieve more than 88% accuracy with transfer learning. Among them, ResNet50 has the highest recall rate (94.31%), accuracy (90.80%) and F1 value (91.11%), and the shortest testing time (148s). It has been selected out after comprehensive measurements of the above metrics and recommended as the most suitable model structure for current scenario requirements.[Limitations] The amount of labeled data in this study is relatively small, which may not be able to cover all the types of private information.[Conclusions] This study validates the feasibility and efficiency of deep transfer learning in the field of classification of private Weibo images. The result can be applied to various types of social media platforms to warn users about the risk of privacy leaking. The annotated image dataset can be used in others’ further researches as both a foundation and a comparison.

Select

Unsupervised Cross-Language Model for Patent Recommendation Based on Representation

Zhang Jinzhu,Zhu Lipeng,Liu Jingjie

Data Analysis and Knowledge Discovery. 2020, 4(10): 93-103. https://doi.org/10.11925/infotech.2096-3467.2020.0272

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs a cross-language recommendation model for patents based on text semantic representation, aiming to reduce the number of bilingual dictionaries and large-scale corpus, as well as improve the ability of domain adaptation.[Methods] First, we designed a word vector mapping method with unsupervised cross-language algorithm. Then, we mapped Chinese and English word vectors to the unified semantic vector space with linear transformation, which constructed the semantic mapping relationship between Chinese and English words. Third, we created semantic representation of patent texts based on cross-language word vector with smooth inverse frequency (SIF) reweighting method. It realized the semantic representation of Chinese-English patent texts in the same vector space. Finally, we calculated the semantic similarity between patent texts and recommend the cross-language patents.[Results] We examined the proposed method with patents on “wireless communication” and the recommendation accuracy rate of the top 1 and the top 5 reached 55.63% and 77.82%, which were 0.66% and 1.45% higher than those of the weak supervised based cross-language recommendation. They were also 4.29% and 3.90% better than the machine translation based ones.[Limitations] We only examined the proposed method with Chinese and English patents from one specific field.[Conclusions] This proposed method could recommend Chinese and English patents effectively, which help future research in cross-language patent recommendations.

Select

Li Jiaquan,Li Baoan,You Xindong,Lü Xueqiang

Data Analysis and Knowledge Discovery. 2020, 4(10): 104-112. https://doi.org/10.11925/infotech.2096-3467.2019.1157

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The study uses patent knowledge graph to calculate similarities between patent terms, aiming to detect infringement cases from patent texts.[Methods] We calculated term similarities based on the knowledge graph of new energy vehicle patent. Other factors included: the concept hierarchy of terms, the distance between terms in the knowledge graph, the semantic similarity of terms, as well as the attributes of terms.[Results] The accuracy and recall rates of patent term classification were more than 80%, which were significantly higher than those of the traditional methods.[Limitations] Manual construction of concept hierarchy tree and annotation of term classification might yield errors.[Conclusions] It is feasible to compute similarities between patent terms based on the knowledge graph, which provides good reference for future research.

Select

Classification Model for Few-shot Texts Based on Bi-directional Long-term Attention Features

Xu Tongtong,Sun Huazhi,Ma Chunmei,Jiang Lifen,Liu Yichen

Data Analysis and Knowledge Discovery. 2020, 4(10): 113-123. https://doi.org/10.11925/infotech.2096-3467.2020.0206

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a classification model for few-shot texts, aiming to address the issues of data scarcity and low generalization performance.[Methods] First, we divided the text classification tasks into multiple subtasks based on episode training mechanism in meta-learning. Then, we proposed a Bi-directional Temporal Convolutional Network (Bi-TCN) to capture the long-term contextual information of the text in each subtask. Third, we developed a Bi-directional Long-term Attention Network (BLAN) to capture more discriminative features based on Bi-TCN and multi-head attention mechanism. Finally, we used the Neural Tensor Network to measure the correlation between query samples and support set of each subtask to finish few-shot text classification.[Results] We examined our model with the ARSC dataset. The classification accuracy of this model reached 86.80% in few-shot learning setting, which was 3.68% and 1.17% better than those of the ROBUSTTC-FSL and Induction-Network-Routing models.[Limitations] The performance of BLAN on long text is not satisfactory. [Conclusions] BLAN overcomes the issue of data scarcity and captures comprehensive text features, which effectively improves the performance of few-shot text classification.

Select

A BiLSTM-CRF Model for Protected Health Information in Chinese

Liu Jingru,Song Yang,Jia Rui,Zhang Yipeng,Luo Yong,Ma Jingdong

Data Analysis and Knowledge Discovery. 2020, 4(10): 124-133. https://doi.org/10.11925/infotech.2096-3467.2020.0167

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an automated scheme to remove personal information from clinical records based on the BiLSTM-CRF model, aiming to protect patient privacy and identify protected health information (PHI) from unstructured files.[Methods] We collected experimental data from the discharge summaries of a health information platform. According to the 18 PHI regulations specified by HIPAA, we determined 7 PHI categories and 15 PHI types. We used the BiLSTM-CRF model to effectively identify protected health information from unstructured clinical records.[Results] The accuracy rate, recall rate and F value of all entity category recognition were 98.66%, 99.36%, and 99.01% respectively, and the wrong labels were summarized and analyzed.[Limitations] The corpus characteristics need to be improved, and the clinical text quality after automatic recognition of PHI was not evaluated.[Conclusions] The BiLSTM-CRF model could automatically recognize named entities without feature engineering, which promotes the sharing and utilization of clinical information.

Select

Active Learning Strategies for Extracting Phrase-Level Topics from Scientific Literature

Tao Yue,Yu Li,Zhang Runjie

Data Analysis and Knowledge Discovery. 2020, 4(10): 134-143. https://doi.org/10.11925/infotech.2096-3467.2020.0281

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores methods of extracting information from scientific literature with the help of active learning strategies, aiming to address the issue of lacking annotated corpus. [Methods] We constructed our new model based on three representative active learning strategies (MARGIN, NSE, MNLP) and one novel LWP strategy, as well as the neural network model (namely CNN-BiLSTM-CRF). Then, we extracted the task and method related information from texts with much fewer annotations. [Results] We examined our model with scientific articles with 10%~30% selectively annotated texts. The proposed model yielded the same results as those of models with 100% annotated texts. It significantly reduced the labor costs of corpus construction. [Limitations] The number of scientific articles in our sample corpus was small, which led to low precision issues. [Conclusions] The proposed model significantly reduces its reliance on the scale of annotated corpus. Compared with the existing active learning strategies, the MNLP yielded better results and normalizes the sentence length to improve the model’s stability. In the meantime, MARGIN performs well in the initial iteration to identify the low-value instances, while LWP is suitable for dataset with more semantic labels.

Please choose a citation manager

Content to export

25 October 2020, Volume 4 Issue 10

模态框（Modal）标题

Please choose a citation manager

Content to export

25 October 2020, Volume 4 Issue 10