Data Analysis and Knowledge Discovery

Select

Research on Knowledge Base Error Detection Method Based on Confidence Learning

Li Wenna,Zhang Zhixiong

Data Analysis and Knowledge Discovery. 2021, 5(9): 1-9. https://doi.org/10.11925/infotech.2096-3467.2021.0179

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the error detection method for knowledge base with the help of confidence learning, aiming to reduce the noise data. [Objective] We used the TransE model to represent knowledge base triples, and used the multi-layer perceptron model to detect errors. Then, we cleaned the dataset with confidence learning, and reduced the influence of noise data through multiple rounds of iterative training. [Results] We examined our new method with DBpedia datasets, and found the optimal F1 value reached 0.736 4, which is better than the control group. [Limitations] The noise data in the experiment was artificially generated and was different from the distribution of real world data. More research is needed to evaluate our method with larger knowledge bases. [Conclusions] The proposed method could reduce the influence of noise data through confidence learning, and more effectively detect knowledge base errors.

Select

Diffusion Model for Tacit Knowledge of Scientific Cooperation Network Based on Relevance: Case Study of Major Sci-Tech Projects

Lu Yunmeng,Liu Tiezhong

Data Analysis and Knowledge Discovery. 2021, 5(9): 10-20. https://doi.org/10.11925/infotech.2096-3467.2021.0275

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Tacit knowledge is an important resource for R&D and innovation of major science and technology projects. It is of practical significance to study the simultaneous diffusion of multiple types of interrelated tacit knowledge. [Objective] We proposed a method for evaluating the knowledge distance between scientific teams with the help of knowledge relevance, and constructed a tacit knowledge diffusion model based on the scientific cooperation network. We also investigated the influencing mechanism of knowledge relevance and interaction strategies on the diffusion of tacit knowledge through multi-agent simulation. [Results] In the early stage of dissemination, the speed of knowledge diffusion with strong knowledge relevance was faster than those of weak knowledge relevance. As the difference of knowledge among scientific teams became smaller, and the similarity of knowledge structure between scientific teams increased, and the influence of knowledge relevance on knowledge diffusion gradually weakened. The interaction strategy between subjects had greater impacts on knowledge diffusion. [Limitations] The carrier network of tacit knowledge is a real scientific cooperation network, but its dissemination process was simulated in the lab. [Conclusions] This paper analyzes the dynamic process and effects of tacit knowledge diffusion, and provides suggestions to promote the using of tacit knowledge.

Select

Short-Text Classification Method with Text Features from Pre-trained Models

Chen Jie,Ma Jing,Li Xiaofeng

Data Analysis and Knowledge Discovery. 2021, 5(9): 21-30. https://doi.org/10.11925/infotech.2096-3467.2021.0282

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses word vectors from different pre-trained models to enhance text semantics of Word2Vec, BERT and others, and then significantly improve the news classification. [Objective] We utilized the BERT and ERNIE models to extract context semantics, and the prior knowledge of entities and phrases through Domain-Adaptive Pretraining. Combined with the TextCNN model, the proposed method generated high-order text feature vectors. It also merged these features to achieve semantic enhancement and better short text classification. [Results] We examined the proposed method with public data sets from Today's Headline News and THUCNews. Compared with the traditional Word2Vec word vector representation, the accuracy of our new model improved by 6.37% and 3.50%. Compared with the BERT and ERNIE methods, the accuracy of our new model improved by 1.98% and 1.51% respectively. [Limitations] The news corpus in our study needs to be further expanded. [Conclusions] The proposed method could effectively classify massive short text data, which is of great significance to the follow-up text mining.

Select

Construction and Application of GCN Model for Text Classification with Associated Information

Zhou Zeyu,Wang Hao,Zhao Zibo,Li Yueyan,Zhang Xiaoqin

Data Analysis and Knowledge Discovery. 2021, 5(9): 31-41. https://doi.org/10.11925/infotech.2096-3467.2021.0266

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to learn the text contexts and the polysemy of words, aiming to improve the performance of automatic text classification. [Objective] We proposed a GCN model for long text classification with associated information. First, we used BERT to obtain the initial features of word vectors of the long texts. Then, we input these initial features into the BiLSTM model to capture their semantic relationship. Third, we represented the word features as nodes of the graph convolutional network SGCN. Fourth, we used the vector similarity between words as the edge to connect the nodes, and construct a graph structure. Finally, we input the long text representation from SGCN into the fully connected layers to finish the classification tasks. [Results] We examined our model with Chinese scientific literature having multiple subjects. The accuracy of our model is 0.834 09, which is better than the benchmark model. [Limitations] We only treated the texts as single topic ones for multi-classification tasks. [Conclusions] The proposed model based on BERT, BiLSTM and SGCN algorithms could effectively classify long texts.

Select

Visualizing Knowledge Graph for Explosive Formula Design

Zhou Yang,Li Xuejun,Wang Donglei,Chen Fang,Peng Lijuan

Data Analysis and Knowledge Discovery. 2021, 5(9): 42-53. https://doi.org/10.11925/infotech.2096-3467.2021.0356

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to obtain and use the knowledge of formula design principle, component correlation and preparation technology, aiming to improve the process of explosive design. [Context] Our study organizes the scattered and complex knowledge for explosive formula design and visualizes design process for researchers. [Objective] We took the formulation of polymer bonded explosive as an example and built the knowledge graph of explosive formula with NLP technology. Then, we designed different visual analysis methods for each topic's knowledge graph. [Results] The new knowledge graph presented the related expression of structured and unstructured knowledge for researchers. We examined effectiveness of the proposed method with formulation of polymer bonded explosive, and found it helped researchers obtain the required formula design knowledge effectively. [Conclusions] This study offers practical solutions for researchers to use the knowledge of explosive formula design.

Select

Identifying Pathogens of Foodborne Diseases with Machine Learning

Wang Hanxue,Cui Wenjuan,Zhou Yuanchun,Du Yi

Data Analysis and Knowledge Discovery. 2021, 5(9): 54-62. https://doi.org/10.11925/infotech.2096-3467.2020.1105

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper introduces external data to enhance the word vector representation of exposure foods, and then uses machine learning methods to identify foodborne disease pathogens. [Objective] First, we extracted space, time, patient information, exposure information from foodborne disease cases as features to identify foodborne disease pathogens. Then, we used word vector representation technology integrating domain knowledge to embed foodborne disease exposure foods. Third, we utilized XGBoost machine learning model to examine the correlation among features, and found several important foodborne disease pathogens. [Results] The proposed method yielded more accurate word vector representation of exposure foods than those of the traditional models. It also achieved 68% precision and recall on identifying four important foodborne disease pathogens: Salmonella, Escherichia coli, Vibrio parahaemolyticus and Norovirus, which provides some auxiliary diagnosis and treatment for the patients. [Limitations] We only analyzed four major foodborne disease pathogens. [Conclusions] The proposed method could improve the control of foodborne diseases.

Select

Annotation Method for Extracting Entity Relationship from Ancient Chinese Works

Wang Yifan,Li Bo,Shi Hua,Miao Wei,Jiang Bin

Data Analysis and Knowledge Discovery. 2021, 5(9): 63-74. https://doi.org/10.11925/infotech.2096-3467.2021.0460

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an annotation method for ancient Chinese datasets, aiming to standardize the annotation procedures. [Objective] We proposed a new method integrating logical semantics, deep learning and history knowledge. This model, which is suitable for few-shot learning, includes three principles of “annotation of relationship valence”, “annotation of propositional logic”, “existence of a single relationship”. [Results] We examined the proposed annotation model with the text dataset of Shiji (Historical Records in Chinese), and found its F1 values for the tasks of relationship extraction and the propositional logic extraction reached 42.02% and 34.07% respectively. [Limitations] The proposed method, which did not include the pre-trained models like BERT or ALBERT, only used the classic Word2Vec model for word embedding. The model's performance could be further improved. [Conclusions] Our new annotation method could effectively extract entity relationship from Ancient Chinese works.

Select

Classification Model for Medical Entity Relations with Convolutional Neural Network

Fan Shaoping,Zhao Yuxuan,An Xinying,Wu Qingqiang

Data Analysis and Knowledge Discovery. 2021, 5(9): 75-84. https://doi.org/10.11925/infotech.2096-3467.2021.0015

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new classification model for entity relationship based on the Convolutional Neural Network (CNN) with multi-features embedding, aiming to improve the classification results and simplify feature calculation. [Objective] Based on the existing algorithms of embedded features, our CNN model integrated word positions and lexical features, as well as demonstrated the representation methods for the features. These features did not require complex algorithm calculation, which improved the model's performance. [Results] We examined the proposed model with the Bio-Medical corpus of AIMed, GENIA and ChemProt. The F1 scores were 0.7342, 0.9764 and 0.8900, respectively. This model yielded the best results with the GENIA and ChemProt datasets. [Limitations] Our model did not include the prior domain knowledge from biomedical field. [Conclusions] The proposed model could effectively conduct entity relationship classification, which also help the research on relation extraction and knowledgebase construction in bio-medical field.

Select

Identifying Lead Users in Open Innovation Community from Knowledge-based Perspectives

Shan Xiaohong,Wang Chunwen,Liu Xiaoyan,Han Shengxi,Yang Juan

Data Analysis and Knowledge Discovery. 2021, 5(9): 85-96. https://doi.org/10.11925/infotech.2096-3467.2021.0237

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores ways to identify lead users in different fields of the open innovation community, aiming to help enterprises obtain external knowledge resources. [Objective] First, we used the LDA to extract user topics and construct a user knowledge bipartite network. Then, we combined the characteristics of the lead users' knowledge structure and traditional individual attributes. Third, we proposed a link prediction method based on the Exponential Random Graph Model to identify lead users in different fields. Finally, we conducted an empirical study using the Joint Definition Community as an example. [Results] We identified 20 lead users and found their average link probability was greater than 0.900. Compared with traditional link prediction methods, our method had the largest AUC of 0.996 7, and the smallest ARC of 0.013 2. [Limitations] Our model did not include the impacts of time factors on user knowledge. [Conclusions] This research enriches the perspectives and methods of lead user identification and lays a solid foundation for the follow-up studies.

Select

Sentiment Analysis of Online Users' Negative Emotions Based on Graph Convolutional Network and Dependency Parsing

Fan Tao,Wang Hao,Wu Peng

Data Analysis and Knowledge Discovery. 2021, 5(9): 97-106. https://doi.org/10.11925/infotech.2096-3467.2021.0146

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develop news method to improve the negative sentiment analysis of online users. [Objective] We proposed a model based on Graph Convolutional Networks (GCN) and dependency parsing. This model combined the BiLSTM and attention mechanism to extract textual features, which were then used as the vertex features. Third, we utilized the GCN to train the vertex features and the corresponding adjacency matrices. Finally, the model generated four types of emotions (anger, disgust, fear and sadness). [Results] We conducted an empirical study with online public opinion datasets (i.e., “COVID-19”) and compared the performance of our model with the baseline models. We found that the proposed model has certain advantages. For the emotion of “fear”, the recognition accuracy reached $93.535 %$ . [Limitations] We only examined the proposed model with online public opinion datasets. More research is needed to evaluate its performance with other public datasets. [Conclusions] Combining the dependency parsing information, the GCN, and the attention mechanism could increase the performance of negative sentiment analysis.

Select

Comparing Prediction Models for Prostate Cancer

Che Hongxin,Wang Tong,Wang Wei

Data Analysis and Knowledge Discovery. 2021, 5(9): 107-114. https://doi.org/10.11925/infotech.2096-3467.2020.1185

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper compares the performance of prostate cancer prediction models based on ensemble learning and non-ensemble learning algorithms, aiming to identify the optimal algorithm and key risk factors for the cancer. [Objective] First, we constructed the prediction models with K-Nearest Neighbor, Decision Tree, Support Vector Machine, and BP neural network. Then, we built prediction models based on AdaBoost, GradientBoost and XGBoost. Finally, we identified risk factors of prostate cancer with the two groups of models. [Results] Among models based on the non-ensemble algorithms, the Decision Tree model had the best performance with the accuracy of 0.933 3, the F1 score of 0.930 1, and the AUC of 0.914 5. For the ensemble algorithm based models, the performance of XGBoost model was the best, with the accuracy of 0.957 3, F1 score of 0.962 4, and the AUC of 0.951 3. We found nine important risk factors for prostate cancer, including total PSA and free PSA. [Limitations] The experimental data set and the model building algorithm need to be expanded. [Conclusions] Ensemble learning algorithm is better than the non-ensemble ones to predict prostate cancer and identify risk factors.

Select

Optimizing Large Hospital Operating Rooms with Data Analytics

Chen Donghua,Zhao Hongmei,Shang Xiaopu,Zhang Runtong

Data Analysis and Knowledge Discovery. 2021, 5(9): 115-128. https://doi.org/10.11925/infotech.2096-3467.2020.1123

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to optimize the management of large hospital operating rooms with the help of data analytics. [Objective] We collected about fifty thousand surgical cases from one large hospital in China. Then, we conducted regression analysis for correlation of surgical indicators, cluster analysis and association rule mining for resource usage, as well as the time series forecasting to predict the number of surgical cases. Finally, we discussed optimization strategies for operating rooms. [Results] The duration of 75% surgical procedures showed significant ties with other indicators. The FP-Growth algorithm with minimum confidence of 0.85 identified reliable patterns of resource usage. The accuracy of time series forecasting was increased by at least 37.5% due to the use of weekly numbers of surgical procedures. [Limitations] We did not link the operating room dataset with records from other medical information systems. Therefore, the proposed method might not work for other hospital departments. Meanwhile, our method needs to be examined with data from other hospitals. [Conclusions] This study could help to optimize the large hospital operating rooms.

Please choose a citation manager

Content to export

25 September 2021, Volume 5 Issue 9

模态框（Modal）标题

Please choose a citation manager

Content to export

25 September 2021, Volume 5 Issue 9