Data Analysis and Knowledge Discovery

Select

Abstracting Biomedical Documents with Knowledge Enhancement

Deng Lu,Hu Po,Li Xuanhong

Data Analysis and Knowledge Discovery. 2022, 6(11): 1-12. https://doi.org/10.11925/infotech.2096-3467.2022.0034

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes a new text summarization model for biomedicine research, aiming to improve the quality of their abstracts. [Methods] First, we obtained the important contents of the biomedical texts with extractive abstracting technology. Then, we combined the important contents with related knowledge base to extract the key terms and their corresponding concepts. Third, we integrated these contents and concepts to the neural network abstrcting model as background knowledge for the attention mechanism. With the help of domain knowledge, the proposed model can not only focus on the important information from the texts, but also reduce the noises occurring due to the introduction of external information. [Results] We examined the proposed model with three biomedical data sets. The average ROUGE of the proposed model’s PG-meta reached 31.06, which was 1.51 higher than the average ROUGE of the original PG model. [Limitations] We did not investigate the impacts of different knowledge acquiring methods on the effectiveness of our model. [Conclusions] The proposed model can better learn the in-depth meaning of biomedical documents and improve the quality of their abstracts.

Select

Biomedical Text Classification Method Based on Hypergraph Attention Network

Bai Simeng,Niu Zhendong,He Hui,Shi Kaize,Yi Kun,Ma Yuanchi

Data Analysis and Knowledge Discovery. 2022, 6(11): 13-24. https://doi.org/10.11925/infotech.2096-3467.2022.0145

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new model integrating tag semantics. It uses text-level hypergraph and cross attention mechanism to capture the organizational structure and grammatical semantics of literature, aiming to improve the classification of biomedical texts. [Methods] First, we utilized the fine-tuned BioBERT to retrieve vector features from the biomedical texts. Then, we constructed a text-level hypergraph to capture the word order, semantics, and syntactics of the texts. Finally, we merged the features of text-level hypergraph and labelled semantics through the cross attention mechanism network to finish the text classification. [Results] The experimental results on the PM-Sentence dataset show that the proposed model is 2.34 percentage points higher than the baseline model in the comprehensive evaluation of F1 indicators. [Limitations] The experimental dataset needs to be expanded to evaluate the model’s performance in other fields. [Conclusions] The newly constructed model improves the classification of biomedical texts and provides effective support for knowledge retrieval and mining.

Select

Text Retrieval Based on Syntactic Information

Zhang Yongwei,Liu Ting,Liu Chang,Wu Bingxin,Yu Jingsong

Data Analysis and Knowledge Discovery. 2022, 6(11): 25-37. https://doi.org/10.11925/infotech.2096-3467.2022.0093

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to explore an efficient method for retrieving syntactic information from large text corpora. [Methods] First, we created linearized indices for syntactic information based on their features. Then, these indices provide matching information to improve retrieval efficiency. [Results] We examined our new model with the People’s Daily Corpus of 28.51 million sentences. The average processing time for 26 queries was 802.6 milliseconds, which met the requirements of retrieval systems for large corpora. [Limitations] More research is needed to evaluate the proposed method with larger number of queries. [Conclusions] Our new method could quickly retrieve lexical, dependency syntactic and constituency syntactic information from large text corpora.

Select

Identifying Useful Reviews with Improved Graph Convolutional Neural Network

Li Xuemei,Jiang Jianhong

Data Analysis and Knowledge Discovery. 2022, 6(11): 38-51. https://doi.org/10.11925/infotech.2096-3467.2022.0129

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to utilize the semantic deviation of comments, aiming to identify the useful online reviews. [Methods] We constructed an FFGCN model integrating chunk analysis and feature membership to evaluate the comments’ usefulness. Then, we utilized chunk analysis to obtain the feature and opinion chunks as nodes on the graph. Third, with the help of multi-granularity feature thesaurus, we integrated the membership relationship between feature words into the graph. Finally, we classified the comments through convolution on the graph. [Results] The recognition accuracy of the FFGCN model on the two datasets were 93.4% and 93.9%, which were 0.9 and 1.0 percentadge point higher than the optimal results of the baseline model. [Limitations] We only examined the new model with mobile phone review data. More research is needed to evaluate the model with data sets from other fields. [Conclusions] The proposed model can effectively identify the helpful products reviews online.

Select

Multi-Truth Discovery Method Based on Attribute Fusion

Yang Haolin,Dong Yongquan,Chen Huafeng,Zhang Guoxi

Data Analysis and Knowledge Discovery. 2022, 6(11): 52-60. https://doi.org/10.11925/infotech.2096-3467.2022.0286

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper adds influence of auxiliary attributes to the existing models for multi-truth discovery, aiming to improve their F1 values. [Methods] First, we used the auxiliary attributes to calculate the source expertise and consensus degree. Then, we combined the activity degree of multi-truth attribute values to get the degree of support from the source for the conflicting data. Third, we called the existing truth discovery methods to obtain the pseudo tags of the truth. Finally, we used the neural network to capture the complex relationship between the sources and the conflicting data, and identified all truth. [Results] Compared with the sub-optimal model, our method improved the F1 value by 2.25% on the book dataset and by 5.42% on the movie dataset. [Limitations] The proposed method included auxiliary attributes reflecting object features, and more research is needed to explore the impacts of other auxiliary attributes on multi-truth discovery. [Conclusions] The proposed method could effectively discover multi-truth.

Select

Core Patent Portfolio Identification and Application in Professional Technical Field

Zeng Wen,Wang Yuefen

Data Analysis and Knowledge Discovery. 2022, 6(11): 61-71. https://doi.org/10.11925/infotech.2096-3467.2022.0161

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs identification methods for core patent portfolio and then examines their application with the help of large-scale datasets. [Methods] Through cross-combination, we constructed five identification models for the patents, which included six features of the patents. We then compared our methods’ performance with datasets of artificial intelligence. [Results] Different combined methods yielded highly consistent results when applied to various datasets. Meanwhile, as the number of core patents increased, the duplicated rates between the two methods gradually decreased. For example, the core patent duplication rates of method ① and method ④ dropped from 80% to 47%. [Limitations] We only investigated the common identification requirements. More research is needed to study those for specific and individualized areas. [Conclusions] The five constructed methods can be applied to different scenarios. For the rapidly developing field of artificial intelligence, the entropy weight method combining grey relational analysis and the entropy weight method with TOPSIS may yield better results.

Select

Selecting Optimal LDA Numbers to Identify News Topics

Yang Yang,Jiang Kaizhong,Yuan Mingjun,Hui Lanxin

Data Analysis and Knowledge Discovery. 2022, 6(11): 72-78. https://doi.org/10.11925/infotech.2096-3467.2022.0115

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an adaptive method to decide the optimal topic numbers for the LDA model, aiming to effectively identify news topics. [Methods] Frist, we extract the needed data from news using semantics and time series, which helped us construct the corresponding feature vectors. Then, we utilized the Co-DPSC algorithm to collaboratively train the two views and obtained a semantic feature matrix containing timing effects. Finally, we conducted the density peak clustering by row after the matrix dimension reduction, which generated the optimal number of topics. [Results] The precision and F value of the proposed model were improved by 35.09% and 15.39%. [Limitations] We only clustered keywords from news and need to examine the new model with datasets from other fields. [Conclusions] The proposed method could provide better number of topics for the LDA model.

Select

Knowledge Modeling and Association Q&A for Policy Texts

Hua Bin,Kang Yue,Fan Linhao

Data Analysis and Knowledge Discovery. 2022, 6(11): 79-92. https://doi.org/10.11925/infotech.2096-3467.2022.0185

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develops a smart question-answering model for association policy based on cognitive semantic knowledge understanding, aiming to improve the government services. [Methods] First, we established a model based on policy connotation to express policy knowledge. Then, we introduced the attention mechanism for question words and classified policy issues combining the improved ERNIE + CNN model. Third, we used the semantic role labeling IDCNN + CRF model and cognitive computing method to obtain the semantics and pragmatic knowledge. Finally, based on knowledge fusion and semantic retrieval, we utilized knowledge aggregation technology to generate relevant answers. We also adopted the BERT semantic similarity calculation and knowledge unit measurement to evaluate the quality of answers. [Results] The accuracy of problem classification reached 90.76%, which was 18.81% and 5.05% higher than those of the original BERT and ERNIE models. The precision of problem knowledge acquisition reached 95.88%, and the accuracy of the answer quality reached 93.75%. The semantic similarity of the answers was 0.88, while the knowledge consistency was 0.96. [Limitations] The performance of our model is limited by the integrity of the domain knowledge system, while the answers’ relevance relies on the accuracy of policy knowledge extraction. [Conclusions] Based on the deconstruction of policy contents and scientific knowledge representation, the proposed method can generate answers for questions on different policy contents.

Select

GNN-MTB: An Anti-Mycobacterium Drug Virtual Screening Model Based on Graph Neural Network

Gu Yaowen,Zheng Si,Yang Fengchun,Li Jiao

Data Analysis and Knowledge Discovery. 2022, 6(11): 93-102. https://doi.org/10.11925/infotech.2096-3467.2022.0196

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs a virtual screening model for anti-tuberculosis drugs aiming to support the research and development of new medicine. [Methods] We proposed a curriculum learning-optimized graph neural network model for anti-tuberculosis inhibitors virtual screening (GNN-MTB). Then, we created a benchmark dataset for anti-tuberculosis drugs from the open access databases. Finally, we compared the performance of the GNN-MTB with four classic machine learning models and two graph neural network models on the benchmark dataset of 10,789 records. [Results] The proposed GNN-MTB model’s AUC score reached 0.912 and its AUPR score was 0.679, which were higher than those of the classic models. The maximum improvement of our method in AUC and AUPR were 3.872% and 13.167%. The GNN-MTB is made open source and could be found at https://github.com/gu-yaowen/GNN-MTB. [Limitations] The proposed model needs to add the analysis data on drug sensitivity and bacterial resistance. [Conclusions] The proposed GNN-MTB model benefits the development of anti-tuberculosis drug screening. This method could also create drug virtual screening models for other diseases.

Select

Identifying Phishing Websites Based on URL Multi-Granularity Feature Fusion

Hu Zhongyi,Zhang Shuoguo,Wu Jiang

Data Analysis and Knowledge Discovery. 2022, 6(11): 103-110. https://doi.org/10.11925/infotech.2096-3467.2022.0141

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes a model based on URL multi-granularity feature fusion, aiming to more effectively identify phishing websites. [Methods] First, we retrieved the character-level and word-level features of URLs with one-hot encoding and BERT. Then, we constructed the new identification model by fusing the deep features of both granularities. [Results] The accuracy, recall, F-value, and AUC values of the proposed model reached 0.96, 0.98, 0.97, and 0.97, respectively. It had better performance than the single-granularity feature representation-based models, benchmark classifiers, and other popular models. [Limitations] More research is needed to include webpage contents to the model. [Conclusions] The proposed model can represent URL features more comprehensively, and effectively identify phishing websites.

Select

Predicting Short-Term Urban Traffics Based on Causality Analysis Graph

Wang Jie,Gao Yuan,Zhang Lei,Ma Liwen,Feng Jun

Data Analysis and Knowledge Discovery. 2022, 6(11): 111-125. https://doi.org/10.11925/infotech.2096-3467.2022.0090

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the complex spatial interaction mechanism between regions, aiming to effectively predict short-term traffic flow. [Methods] Based on the graph neural network, we proposed a new predictive model, which integrated the regional functional similarity matrix and the causality matrix. Then, we developed a training strategy of “Mining traffic time series causal relationship → Extracting Spatio-temporal features → Predicting traffic flows”. Third, we predicted the traffic flows by capturing the spatio-temporal dependence characteristics of regional traffic. [Results] We tested the proposed model with Didi Chuxing data set from Chengdu. Compared with the optimal baseline model, the RMSE and MAE values were reduced by 3.098% and 4.783%, respectively. [Conclusions] The causal diagram for traffic sequence can simultaneously integrate the features of spatial distance relationships, road connectivity, and function similarities. With the help of causal relationships, the proposed model could effectively predict regional traffic flows.

Select

Comprehensive Quality Profiling for Micro-, Small-, and Medium-sized Enterprises Based on Deep Learning

Cao Lina,Zhang Jian,Chen Jindong,Fan Hui

Data Analysis and Knowledge Discovery. 2022, 6(11): 126-138. https://doi.org/10.11925/infotech.2096-3467.2022.0078

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develops the comprehensive quality profiling technology for the micro-, small-, and medium-sized enterprises (MSMEs) based on deep learning, aiming to more accurately depict the quality of MSMEs. [Methods] We proposed a comprehensive quality profiling system with five dimensions: quality innovation ability, process quality control, product quality level, operational quality and risks, as well as financial quality. Then, we designed a diversified profiling method based on the quality check reports and online user comments. Finally, we proposed the comprehensive quality profiling technology for MSMEs with the help of deep learning. [Results] The F value of our pre-trained BERT model was 4.66%, 1.99%, and 4.25% higher than those of the benchmark models. The review classification model based on the pre-trained Word2Vec was 6.03% higher than the traditional TF-IDF model. [Limitations] More dimensions related to enterprise quality need to be added and optimized. [Conclusions] Deep learning technology expands the dimensions and improves the accuracy of enterprise quality profiling. The proposed method also provides technical support for service innovation.

Select

Constructing Knowledge Base for Chinese Geographical Name

Li Xiaomin,Wang Hao,Li Yueyan,Zhao Meng

Data Analysis and Knowledge Discovery. 2022, 6(11): 139-153. https://doi.org/10.11925/infotech.2096-3467.2022.0183

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses linked data technology to study the evolution of geographical names in China, aiming to more effectively conduct digital humanity research. [Methods] First, we constructed the knowledge base CGNE_Onto for the evolution of Chinese geographical names. Then, we formulated the strong and weak marker words to identify evolution type sentences from the historical data. Third, we utilized the BERT-BiLSTM-CRF model to identify the time and place name entities from the evolution type sentences. Fourth, we used the newly generated entities as classes to build the ontology knowledge base, which was visualized from the perspective of direct and indirect path relationship. Finally, we analyzed the numbers and reasons of different evolution types in each dynasty. [Results] The proposed model intuitively demonstrated the evolution of geographical names, and provided some new directions for the analysis of geographical names data. [Limitations] The experimental data set needs to be expanded to improve the quality of evolution feature words. [Conclusions] The knowledge base for place names clearly shows their historical evolutions, as well as the evolution types in different dynasties.

Please choose a citation manager

Content to export

25 November 2022, Volume 6 Issue 11

模态框（Modal）标题

Please choose a citation manager

Content to export

25 November 2022, Volume 6 Issue 11