Data Analysis and Knowledge Discovery

Current Issue

, Volume 7 Issue 5

Previous Issue Next Issue

For Selected:

View Abstracts

Download Citations
EndNote Reference Manager ProCite BibTeX RefWorks

Toggle Thumbnails

Select

Scenarized Intelligent Data-Driven Research Model: Concept, Technical Framework, and Experimental Verification

Wang Xuezhao, Wang Yanpeng, Zhao Ping, Chen Fang, Chen Xiaoli

2023, 7 (5): 1-9. DOI: 10.11925/infotech.2096-3467.2023.0421

Abstract

HTML ( 43 )

PDF(1008KB) ( 510 )

[Objective] This paper proposes a scenarized intelligent data-driven research model and conducts preliminary verification through several cases. [Methods] We developed a quantitative characterization model named SDS (S&T decision scenarios(S) - scenarized data alignment(D_X) - solution scenarios(S)). The implementation path of SDS was divided into three steps: S&T decision scenarization demands, scenarized data construction, and optional solution generation. [Results] We verified the model through two cases which supported specific decision-making scenarios such as the selection of emerging and disruptive technologies, the perception of S&T frontier trend, the evaluation of scientific research proposals, and situational awareness in the conflict between Russia and Ukraine. The research results were recognized by relevant S&T decision-makers. [Limitations] The automation level of data scenarized processes is relatively low, and there is a need to improve the combinations of intelligent technologies and information basic theoretical methods in the process of generating evidence chains. [Conclusions] The scenarized intelligent data-driven research model promotes the breadth and depth of research conclusions, improves the efficiency and speed of research work, and verifies the reusability and portability of scenarized intelligent data. It can provide reference and guidance for the concept, ideas, and implementation path of research and services for future S&T decision-making.

Figures and Tables | References | Related Articles | Metrics

Select

Analyzing Evolution of Basic Research Funding Orientation: Case Study of NSF

Wei Huanan, Lei Ming, Wang Xuefeng, Yu Yin

2023, 7 (5): 10-20. DOI: 10.11925/infotech.2096-3467.2022.0627

Abstract

HTML ( 20 )

PDF(1294KB) ( 334 )

[Objective] This paper identifies and analyzes the funding orientation of basic research projects funded in the United States, aiming to provide suggestions for improving the funding layout of science funds in China. [Methods] Based on the literature review, we established a feature system for identifying funding orientation from four dimensions: basic information, collaborative characteristics, project characteristics, and output characteristics. Then, we constructed a recognition model with the help of machine learning. Finally, we conducted the corresponding evolution analysis. [Results] The SVM model with an RBF kernel had a better identification effect. The case analysis of synthetic biology showed that the NSF balanced “free exploration” and “demand-oriented”. The basic research of “free exploration” was consistent throughout. In contrast, the basic research of “demand-oriented” was relatively scarce in the early stages, gradually increasing with the development of the field. Changes in the two funding orientations are closely related to the development stage of the discipline and the national strategic policies. [Limitations] We only chose one field for case analysis, which lacked representativeness. We only included NSF project data and did not include NIH, FDA, and other data, so the comprehensiveness of the data source needs to be strengthened. [Conclusions] This study is a valuable exploration of identifying basic research funding orientation. By identifying and analyzing the funding orientation of NSF projects in synthetic biology, this study can provide suggestions for the funding layout of NSFC in China and promote the coordinated development of basic research in China.

Figures and Tables | References | Related Articles | Metrics

Select

Hotel Stock Prediction Based on Multimodal Deep Learning

Liu Yang, Zhang Wen, Hu Yi, Mao Jin, Huang Fei

2023, 7 (5): 21-32. DOI: 10.11925/infotech.2096-3467.2022.0538

Abstract

HTML ( 56 )

PDF(4206KB) ( 258 )

[Objective] This paper aims to predict the price trend of hotel stocks by analyzing consumer sentiment in tourism reviews using multi-modal deep learning methods. [Methods] First, we constructed a multi-modal deep learning model to encode the multi-modal information. Then, we extracted the interaction information between texts and images through LSTM and graph neural network. Finally, we predicted the price of hotel stocks. [Results] We conducted an empirical study using Yelp’s tourism review data. Compared with the baseline models, the proposed model has superiority, and the average accuracy of stock prediction reached 59.10%. [Limitations] The proposed model was only tested on the dataset of four hotels on the Yelp website and has not been further validated on other tourism platforms. [Conclusions] The proposed model can effectively extract the interactive information between different modalities and improve the accuracy of hotel stock prediction.

Figures and Tables | References | Related Articles | Metrics

Select

Financial Fraud Detection for Growth Enterprise Market Listed Companies Based on Data Fusion

Li Aihua, Wang Diwen, Xu Weijia, Li Zimo, Yao Sihan

2023, 7 (5): 33-47. DOI: 10.11925/infotech.2096-3467.2022.0585

Abstract

HTML ( 14 )

PDF(2938KB) ( 383 )

[Objective] This paper builds ensemble models to detect financial frauds of Growth Enterprise Market (GEM) listed companies. [Methods] We constructed a financial fraud anomaly detection framework based on data fusion. In the data layer, we fused structured, text, and multi-source heterogeneous data to construct financial and non-financial information features. In the information layer, we combined different sampling and ensemble classification models. In the knowledge layer, we fused current domain information to construct the model evaluation indicators. [Results] After non-balance processing, the evaluation indicators of the model were better than those of the un-processed results. The optimized SMOTE+ENN+LightGBM model achieved an F_β of 0.7738. In addition, the detection results containing multiple types of features were better than those containing only single-class features. [Limitations] The proposed method mainly identifies suspicious financial fraud companies. It cannot distinguish or determine specific types of fraud. [Conclusions] Non-balance processing is beneficial for improving the model’s ability to find abnormal samples, and the fusion of multi-source heterogeneous data positive affects the identification of financial frauds in listed companies.

Figures and Tables | References | Related Articles | Metrics

Select

Paper Recommendation Based on Academic Knowledge Graph and Subject Feature Embedding

Li Kaijun, Niu Zhendong, Shi Kaize, Qiu Ping

2023, 7 (5): 48-59. DOI: 10.11925/infotech.2096-3467.2022.0424

Abstract

HTML ( 20 )

PDF(1112KB) ( 432 )

[Objective] This paper proposes a new model that integrates multiple features to provide accurate paper recommendation services for researchers. [Methods] First, we designed a feature extraction framework to extract and fuse entity relation features and topic features from the knowledge graph and the content of academic papers, respectively. Then, we proposed a paper recommendation method based on the knowledge embedding-based encoding-decoding model, which improved the learning effect of high-dimensional fusion features. [Results] We examined our new model on the DBLP-v11 dataset. The proposed method improved the Recall and MRR scores by 8.9% and 2.9%, respectively, compared with the suboptimal model. [Limitations] The proposed graph feature learning method does not consider the weight of entities in the real environment. [Conclusions] The new paper recommendation method could effectively learn high-dimensional features, which provide guidance for subsequent research.

Figures and Tables | References | Related Articles | Metrics

Select

Text Classification Method for Urban Portrait Based on Multi-Label Annotation Learning

Ye Guanghui, Li Songye, Song Xiaoying

2023, 7 (5): 60-70. DOI: 10.11925/infotech.2096-3467.2022.0673

Abstract

HTML ( 15 )

PDF(825KB) ( 433 )

[Objective] The study uses machine learning technology to analyze and obtain multi-labels for long social texts, aiming to provide new ideas for urban portrait text analysis and other related studies. It addresses the problems facing urban data portrait analysis, such as unstructured, different lengths, and non-singular topics in relevant analysis texts. [Methods] We retrieved social media texts on urban impressions from the Zhihu platform and performed sentence segmentation and noise reduction processing on the texts. Then, we manually annotated some texts using the existing urban portrait annotation framework. Next, we trained the support vector classification, convolutional neural networks, and Naive Bayesian and comprehensively evaluated their performance. We used the optimal model to obtain all labels for long texts, and utilized the ML-kNN multi-label learning model for training a multi-label social text classification model. [Results] Regarding the single-label text classification model, the support vector classification model had the best overall performance, with an accuracy rate of 0.690 0 for short text labeling. Using ML-kNN to build a multi-label text classification model, the highest accuracy rate reached 0.810 3, and the average Hamming loss was 0.035 3. [Limitations] The impact of textual context on topic classification needed to be fully considered. [Conclusions] Based on the long social text data on the Zhihu platform, the proposed multi-label classification model can effectively identify multiple labels for social long texts on the urban portrait.

Figures and Tables | References | Related Articles | Metrics

Select

Name Disambiguation Based on Similar Features and Relation Graph Optimization

Cui Huanqing, Yang Junzhu, Song Weiqing

2023, 7 (5): 71-80. DOI: 10.11925/infotech.2096-3467.2022.0576

Abstract

HTML ( 12 )

PDF(938KB) ( 353 )

[Objective] The paper aims to fully utilize the feature information and relation information of academic literature to improve author name disambiguation. [Methods] We proposed a name disambiguation method combining feature information embedding and relation graph optimization. First, we extracted feature information from literature and applied representation learning to obtain the embedding vectors. Then, we mined the relationship information between literatures, and also constructed four relation graphs to optimize the embedding vectors of each literature. Finally, we used hierarchical agglomerative clustering algorithm to obtain the disambiguation results. [Results] We examined the new model on AMiner-na dataset and found its average F1 score reached 68.78%, which was 1.81 percent points higher than the second best method. [Limitations] The proposed method focuses on the average disambiguation effect of all authors, and the disambiguation effect of some authors needs to be improved. [Conclusions] The proposed method can fully utilize the literature relation information, and effectively improve the effect of author name disambiguation.

Figures and Tables | References | Related Articles | Metrics

Select

Detecting Weibo Rumors Based on Hierarchical Semantic Feature Learning Model

Huang Xuejian, Ma Tinghuai, Wang Gensheng

2023, 7 (5): 81-91. DOI: 10.11925/infotech.2096-3467.2022.0613

Abstract

HTML ( 17 )

PDF(849KB) ( 376 )

[Objective] This paper tries to improve the accuracy and timeliness of Weibo rumor detection. [Methods] We proposed a rumor detection method based on the hierarchical semantic feature learning model (BCGA). Firstly, we extracted the semantic features of a single text in an event based on the BERT model. Secondly, we dynamically grouped the event propagation data based on the time domain. Next, we used the convolutional neural network to learn the semantic correlation features of the text sets in each time domain. Fourth, we input the semantic correlation features in each time domain into the deep bidirectional gated recurrent neural network to learn the deep semantic temporal features of the event propagation process. Finally, we integrated the attention mechanism to make the model focus on the rumor feature in semantic temporal features. [Results] Experiments on the Weibo public data sets show that the detection accuracy of the model reached 95.39%, while the detection delay was within 12 hours. [Limitations] The model requires a certain amount of forwarding and commenting information and the detection effect is not prominent when the event is not popular enough. [Conclusions] The hierarchical semantic feature learning model achieves a learning process from local to global semantics, improving the performance of Weibo rumor detection.

Figures and Tables | References | Related Articles | Metrics

Select

Linguistic Knowledge-Enhanced Self-Supervised Graph Convolutional Network for Event Relation Extraction

Xu Kang, Yu Shengnan, Chen Lei, Wang Chuandong

2023, 7 (5): 92-104. DOI: 10.11925/infotech.2096-3467.2022.0602

Abstract

HTML ( 9 )

PDF(1002KB) ( 371 )

[Objective] This paper proposes a Linguistic Knowledge-enhanced Self-Supervised Graph Convolutional Network (LKS-GCN) model, aiming to improve the existing method for event relation extraction. [Methods] First, we used the BERT model to encode the input texts, and learned the syntactic relationships between words with graph convolutional network to enhance text representations. Then, we introduced a multi-head attention mechanism to distinguish different dependency features and utilized segment-level max pooling operation to extract structural information. Next, the pooled results of multiple segments were combined as the relation features of event pairs. We conducted adaptive clustering based on the relation representation features and generated pseudo-labels as the self-supervision information. Finally, we optimized event relation features through iterative self-supervised training. [Results] We evaluated the new model on TACRED and FewRel datasets, which made the B³-F1 2.1% and 1.2% higher than the best baseline methods. [Limitations] The model treated the syntactic dependency tree as an undirected graph and did not consider the edges’ direction and dependency edges’ label information. [Conclusions] The LKS-GCN model could effectively enhance text representation and provide a self-supervised learning framework for event relation extraction with limited labeled data.

Figures and Tables | References | Related Articles | Metrics

Select

Deep Cross-modal Hashing Based on Intra-modal Similarity and Semantic Preservation

Li Tianyu, Liu Libo

2023, 7 (5): 105-115. DOI: 10.11925/infotech.2096-3467.2022.0536

Abstract

HTML ( 8 )

PDF(8383KB) ( 34 )

[Objective] This paper aims to solve the problem of most existing cross-modal hashing methods, which only consider inter-modal similarity and need to fully utilize label semantic information, thereby ignoring heterogeneous data details and leading to the loss of semantic information. [Methods] Firstly, we used Euclidean distance and Tanimoto coefficient to measure the intra-modal similarity of data from images and texts, respectively. Then, we used the weighted values of the two to measure the inter-modal similarity to fully utilize the detailed information of heterogeneous data. Next, we preserved the semantic information of data labels to improve the discriminability of the hash codes and prevent the loss of semantic information. Finally, we calculated the quantization loss of the generated hash codes and imposed the hash bit balance constraint to further improve the quality of the hash codes. [Results] Compared with 11 existing methods,the mAP score was increased by 9.5% and 5.8% in the Chinese image retrieval by text and text retrieval by image tasks of the MIR-Flickr25k dataset and by 4.7% and 1.1% on the NUS-WIDE dataset. [Limitations] The model training depends on label information, and its performance may decrease in unsupervised and semi-supervised situations. [Conclusions] The proposed method can preserve the detailed information of heterogeneous data and prevent the loss of semantic information, effectively improving the retrieval performance.

Figures and Tables | References | Related Articles | Metrics

Select

A Novel Borderline Over-Sampling Method Based on KNN and Deep Gaussian Mixture Model for Imbalanced Data

Zhang Haibin, Xiao Han, Yi Cancan, Yuan Rui

2023, 7 (5): 116-122. DOI: 10.11925/infotech.2096-3467.2022.0609

Abstract

HTML ( 4 )

PDF(853KB) ( 374 )

[Objective] This paper proposes a borderline oversampling method based on the k-nearest neighbor algorithm (KNN) and Deep Gaussian Mixture Model (DGMM) to address the classifier bias due to data imbalance. [Methods] Firstly, we used the KNN algorithm to obtain the borderline minority samples in the training set. Secondly, we constructed a DGMM for the minority samples. Next, we applied the DGMM in reverse to generate the oversampling samples that conform to the distribution characteristics of the borderline minority samples. Finally, we used the three sigma guidelines to remove noise samples. We repeated the process until no outlier samples were generated. [Results] The proposed method improved the AUC and G-mean up to 8.62% and 12.99%, respectively. The corresponding average increased by 3.51% and 4.93%. [Limitations] The parameter optimization method for DGMM needs further improvement. [Conclusions] The proposed method can better address the problem of imbalanced data.

Figures and Tables | References | Related Articles | Metrics

Select

Identifying Medical Named Entities with Word Information

Ben Yanyan, Pang Xueqin

2023, 7 (5): 123-132. DOI: 10.11925/infotech.2096-3467.2022.0547

Abstract

HTML ( 13 )

PDF(900KB) ( 289 )

[Objective] This paper utilizes the word information to identify and infer the key clinical features in online consultation records and address the difficulty in recognizing the boundaries of named entities. [Methods] First, we constructed a new model based on MacBERT and conditional random fields. Then, we embedded the word position and part of speech as the dialogue text information by the speaker role embedding. Finally, we used the weighted multi-class cross-entropy to solve the problem of entity category imbalance. [Results] We conducted an empirical study with online consultation records from Chunyu Doctor. The F₁ value of the proposed model in the named entity recognition task was 74.35%, which was nearly 2% higher than directly using the MacBERT model. [Limitations] We did not design a specific model for Chinese word segmentation. [Conclusions] Our new model with more dimensional features can effectively improve its ability to recognize key features of clinical findings.

Figures and Tables | References | Related Articles | Metrics

Select

Ten-Year Prediction of Coronary Heart Disease Based on PCHD-TabNet

Jiang Linfu, Yuan Zhenming, Zhang Xingwei, Jiang Huaqiang, Sun Xiaoyan

2023, 7 (5): 133-144. DOI: 10.11925/infotech.2096-3467.2022.0603

Abstract

HTML ( 26 )

PDF(2506KB) ( 401 )

[Objective] This paper tries to accurately predict the risk of coronary heart disease and analyze the importance of different factors of coronary heart disease, which helps doctors timely intervene and effectively support patients in prevention and treatment. [Methods] We proposed a coronary heart disease prediction framework based on an attention-interpretable tabular learning neural network (PCHD-TabNet). We used self-supervised learning to help the model accelerate convergence and maintain stability. [Results] The overall performance of PCHD-TabNet was better than other models, and the AUC of the dataset reached 0.72. [Limitations] Framingham data set is routine physical examination data. If there are better clinical data, the predictive performance may be further improved. [Conclusions] Comparative experiments show that the proposed method improves the model’s performance and is superior to other traditional models. This study provides an efficient method for coronary heart disease prediction. It also serves as a reference for similar data mining tasks.

Figures and Tables | References | Related Articles | Metrics

Select

Customer Satisfaction Modelling for Healthcare Wearable Devices Through Online Reviews

Lin Weizhen, Liu Hongwei, Chen Yanjun, Wen Zhanming, Yi Minqi

2023, 7 (5): 145-154. DOI: 10.11925/infotech.2096-3467.2022.0420

Abstract

HTML ( 18 )

PDF(778KB) ( 318 )

[Objective] This paper identifies the dimensions of customer interest in wearable healthcare devices and their impact on satisfaction, aiming to inspire businesses to optimize their products and services. [Methods] First, we retrieved 11,349 online reviews from Amazon.com as the corpus. Then, we used the LDA model to identify customer satisfaction dimensions. Finally, we constructed a satisfaction model using machine learning algorithms. [Results] The satisfaction model constructed with the Multi-Layer Perceptron (MLP) had the best prediction effect (F₁=0.6534). Customers’ attention on products focused on 13 dimensions across seven comprehensive attributes: functionality attributes, service attributes, quality attributes, value attributes, ease of use attributes, social attributes, usefulness attributes. Functionality attributes was the most important product feature for customers. Social, quality, and service attributes had a negative impact on customer satisfaction and should be the priority for businesses to improve products and services. [Limitations] We did not consider the reviews’ authenticity and in future will include cases of false and malicious reviews in the analysis process. [Conclusions] This paper identifies the dimensions of customer attention to products, their impact on satisfaction, and the order in which improvement should be made, providing management insights for business.

Figures and Tables | References | Related Articles | Metrics