Data Analysis and Knowledge Discovery

Select

Explicit Rating Filling Strategy Based on Selection Data Bias Elimination and Conditional Generative Adversarial Networks

Shi Lei, Li Shuqing, Jiang Mingfeng, Zhang Zhiwang, Wang Yu

Data Analysis and Knowledge Discovery. 2023, 7(6): 1-14. https://doi.org/10.11925/infotech.2096-3467.2022.0605

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study is to address the issues of data sparsity and user selection bias in explicit rating data in recommender systems, by proposing a rating data filling model based on uninteresting item injection. [Methods] A general rating data filling model is constructed based on Conditional Generative Adversarial Networks framework. Denoising Auto-Encoder is used as the generator to capture the nonlinear potential factors behind the interaction and improve the robustness of model. To address the selection bias problem, uninteresting items are identified based on the user’s time point visibility, and are injected into the model by modifying the mask operation to generate data consistent with the user’s real rating distribution. [Results] Our experiments on MovieLens and Amazon datasets show that after data filling, the recommendation accuracy of ItemCF, BiasSVD, and AutoRec improves by more than three times on average. [Limitations] The data generation method relies on rating data and may not be effective in the case of extremely sparse rating data, such as in cold start scenarios. [Conclusions] The proposed model effectively alleviates data sparsity and eliminates selection bias, significantly improving the performance of recommended tasks of existing collaborative filtering methods.

Select

Technology Recognition and Link Prediction Method Based on GNN

Xu Xin, Li Qian, Yao Zhanlei

Data Analysis and Knowledge Discovery. 2023, 7(6): 15-25. https://doi.org/10.11925/infotech.2096-3467.2022.0361

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper integrates time features into a patent IPC co-occurrence network and trains the GNN model for link prediction. It aims to provide a reference for technology discovery and knowledge supply. [Methods] First, we collected the patent data on “privacy protection” to construct an IPC co-occurrence network. Then, we assigned time distribution, stability, and attention features to the network nodes. Third, we trained the GraphSAGE model to obtain the IPC nodes’ representation and predict the link score between them. It provides assistance and support for technology opportunity mining. [Results] Compared with the traditional link prediction method based on node similarity and the Node2Vec, the proposed model achieved a 30% improvement in the AUC metric. [Limitations] As a deep learning model, GNN has some disadvantages in training time. [Conclusions] Our new link prediction method exhibits high prediction accuracy. Combined with the time characteristics, it can capture the dynamic characteristics of nodes and provide valuable insights for technology discovery and other tasks.

Select

Construction and Verification of Type-Controllable Question Generation Model Based on Deep Learning and Knowledge Graphs

Wang Xiaofeng, Sun Yujie, Wang Huazhen, Zhang Hengzhang

Data Analysis and Knowledge Discovery. 2023, 7(6): 26-37. https://doi.org/10.11925/infotech.2096-3467.2022.1000

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This research aims to automatically generate questions, thereby reducing the workload of manual question generation. It also addresses the issues of uncontrollable question difficulty and limited question dimensions due to collaborative question generation. It encourages learners to engage in-deep reading comprehension with intelligent questions. [Methods] We proposed a question generation model based on Transformer and knowledge graph to automatically generate type-controllable questions. First, we input the knowledge graph into the Graph Transformer module of the TCQG (Type Controllable Question Generation) model for graph representation learning and obtained the subgraph vector. Then, we obtained matching external questions for each subgraph using similarity measures. Next, we input the parameters of 4MAT question type and those external questions into the BiLSTM network for externally enhanced vectors. Finally, we entered the subgraph vector and the externally enhanced vector into the Pointer-Generator Network of the TCQG model to generate questions. [Results] The TCQG model achieves better representation learning of the knowledge graph through the Graph Transformer. The BLEU value is 39.62 on the one-hop triple dataset. In evaluating “what is” questions, the BLEU score is 38.63. Both surpassed the baseline model. [Limitations] This research is limited by the types of questions and cannot cover all types of questions in human language. In addition, this research did not involve matching responses to the questions, which limits its real-world applications. [Conclusions] This research generates diverse, semantically rich, and naturally expressed questions needed in educational scenarios. It enables learners to benefit from the generated questions and engage in deeper reading comprehension.

Select

Review of Detection Methods for Scientific Data Citations

Zhou Jiayin, Qian Qing, Tang Mingkun, Wu Sizhu

Data Analysis and Knowledge Discovery. 2023, 7(6): 38-49. https://doi.org/10.11925/infotech.2096-3467.2022.0662

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the characteristics of the existing data citation practices and summarizes their recognition methods. It also explores current research and future development trends. [Methods] The existing data citation detection methods could be divided into three categories: rule-based recognition, supervised machine learning algorithm, and semi-supervised machine learning algorithm. We also reviewed each method’s principles, characteristics, existing problems, performance, and applications of each method. [Results] The existing technologies are concentrated on supervised machine learning algorithms. Detecting data citation with the help of citing behaviors and extracting data citation elements are the future direction. [Limitations] This paper summarizes the characteristics of data citations and existing recognition algorithms. It did not elaborate on the technical details of these algorithms. [Conclusions] There are still some problems in detecting data citation, such as research field limitations, lack of diversity in methods, and insufficient consideration of data citation characteristics, which need further optimization.

Select

Identifying Academic Expertise of Researchers Based on Iceberg Model

Song Peiyan, Long Chenxiang, Li Yiran, Ni Xuening

Data Analysis and Knowledge Discovery. 2023, 7(6): 50-60. https://doi.org/10.11925/infotech.2096-3467.2022.0542

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to automatically identify the academic expertise of researchers, which improves the research project evaluation and talent assessment. [Methods] Firstly, we adopted the Iceberg Model to describe the academic expertise of researchers. The visible part of the “iceberg” reveals the researchers’ areas of expertise and specialization, which identify their core competencies and main research directions. The lower part of the “iceberg” indicates the “comparative advantages” of researchers’ expertise. Then, we used labels to represent researchers’ expertise and utilized machine learning techniques such as LDA and BERT to extract, cluster, and generate matrices of academic labels. Finally, we proposed the self-focus and the peer-relative indexes to identify the researchers’ main areas and relative position in the scientific community. [Results] Using a sample of 20 researchers, we generated 8,985 sets of label words and their weights and described researchers’ expertise at a fine-grained level. And then, the “Self-Focus Index” and the “Peer-relative Index” were calculated based on the domain-researcher matrix (40×20). We found the proposed method can accurately reflect researchers’ expertise in specific research areas and relative positions within the scientific community. [Limitations] Future work should consider incorporating the temporal factor to capture the temporal evolution characteristics of researchers’ academic expertise. [Conclusions] The advantages of the proposed method are twofold. Firstly, the iceberg model effectively explains what researchers do and how well they do it. The model provides a theoretical basis for label extraction, index design, and enhancing interpretability. Secondly, in addition to quantifiable comparative expertise index calculations, the method achieves fine-grained, precise, and dynamic talent expertise profiling.

Select

Generating Patent Text Abstracts Based on Improved Multi-head Attention Mechanism

Shi Guoliang, Zhou Shu, Wang Yunfeng, Shi Chunjiang, Liu Liang

Data Analysis and Knowledge Discovery. 2023, 7(6): 61-72. https://doi.org/10.11925/infotech.2096-3467.2022.0530

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper addresses the problem of single-bias in patent text summarization caused by the single input structure of the patent text in patent texts. It also addresses the issues of repeated generation, the need for conciseness and fluency, and the loss of original information in generating abstracts. [Methods] We designed a patent text abstract generation model based on an improved multi-head attention mechanism (IMHAM). Firstly, we designed two cosine similarity-based algorithms based on the logical structure of the patent text to address the single structure issue and select the most important patent document. Then, we established a sequence-to-sequence model with a multi-head attention mechanism to learn the feature representation of patent text. Meanwhile, we added self-attention layers at the encoder and decoder levels. Next, we modified the attention function to address the problem of repetitive generation. Finally, we added an improved pointer network structure to solve the problem of original information loss. [Results] On the publicly available patent text dataset, the Rouge-1, Rouge-2, and Rouge-L scores of the proposed model were 3.3%, 2.4%, and 5.5% higher than the MedWriter baseline model. [Limitations] The proposed model is more applicable for documents with multiple structures and cannot fully utilize the algorithm for selecting the most important ones from single-structured documents. [Conclusions] The proposed model has good generalization ability in improving the quality of summary generation for text with multi-document structures.

Select

Tibetan News Text Classification Based on Graph Convolutional Networks

Xu Guixian, Zhang Zixin, Yu Shaona, Dong Yushuang, Tian Yuan

Data Analysis and Knowledge Discovery. 2023, 7(6): 73-85. https://doi.org/10.11925/infotech.2096-3467.2022.0453

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] To improve pre-training knowledge in Tibetan, this paper proposes a classification method for Tibetan news text based on Graph Convolutional Network (GCN) using the construction relationship between Tibetan syllables and documents. [Methods] First, we constructed the Tibetan news corpus text graph based on syllable-syllable and syllable-document relations. Then, we initialized the GCN using the one-hot representation of syllables and documents and jointly learned the embedding of syllables and documents under the supervision of document category labels in the training dataset. Finally, we transformed the text classification tasks into node classification. [Results] The Graph Convolutional Network achieves an accuracy of 70.44% on the classification of Tibetan news body texts, which is 8.96%-20.66% higher than the baseline models. It had a 61.94% accuracy on the Tibetan news titles, 6.61%-26.05% higher than the baseline models. Additionally, the Graph Convolutional Network is 0.73%-15.1% higher in accuracy than the SVM and CNN with pre-trained syllable embedding and Chinese minority pre-trained language model CINO. It is 15.65% higher in accuracy on the Tibetan content text compared to Word2Vec+LSTM. [Limitations] It still relies on labeled datasets in Tibetan, which are relatively scarce. [Conclusions] This paper designs three comparative experiments to demonstrate the effectiveness of Graph Convolutional Networks on Tibetan news text classification. It effectively solves the problem of cluttered information in Tibetan news text and helps data mining for Tibetan news texts.

Select

An Event Extraction Method Based on Template Prompt Learning

Chen Nuo, Li Xuhui

Data Analysis and Knowledge Discovery. 2023, 7(6): 86-98. https://doi.org/10.11925/infotech.2096-3467.2022.0495

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes a joint event extraction model employing an automatically constructed template to leverage the knowledge of pre-trained language models, aiming to improve the existing event extraction models relying on sequence labeling and text generation. [Methods] Firstly, we designed an automatic template construction strategy based on the Event Prompt to generate unified prompt templates. Then, we introduced the Event Prompt Embedding layer for the Event Prompt at the encoding level. Next, we used the BART model to capture the semantic information of the sentence and generated the corresponding prediction sequence. Finally, we jointly extracted trigger words and event arguments from the prediction sequences. [Results] In a dataset containing complex event information, the F₁ values for event trigger and argument extraction reached 77.67% and 65.06%, which were 2.43% and 1.62% higher than the optimal baseline method. [Limitations] The proposed model could only work with sentence-level texts and optimize the Event Prompt at the encoding layer. [Conclusions] The proposed model can reduce the template construction cost while maintaining the same or even better performance. The model could recognize text with complex event information and improve the multi-label classification for event elements.

Select

Patent Keyphrase Extraction Based on Patent Term and Layer Information

Yu Yan, Wang Li, Zheng Siyu

Data Analysis and Knowledge Discovery. 2023, 7(6): 99-112. https://doi.org/10.11925/infotech.2096-3467.2020.0577

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a patent key phrase extraction method incorporating terminology and hierarchical information to improve the accuracy of patent key phrase extraction. It tries to improve the existing graph-based model, which tends to select long key phrases and ignores the phrases’ positional information. [Methods] Based on the traditional graph model, we constructed a new terminology degree metric to measure the terminological information of candidate key phrases. Considering the characteristics of patent documents, we divided patents into several hierarchies and used their weight metrics to measure the positional information of candidate key phrases. [Results] By incorporating terminology information, the F value of the new method improved by 7.615% (nanotechnology), 11.515% (image recognition), 9.813% (chip), and 8.839% (LCD). By incorporating the hierarchical information, the new method’s F value improved by 9.880% (nanotechnology), 6.929% (image recognition), 6.099% (chip), and 5.576% (LCD). [Limitations] The candidate key phrase selection method based on part-of-speech rules may produce more noise. [Conclusions] The proposed method effectively enhances the accuracy of patent key phrase extraction.

Select

Unbalanced Fake Review Processing Model Based on Cost-Sensitive Learning

Liu Meiling, Shang Yue, Zhao Tiejun, Zhou Jiyun

Data Analysis and Knowledge Discovery. 2023, 7(6): 113-122. https://doi.org/10.11925/infotech.2096-3467.2022.0442

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to enhance the detection of fake reviews by improving the model’s ability to learn deep semantic information from text and addressing the problem of data imbalance. [Methods] User behavior and text characteristics of the dataset were analyzed to automatically calculate a cost-sensitive matrix based on inter-class separability, thereby improving the model’s ability to learn from unbalanced data. Additionally, the text encoding ability of BERT was utilized to optimize the model further. [Results] Extensive experiments on the YelpCHI dataset showed that the proposed model outperformed existing advanced methods with an 18% improvement in F1 value and a 12% improvement in AUC value. [Limitations] While the proposed method has achieved promising results, further research is needed to explore its applicability to other domains. [Conclusions] Leveraging user behavior and text features for category separability calculation effectively enhances the performance of the model in detecting fake reviews. The proposed method’s integration of cost-sensitive matrix and BERT’s text encoding ability holds great potential for improving the detection of fake reviews.

Select

Evaluating Student Engagement with Deep Learning

Wang Nan, Wang Qi

Data Analysis and Knowledge Discovery. 2023, 7(6): 123-133. https://doi.org/10.11925/infotech.2096-3467.2022.0485

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs an expression data set of engagement degrees and designs a joint evaluation model for students’ class engagement. It addresses the issues of lacking relevant expression data sets and the low accuracy of the existing models. [Methods] We collected data based on actual online classes and constructed an expression dataset suitable for engagement recognition. Then, we designed an improved VGG model to evaluate the dataset and recognize student engagement. Third, we combined the expression and face scores to establish a joint evaluation model for students’ engagement and calculated the tested students’ actual class engagement scores. [Results] We adjusted and verified the network structure through parameter tuning optimization for engagement expression recognition. The improved model VGG16+Dense+Dropout(lr=1e-5) had the highest accuracy among the four compared model architectures, reaching over 92%. The joint engagement score is more accurate for engagement evaluation than the single expression engagement score. [Limitations] We did not include more ablation studies in training the model; more research is needed to explore the deeper neural networks. [Conclusions] The dataset of W-AttLe is suitable for evaluating students’ class engagement. The proposed joint engagement evaluation model outperforms the single index model. The proposed weighted test scheme combining knowledge point test and self-test of comprehension degree validates the joint engagement degree model.

Select

Associations Between Following Network of Online Investment Community and Stock Market

Li Yulu, Zhao Jichang

Data Analysis and Knowledge Discovery. 2023, 7(6): 134-147. https://doi.org/10.11925/infotech.2096-3467.2022.0482

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the stock preference in the following network of Guba (a Chinese online investment community) users. It examines the correlation between the stock market performance and the social structures of the network. [Methods] First, we used statistical analysis to study users’ preferences. Then, we utilized complex network analysis to learn the structural characteristics of the users’ following network. Finally, we conducted a correlation analysis to examine the correlations between network structures and stock price fluctuations. [Results] Users with the following relationships in the network are more similar in their stock preference (K-S test~0.235, p~0). The structures of the following network affect the dissemination of information, which is significantly correlated to the fluctuation of stock prices. The structural variables of network efficiency are significantly negative (p~0.01). Our findings suggest that the stronger the ability of the following network to spread information, the more independent the fluctuation of stock price will be from the fluctuation of other stocks and the market average. Increasing the ability to disseminate information on the following network can reduce the co-oscillation of the stock price in China. [Limitations] This study lacks experimental validation and analysis comparison of data from different social platforms. [Conclusions] The research methods and results presented in this paper can provide some guidance for market regulation and stock investment.

Select

Deep Learning Model of Drug Recommendation Based on Patient Similarity Analysis

Wu Jialun, Zhang Ruonan, Kang Wulin, Yuan Puwei

Data Analysis and Knowledge Discovery. 2023, 7(6): 148-160. https://doi.org/10.11925/infotech.2096-3467.2022.0535

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develops a deep learning model that accurately predicts drug combinations by analyzing structured time-series medical data and patient similarity. [Methods] Our model learned comprehensive patient representations by parsing structured time-series data through two attention mechanisms. Then, we calculated the patients’ similarity to enrich their representation and transformed the drug recommendation problem into a multi-label learning task. [Results] We examined the new model with the MIMIC-III dataset. Compared to other mainstream models, the proposed one achieved improvements of at least 1.09%, 2.38%, 1.40%, and 1.08% in DDI rate, Jaccard similarity, PRAUC, and F1-score, respectively. [Limitations] Our model should have included the prior domain knowledge from biomedical fields. More research is needed to thoroughly investigate the noise in the data and potential issues in clinical applications. [Conclusions] The proposed method can learn comprehensive patient representations and enhance the safety and accuracy of drug recommendation tasks.

Please choose a citation manager

Content to export

25 June 2023, Volume 7 Issue 6

模态框（Modal）标题

Please choose a citation manager

Content to export

25 June 2023, Volume 7 Issue 6