Home Table of Contents

25 January 2025, Volume 9 Issue 1
    

  • Select all
    |
  • Sun Wenju, Li Qingyong, Zhang Jing, Wang Danyu, Wang Wen, Geng Yangli’ao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 1-30. https://doi.org/10.11925/infotech.2096-3467.2024.0508
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study comprehensively reviews the advancements in deep incremental learning techniques from the perspective of addressing catastrophic forgetting, aiming to provide references for the research community. [Coverage] Utilizing search terms such as “Incremental Learning”, “Continual Learning”, and “Catastrophic Forgetting”, we retrieved literature from the Web of Science, Google Scholar, DBLP, and CKNI. By reading and organizing the retrieved literature, a total of 105 representative publications were selected. [Methods] The paper begins by defining incremental learning and outlining its problem formulation and inherent challenges. Subsequently, we categorize incremental learning methods into regularization-based, memory-based, and dynamic architecture-based approaches, and review their theoretical underpinnings, advantages and disadvantages in detail. [Results] We evaluated some classical and recent methods in a unified experimental setting. The experimental results demonstrate that regularization-based methods are efficient in application but cannot fully avoid forgetting; memory-based methods are significantly affected by the number of retained exemplars; and dynamic architecture-based methods effectively prevent forgetting but incur additional computational costs. [Limitations] The scope of this review is limited to deep learning approaches, excluding traditional machine learning techniques. [Conclusions] Under optimal conditions, memory-based and dynamic architecture-based strategies tend to outperform regularization-based approaches. However, the increased complexity of these methods may hinder their practical application. Furthermore, current incremental learning methods show suboptimal performance compared to joint training models, marking a critical direction for future research.

  • Mu Xin, Han Xiaoxu, Zhu Feida
    Data Analysis and Knowledge Discovery. 2025, 9(1): 31-40. https://doi.org/10.11925/infotech.2096-3467.2024.0512
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the issue that existing methods rely on heuristic random data generation or random data perturbation, leading to unstable results, this paper proposes an end-to-end self-supervised-based data auditing framework based on deep network perturbations. [Methods] The proposed framework replaces the traditional random data perturbation approach with deep network-based perturbations. It implements data auditing by optimizing the distance relationships of the auditing data in the output space and performing multiple rounds of iteration. [Results] We perform comprehensive experiments on CV and NLP datasets. The results, measured by F1 and AUC metrics, show average improvements of 5.22% and 6.29% respectively. [Limitations] The theoretical foundation of the algorithm is not discussed, and the scalability of the model on different types of data is not elaborated. [Conclusions] The self-supervised-based data auditing framework based on network perturbations avoids the uncertainties introduced by random methods and outperforms existing algorithms.

  • Fu Yun, Liu Xiwen
    Data Analysis and Knowledge Discovery. 2025, 9(1): 41-54. https://doi.org/10.11925/infotech.2096-3467.2024.0692
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To break through the traditional reliance on expert experience, adopting a data- and knowledge-driven collaborative approach to construct a unified language of experiment operation units (ULEU). [Methods] A synthetic experiment operation unit recognition model, NL2ULEU, was developed based on the large language model GPT-4. Testing on 100 synthetic protocols showed that the model achieved an accuracy rate exceeding 91% in recognizing the components of synthetic experiment operation units (including synthetic operations and operational parameters). Forty-seven commonly used synthetic operations were selected, and the recognition results of the model, expert feedback on errors, and the co-occurrence strength of synthetic operations and parameters were utilized to standardize each synthetic operation and its associated parameters, thus constructing a unified representation language. [Results] NL2ULEU was used to process 811 synthetic protocols. The 47 synthetic operations and their associated parameters were standardized, leading to the development of a unified language of experiment operation units, which includes 30 operation unit sets. Each synthetic experiment operation unit consists of a synthetic operation and several associated parameters.[Limitations] This study only selected common synthetic operations for analysis. Future work can extend this method to further standardize additional synthetic operations, gradually enriching and improving the synthetic experiment operation unit set. [Conclusions] Compared to the commonly used synthetic experiment operation unit representation framework χDL, the ULEU developed in this study provides more accurate content and format for revealing the details of synthetic protocols.

  • Wang Xiaolun, Yao Qian, Lin Jiahui, Zhao Yuxiang, Sun Zhihao, Lin Xinlan
    Data Analysis and Knowledge Discovery. 2025, 9(1): 55-64. https://doi.org/10.11925/infotech.2096-3467.2024.0098
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] Based on self-determination theory, this study explores the motivations of service providers to participate in tasks on skill crowdsourcing platforms. [Methods] We retrieved 15,641 bids and 2,385 service provider records from the epwk.com platform. We utilized the TF-IDF and the BERT to analyze text features and calculate motivation variables. Finally, we constructed a negative binomial regression model considering the dependent variables as count variables. [Results] The motivations and behaviors of service providers participating in skill crowdsourcing were significantly correlated at the 1% level (R²=23.10%). Task difficulty improved the model’s explanatory power, negatively moderating competence and reputation (p<0.05) while positively moderating social recognition (p<0.01). [Limitations] The representativeness is limited to a single platform. Future studies could collect data from multiple platforms for comparative validation. External factors such as platform dynamics and policy environments might interfere with the data, which should be considered in future research to deepen the conclusions. [Conclusions] This paper expands the theoretical foundation for service provider participation in crowdsourcing tasks and offers practical insights for service providers, buyers, and platforms.

  • Zhang Lanze, Gu Yijun, Peng Jingjie
    Data Analysis and Knowledge Discovery. 2025, 9(1): 65-78. https://doi.org/10.11925/infotech.2096-3467.2023.1009
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To enhance the accuracy of graph neural networks in credit fraud detection, this paper introduces topological structure analysis. It proposes a graph-based deep fraud detection model (PSI-GNN) integrating prior structural information. [Methods] We embed the attribute information representing the topological structure of central nodes into feature vectors through structural information encoding. Then, we divided the message-passing process into proximal and distal aspects. We aggregated proximal node information based on a shallow graph neural network model and aggregated distal homophily information guided by random walk structural similarity. Finally, we combined the results of the above message passing to obtain node embedding representations. [Results] We examined the new model on the DGraph-Fin and TFinance datasets, which include fraudulent behaviors. The Macro-F1 and AUC of the PSI-GNN model improved by 2.62%, 4.55%, and 4.67%, 2.33%, respectively, compared to nine graph neural network models in related fields. [Limitations] The processing of node structural information incurs significant time overhead. [Conclusions] By modeling the structural attributes and homophily information of credit networks, we can effectively detect credit fraudsters.

  • Xie Jun, Yang Haiyang, Xu Xinying, Cheng Lan, Zhang Yarui, Lü Jiaqi
    Data Analysis and Knowledge Discovery. 2025, 9(1): 79-89. https://doi.org/10.11925/infotech.2096-3467.2023.1072
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This article proposes a knowledge graph completion method based on multi-view fusion and multi-feature extraction, aiming to address issues of low-quality knowledge representation and poor performance in existing models. [Methods] Firstly, we generated multiple single-view networks through a view encoder and obtained the final knowledge representation of the entity using multi-view attention to fuse information from different views. Secondly, we extracted semantic and interaction features of the head entity and the relations with different feature extractors. Finally, we employed a cross-attention module to fuse the semantic and interaction features and match them with tail entities. [Results] Experiments on the link prediction task showed that compared to baseline models, the proposed model improved the Hits@10 metric by 0.4% and 0.9% on the general datasets FB15K-237 and WN18RR, respectively. The Hits@10 metric on the domain datasets Kinship and UMLS reached 99.0% and 99.9%. [Limitations] Relationship was not updated during view updates, resulting in an average quality of relation knowledge representation vectors. [Conclusions] The multi-view fusion model effectively improves the quality of knowledge graph representation, and the multi-feature extraction framework significantly enhances link prediction accuracy.

  • Wang Zhenyu, Zhu Xuefang, Yang Rui
    Data Analysis and Knowledge Discovery. 2025, 9(1): 90-99. https://doi.org/10.11925/infotech.2096-3467.2023.1273
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper utilizes large language models (LLMs) to generate high-quality auxiliary knowledge, aiming to improve the performance of multimodal relation extraction. [Methods] We introduced a multimodal similarity detection module to construct multimodal prompt templates, which allow the LLM to integrate visual information and prior knowledge into the generated high-quality auxiliary knowledge. We combined the obtained auxiliary knowledge with the original text and input it into downstream text models to accurately predict entity relationships. [Results] The proposed model outperformed the best-baseline model on the MNRE dataset, achieving 4.09% and 7.84% improvements in accuracy and F1 score. [Limitations] We only examined the proposed model on English datasets. [Conclusions] Comparative experiments and case studies validate the model’s effectiveness in multimodal relation extraction. Our new model provides a direction for applying LLMs to multimodal information extraction tasks in the future.

  • Rang Yuchen, Ma Jing
    Data Analysis and Knowledge Discovery. 2025, 9(1): 100-109. https://doi.org/10.11925/infotech.2096-3467.2023.1130
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To reduce inter-modal differences and strengthen the correlation between modalities, this paper proposes a multimodal alignment sentiment analysis model to accurately capture the sentiment tendencies embedded in multimodal data. [Methods] For the textual modality, the original text data, supplemented with image captions, is processed using the RoBERTa pre-trained model for text feature extraction. We used the Clip Vision Model to extract image features for the image modality. The text and image features are aligned through a multimodal alignment layer based on a Multimodal Transformer to obtain enhanced fused features. Finally, the fused multimodal features are inputted into a multilayer perception for sentiment recognition and classification. [Results] The proposed model achieved an accuracy of 71.78% and an F1 score of 68.97% on the MVSA-Multiple dataset, representing improvements of 1.78% and 0.07%, respectively, over the best-performing baseline model. [Limitations] The model’s performance was not validated using additional datasets. [Conclusions] The proposed model effectively promotes inter-modal fusion, achieves better fusion representations, and enhances sentiment analysis.

  • Pang Qinghua, Xu Xun, Zhang Lina
    Data Analysis and Knowledge Discovery. 2025, 9(1): 110-120. https://doi.org/10.11925/infotech.2096-3467.2023.1076
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the issues of singularity and lack of novelty in Weibo topic recommendations, this study proposes a more comprehensive model to meet users’ personalized needs. [Methods] Firstly, we used the LDA model to mine users’ historical topics, constructing Weibo-topic and user-topic matrices. Next, we comprehensively evaluated Weibo topics from the dimensions of interaction, attributes, and frequency, forming a multi-dimensional assessment of topics by users. Meanwhile, we simulated the process of user interest forgetting and decay to construct a dynamic user interest model, from which we obtained the user’s neighbor set. Finally, we used hybrid recommendation to form the users’ final evaluation of topics, providing topic recommendations for users. [Results] We conducted ablation experiments on a real dataset and found that the proposed model achieved higher comprehensive evaluation in F1 scores, coverage, and novelty than single models. [Limitations] We performed topic mining only on Weibo posts without incorporating user comments or other information. [Conclusions] The proposed model ensures accuracy while offering users more diverse and novel Weibo recommendation content.

  • Li Linxia, Chen Bo, Zhou Maoke, Zhao Xiaobing
    Data Analysis and Knowledge Discovery. 2025, 9(1): 121-132. https://doi.org/10.11925/infotech.2096-3467.2024.0065
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to quantify sentence alignment scores for parallel corpora of low-resource languages, obtain high-quality parallel corpora, and improve machine translation performance. [Methods] We proposed NeuroAlign, a neural network-based unsupervised sentence embedding alignment scoring method. Parallel sentence pairs were embedded into the same vector space, alignment scores for candidate sentence pairs in the parallel corpus were calculated, and low-scoring sentence pairs were filtered out based on score ranking. Finally, we obtained high-quality bilingual parallel corpora for low-resource languages. [Results] In the BUCC2018 parallel text mining task, the F1 score improved by 0.5%~0.8%. In the CCMT2021 low-resource language neural machine translation task, the BLEU score improved by 0.1-10.9. The sentence alignment scores closely approximated human evaluation. [Limitations] Due to the scarcity of low-resource bilingual parallel corpora, our research was limited to Tibetan-Chinese, Uyghur-Chinese, and Mongolian-Chinese language pairs. [Conclusions] This new method can effectively increase sentence alignment scoring for low-resource language machine translation parallel corpora, improving the corpus quality at the data source level and enhancing machine translation performance.

  • Dong Wenjia, Sun Tan, Zhao Ruixue, Ma Weilu, Xiong He, Xian Guojian
    Data Analysis and Knowledge Discovery. 2025, 9(1): 133-144. https://doi.org/10.11925/infotech.2096-3467.2024.0302
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study explores the deep relationships among document features and enhances the effectiveness of author name disambiguation in academic literature. [Methods] We designed a knowledge-enhanced feature extraction framework incorporating prior knowledge from standardized knowledge bases such as institutional name authority files, disciplinary classification systems, and thesauri. Based on the standardized data, the framework integrates semantic and relational information of document features through heterogeneous information network embedding to generate high-quality document vector representations. Finally, we used hierarchical agglomerative clustering for clustering. [Results] Our model’s F1 score reached 89.07% on the constructed test dataset. [Limitations] The quality and scale of the knowledge base limit the model’s accuracy and generalizability in emerging and subdivided fields. [Conclusions] This proposed method combines expert prior knowledge with powerful learning capabilities of deep learning, providing an effective approach for author name disambiguation tasks in academic literature.

  • Shen Yangtai, Qi Jianglei, Ding Hao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 145-153. https://doi.org/10.11925/infotech.2096-3467.2023.0808
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a latent non-negative factorization topic recommendation model based on LDA and transfer learning to improve recommendation accuracy in sparse data scenarios. The new model aims to address the data sparsity issue in publication recommendations. [Methods] We used non-negative matrix factorization to fill the high-dimensional sparse matrix of non-negative data. Then, we constructed a latent topic model based on LDA and non-negative matrix factorization, fully considering the thematic distribution characteristics of user reviews. Additionally, we applied different dimensions of user information to rating prediction to mitigate data sparsity. Finally, we introduced a transfer learning mechanism to extract and transfer model parameters from pre-trained models of related publication categories. This mechanism assisted the feature learning for the target model data and improved the effectiveness of the recommendation for less popular publications. [Results] We conducted comparative experiments against three baseline methods with three publication datasets. The proposed model achieved average precision, F1 score, and NDCG of 0.773 2, 0.708 5, and 0.746 8. The model’s overall performance surpasses other baseline models. [Limitations] When the number of users in the system is too small, other methods are needed for cold-start situations. [Conclusions] The proposed method has strong generalization capabilities for user interest features, alleviates popularity bias and data sparsity, and effectively improves the accuracy of publication recommendations.