Home Browse Online first

Online first

The manuscripts published below will continue to be available from this page until they are assigned to an issue.
Please wait a minute...
  • Select all
    |
  • Xiao Kui, Wang Ziming, Zheng Lele, Zhang Miao, Li Zhifei, Zhang Yan, Chen Hao, Wang Shihui
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0239
    Online available: 2025-09-25

    [Objective] Existing cognitive diagnosis methods rely heavily on prerequisite and similarity relationships among knowledge concepts, which leads to several challenges in educational applications, including difficulties in automated modeling, sparse relational structures, and the lack of standardized evaluation criteria. To address these issues, this study proposes a multi-layer graph contrastive cognitive diagnosis model based on knowledge concept co-occurrence relationships.[Methods] First, dual-view concept graphs are constructed using concept co-occurrence relations and correlated co-occurrence dependencies. Second, multi-layer graph contrastive learning is applied to iteratively enhance concept node embeddings via cross-graph alignment. Finally, the enhanced embeddings are integrated with student-exercise interaction features into diagnostic functions to generate knowledge state estimations.[Results] Experiments on three real-world educational datasets (ASSISTments09, MAT2016, and EdNet-1) demonstrated that the proposed model achieved accuracies of 72.05% and 70.70% for CO-IRT and CO-SCD, outperforming baseline models by 3.5% and 1.1%, respectively. For CO-NCD, the model attained an AUC of 76.40%, surpassing baseline methods by 1.5%, while exhibiting superior interpretability.[Limitations] The current method exhibits limitations in handling scenarios where exercises contain only a single knowledge concept. [Conclusion] The multi-layer graph contrastive cognitive diagnosis model based on concept co-occurrence relations effectively captures complex associations between knowledge concepts through the construction of knowledge concept relation graphs and adaptive contrastive enhancement mechanisms. Experimental results confirm that the model significantly improves the accuracy of knowledge state inference.

  • Xiao Kejiang, Chen Liang, Fang Shuo, Pang Shiyan, Qiu Jiefan, Dong Yaning, Yang Wenqi, Guo Shanfeng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1116
    Online available: 2025-09-23

    [Objective] In response to the problems of data sparsity, cold start, and insufficient feature utilization in online course recommendations, this paper proposes a recommendation model fusing course knowledge graph and graph attention network (CKGAT). [Methods] The learner module of CKGAT enhances the memory and generalization abilities of learners' features through feature cross layer and fully connected layer, respectively. The course module of CKGAT is based on graph attention network to implement message passing and mine high-order semantic features between course entities. The recommendation results are obtained by calculating the dot product of the output vectors from the two modules. [Results] In a comparison experiment based on the MoocCubeX dataset, CKGAT improved 1.28%, 1.62%, and 1.00% on ACC, F1, and AUC metrics, respectively, compared to the best baseline model. [Limitations] The course knowledge graph in this paper is not very rich, and the computational complexity of the model can be further optimized. [Conclusions] CKGAT proposed in this paper has achieved good recommendation results, which is helpful to improve the effect of online course recommendation.

  • Zhu Hou, Tan Yawen, Wu Zishuai
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0309
    Online available: 2025-09-23

    [Objective] This study aims to develop a method and system for detecting violations of privacy agreements using natural language processing technology, facilitating the automated and intelligent identification of violations and their judicial interpretations.[Method] We analyze the "GB/T 35273-2020 Personal Information Security Specifications", identifying 19 essential elements of privacy agreements and 32 specific requirements. We then construct and train a model for extracting these core elements and detecting violations, utilizing techniques such as text classification, named entity recognition, and QLora fine-tuning of large language models.[Results] Experimental findings indicate that the fine-tuned Gemma-2b model exhibits excellent performance in violation detection, achieving the highest results in Dataset One and significantly outperforming the ChatGLM2-6b model (F1 score 0.7647 vs. 0.3735). Additionally, the Gemma-2b model demonstrates superior quality in generating compliance explanations, as evidenced by higher BERTScore evaluation scores (F1 score 0.8054 vs. 0.7440).[Limitations] The generality of the adopted standards constrains detection granularity in scenario-specific contexts, while model input length limitations may compromise semantic completeness across extended texts.[Conclusion] The technical framework proposed in this study effectively identifies the core elements of privacy agreements and facilitates interpretable violation detection, thereby enhancing the oversight and monitoring of the implementation of relevant laws and regulations in privacy agreements.

  • Fu Zhu, Qiu Changchang, Liu Peng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0157
    Online available: 2025-09-23

    [Objective] To provide new solutions for entity recognition in vertical fields where data is scarce. [Methods] We adopt a multi-round prompt approach for large language models(LLMs), integrating four rounds of prompts: entity definition, core examples, knowledge enhancement, and error correction. Taking the Chinese ship faults field as an example, we select five mainstream LLMs to conduct experiments on our self-built corpus and the CCKS2017 dataset. [Result] DeepSeek achieves the best entity recognition performance on both the self-built corpus and the CCKS2017 dataset, with F1 values reaching 90.62% and 90.36% respectively. [Limitations] The scale of self-built corpora is relatively small, and there may be subjective biases in manual prompts. [Conclusions] The entity recognition effects of different LLMs vary greatly. Core examples have a stronger effect on model gain compared to random examples. Each round of prompts has a certain gain effect, and multiple rounds of prompts have the greatest gain effect. The proposed method has good performance and robustness.

  • Wu Hong, Zhang Li, Wan Sihan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0449
    Online available: 2025-09-23

    [Objective] To explore how various conditional factors that affect the performance of online medical teams can lead to high or non-high performance through different combinations.[Methods] Taking the Haodf online platform as the research background, the NCA method was first used to test the non-necessity of a single conditional variable to the outcome variable. Secondly, the fsQCA method was used to conduct configuration analysis on the performance of high online medical teams and non-high online medical teams. Finally, the dynamic fsQCA method is used to explore the dynamic evolution law of the performance configuration of high online medical teams.[Results] None of the eight conditional variables constitutes a necessary condition that affects high or non-high online medical team performance alone. There are four configuration modes that lead to high online medical team performance and three configuration modes that lead to non-high online medical team performance, and the configuration patterns of high online medical team performance vary at different time stages.[Limitations] Only second-hand data from a single online medical platform was obtained, which to some extent affected the generalizability of the research conclusions. Future research may be considered to be conducted on other online medical platforms and Internet hospitals.[Conclusions] The configuration of high or non-high online medical team performance verifies the four principles of complexity theory. The research results are helpful for platform decision-making to optimize resource allocation and provide theoretical basis and practical guidance for the establishment and operation of online medical teams.

  • Duan Yufeng, Xie Jiahong, Bai Ping, Gong Tianyang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0410
    Online available: 2025-09-23

    [Objective] To explore whether large language models (LLMs) and prompt engineering can replace classical deep learning models in the task of entity relation extraction from Chinese medical texts with high professionalism and domain characteristics. [Methods] This study uses three LLMs, GLM-4, ERNIE-4-Turbo, and DeepSeek-R1, and three classical deep learning models, CBLUE, CasRel, and GPLinker, to systematically compare the performance differences between LLMs based on prompt engineering and classical deep learning models. The comparison is conducted by varying the number of relation types to be extracted, the number of examples in the prompt for LLMs, and the training data size for classical deep learning models. We use bert-base and roberta as encoders for classical deep learning models. [Results] Experimental results on the CMeIE-V2 dataset show that: (1) roberta-CBLUE and roberta-GPLinker have the best extraction effect. When extracting one relation type, the F1 score reaches 0.5826 and 0.5853, and when extracting ten relation types, the F1 score is 0.5112 and 0.4934; (2) LLMs are not good at extracting multiple relation types at the same time. When extracting two relation types, the F1 scores of GLM-4, ERNIE-4-Turbo, and DeepSeek-R1 decrease by 0.1182, 0.0885, and 0.1310, respectively, compared to extracting one relation type; (3) adding examples to prompt can improve the extraction performance of LLMs, but more examples do not always mean better results. [Limitations] This study is based on a single dataset, and future work could extend the experiments to datasets from other domains. [Conclusions] The prompt engineering approach for LLMs is currently difficult to replace classical deep learning models and can only be considered as an alternative when labeled samples are limited.

  • Liu Yao, Wu Yani, Xiao Zheng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0427
    Online available: 2025-09-23

    [Objective] To address fragmented medical data, weak knowledge associations, and limited generalization of traditional knowledge bases, we integrate medical knowledge using cognitive graphs and construct a dynamic vector knowledge base for precise clinical decisions. [Methods] A joint extraction model processes multi-source heterogeneous data to build a multi-level concept network. A vector knowledge base framework combining Monte Carlo tree search (MCTS) and adversarial learning optimizes diagnostic pathways and captures complex entity relationships. [Results] The Bart model achieved a 0.9974 F1-score in entity recognition. The reasoning accuracy improved from 67% to 99% after 200 rounds of adversarial training, demonstrating strong inference enhancement. [Limitations] Validation used only dermatology data and Chinese medical texts; multilingual adaptability and emerging term handling need improvement. [Conclusions] Our approach overcomes traditional knowledge base limitations and enhances clinical decision support.

  • Yang Liangliang, Yao Shuang, Li Yueyan, Yang Yuxiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1175
    Online available: 2025-09-23

    [Objective] Enhancing link prediction accuracy through node multi-dimensional attribute embedding and edge self-representation enhancement, and accurately identifying industrial technology innovation opportunity scenarios.[Methods]A link prediction model based on GCN two-stage features is proposed to identify potential technology opportunities. In the first stage, the learning nodes are embedded in the representation, and in the second stage, the representation of edges is strengthened; Use market demand, technology competition and policy orientation to match technology application scenarios.[Results] The proposed link prediction model achieved an approximately 30% improvement in F1 score compared to the first stage, which relied solely on GCN node embedding features, and an approximately 10% improvement compared to the second stage, which relied solely on similarity features. By combining the three-dimensional evaluation model, a total of four categories and 34 pairs of technological opportunities were identified. Sensitivity analysis identified five application scenarios for these opportunities, enhancing the adaptability and reliability of technological opportunity identification in the new energy vehicle sector.

    [Limitations] Focusing solely on the field of new energy vehicle technology, the universality and transferability of the methods require improvement; the identification of technological opportunities is centred on single-theme pairs, lacking multi-theme complex combination modelling; there is an urgent need to combine international data with cross-disciplinary research to further validate the universality of the methods.[Conclusions] By combining link prediction with multi-dimensional indicator evaluation methods, it is possible to identify technological opportunities more accurately and in greater detail.

  • Dou Luyao, Wang Zehao, Qu Jingchen, Zhou Zhigang, Dai Longzheng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0977
    Online available: 2025-09-23

    [Objective] This study aims to address the challenges of data utilization efficiency and privacy protection in multi-institutional data fusion, thereby enhancing both the practicality and security of data sharing among multiple institutions.[Method] A novel multi-institutional data fusion model, termed CEFDFM-MI (Cloud-Edge Federated Data Fusion Model for Multi-Institution), is proposed, which integrates cloud-edge collaboration with federated learning. Within the cloud-edge collaborative framework, a budget allocation mechanism and an information gain evaluation mechanism are devised to ensure the efficiency of data fusion processes. Furthermore, by leveraging the distributed characteristics of federated learning, the proposed model ensures secure and privacy-preserving data integration across institutions. The model is empirically evaluated on MNIST, CIFAR-10, and CIFAR-100 datasets under independent and identically distributed (IID) scenarios, as well as non-IID scenarios exhibiting low, medium, and high degrees of heterogeneity, in order to assess its performance across diverse and complex environments.[Results] Under IID conditions, the CEFDFM-MI model achieves a maximum accuracy of 94.52%. In non-IID scenarios with low, medium, and high heterogeneity, the model attains peak F1 scores of 73.71%, 74.51%, and 73.45%, respectively. Moreover, in the presence of model heterogeneity at the edge level, the proposed model demonstrates an accuracy improvement of approximately 6%–8% compared to independent training on individual edge nodes.[Limitations] The current study does not address scenarios where the objectives of cloud and edge models are misaligned, and the model's applicability in more complex environments remains to be further explored.[Conclusions] The proposed CEFDFM-MI model exhibits superior global performance relative to FedAvg and FedProx, and possesses robust capabilities in handling model heterogeneity across multiple institutions.

  • Meng Jiana, Ma Tengfei, Zhao Di, Liu Shuang, Wang Bolin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1227
    Online available: 2025-09-23

    [Objective] This paper aims to address the problem of insufficient inter-modal relationship mining in existing multi-modal rumor detection and adaptively aggregate uni-modal and multi-modal features. [Methods] This paper proposes a rumor detection model based on multi-modal adaptive feature fusion. First, we use the pre-trained BERT and EfficientNet models to extract text and image features respectively; then we use the multi-modal collaborative enhancement network to generate complementary enhancement information between modalities; then we adaptively aggregate single modal features and multi-modal fusion features through the cross-modal similarity learning network; finally, we input the aggregated features into the rumor detection network for detection; at the same time, we use the domain discrimination network to learn the general representation of different events. [Results] Experimental results show that the accuracy of the proposed method on two public datasets, Twitter and Weibo, reached 91.4% and 90.3% respectively, which is better than the baseline model. [Limitations] The model only uses text and image data for rumor detection and fails to incorporate information such as videos or audio that may be attached to the post.[Conclusions] The proposed model can fully explore the relationship between modalities and adaptively aggregate uni-modal and multi-modal features to improve the accuracy of detection.

  • Song Yun, He Fan, Bao Zhijie, Zhao Lingjun, Wei Zhongyu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0425
    Online available: 2025-09-23

    [Objective] This study leverages reasoning chain distillation from large language models to train a new model capable of joint-task deep reasoning for legal charge prediction in judicial proceedings. [Methods] We first employ a teacher reasoning model to generate a deep reasoning distillation dataset, which is then refined through a large language model workflow. The student model is subsequently trained to learn both the reasoning process and the final charge predictions. [Results] Our model outperforms the best-performing baseline on the Criminal-S dataset for charge prediction, achieving performance exceeding that of the 671B-parameter teacher model, despite having only 7B parameters. Specifically, it improves the F1 score by 0.2% in charge prediction and by 18.3% in charge attribute prediction. [Limitations] The model does not account for the complexity of cases involving multiple defendants or multiple charges. [Conclusions] Our model outperforms existing approaches in terms of charge reasoning capability, and achieves high accuracy and enhanced  explainabilityin the task of legal charge prediction.

  • Lu Xinyuan, Xu Anqi, Zhang Jinao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0291
    Online available: 2025-09-23

    [Objective] Explainability and explanation accuracy often form a paradox in the design of Explainable Artificial Intelligence (XAI). Exploring the relationship between these factors and user information behavior is crucial. It helps deepen the theoretical mechanisms of human-AI interaction. Additionally, it optimizes explainable designs for various scenarios. [Methods] Through two parallel experiments under different contextual conditions, this study examines the effects of AI explainability and the relevance of explanation content on individuals’ information adoption, and further explores the shifting patterns of the AI "explanation singularity" across varying task information demand scenarios.[Results] AI explainability significantly affects individual information adoption, and the specific relationship is constrained by content relevance. Explanation singularities represent the turning point where explainable designs influence information adoption behavior, the explanation singularity  varies across different task contexts. Theses discovery not only expands the research perspective on the alignment between explanatory rationality and human preferences, but also reveals the crucial role of the relationship between explanations and content relevance in driving users' information adoption, emphasizing its marginal impact on users' information adoption behavior.[Limitations] The presence of false information has not been considered, which limits the revelation of the "explanation singularities" of XAI on individual information adoption.[Conclusions] The study reveals the differences in AIGC explanation adoption mechanisms from a multi-context perspective, uncovering the variation patterns of the "explanation singularities."

  • Han Mingxing, Xu Liwei, Li Jiaxuan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0317
    Online available: 2025-09-23

    [Objective] In response to the shortcomings of existing multimodal fake news detection studies in foreground-background features and adaptive adjustment mechanisms for external features, this study proposes a multimodal fake news detection model based on contrast-driven feature augmentation.[Methods] This study uses a Semantic Decoupling Module to separate the foreground subject representation and background contextual information of news images, which is used to guide multi-level feature fusion and key feature augmentation. This study uses an entropy weight method combined with crowd feedback features from news comments to achieve an adaptive adjustment mechanism for external feature weights.[Results] The proposed model achieved precision improvements of 1.21% and 1.58% on private and public datasets, respectively, demonstrating its superiority. [Limitations] Comparison-driven feature fusion significantly increases model complexity and hardware requirements; performance depends on the quality of crowd feedback features; other modalities such as video and speech are not considered.[Conclusions] The model can provide technical support for social media platforms and government regulatory agencies in detecting fake news.

  • Xiang Qian, Cai Zangtai, Li Cuo, Ma Denghao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0109
    Online available: 2025-09-12

    [Objective] This study aims to address the issues of insufficient local feature extraction and decoding redundancy in traditional Transformer models for Tibetan news generation. The goal is to enhance the quality, detail expression, and semantic coherence of generated Tibetan news texts.[Methods] Leveraging the title and prompt words as key inputs to enhance the understanding of the text's theme; in the encoder, the Transformer captures the global features of the title, while the CNN extracts the local features of the prompt words, forming a refined text representation through weighted fusion. The decoder employs a hybrid strategy and an autoregressive mechanism to gradually generate news content closely related to the input, reducing redundancy and improving the naturalness and coherence of the text. [Results] On a self-constructed Tibetan news generation dataset, the model achieved BLEU, ROUGE, and Distinct scores of 38.9%, 35.8%, and 47.2%, respectively, showing a significant improvement over baseline models.[Limitations] This study primarily focuses on Tibetan news generation and has not yet been validated in other Tibetan text generation scenarios, such as literary creation or technical documentation. Additionally, the computational efficiency and resource consumption of the model require further optimization.[Conclusions] By innovatively combining global and local feature extraction with a hybrid decoding strategy, this study significantly enhances the quality and detail expression of Tibetan news generation. It provides an innovative solution for Tibetan natural language processing.

  • Yao Jianjun, Zhuang Zicong, LI Ruisheng, Yang Dunshun, Zhang Zhen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0156
    Online available: 2025-09-12

    [Objective]This paper achieves high-quality entity-relation extraction for Chinese Judicial Texts, significantly advancing the precision of domain-specific triplet extraction in judicial judgment documents through an optimized framework tailored to judicial linguistic patterns and semantic dependencies.[Methods] This paper proposes S-CasRel-MF, a joint extraction model based on CasRel with entity feature-sensitive attention mechanisms. The model incorporates a LERT Chinese encoder to enhance semantic modeling of complex Chinese judicial texts, combines self-attention mechanisms and BiLSTM to improve contextual interactive feature capture for entities, addresses error propagation in subject-object extraction through domain-specific feature dictionaries and multi-head attention mechanisms, and mitigates sample imbalance during decoding using a Focal Loss function.[Results]Experimental results demonstrate that the S-CasRel-MF model attains F1-scores of 83.51% and 83.40% on drug-related and theft case datasets, respectively, achieving statistically significant improvements of 9.77% and 8.73% over the CasRel baseline;Compared to other types of extraction models, F1-scores increased by 16.59% and 15.70% on average.[Limitations]The model demonstrates elevated computational complexity during inference phases, while its reliance on external legal entity feature lexicons creates domain adaptation bottlenecks—manifested through lexicon coverage gaps during cross-domain migration—that systematically compromise extraction fidelity.[Conclusions]Our model exhibits superior capability in capturing intricate inter-entity relationships within judicial texts, demonstrating statistically significant performance advantages over existing baselines in joint entity-relation extraction tasks, particularly for semantically dense legal interactions.

  • Wu Huidong, Su Qiudan, Wu Dengsheng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1121
    Online available: 2025-09-12

    [Objective] To effectively integrate journal rank results with tied ranks, reflecting the academic community's overall assessment of publication quality. [Methods] A journal rank aggregation method based on a Linear Pseudo-Boolean Optimization model (LPBO) is proposed. This approach transforms various original ranking results into pairwise comparisons between journals and utilizes a generalized Kendall-τ distance to construct the LPBO model. Under the constraints of ensuring the uniqueness and transitivity of journal rankings, the method achieves an aggregated ranking outcome. [Results] An empirical study on journals in the field of information science and library science revealed that the correlation coefficient between the aggregated results of LPBO and the original ratings is 13.7% higher than that among the original ratings themselves, while preserving tied rank information. The model demonstrates robust performance when handling data with different scales, incomplete information, and varying numbers of journals and rankings. [Limitations] The model may face efficiency challenges when applied to large datasets. [Conclusions] The LPBO method avoids the biases introduced by strict ranking approaches, providing a fair and robust solution for evaluating journal quality and impact, with significant theoretical and practical value.

  • Wang Xianwen, Yin Yixian, Geng Yu, Yu Qianqian, Zhang Guangyao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0171
    Online available: 2025-09-12

    [Objective] From the perspective of patent citing papers, this study explores the relationship between the disruptiveness of scientific papers and their technological impact, enriching the research on the factors influencing the flow of scientific knowledge into the technological domain.[Methods] Using over 680, 000 scientific papers published in the field of artificial intelligence, combined with patent citation data, we have built a large-scale dataset. Applying regression models such as Probit, we conducted analyses across five dimensions: possibility, importance, universality, persistence, and time lag.[Results] The findings reveal a positive correlation between a paper's disruptiveness and the possibility of being cited by patents, indicating that disruptive science is more likely to generate technological impact. Meanwhile, highly disruptive scientific outputs yield more significant, universe, and persistent technological impacts, but has shorter time lags.[Limitations] The motivation and type of citations are not considered, and the patent characteristics are not analyzed.[Conclusions] This study confirms a positive correlation between the disruptiveness of scientific papers and their technological impact, and provides a theoretical foundation for policymaking aimed at accelerating the technological translation of scientific knowledge.

  • Yang Rui, Zhu Xuefang, Wang Zhenyu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1197
    Online available: 2025-09-12

    [Purpose] Based on chain-of-thought prompts, we explore the usability and effectiveness of large language models in multimodal entity disambiguation tasks. [Method] Construct a large language model prompt template based on thought chain prompts, input prior knowledge and multimodal information into the large language model, and assist the model to determine the entity that the mention accurately refers to from the candidate entity set. [Result] Experiments show that in the three datasets of Wiki-MEL, Twitter-MEL and Weibo-MEL, the accuracy of the PLMED model is improved by 15.1%, 11.5% and 4.1% respectively compared with the current most advanced models.[Limitations] The experiment did not explore in detail the performance changes of the large language model in the multimodal entity disambiguation task under different prompt word construction methods.[Conclusions] The large language model based on thought chain prompts can better adapt to multimodal entity disambiguation tasks in different scenarios. The large language model has great application potential in multimodal entity disambiguation tasks.

  • Wang Zhenyu, Zhu Xuefang, Zhang Jundong, Yang Rui, Liu Songyin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1191
    Online available: 2025-09-12

    [Objective] To address the growing demand for diverse and personalized information needs in of users in bibliographic search scenarios and to enhance the performance and user experience of question answering systems.[Methods] This paper constructs a conversational question answering system for bibliographic search (BSCQA). The system employs the model context protocol to integrate large language models with external databases. To improve the accuracy of Text-to-SQL generation, this paper designs and integrates a contrastive learning-based example selection strategy to enhance the model's understanding of query intent in specialized domains.[Results] Experimental results on the bibliographic search semantic parsing dataset constructed in this paper demonstrate that, compared to the zero-shot scenario, the proposed approach improves the execution accuracy of DeepSeek-V3 by 18.1% in the 5-shot scenario.[Limitations] Due to the limited coverage of the experimental dataset, the system's adaptability in cross-domain applications still requires further improvement.[Conclusions] The proposed BSCQA system showcases the potential and application value of large language models in the typical application scenario of intelligent bibliographic retrieval within the library and information science domain, providing a reference for the research and application of conversational question-answering systems in other vertical domains.

  • Wang Zhenyu, Zhu Xuefang, Zhang Jundong, Yang Rui, Liu Songyin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0003
    Online available: 2025-09-12

    [Objective] To address the growing demand for diverse and personalized information needs in of users in bibliographic search scenarios and to enhance the performance and user experience of question answering systems.[Methods] This paper constructs a conversational question answering system for bibliographic search (BSCQA). The system employs the model context protocol to integrate large language models with external databases. To improve the accuracy of Text-to-SQL generation, this paper designs and integrates a contrastive learning-based example selection strategy to enhance the model's understanding of query intent in specialized domains.[Results] Experimental results on the bibliographic search semantic parsing dataset constructed in this paper demonstrate that, compared to the zero-shot scenario, the proposed approach improves the execution accuracy of DeepSeek-V3 by 18.1% in the 5-shot scenario.[Limitations] Due to the limited coverage of the experimental dataset, the system's adaptability in cross-domain applications still requires further improvement.[Conclusions] The proposed BSCQA system showcases the potential and application value of large language models in the typical application scenario of intelligent bibliographic retrieval within the library and information science domain, providing a reference for the research and application of conversational question-answering systems in other vertical domains.

  • Cai Mouxi, Sun Haichun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0278
    Online available: 2025-09-12

    [Objective] To address the low accuracy of multi-step reasoning in Visual Question Answering (VQA) tasks, this study integrates external knowledge and optimizes Large Language Model (LLM) prompting methods. [Methods] A new knowledge-based visual question answering model is proposed, which helps LLM understand image region information by introducing the PromptCap caption model to generate image captions, and integrates open-source and image semantic knowledge retrieval to increase reasoning sources. Meanwhile, a new question decomposition scheme is proposed to reduce reasoning difficulty, and relevant context examples are constructed with multiple reasoning answers integrated to improve accuracy.[Results] The model in this paper achieves an accuracy of 67.2% and 64.8% on the OK-VQA and A-OKVQA datasets respectively, outperforming the current SOTA models. Ablation experiments also verify the effectiveness of each module in the model.[Limitations] The model is more suitable for multi-step reasoning VQA scenarios, as the question decomposition module introduces additional computational and temporal overhead, reducing efficiency for simple VQA tasks. [Conclusions] By leveraging LLMs to optimize the KB-VQA framework, this work provides an effective novel approach for knowledge-based VQA requiring complex reasoning.

  • Chen Danlei, Hua Bolin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0420
    Online available: 2025-09-12

    [Objective] This study explores the mapping between syntactic forms and rhetorical functions of research gap sentences (RGS), integrating multi-level linguistic features to advance a novel framework for sentence-level knowledge extraction in academic texts.[Methods] Guided by the principles of validity, redundancy, and complementarity, we restructure syntactic complexity indices into a unified evaluative framework. Variance analysis and linear regression are conducted to examine differences in syntactic complexity across various types of research gap sentences. To address the automatic classification task, we fine-tune scientific language models to extract textual semantic features and adaptively integrate them with syntactic information through a gated fusion mechanism. Furthermore, a hybrid loss function that dynamically combines cross-entropy and Dice loss is introduced to improve the model’s performance on minority classes.[Results] There are significant inter-type distinctions among research gap sentences across multiple dimensions of syntactic complexity. Our gated fusion-based model outperforms state-of-the-art baselines by at least 1.44% in terms of F1 score, with ablation studies further demonstrating the necessity and rationality of its key components.[Limitations] This study focuses on mining research gap sentences from representative conference papers in the field of artificial intelligence, without extending the analysis beyond disciplinary boundaries. In addition, the relationship between different types of research gaps and measures of scientific innovation or impact has not yet been explored.[Conclusions] The proposed gated fusion-based model achieves accurate classification of fine-grained rhetorical functions in academic texts under low-resource conditions. It effectively supplements and reconciles distinct linguistic cues present at different representational levels, while maintaining a well-balanced trade-off among stability, robustness, and scalability.

  • Dai Wei, Zhu Xingce, Song Yang, Yang Xiao, Geng Xueyu, Ma Jingdong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0298
    Online available: 2025-09-12

    [Objective] To compare the effectiveness of different reasoning methods of large language models (LLM) in policy intelligent question answering (Q&A) within the context of public health emergencies.[Methods] Taking DeepSeek-R1 as the experimental backbone, we implemented retrieval-augmented generation (RAG), knowledge graph (KG) enhanced reasoning, fine-tuning, reasoning with web search support, and self-contained reasoning without external data. All methods were benchmarked against human-annotated Q&A pairs. For the self-contained setting, we further compared the performance of Qwen-QwQ and GPT-4o.[Results] In the automatic evaluation, the combined approach of RAG, KG, and fine-tuning achieved the highest BLEU-4 and ROUGE-L scores (0.259 and 0.494, respectively), followed by the web search method (BLEU-4: 0.225; ROUGE-L: 0.465). In the manual evaluation, the combined approach received the highest score for content accuracy (3.560), while large-parameter models without external data support performed better in fluency, completeness, usability, and credibility.[Limitations] The experimental dataset was limited to publicly available policy texts and did not incorporate data from closed real-world systems. In addition, multimedia formats were not included, which precluded evaluation under multimodal conditions.[Conclusions] Supported by local data, different reasoning methods can substantially reduce hallucination and enhance accuracy in policy Q&A, demonstrating strong applicability to vertical domains such as public health emergencies. Incorporating external data further improves factual accuracy, whereas self-contained reasoning without external data tends to produce responses that are completer and more convincing.

  • Wang Zhenyu, Ping Yifang, Xiao Tong, Wang Jianmin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0197
    Online available: 2025-09-12

    [Objective] To improve the accuracy of learning behavior detection in online education and promote educational intelligence to move from result-oriented to process-interpretable new paradigm, this paper proposes a large language model-driven learner behavior detection model for university catechism courses. [Methods] The model fuses large language models, retrieval-enhanced generation, and interpretable artificial intelligence techniques to construct text-time series data that incorporates learned behavioral features and emotional features. The text data is extracted and converted into time series data by the BERT model, which is trained using the LightGBM model, and then combined with the SHAP method to quantify the marginal contribution of each feature to the model prediction. This approach enables the interpretability of the prediction process. [Results] The proposed model performs excellently in learning behaviour detection tasks, with an accuracy rate of 99.90%, a recall rate of 99.78%, and an F1 score of 97.69%, all of which are significantly better than all baseline models. Compared with the lowest-performing logistic regression model, the three indicators improved by 20.84%, 24.34%, and 21.94%, respectively, fully verifying the model's advantages in recognizing complex features. [Limitations] The data were obtained from a single online learning platform, and the sample granularity was coarse, with some external generalization limitations. [Conclusions] This study effectively improves the accuracy and interpretability of the model by integrating interpretable artificial intelligence and multimodal features to provide decision support for online education platforms in colleges and universities.

  • Yao Tianchen, Wang Hao, Li Peiqi, Bu Wenru, Yuan Ruiyang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0245
    Online available: 2025-09-12

    [Objective]This study aims to address the challenge of constructing the semantic relationship between "imagery" and "implication" in Chinese poetic wine culture, promoting the structured representation and automatic recognition of traditional culture, and providing technical support for the intelligent understanding and dissemination of cultural resources. [Methods]We propose the UOCIP model, which involves a four-stage process including corpus construction, imagery extraction, implication recognition, and knowledge graph construction. The model incorporates adversarial training mechanisms and AIGC-based data augmentation strategies to enhance performance in low-resource scenarios. [Results]Experimental results show that the Macro-F1 score of the imagery extraction model improved from 56.2% to 58.6%. The implication recognition model achieved a Macro-F1 score of 83.4% under a multi-modal attention mechanism. The constructed implication graph covers five types of terminological systems and supports the recognition of out-of-vocabulary terms. [Limitations]Despite the promising outcomes, the model still faces challenges in adapting to atypical poetic styles and in controlling the semantic boundaries of the graph. Its generalization ability and boundary precision require further improvement. [Conclusion]This study establishes an effective paradigm for semantic modeling of Chinese poetic wine culture under cold-start conditions, demonstrating the feasibility and applicability of the UOCIP model in the intelligent processing of traditional culture and offering theoretical and methodological references for future research.

  • Song Yuxin, Liu Lin, Wang Hailong, Liu Jing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0349
    Online available: 2025-09-12

    [Objective] The effect of retrieval granularity on the performance of Retrieval-Augmented Generation (RAG) is reviewed. The trade-off between context integrity and precision for different granularities is also analyzed. [Coverage] Keywords such as “Retrieval-Augmented Generation” and “Retrieval Granularity” were used to retrieve literature from 2020 to 2025 in databases including Google Scholar, ACM Digital Library, and CNKI. Ultimately, a total of 106 representative papers were selected for review. [Methods] RAG methods are categorized by retrieval granularity. On this basis, a thorough comparison was conducted with regards to their technical paths, core mechanisms, innovations, and limitations. [Results] A research framework of “coarse-grained, fine-grained, and hybrid-grained” retrieval is established. The findings reveal a core trade-off: coarse granularity ensures context at the cost of noise, while fine granularity offers precision at the risk of semantic fragmentation. Consequently, the fusion and scheduling mechanisms for hybrid approaches are the main challenge. [Limitations] The review primarily focuses on text-based RAG research methods, with less comprehensive coverage of multimodal RAG research involving images, audio, or video. [Conclusions] The advancement of RAG depends on smarter granularity selection and better information fusion. Future work should explore proposition-level retrieval, dynamic granularity selection, adaptive mechanisms, and the synergy of structured and unstructured knowledge.

  • Zhou Yuhao, Wang Jie, Zhang Shunxiang, Li Jiawei, Zhang Yongqi, Yang Junni
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0028
    Online available: 2025-09-12

    [Objective] To solve the problems of entity overlap and complex relations in topic-oriented sarcasm detection, this paper proposes a prompt learning model that integrates external knowledge-aware attention to improve the accuracy of topic-oriented sarcasm detection.[Methods] Firstly, a topic-oriented prompt learning template is summarized based on the topic and comment text. Secondly, entities in the topic and comment text are identified and aligned with entities in the knowledge graph. The entities and their contexts are used as external knowledge to provide supplementary information. Then, external knowledge-aware attention is designed to measure the importance of knowledge. Finally, the mapping words are specified and the masks are predicted through the Veralizer module.[Results] Experiments on the public ToSarcasm dataset show that the proposed model outperforms the compared advanced models, with an accuracy of 72.25% and an F1 value of 77.16%.[Limitations] This study did not use a learnable soft prompt method to construct the prompt template, and there is still room for further optimization in the prompt design and the selection of mapping words. At the same time, only the ToSarcasm dataset was used for model training, and the generalization ability of the model needs to be improved.[Conclusions] The introduction of external knowledge can effectively solve the problem of entity overlap, prompt learning can effectively solve the problem of complex relations, and the model can effectively improve the accuracy of topic-oriented sarcasm detection.

  • Sun Qinglin, Wang Xiaomei, Chen Ting, Song Xinyu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0257
    Online available: 2025-09-12

    [Objective]To better understand the trajectory of scientific development, this study proposes a general-purpose method for automatically classifying research levels in academic papers, aiming to delineate level boundaries precisely, reveal disciplinary evolution patterns, and support research planning, resource optimization, and efficient knowledge transfer.[Method]Building upon the framework developed by Boyack et al., we construct a cross-disciplinary research level classification scheme. A large language model is fine-tuned through supervised learning on annotated data, integrating deep semantic analysis and prompt engineering techniques to enable automated level identification. The model’s effectiveness is further validated using mapping science structure.[Results]The proposed method achieves an F1 score of 85.45% and an accuracy of 85.44% in the research level classification task, significantly outperforming the multinomial logistic regression baseline (62.71% and 62.00%, respectively), and demonstrates clearer level distinctions within the mapping science structure.[Limitation]Due to computational constraints, comparative experiments across multiple models were not conducted.[Conclusion]The supervised fine-tuning of large language models shows strong accuracy and robustness in research level classification tasks. Optimizing prompt design may further enhance performance, offering effective support for accelerating scientific knowledge translation and guiding discipline development.

  • Ma Weilu, Sun Tan, Zhao Ruixue, Xian Guojian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0225
    Online available: 2025-07-08

    [Objective] To simplify scientific literature summarization and generate intuitive graph-based summaries for enhancing research efficiency.[Methods] Rice breeding-related papers were extracted from the PMC database, and 4,276 full-text-to-summary QA pairs were constructed. Optimal prompts and temperature coefficients were experimentally identified. The Qwen2.5-7B-Instruct large language model was fine-tuned using supervised datasets. The fine-tuned model was then integrated into the GraphRAG framework to generate graph-based summaries for individual papers. Global queries with the optimized prompt were subsequently executed in GraphRAG to produce textual summaries.[Results] Compared to baseline models, the proposed method achieved F1 score improvements of 44.16%, 61.36%, and 54.87% on ROUGE-1, ROUGE-2, and ROUGE-L, respectively. In a 5-point manual evaluation, the method outperformed baselines by an average of 1.78 points, with graph-based summaries demonstrating significantly enhanced intuitiveness.[Limitations] Hardware constraints limited the scale of the selected LLM, potentially restricting generative capability. Additionally, the GraphRAG framework exhibited prolonged index construction times, highlighting the need for efficient inference acceleration in practical applications.[Conclusions] Graph-enhanced retrieval-augmented generation technology effectively captures long-range implicit information in scientific papers, producing comprehensive textual summaries and hierarchically structured graph-based summaries. This methodology improves researchers’ reading efficiency and supports scientific productivity.

  • Yao Yuanzhang, Xu Jian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0011
    Online available: 2025-07-04

    [Objective]This study aims to analyze the phenomenon of semantic differences of interdisciplinary terms across different fields and explore the underlying causes of these semantic variations.[Methods]We utilize pre-trained deep learning models to automate the identification and quantification of semantic differences in terms. A semantic difference degree indicator is designed to quantitatively measure the extent of these differences, and a co-occurrence analysis is conducted for the disciplines involved in the terms.[Results]The identification accuracy of semantic differences based on the pre-trained model reaches 0.8193, and the constructed measurement indicators effectively quantified semantic differences.[Limitations]The study is limited to the semantic differences of Chinese terminology, with a restricted scope in terms of the interdisciplinary range of the terms selected.[Conclusions] The main causes of semantic differences in interdisciplinary terms are identified as: specialization and fragmentation of disciplines, linguistic and contextual differences, hierarchical and abstract conceptualization, cognitive emphasis differences, and the influence of interdisciplinary intersection and integration. This provides new perspectives and methodologies for exploring the reasons behind terminological discrepancies and their relationships with disciplines.

  • Deng Hangyu, Tang Chuan, Pu Yunqiang, Ao Lijuan, Wang Wanjing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0993
    Online available: 2025-07-04

    [Objective] Given the characteristics of large volume, broad scope, and frequent colloquial expressions in U.S. Congressional hearing transcripts, this paper proposes a framework for automatically identifying China’s science and technology security risks. [Methods] Starting from the data features of the hearings and the actual needs of analysts, this study realizes and integrates modules such as text filtering, summary generation, and question-answering by utilizing large language models. [Results] Using the 118th Congress hearings as experimental texts, the F1 score for text filtering, ROUGE-Lsum for summary generation, and the risk point recall rate for the QA system reached 0.7751, 0.6032, and 0.7636 respectively, significantly outperforming the baselines. [Limitations] This method is primarily designed for U.S. Congressional hearing transcripts and needs further validation with more types of data to consider it a general approach. [Conclusions] The proposed method can assist researchers in better extracting technological security risks from U.S. Congressional sources and preparing corresponding strategies.

  • Zhang Shuangbao, Cheng Quan, Zeng Yan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1126
    Online available: 2025-07-04

    [Objective]The utilization of semantic association information between Chinese texts is imperative to enhance the efficacy of extracting unstructured events from the text.[Methods]The present study proposes a Chinese document-level event extraction model (CSDEE) that utilizes an attention mechanism to construct a cross-document interactive semantic network, with the objective of enhancing entity recognition performance. The event extraction task is then completed through document encoding and event extraction information decoding.[Results]The experimental results demonstrate that the CSDEE model attains 80.7%, 84.1%, and 82.3% accuracy, recall, and F1 score in event extraction, respectively, outperforming existing baseline models.The ablation experiments conducted on the model and the generalization experiments on the public datasets ChFinAnn and DuEE-fin further substantiate the efficacy of the model in Chinese document event extraction tasks.[Limitations]At present, the model has only enhanced the performance of document event extraction and has not yet engaged in multi-classification tasks for overlapping event types.[Conclusions] A comprehensive exploration of the parallel semantic information inherent in document-level data has the potential to enhance the precision of document event extraction operations.

  • XIEWei, XIA Hongbin, LIU Yuan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1132
    Online available: 2025-07-04

    [Objective] This study aims to utilize deep learning methods to address the current issue of insufficient utilization of complete entity and relation interaction information in zero-shot relation extraction tasks. [Methods] We propose a Joint Contrastive Learning model (JCL) for zero-shot relation extraction, which integrates entity and relation information based on contrastive learning. Firstly, data augmentation techniques are applied to the original input text to enhance the model's effective information. Secondly, an enhanced cross-attention module is used to deeply integrate entity pairs and jointly process relations, extracting interaction information between entities as well as between entities and relational semantics, thereby amplifying the subtle differences of various relations in the embedding space. Finally, the model is optimized using a combination of cross-entropy loss and contrastive loss. [Results] Compared with the baseline model, the proposed approach achieves improvements on the FewRel dataset with unseen relations: an F1 score increase of 3.12% for 𝑚=5, 5.19% for 𝑚=10, and 1.99% for 𝑚=15. On the Wiki-ZSL dataset, improvements are 7.05% for 𝑚=5, 3.42% for 𝑚=10, and 8.08% for 𝑚=15. [Limitations] The study is limited by the relatively homogeneous and small number of datasets used in this research field. [Conclusions] The proposed Joint Contrastive Learning model for zero-shot relation extraction demonstrates advanced performance on three public datasets, showcasing its efficacy for this specific task.

  • Shengli Zhou, Rui Xu, Tinggui Chen, Shaojie Wang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1138
    Online available: 2025-07-04

    [Objective] To address the insufficient characterization of multimodal features in the AI face-swap fraud process, this study establishes a face-swapping fraud risk identification model (FSFRI) that synergistically integrates multimodal features to optimize victimization risk assessment. [Methods] By comprehensively considering the generation and propagation processes of AI face-swapping fraud, FSFRI extracts four types of features: fake face video frames, traffic composition description features, traffic payload data features, and traffic temporal features. Through the feature fusion module, it achieves complementary integration of cross-modal features. Finally, via risk identification module, FSFRI effectively detects and identifies deception risks. [Results] In the dataset generated through simulation experiments, the FSFRI achieved good identification performance, with an F1 score of 0.92. It also demonstrated strong robustness in low-noise environments (with noise levels ranging from 0 to 0.2), and the F1 score only decreases by 0.019 at a noise ratio of 0.2. [Limitations] Due to the increased complexity of FSFRI from using multimodal features, the model faces higher computational performance demands. The FSFRI's risk identification effectiveness in high-noise environments remains to be further enhanced. [Conclusions] FSFRI can effectively extract and integrate the multi-modal features generated in the process of AI face-changing fraud, and precisely identify AI face-swapping fraud victimization risks.

  • Ma Yingxue, Gan Mingxin, Hu Lei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1235
    Online available: 2025-07-04

    [Objective] To address the issue that deep learning recommendation methods lack modeling of user interest distribution characteristics and cannot fully capture user preferences, a sequential recommendation method based on modeling the aggregative and hierarchical distribution characteristics of user interests is proposed. [Methods] Using attention network and LSTM, representation vectors of users and items are obtained from behavioral sequences, and the positional centers and boundary radii of user interest distributions are learned. The hierarchy and aggregation of interest distribution are characterized by two radii. User preferences are predicted by fitting the distance between candidate item features and the distribution center of user interest to interaction probability. Recommendations are generated by fusing behavior predictions based on neural networks with preference estimation based on interest model. [Results] Experimental results on Amazon dataset demonstrate that compared to the best-performing baseline, the proposed method achieves optimal performance in terms of precision, recall, F-score, coverage and other evaluation metrics, with performance improvements exceeding 10 percentage points. [Limitations] User generated content besides behavior sequence is not considered. Future work can improve interest modeling by integrating user comments and other information. [Conclusions] This method can accurately describe the distribution characteristics of user interest, improve the accuracy of recommendation, and optimize the comprehensive quality of recommendation results.

  • Sun Mengge, Wang Yanpeng, Fu yun, Liu Xiwen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1192
    Online available: 2025-07-04

    [Objective] This study explores the construction of prompt engineering methods for large language models in the task of multi-domain scientific knowledge entity extraction, using scientific short texts as experimental data. The aim is to address the challenges posed by insufficient semantic context and domain diversity in short text entity extraction.[Methods] To tackle the issues of wide domain coverage, condensed semantics leading to insufficient contextual information, and ambiguous entity boundaries in short texts, this study proposes a Scientific Prompt-based entity extraction strategy grounded in knowledge-prompt learning. By integrating the BERTopic method, the strategy dynamically incorporates domain knowledge into prompt design to enhance the semantic understanding and recognition capabilities of large language models, thereby improving extraction accuracy and generalization.[Results] Experimental results demonstrate that under the Scientific Prompt strategy, the F1-Value scores of QWEN2.5-7B, QWEN2.5-7B (fine-tuned), and GPT-4o models are 0.6526, 0.7407, and 0.7878, respectively. In contrast, the Zero-Shot F1-Values for the same models are 0.5534, 0.6165, and 0.6822, respectively. The results indicate that the Scientific Prompt strategy significantly outperforms fine-tuning in open-source models (0.6526 vs 0.6165), with the fine-tuned QWEN2.5-7B model under the prompt strategy slightly surpassing the performance of GPT-4o (0.7407 vs 0.6822).[Limitations] This study only evaluates the proposed strategy on Chinese scientific intelligence short texts, and its applicability to English texts remains untested.[Conclusions] The experiments demonstrate that the Scientific Prompt strategy can significantly enhance the performance of large language models in short text, multi-domain entity extraction tasks without requiring parameter updates. Its effectiveness in unsupervised scientific short texts is also validated, enabling accurate extraction of scientific entities to monitor technological trends. This research provides an important reference for knowledge entity extraction in general scientific short text tasks.

  • Zhang Xiaojuan, Ji Ruyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0018
    Online available: 2025-07-04

    [Objective]This paper proposes a global citation recommendation framework based on both static and dynamic heterogeneous graphs, aiming to enhance the accuracy of citation recommendations.[Methods]This paper first constructs static weighted heterogeneous networks and temporal heterogeneous networks separately. For the static heterogeneous network, the mixed random walks and the skip-gram model are used to generate the embedded representations of nodes, which can capture the local and global network information. For the temporal network, the meta-path instances are first generated based on the meta-path-based random walk, and then the temporal evolution process is modelled in the heterogeneous graph to generate the embedded representations of nodes in the graph. Then, the final embeddings of paper nodes are produced using joint and separate training methods. Finally, candidate citation lists are generated for input papers by calculating the similarity between the final embeddings of paper nodes.[Results]Experimental results show that: the experimental performance of the proposed methods outperform those that only consider dynamic or static information of network; the independent training method performs best in terms of almost all recall metrics (except for recall @40); the uncertainty-based multi-task weighting method achieves the best performance in terms of MRR and MAP metrics, with values of 0.308 and 0.297.[Limitations] The performance of the newly proposed model hasn't been verified across multiple datasets. The running efficiency of the model still needs to be further optimized.

    [Conclusions]Considering both the static and dynamic aspects of the network can effectively enhance the performance of global citation recommendation.

  • Congjing Ran, Qunzhe Ding, Yonghui Song, Fuxin Wang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1171
    Online available: 2025-07-04

    [Objective] To address the challenge of distinguishing substantive patent transactions in patent transfer data, this study proposes a systematic approach that integrates multiple methods based on the Levenshtein distance algorithm. This approach effectively identifies substantive patent transactions and explores their technical characteristic differences.

    [Methods] A screening process is proposed for different patent transfer scenarios. One of the key steps involves using multiple text similarity algorithms based on Levenshtein distance to calculate the similarity scores of the names and addresses of the parties involved in the transaction. These scores are then combined with a set threshold to exclude non-market-based transaction records related to internal resource reallocation. At the same time, the accuracy of the method is validated through empirical research, and statistical analysis is used to compare the differences in technical indicators across different transaction types.[Results] The experimental results show that this method achieves an accuracy of 81.27% and is effective in identifying patent behaviours that involve substantive transactions. Patents that undergo substantive transactions have significantly higher technical indicators such as the number of independent claims, the number of family patents, and the number of times cited compared to patents that do not undergo substantive transactions (p < 0.05).[Limitations] The dataset's temporal scope is restricted, and the model's adaptability to handle complex address structures requires further refinement to improve generalizability.[Conclusions] This study establishes an effective and scalable methodology for classifying substantive patent transaction behaviours, offering valuable data support for advancing research in technology transfer and patent commercialization.

  • Li Yihong, Yu Yanfang, Yu Qiwei, Li Sujuan, Zhang Shaolong, Ye Junjun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0239
    Online available: 2025-07-04

    [Objective]Large language generation models have brought new ideas to the Chinese open relation extraction task, but how to optimize the quality of the relation extraction results generated by the model has become an important issue.[Methods]This paper proposes a low-cost large model fine-tuning method based on multi-dimensional self-reflective learning enhancement (SRLearn). It automatically guides the model to engage in multi-dimensional self-reflective learning, thereby optimizing the model's Chinese relationship extraction generation quality.

    [Results]Compared to the LoRA+DPO fine-tuning method, the SRLearn method improves performance by 15 percentage points on the WikiRE1.0 dataset and 6.5 percentage points on the DuIE2.0 dataset, validating the effectiveness of this approach.[Limitations]The SRLearn method needs to consider covering more generation quality issues in the future.[Conclusions] The large model fine-tuning method based on Multidimensional self-reflection learning can greatly improve the generation quality of Chinese relation extraction.

  • Su Yanyuan, Dong Xiaoyu, Han Cuijuan, Zhang Yaming
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1157
    Online available: 2025-07-04

    [Objective] A federated learning framework embedded with dual-channel attention convolution is designed. It could solve the difficult problem of cross social networks feature extraction caused by privacy protection restrictions, and identify social robot accounts accurately.[Methods] Firstly, the federated learning framework is adopted to realize data integration of cross social networks. Secondly, the dual-channel attention convolution mechanism is introduced into the local model module to comprehensively mine data features. Thirdly, with the help of basic convolution neural network and blockchain, the local model parameters are integrated in the federated aggregation module to obtain and securely store the optimal model parameters.[Results] The experimental results on the TwiBot-20&Weibo-bot dataset show that the accuracy rate, precision rate, recall rate and F1 value of FL-DCACNN model reaches 91.63%, 97.10%, 97.14% and 96.88%, respectively, and show strong generalization ability.[Limitation] The multi-modal feature extraction only considers the structured data, text data and picture data, but does not involve the video and audio data. [Conclusions] FL-DCACNN model could effectively solve the problem of poor recognition effect of social robots caused by insufficient feature extraction and single data source due to data privacy, so as to further improve model recognition performance and realize accurate recognition of social robots.