Data Analysis and Knowledge Discovery

Select

Prompt Learning Based on Retrieval-Augmented Generation with Fine-Grained Knowledge Graph

Meng Xuyang, Wang Hao, Li Yuanqing, Li Yueyan, Deng Sanhong

Data Analysis and Knowledge Discovery. 2025, 9(9): 1-12. https://doi.org/10.11925/infotech.2096-3467.2024.0914

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] This paper proposes a paradigm integrating large language models (LLMs) with knowledge graphs (KGs). We aim to address issues such as catastrophic forgetting, poor interpretability of generated content, and excessive demand for data and computational resources in vertical domain question-answering (QA) systems with fine-tuned LLMs. [Methods] First, we constructed a fine-grained KG for the traditional Chinese medical text “Treatise on Cold Damage”. Then, we employed a retrieval-augmented generation (RAG) model to incorporate this KG into a LLM through prompt learning to build a QA system. [Results] Compared to baseline models and fine-tuned models with professional data, the proposed system achieved a 14.67 and 1.33 percentage points higher satisfaction rate in subjective evaluations. In the objective evaluation, our model demonstrated an overall accuracy of 20.00 percentage points higher than the baseline models and 2.00 percentage points lower than the fine-tuned models. [Limitations] The application is limited to the traditional Chinese medicine domain related to the Treatise on Cold Damage. There is also a lack of standardized benchmarks to evaluate the system’s professional capabilities. [Conclusions] The proposed approach enhances the interpretability of generated content from vertical domain QA systems while substantially reducing the need for data and computational resources.

Select

MWDLS: A Multimodal Online Health Rumor Detection Model Considering Language Style Features

Liu Yan, Zhan Yalan, Jiang Ziheng, Li Jinliang, Yan Zhijun, He Chaocheng

Data Analysis and Knowledge Discovery. 2025, 9(9): 13-24. https://doi.org/10.11925/infotech.2096-3467.2024.0991

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] To address the insufficient attention in existing literature to the language style characteristics of rumors and the partially truthful dual-faced health information, this paper proposes a multimodal online health rumor detection model incorporating language style features (MWDLS: A Multimodal Wide and Deep Model for Online Health Rumor Detection Considering Language Style). [Methods] The MWDLS model leverages Aristotle’s rhetorical theory to extract persuasive language style features— appealing to emotion, logic, and character—and employs a bidirectional cross-modal interaction fusion strategy with a gating mechanism to achieve joint representation learning and classification prediction of shallow language style features and deep semantic features. [Results] We conducted extensive experiments on a real-world dataset from a leading Chinese social media platform and found that MWDLS outperformed the baseline models. It improved the F1 score of the target task by up to 11.98 percentage points. Notably, for the health rumor category and the dual-faced health information category, MWDLS increased the F1 scores by up to 16.63 and 11.71 percentage points, respectively. [Limitations] The current model does not examine other modalities, such as video and audio, nor does it incorporate large language models or knowledge-aware mechanisms to enhance early detection of health rumors. [Conclusions] By integrating language style features with multimodal deep semantic features, MWDLS effectively enhances the performance of online health rumor detection.

Select

Entity Relation Extraction of Chinese Medical Text Based on Large Language Model and Prompt Engineering

Duan Yufeng, Xie Jiahong

Data Analysis and Knowledge Discovery. 2025, 9(9): 25-36. https://doi.org/10.11925/infotech.2096-3467.2024.0965

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] This study investigates the performance differences among existing large language models (LLMs) in extracting entities and relations of Chinese medical text, and analyzes the influence of the number of examples and relation types on the extraction performance. [Methods] Based on prompt engineering approach, we use the API way to call 9 mainstream LLMs, modifying prompt from two perspectives: the number of examples and the number of relation types. Experiments are conducted using CMeIE-V2 dataset to compare extraction performance. [Results] (Ⅰ) The comprehensive extraction ability of GLM-4-0520 is in the first place, with F1 scores of 0.4422, 0.3869, and 0.3874 when extracting three relation types of “clinical manifestation”, “medication”, and “etiology” respectively. (Ⅱ) When varying the number of examples m in the prompt, the F1 score initially increases with m, and reaches a maximum score of 0.4742 when m=8, but it declines after m>8. (Ⅲ) After increasing the number of relation types to be extracted, n, the F1 score drops significantly: when n=2, the F1 score decreases by 0.1182 compared to n=1, and when n=10, the F1 score is only 0.2949. [Limitations] Currently, there are few public datasets available, so the experimental results are based on a single dataset. Additionally, since medical-domain LLMs are difficult to access via API, all models used in this study are from general domain. [Conclusions] The extraction performance varies greatly among different LLMs; A suitable number of examples can improve the extraction performance, but more is not always better; LLM is not good at extracting multiple relation types at the same time.

Select

Retrieval-Augmented Generation of Policy Texts Based on Large Language Models

Shen Si, Feng Shuyang, Wu Na, Zhao Zhixiao

Data Analysis and Knowledge Discovery. 2025, 9(9): 37-48. https://doi.org/10.11925/infotech.2096-3467.2024.0670

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] This paper aims to enhance the utilization efficiency of governmental information resources and advance the intelligent transformation of public services by addressing the inherent knowledge limitations of general LLMs when processing policy texts. We investigate the effectiveness of a RAG framework to construct a more precise and reliable intelligent policy Q&A system. [Methods] This paper proposes a retrieval-augmented generation framework based on the Chinese policy large language model ChpoGPT. Specifically, the framework retrieves semantically similar policy documents from a knowledge base based on user queries and combines the retrieved results with ChpoGPT to enhance the model’s capabilities for downstream tasks. [Results] Experimental results demonstrate that our framework significantly outperforms existing models on key metrics. The ChpoGPT-based framework achieved a factuality score of nearly 90%. In terms of answer relevance, it scored 80.2%, outperforming the Gemini-1.0-pro model by 2.1%. Furthermore, it attained an answer semantic similarity score of 56.4%, surpassing the ERNIE 4.0 and Gemini-1.0-pro models by 4.1% and 2.8%, respectively. [Limitations] The language model still exhibits some uncontrollable behaviour in its answer output. [Conclusions] The retrieval-augmented generation of policy texts based on LLMs has certain reference value for the intelligent transformation of government services, but it still needs further improvement and optimization.

Select

Exploring the Effectiveness of Prompt Strategies in Generative AI Conversations

Zhou Jie, Wang Dongyi, Dai Qinquan, Xia Sudi

Data Analysis and Knowledge Discovery. 2025, 9(9): 49-59. https://doi.org/10.11925/infotech.2096-3467.2024.0939

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] This study explores universal and effective prompt strategies for generative AI to enhance user interaction skills and optimize user experience. [Methods] We adopted the Q method to invite participants to rank the effectiveness of various prompt strategies based on their cross-task and cross-model experiences in general scenarios. Then, we identified universally effective prompt strategy types. [Results] The study found that the most effective prompt strategies include clarifying the question, defining the goal, and providing background information. Universal effective prompt strategies can be categorized into three types: (Ⅰ) clear requirements and precise guidance, (Ⅱ) explicit explanation and logical sequencing, and (Ⅲ) task decomposition and diversified expression. [Limitations] Our data were collected only from Chinese users. This study focused on overall contextual analysis without examining variations in prompt strategies across specific scenarios, task types, and model conditions. [Conclusions] From a user-centered perspective, this study employs the Q method to identify effective prompt strategies, addressing the lack of systematic and quantitative approaches in existing prompt engineering. The identified strategies provide a structured framework for prompt design theory and offer strategic insights for enhancing human-AI collaboration and AI interaction literacy.

Select

Generating Multimodal Sentence Summarization with Large Model Theme Enhancement

Zhang Le, Xu Yangke, Chen Yansong, Zhang Leihan

Data Analysis and Knowledge Discovery. 2025, 9(9): 60-73. https://doi.org/10.11925/infotech.2096-3467.2024.0765

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] In the process of generating summaries from textual and image information, when the multimodal contents are not entirely related to the reference summary, directly fusing them can introduce noise. To address this issue, this paper proposes a multimodal sentence summarization method based on large model theme enhancement. [Methods] We fine-tuned a large language model to produce high-quality theme and keyword information. Then, we used an attention mechanism to effectively fuse the theme with image information, reducing noise in the multimodal features. The original text was fused with the keywords to obtain a multimodal semantic supplementary feature with enhanced theme information. Finally, these two types of features are combined to generate the multimodal summary. [Results] On the public MMSS dataset, compared to the best-performing baseline Vision-GPLM model, our method improved ROUGE-1, ROUGE-2, and ROUGE-L scores by 2.79, 2.20, and 2.28 percentage points, respectively. [Limitations] The prompt templates utilized in large language model fine-tuning are relatively simple. We did not attempt fine-tuning with larger-parameter versions of the language model. The fine-tuning performance of the large language model can have a certain impact on the overall model performance. [Conclusions] By fine-tuning a large language model, this work reduces noise in multimodal features and enables different modalities to be fused while enhancing the model’s grasp of the main theme, thereby improving summary quality.

Select

An Interpretable Model for Disease Causes Based on RF-ISSA-SVM and SHAP—Case Study on Obesity

Ma Jie, Sun Wenjing, Hao Zhiyuan

Data Analysis and Knowledge Discovery. 2025, 9(9): 74-87. https://doi.org/10.11925/infotech.2096-3467.2024.0938

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] This study constructs a high-quality and interpretable disease prediction model, identifying the key causes influencing disease formation. It analyzes how these causes act on disease development, and provides strong support for auxiliary diagnosis and precision medicine. [Methods] First, we used a random forest model to select the most representative feature subset from the multidimensional characteristics of disease data. Then, we designed an enhanced sparrow search algorithm to adaptively obtain the kernel parameters and penalty coefficient of a support vector machine (SVM). The optimized SVM model was applied to predict and analyze data samples, and we compared its performance with eight baseline models. Finally, we employed the SHAP framework to analyze the relationships between disease causes and formation quantitatively. [Results] Using obesity as the research subject, we conducted an empirical experiment. The proposed model achieved prediction accuracy, specificity, and Matthews correlation coefficient values of 85.5%, 83.6%, and 61.0%, all higher than those of the eight baseline models, demonstrating the model’s effectiveness. Furthermore, family history, frequency of vegetable intake, number of daily main meals, height, gender, transportation mode, and high-calorie food intake were identified as key factors influencing obesity formation. [Limitations] The empirical study focused on obesity and cannot effectively verify the model’s generalizability. The interaction between the feature variables was not analyzed. [Conclusions] The proposed model not only achieves high prediction accuracy but also analyzes the magnitude and direction of different causes’ effects on disease formation, and the conclusions can provide decision support for medical institutions.

Select

CHEN-AND: A Dataset and Evaluation Suite for Chinese-English Scientific Literature Joint Author Disambiguation

Zhang Li, Hu Jingxuan, Liu Xiwen, Lu Wei

Data Analysis and Knowledge Discovery. 2025, 9(9): 88-101. https://doi.org/10.11925/infotech.2096-3467.2024.0924

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] Existing literature on author name disambiguation lacks focus on Chinese-English collaborative author disambiguation, with one key reason being the absence of specialized and reliable datasets. This study addresses this issue by proposing a method to automatically construct an author disambiguation dataset using open internet resources. [Methods] Utilizing this method, we built a large labeled dataset, CHEN-AND, for Chinese-English collaborative author disambiguation research. Based on this dataset, we developed and evaluated several baseline disambiguation methods. [Results] The evaluation results show that the better-performing disambiguation methods achieved P-F1 and B3-F1 scores of 79.86% and 84.25%, respectively, which are significantly lower than the accuracy rates of mainstream English author disambiguation methods. [Limitations] CHEN-AND is focus on researchers in STEM fields, which shows biases from the actual disciplinary distribution of authors. This is mainly because one of the databases used for dataset construction, CSCD, is a literature database toward the Science, Technology, Engineering, and Mathematics (STEM) fields. [Conclusions] This study publicly releases the CHEN-AND dataset and disambiguation method evaluation results to facilitate future research on more efficient cross-lingual author disambiguation methods and the development of high-quality academic information exchange platforms.

Select

A Chinese Hate Speech Detection Method Integrating Multi-dimensional Sentiment Features

Dan Zhiping, Li Lin, Yu Xiaosheng, Lu Yujie, Li Bitao

Data Analysis and Knowledge Discovery. 2025, 9(9): 102-113. https://doi.org/10.11925/infotech.2096-3467.2024.0957

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] In light of the fact that hate speech containing no obvious malicious words cannot be effectively identified in Chinese text, a Chinese hate speech detection method integrating multi-dimensional sentiment features (RMSF) was proposed. [Methods] Firstly, the RoBERTa model is used to extract both character- and sentence-level features from the input text, while sentiment dictionaries are used to derive multi-dimensional sentiment attributes. These character and sentiment features are then concatenated and fed into a BiLSTM network to capture deeper contextual semantic information. Subsequently, the output of the BiLSTM is concatenated with the sentence-level features derived from RoBERTa and processed through a multilayer perceptron before being classified using the SoftMax function. To address class imbalance, the focal loss function is applied during model optimization, thereby improving the accurate discrimination of hate speech. [Results] On the TOXICN dataset, the RMSF method achieves precision, recall, and F1 scores of 82.63%, 82.41%, and 82.45%, respectively. On the COLDataset, it achieves precision, recall, and F1 scores of 82.94%, 82.96%, and 82.85%, respectively. Compared to existing approaches, RMSF yields F1 score enhancements of 1.85% and 1.09% on the respective datasets. [Limitations] The hate speech detection method integrating multi-dimensional emotional features relies on tools such as sentiment lexicons. However, the extraction of emotional characteristics is constrained by the lexicon’s content coverage and semantic granularity. [Conclusions] The experimental findings indicate that incorporating multi-dimensional sentiment features into Chinese hate speech detection models can significantly enhance detection performance.

Select

Identifying Innovative Sentences in Scientific Papers Based on Enrichment Zones Discovery

Wang Yufei, Zhang Zhixiong, Zhang Qin, Zhang Mengting

Data Analysis and Knowledge Discovery. 2025, 9(9): 114-125. https://doi.org/10.11925/infotech.2096-3467.2024.0570

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] Aiming to meet the demand for automatic acquisition of innovative content in scientific papers, this paper proposes a method based on enriched segment detection. [Methods] We identified innovation sentences in two stages. First, we constructed a keyword list for the enriched sections of innovative sentences and employed a sliding window scoring approach to locate these targets, thereby narrowing the scope of sentence identification. Then, we designed a Context-BERT model that integrates contextual information to identify innovation sentences automatically. [Results] The proposed approach achieved an F1 score of 87.27% on the test dataset, demonstrating effective and accurate identification of innovation sentences in scientific papers. [Limitations] The dataset used in this study is relatively limited, focusing on the field of Natural Language Processing. [Conclusions] This paper constructs an automatic innovation sentence recognition engine, preliminarily realizing the practical application of the proposed approach.

Select

Identification of Artificial Intelligence Patent Application Domain Based on Metric Learning

Wen Xiaobo, Hua Bolin

Data Analysis and Knowledge Discovery. 2025, 9(9): 126-135. https://doi.org/10.11925/infotech.2096-3467.2024.0803

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] To identify the application domains involved in artificial intelligence patents. [Methods] Within the metric learning framework, a dual encoder based on BERT was used to separately encode the patent text and its corresponding annotation for application domain, obtaining an embedding that could characterize the application domain of artificial intelligence patents and complete the recognition task. [Results] In the multi-classification test for the field of artificial intelligence patent applications, an accuracy of 0.947 was achieved; in the task of identification of artificial intelligence patent applications, a multi-level clustering system with a silhouette coefficient of 0.36 was obtained. [Limitations] Higher-quality annotated data is not easily accessible. Meanwhile, the metric learning framework and encoder employed are relatively simplistic, leaving substantial room for optimization. [Conclusions] Metric learning can be applied to the targeted identification of the application fields of artificial intelligence patents, inspiring the optimization of unsupervised topic recognition.

Select

Traditional Craft Entity Recognition with Fusion of Boundary Features and Attention Sequence Structure

Zhang Xinsheng, Li Jiang, Wang Minghu

Data Analysis and Knowledge Discovery. 2025, 9(9): 136-151. https://doi.org/10.11925/infotech.2096-3467.2024.0726

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] This paper proposes a new ER-BFAS model to address issues such as blurred entity boundary attributes and sparse corpus data in the entity extraction process. Our model combines boundary features with an attention sequence structure, aiming to identify and predict traditional craft entity labels. [Methods] The model incorporates entity boundary attribute features into a joint embedding layer of text labels, and uses an attention mechanism to generate feature vectors. At the same time, the model employs a bidirectional long short-term memory (BiLSTM) network to capture the association information among craft-related entity labels, enhancing the model’s recognition capability for different labels. Finally, it uses a conditional random field (CRF) to predict the craft entity labels and select the label with the highest conditional probability as the prediction result. [Results] Compared with other sequence labeling models, the ER-BFAS model achieved an F1-score of 0.85 on the traditional craft dataset, with label precision rates exceeding 0.90. The precision reached 0.75 on the DGRE dataset, further validating the model’s generalization ability. [Limitations] The experimental data types are limited, and complex entity relationships are not addressed. [Conclusions] The ER-BFAS model effectively identifies entity boundary information in both traditional craft and general datasets, significantly enhancing entity recognition capabilities in the field of traditional crafts.

Select

Cross-Network Identity Linkage Algorithm Based on Dual Mapping Learning

Cheng Jialin, Yuan Deyu, Chen Ziyan

Data Analysis and Knowledge Discovery. 2025, 9(9): 152-161. https://doi.org/10.11925/infotech.2096-3467.2024.0440

Abstract ( ) Download PDF ( ) HTML

Knowledge map

Save

[Objective] To address the challenge of identifying the same individual across different social networks, this paper proposes a cross-network user identity linkage algorithm named eDual-ViewUIL, based on dual mapping learning. [Methods] The new algorithm integrates graph embedding and deep learning. First, it expands potential user relationships through network expansion. Then, we used the DeepWalk algorithm to learn a low-dimensional representation of user nodes. Finally, we introduced the concept of dual mapping learning with the rank-sensitive likelihood loss function to associate user identities across networks. [Results] Experimental validation on three datasets showed that the precision of the proposed algorithm improved by more than six percentage points compared to the three baseline algorithms. [Limitations] Due to the massive number of social network users, the overall computational load is relatively high, and the efficiency of the algorithm needs improvement. [Conclusions] The eDual-ViewUIL algorithm demonstrates strong generalization ability in scenarios with limited labeled data and imbalanced positive/negative samples. It holds significant practical and application value for solving the cross-network identity linkage problem.

Select

Data Enclave Technology and Its Application in Population Health Data Sharing

Zhong Ming, Qian Qing, Zhou Wei, Wu Sizhu

Data Analysis and Knowledge Discovery. 2025, 9(9): 162-172. https://doi.org/10.11925/infotech.2096-3467.2024.0461

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study focuses on the construction of the data enclave of the National Population Health Data Center (NPHDC), aiming to provide a more efficient, secure, and flexible environment for data processing and analysis. It addresses the challenges in centralized data storage, data security risks, limited computing resources, and urgent needs of users for data analysis and utilization. [Methods] The types, characteristics, implementation mechanisms, and scenario applicability of data enclaves were summarized. Combining the data application characteristics of NPHDC, we constructed its big data analytics platform based on the virtual data enclave approach integrating security enhancement, micro-segmentation, and artificial intelligence technologies. [Results] The big data analytics platform supports services such as data review, data processing, data analysis and mining, and peer review of publication-associated data for NPHDC. It has completed the review tasks of more than 32,000 datasets for more than 2,800 projects, more than 10,000 data analysis tasks, and more than 5,000 data processing tasks, with zero data leakage incidents and a resource utilization rate of 80%. [Limitations] The platform cannot achieve data sharing with decentralized storage across institutions. Further research needs to explore data enclaves based on privacy-preserving technologies such as multi-party secure computing and federated learning in combination with the development of NPHDC. [Conclusions] The platform effectively addresses the needs for secure sharing and collaborative analysis of population health data in a centralized manner, and is of great significance for the security protection and shared utilization of national population health scientific data.

Please choose a citation manager

Content to export

25 September 2025, Volume 9 Issue 9

模态框（Modal）标题

Please choose a citation manager

Content to export

25 September 2025, Volume 9 Issue 9