Home Table of Contents

25 April 2025, Volume 9 Issue 4
    

  • Select all
    |
  • Chen Jing, Cao Zhixun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 1-13. https://doi.org/10.11925/infotech.2096-3467.2024.0446
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to analyse the differences in combating hallucinations in large language models between unstructured knowledge, exemplified by knowledge base resources, and structured knowledge, exemplified by knowledge graph resources, using the Traditional Chinese Medicine (TCM) Q&A domain as a case study. Based on these findings, strategies for improving the ability of large language models to combat hallucinations in vertical domains are discussed. [Methods] The study designs experiments using external knowledge combined with prompt engineering techniques to analyse the differences in prompt effects between knowledge base resources and knowledge graph resources in the TCM Q&A domain. It also investigates the superiority of dynamic triplet strategies and integrated fine-tuning strategies in optimising large language models against hallucinations. [Results] Experimental results show that compared to prompts from unstructured knowledge in the knowledge base, prompts from structured knowledge in the knowledge graph perform better in terms of precision, recall and F1 score, improving by 1.9%, 2.42% and 2.2% respectively to reach 71.44%, 60.76% and 65.31%. Further analysis of the optimisation strategies shows that the combination of the dynamic triplet strategy and fine-tuning had the best effect against hallucinations, achieving precision, recall and F1 scores of 72.47%, 65.87% and 68.62% respectively. [Limitations] This study is limited to a single field, as it was only tested in the field of Traditional Chinese Medicine Q&A, and its generalisability needs to be validated in a wider range of scientific fields. [Conclusions] This study has demonstrated that in the field of Traditional Chinese Medicine, structured knowledge from knowledge graphs outperforms traditional unstructured knowledge in reducing hallucinations and improving the accuracy of model responses. It demonstrates the critical role of structured knowledge in enhancing model comprehension skills. The integration of fine-tuning strategies with knowledge resources provides an effective way to improve performance in large language models. This paper provides a theoretical rationale and methodological support for integrating external knowledge into large language models to improve knowledge performance.

  • Zhou Linxing, Wang Shuai
    Data Analysis and Knowledge Discovery. 2025, 9(4): 14-31. https://doi.org/10.11925/infotech.2096-3467.2024.0210
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] In response to the frequent data poisoning attacks on LLM, which result in uncontrolled and covert output of blackened information by the distributor, separate poisoning and blackening detection schemes and models are designed and integrated to form an intelligence perception method operation mechanism. [Methods] Firstly, XMC-GAN+YOLO and New Phillips-Huber are used as information generation models and data poisoning activity recognition support methods, respectively. Secondly, RNA-Seq, KdV-IE, Percolation and AREMBMTD are used as core representation methods for blackout separation, energy, permeation and destruction degree, and implementation models are obtained. Finally, based on the reconstruction mechanism and linkage problem solving framework, the above solution models are linked into a complete method operation mechanism to effectively output intelligence perception results. [Results] The empirical analysis results show that the overall performance improvement rate of the method operation mechanism compared with similar suboptimal methods is 18.81%, with a leading rate of 7.48%. Under this condition, it can output three types of information generation modes (manual, fusion and AI), three types of poisoning methods (single, composite and mixed) and four types of intelligence perception results (separation, energy, penetration and destruction). [Limitations] The descending order of intelligence perception efficiency for different redacted information modalities is text, image, and audio/video, which is limited by the granularity and channels of content parsing, and does not pay special attention to low efficiency outside of text. [Conclusions] This article fully integrates the principles of data poisoning with the LLM blackout phenomenon, and provides an intelligence perception method that avoids the separation of deep principles and surface phenomena. It can effectively output the results of poisoning activity detection and blackout perception, and is superior to the comparison model in terms of improvement and leading rate indicators.

  • Sun Guangyao, Zhao Zhixiao, Shen Si, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2025, 9(4): 32-45. https://doi.org/10.11925/infotech.2096-3467.2024.0156
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper focuses on the humanities and social sciences to optimise the practical value of large language models in machine translation. [Methods] We conductes comparative experiments using Baichuan2-13B-Base and Qwen-14B as baseline models to evaluate translation performance improvements from terminology, feature extraction, data augmentation and instruction fine-tuning methods. [Results] Experimental results show that LLMs have significant advantages over traditional neural machine translation (NMT) systems. Specifically, instruction-tuned LLMs achieve: (1) a 60% reduction in average inference time compared to baseline models; (2) translation metric improvements ranging from +0.82 to +8.54 percentage points; and (3) task-specific improvements, with English-to-Chinese translation gains being 2.99x and 4.58x greater than Chinese-to-English improvements for the fine-tuning models, respectively. [Limitations] While this research addresses terminological concepts embedded in disciplinary and socio-cultural contexts, it does not undertake a granular analysis of diverse data sources within the humanities and social sciences. [Conclusions] This work provides critical insights for optimising LLM applications in specialised domains and contributes to the advancement of cross-linguistic and cross-cultural communication.

  • Tian Xuecan, Wang Li
    Data Analysis and Knowledge Discovery. 2025, 9(4): 46-56. https://doi.org/10.11925/infotech.2096-3467.2024.0379
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] In response to the limited automation and strong empirical dependence in the detection process of weak signals at the forefront of technology, a detection method is proposed that integrates two signal processing strategies of signal amplification and denoising. [Methods] By simulating the signal processing flow, the signal is first pre-processed using the RE. The weak signal is amplified using the N-gram model and the TF-IDF algorithm. The iterative thresholding shrinkage algorithm (ISTA) is used to measure the future growth trend of the weak signal and further filter out noise. Finally, the growth signal is integrated using the K-Means++ algorithm enhanced by the Word2Vec model. [Results] Signal amplification and signal filtering, as two core processing strategies, effectively avoid the phenomenon of noise drowning out weak signals, thus improving the accuracy and focus of weak signal detection at the cutting edge of technology. [Limitations] The current evaluation of detection effects still relies on professional knowledge, and more objective evaluation methods need to be further explored. Weak signal detection is performed based on a single data source, and future work needs to further expand the data sources. [Conclusions] The automated detection framework proposed in this paper has reduced the reliance on human experience to some extent and has achieved effective and accurate detection results.

  • Zuo Min, Qiu Jiangnan
    Data Analysis and Knowledge Discovery. 2025, 9(4): 57-67. https://doi.org/10.11925/infotech.2096-3467.2024.0255
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to identify the influencing factors and the underlying pathways of groupcharacteristics that are closely associated with crowd wisdom on the knowledge innovation performance within online knowledge communities (OKC). [Methods] Using the English version of Wikipedia as the research object, 180 Wikipedia articles were randomly selected. PLS-SEM was used to test the research model. [Results] Groupdiversity, independence, and decentralization have different direct effects on knowledge innovation performance. Indirect collaboration has a significant mediating effect on the novelty and utility dimensions of knowledge innovation performance. [Limitations] Future research could supplement the current findings with experimental social computing methods and extend the findings to other OKC scenarios where crowd wisdom emerges. [Conclusions] This paper enriches the relevant research on OKC knowledge innovation from the perspective of groupcharacteristics, thus expanding the application scope of crowd wisdom theory. The research findings provide valuable reference implications for OKC management and platform design, facilitating more effective aggregation of crowd wisdom to promote knowledge innovation.

  • Qian Qianwen, Hu Yang, Xia Sudi, Wang Dongyi, Wang Fan
    Data Analysis and Knowledge Discovery. 2025, 9(4): 68-84. https://doi.org/10.11925/infotech.2096-3467.2024.0430
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to explore the mechanisms of health information acquisition behaviour among medical aesthetic consumers, thereby providing theoretical support for informed decision making and orderly development within the medical aesthetic industry. [Methods] Grounded theory was used to code and analyse data from 29 respondents and 6 online sources, leading to the construction of a theoretical model of health information seeking behaviour among medical aesthetic consumers. This model was quantitatively validated using structural equation modeling (SEM). [Results] The health information seeking behaviour of medical aesthetic consumers generally follows a mechanism pathway of “objective factors-subjective factors-demand factors-behaviour”. The constructed model passed quantitative validation and demonstrated high scientific validity (all hypotheses except H2a were supported). [Limitations] Interview data may be subject to recall bias and expression errors, future research could include field experiments for additional validation. Health information seeking behaviour may vary at different stages among medical aesthetic consumers, suggesting that future research could obtain panel data for dynamic analysis. [Conclusions] This model may provide a valuable reference for the information decision-making of medical aesthetic consumers, as well as for content creation and information consulting services within the medical aesthetic industry.

  • Guo Xin, Nie Lei, Wang Jimin, Sun Jing
    Data Analysis and Knowledge Discovery. 2025, 9(4): 85-98. https://doi.org/10.11925/infotech.2096-3467.2024.0226
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to develop an automated quality assessment framework suitable for China’s government open data, thereby facilitating the realization of the value of open government data, such as improved data retrieval. [Methods] Based on an in-depth investigation of China’s government open data platforms and integrating insights from existing data quality assessment research, this study constructs an assessment framework consisting of four primary indicators (content quality, utility quality, metadata quality, and openness quality) and sixteen secondary indicators. An automated measurement method is designed according to the available data fields. The validity of the proposed method is then tested using real data sets from 14 provincial administrative regions. [Results] The average correlation coefficient between the automated scoring results of this study and the manual scoring results is 0.537, while the correlation coefficient with existing research results is 0.736, indicating a relatively strong correlation and provisionally validating the effectiveness of the proposed method. [Limitations] Due to the emphasis on operational feasibility, both the breadth of indicator selection and the depth of measurement have certain limitations. [Conclusions] The proposed method enables dynamic, low-resource dataset-level quality assessment based on user-defined evaluation cycles and indicator weights, thus supporting the establishment of a nationwide integrated government open data retrieval platform.

  • Ren Gang, Cheng Lingfeng, Jia Ziyao, Wang Anning
    Data Analysis and Knowledge Discovery. 2025, 9(4): 99-110. https://doi.org/10.11925/infotech.2096-3467.2024.0236
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] Utilising the multimodal information from images and texts, the ITRHP multimodal review helpfulness prediction model based on image-text matching technology is proposed. [Methods] First, Faster R-CNN and Bi-GRU models are used to extract image and text features, respectively. Second, the matching regions between text and images are captured by a co-attention mechanism to improve the consistency of feature expression. Then, positive and negative attention mechanisms are introduced to obtain the common semantic information of matching and mismatching word region pairs, and an adaptive matching threshold learning module is used to better detect the word region pairs with the highest similarity. Finally, the semantic information is passed to the fully connected layer to obtain the final classification results. [Results] The experimental results show that the ITRHP model achieves an accuracy of 80.17% and 80.27% on the Yelp and Amazon datasets, respectively, and an F1 value of 79.38% and 89.01%, respectively. Compared to the benchmark model, the accuracy on the two datasets improved by up to 2.80 and 2.42 percentage points, and the F1 values improved by up to 2.70 and 7.48 percentage points, respectively. [Limitations] Focused primarily on image and text data in reviews, without exploring more comment features such as comment sentiment and commenter information. [Conclusions] The ITRHP model proposed in this study effectively uses multimodal information through image-text matching technology, which solves the problem of low classification accuracy in multimodal helpfulness prediction models.

  • Zhang Xiu, Ji Ke, Ma Kun, Chen Zhenxiang, Gao Yuan, Wu Jun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 111-122. https://doi.org/10.11925/infotech.2096-3467.2024.0182
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to realise the fast and accurate finding of relevant reports of specific events in massive Internet news, an event matching algorithm based on multidimensional event feature fusion and semantic feature interaction is proposed. [Methods] First, the news events are summarised by dependent syntactic analysis. Then, the multi-dimensional event features extracted by BiLSTM, DPCNN and multi-head attention mechanism are fused by low-rank tensor feature fusion. Finally, the semantic features are interacted by the attention mechanism to jointly participate in the event matching judgement. [Results] Comparison experiments of different algorithms are conducted on real data from Sohu, and the results show that on the three datasets with different news lengths, the methods in this paper have better matching effects, especially the F1 value, which is improved by 0.7 percentage points, 0.69 percentage points and 0.23 percentage points, respectively. [Limitations] The current publicly available corpus is small, and a larger corpus could be created for further experiments. [Conclusions] The constructed model can better extract and interact with text features, which effectively improves the matching performance of news events.

  • Wu Qi, Yu Wei, Chen Junpeng
    Data Analysis and Knowledge Discovery. 2025, 9(4): 123-133. https://doi.org/10.11925/infotech.2096-3467.2024.0318
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to improve the efficiency of fake news detection by mining personality information from comment text. [Methods] BERT models learn the textual features of news and comments. The personality prediction model based on BERT model training learns the big five personality traits. The text features and personality traits of news and comments are used to predict true and fake news. [Results] The experiment is conducted on a subset of Weibo datasets. The result shows that the addition of personality traits could improve the accuracy of fake news detection (+1.96%,90.76%) and F1 scores (+1.51%,90.60%). [Limitations] The use of a personality prediction model requires a certain number of comment texts. The interpretability of the model needs to be further improved. [Conclusions] The personality traits of comment users can effectively improve the identification accuracy and F1 values of fake news detection.

  • Han Pu, Wang Zhiwei, Du Wenwen, Zhang Zihao
    Data Analysis and Knowledge Discovery. 2025, 9(4): 134-144. https://doi.org/10.11925/infotech.2096-3467.2024.0291
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to improve the effect of fine-grained classification of legal text by using decoupling technique to alleviate excessive smoothing and constructing deep graph network to learn hidden features of text, and meanwhile, we adopt attention diffusion mechanism to enhance the long-distance interaction ability of graph network. [Methods] In this study, we propose a fine-grained legal document classification model FLGNN based on a Deep Attention Diffusion Graph Neural Network, which first uses a pre-trained model BERT as an embedding layer to obtain long-range semantic features, then constructs a text-directed graph to capture the text global graph information and hidden features through a deep graph network, and finally optimizes the textual features and carries out the classification task by utilizing feature fusion and node-level attention mechanisms. [Results] The model has an Acc value of 94.85% on the dataset PKULawData from the NLM database, which is an improvement of 1.15%, 3.44%, and 1.72% over the Acc values of the baseline models such as BERT, DADGNN, and RCNN, respectively; and an Acc value of 90.91% on the dataset of legal contract texts, JSCLawData, which is an improvement over BERT, Acc values of baseline models such as DADGNN and RCNN by 1.35%, 4.19% and 4.10% respectively. [Limitations] Further exploration of the model’s applicability to other domains is needed. [Conclusions] The FLGNN model can capture the global graph information of the legal texts and mine the semantic information of the deep network, effectively improving the classification effect of fine-grained legal texts and providing practical support for artificial intelligence in the legal domain.

  • Liu Leping, Liu Fang, Wang Linchen
    Data Analysis and Knowledge Discovery. 2025, 9(4): 145-157. https://doi.org/10.11925/infotech.2096-3467.2024.0132
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to improve the accuracy of gang fraud detection in Medicare fraud risk identification and enhance the security of Medicare fund. [Methods] A Medicare gang fraud risk identification method that integrates the attention mechanism and graph neural networks is proposed. First, the claim is transformed into a high-dimensional vector using the embedding method to obtain the static features of the claim, and then the important fraud factors are given more weight by the attention mechanism to improve the model’s ability to identify the important fraud factors in the claim. Then, a relationship graph is generated based on the dynamic behavioural characteristics of the insured, and the neighbourhood information embedded in the relationship graph is captured using graph neural networks and fused with the static characteristics of the claim to mine the dynamic abnormal behaviours caused by the gang fraud in high-dimensional space. Finally, the fraud probability of the claim is output. [Results] The experimental results on 1.83 million medical claims data from 20,000 participants of a medical insurance organization in China show that the proposed method achieves a recall and accuracy of 91.08% and 90.66%, respectively, with an F1 mean of 0.69, which is better than other classical methods. [Limitations] Only the dynamic behavioural characteristics of the enrollees are fused for Medicare fraud risk identification, and the combination of multi-subject factors such as doctors and pharmacies will be considered in future studies to further improve the accuracy of the model. [Conclusions] The integration of dynamic enrollee behaviour can complement the static feature of the claim, increase the focus on Medicare gang fraud, and improve the accuracy of model identification.

  • Gao Yuan, Li Chongyang, Qu Boting, Jiao Mengyun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 158-169. https://doi.org/10.11925/infotech.2096-3467.2024.0784
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to advance the research on urban tourism flow network structure, and to address the issues of inaccurate point-of-interest recognition and distorted visiting sequence in current tourist journey reconstruction methods based on travelogue texts. [Methods] This paper proposes a method based on a large language model for reconstructing tourist journeys, and explores the structural characteristics of urban tourism flow networks by combining it with social network analysis methods. [Results] The proposed method for reconstructing tourist journey achieves a precision of 94.00% and a recall of 87.78% in POI recognition, significantly outperforming the statistics-based Conditional Random Fields (CRF) method. The reconstructed journey shows a similarity of 83.81% to the actual journey. [Limitations] Tourist journey reconstruction effects depend to a certain extent on the training effects of the Prompts of the large language model. [Conclusions] The conclusions drawn align with public perception and current research findings when taking Xi’an as a case study, demonstrating the accuracy and versatility of the proposed tourist journey reconstruction method.