Home Browse Top access

Top access

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Wang Zhenyu, Zhu Xuefang, Yang Rui
    Data Analysis and Knowledge Discovery. 2025, 9(1): 90-99. https://doi.org/10.11925/infotech.2096-3467.2023.1273
    Abstract (979) PDF (191) HTML (441)   Knowledge map   Save

    [Objective] This paper utilizes large language models (LLMs) to generate high-quality auxiliary knowledge, aiming to improve the performance of multimodal relation extraction. [Methods] We introduced a multimodal similarity detection module to construct multimodal prompt templates, which allow the LLM to integrate visual information and prior knowledge into the generated high-quality auxiliary knowledge. We combined the obtained auxiliary knowledge with the original text and input it into downstream text models to accurately predict entity relationships. [Results] The proposed model outperformed the best-baseline model on the MNRE dataset, achieving 4.09% and 7.84% improvements in accuracy and F1 score. [Limitations] We only examined the proposed model on English datasets. [Conclusions] Comparative experiments and case studies validate the model’s effectiveness in multimodal relation extraction. Our new model provides a direction for applying LLMs to multimodal information extraction tasks in the future.

  • Chen Ting, Ding Honghao, Zhou Haoyu, Wu Jiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 159-171. https://doi.org/10.11925/infotech.2096-3467.2023.1424
    Abstract (922) PDF (130) HTML (806)   Knowledge map   Save

    [Objective] This study explores the impacts of bullet-screen(danmu)content and behavioral characteristics on consumers purchasing behavior in live-streaming e-commerce, as well as the moderating effect of host-product relevance. [Methods] First, we retrieved the bullet-screen data from the Douyin platform and the consumer data from the Huitun platform based on the Elaboration Likelihood Model. Then, we studied the impacts of bullet-screen content characteristics (central route) and behavior characteristics (peripheral route) on consumer purchasing behavior with text mining and zero-inflated negative binomial regression. We also discussed the moderating effect of host-product relevance with grouping regression. [Results] Information richness, social interaction degree and number of bullet-screen comments positively impact purchasing behavior. The emotional polarity of bullet screen comments exhibits an inverted U-shaped effect on purchasing behavior. Compared with live streaming rooms with low host-product relevance, those with high host-product relevance have broader positive impacts on purchase behavior. [Limitations] We only investigated the bullet-screen data from a single live-streaming e-commerce platform. [Conclusions] This study examines the factors influencing consumers’ actual purchasing behavior from the perspective of bullet-screen comments. It provides insights for improving communication between merchants and consumers in live-streaming e-commerce, ultimately enhancing sales performance.

  • Song Mengpeng, Bai Haiyan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 21-34. https://doi.org/10.11925/infotech.2096-3467.2024.0628
    Abstract (876) PDF (82) HTML (619)   Knowledge map   Save

    [Objective] This paper aims to generate structured literature reviews with references automatically, to assist researchers quickly grasp a specific area of scientific knowledge. [Methods] A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts. The GLM3-6B model was fine-tuned for training by generating 3,000 reviews using a large language model and then revising them manually. The corpus was then converted into high-dimensional vectors and stored in an index. These vectors were retrieved to implement LangChain’s external knowledge base. To solve the problem of poor retrieval of proper nouns, a hybrid search with BM25 was used and reordered to improve retrieval accuracy. [Results] Fine-tuning and hybrid retrieval frameworks were used to construct the literature review generation system, improving the BLEU and ROUGE scores by 109.64% and 40.22% respectively, as well as the authenticity score of manual evaluation by 62.17%. [Limitations] Due to limitations in computational resources, the scale of the local model parameters is small and its generation ability needs to be improved further. [Conclusions] The retrieval-augmented generation technique uses large language models not only generates high-quality literature reviews, and provides traceable evidence for the generated content, as well as assists researchers in intelligent reading.

  • Zhang Jing, Gao Zixin, Ding Weijie
    Data Analysis and Knowledge Discovery. 2025, 9(2): 48-58. https://doi.org/10.11925/infotech.2096-3467.2023.1347
    Abstract (653) PDF (116) HTML (193)   Knowledge map   Save

    [Objective] This paper proposes a new model to effectively classify massive police reports. [Methods] We constructed a text classification model based on BERT-DPCNN. Then, we used the BERT pre-trained model to generate word vectors. The model improved the classification performance by optimizing the activation function in the DPCNN model and enhancing the dynamic learning rate. [Results] We conducted comparative experiments between BERT-DPCNN and six other models, including BERT, BERT-CNN, BERT-RCNN, BERT-RNN, BERT-LSTM, and ERNIE. The BERT-DPCNN achieved the best accuracy, recall, and precision. In the binary classification tasks, the accuracy of BERT-DPCNN exceeded 98%. In the eleven-category tasks, the model’s accuracy exceeded 82%. [Limitations] The model has many parameters, and the limited number of experiments calls for further testing. [Conclusions] The new model effectively improves the accuracy of police report classification, providing data support for police departments in analyzing and assessing police incidents.

  • Zhou Zhigang, Dou Luyao, Li Yi, Bai Zengliang
    Data Analysis and Knowledge Discovery. 2024, 8(12): 52-61. https://doi.org/10.11925/infotech.2096-3467.2023.0883
    Abstract (596) PDF (108) HTML (425)   Knowledge map   Save

    [Objective] This paper identifies potential high-value patents by deeply mining the feature information embedded in patent texts based on bilateral semantics and text sequence features. [Methods] First, we constructed a mixed patent dataset from the fields of amorphous alloys, industrial robots, and gene chips. Then, we employed the BERT word vector model to achieve contextual semantic association and word meaning interpretation of patent texts. Third, we utilized the BiGRU network to extract global text sequence information while CNN captured local text sequence information. Finally, we predicted potential high-value patents by combining “bilateral semantics+global+local” semantic and sequence features. [Results] The proposed BERT-BiGRU-CNN model outperforms existing models and is more suitable for predicting potential high-value patents on a large data scale. Our new model achieves a prediction accuracy of over 35%, about 4% higher than the existing ones. [Limitations] The relationship and integration mechanism between standard essential and high-value patents have yet to be considered, and the algorithm complexity needs further optimization. [Conclusions] The BERT-BiGRU-CNN model performs better in text classification tasks than the CNN model. Our new model improves the prediction accuracy of potentially high-value patents by capturing global and local text sequence features.

  • Zhang Le, Chen Yansong, Zhang Leihan
    Data Analysis and Knowledge Discovery. 2025, 9(8): 47-58. https://doi.org/10.11925/infotech.2096-3467.2024.0625

    [Objective] This paper proposes a method that enhances features using large language models and integrates them through multi-level cross-fusion. It addresses the issue in multimodal sentiment analysis, where emotional expressions across different modalities are inconsistent, hindering effective collaborative sentiment decision-making. [Methods] To alleviate the conflicting sentiment information among modalities and improve the representation of sentiment features, we used the multimodal large language model to extract the auxiliary sentiment information within each modality. Then, we employed a hierarchical cross-attention mechanism to learn shared emotional features across modalities while mining auxiliary intra-modal emotional features, thereby enhancing the expression of shared semantic sentiment. During the fusion phase, a modality-attention weighted fusion method is introduced to balance the contributions of shared and auxiliary features. Additionally, we utilized a loss function combining multimodal and unimodal inputs to address the sentiment semantic inconsistencies. [Results] The proposed model outperforms baselines on the public datasets CH-SIMS and CMU-MOSI. On CH-SIMS, binary classification accuracy and F1 score increased by 1.77 and 0.63 percentage points, respectively. On CMU-MOSI, improvements of 0.43 and 0.41 percentage points were observed. For CH-SIMS fata with emotional inconsistency, the binary classification accuracy and F1 score have increased by 1.80 and 1.72 percentage points, respectively. This demonstrates that the proposed model can effectively address the issue of inconsistent sentiment semantics across modalities. [Limitations] The model does not account for the impact of personalized information on individuals in videos. [Conclusions] The proposed approach effectively integrates multimodal features using a hierarchical cross-attention mechanism, improves the representation of shared semantic sentiment, and addresses inconsistencies in emotional semantics across different modalities.

  • Sun Wenju, Li Qingyong, Zhang Jing, Wang Danyu, Wang Wen, Geng Yangli’ao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 1-30. https://doi.org/10.11925/infotech.2096-3467.2024.0508
    Abstract (563) PDF (189) HTML (500)   Knowledge map   Save

    [Objective] This study comprehensively reviews the advancements in deep incremental learning techniques from the perspective of addressing catastrophic forgetting, aiming to provide references for the research community. [Coverage] Utilizing search terms such as “Incremental Learning”, “Continual Learning”, and “Catastrophic Forgetting”, we retrieved literature from the Web of Science, Google Scholar, DBLP, and CKNI. By reading and organizing the retrieved literature, a total of 105 representative publications were selected. [Methods] The paper begins by defining incremental learning and outlining its problem formulation and inherent challenges. Subsequently, we categorize incremental learning methods into regularization-based, memory-based, and dynamic architecture-based approaches, and review their theoretical underpinnings, advantages and disadvantages in detail. [Results] We evaluated some classical and recent methods in a unified experimental setting. The experimental results demonstrate that regularization-based methods are efficient in application but cannot fully avoid forgetting; memory-based methods are significantly affected by the number of retained exemplars; and dynamic architecture-based methods effectively prevent forgetting but incur additional computational costs. [Limitations] The scope of this review is limited to deep learning approaches, excluding traditional machine learning techniques. [Conclusions] Under optimal conditions, memory-based and dynamic architecture-based strategies tend to outperform regularization-based approaches. However, the increased complexity of these methods may hinder their practical application. Furthermore, current incremental learning methods show suboptimal performance compared to joint training models, marking a critical direction for future research.

  • Chen Wanzhi, Hou Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 52-65. https://doi.org/10.11925/infotech.2096-3467.2024.0720

    [Objective] To address the issues in multimodal sentiment analysis, such as insufficient multimodal feature extraction, semantic differences between modalities, and lack of interaction, we propose a temporal multimodal sentiment analysis model that integrates multi-level attention and sentiment scale vectors. [Methods] Firstly, we introduced a scalar Long Short-Term Memory network with a multi-head attention mechanism to construct a deep temporal feature modeling network for extracting rich contextual temporal features from text, audio, and visual modalities. Secondly, we employed the text-guided dual-layer cross-modal attention mechanism and the improved self-attention mechanism to facilitate the deep information exchange across modalities, thereby generating two sentiment scale vectors for sentiment intensity and polarity. Finally, the L1 norm of the sentiment intensity vector was multiplied by the normalized sentiment polarity vector to obtain a comprehensive representation of sentiment strength and polarity, thereby enabling accurate sentiment prediction. [Results] Experiments on the CMU-MOSI dataset show that the proposed model achieves good results in both comparative and ablation experiments, outperforming the next-best model by 1.2 and 2.3 percentage points on the Acc7 and Corr metrics, respectively. On the CMU-MOSEI dataset, the proposed model surpasses baseline models across all evaluation metrics, achieving 86.0% in Acc2 and 86.1% in F1 score. [Limitations] Sentiment expression is highly context-dependent, and the sources of sentiment cues may vary across different scenarios. The proposed model may perform poorly when textual information is insufficient. [Conclusions] The proposed model effectively extracts contextual temporal features from various modalities and leverages the rich emotional information in the text modality for deep inter-modal interaction, thereby enhancing the accuracy of sentiment prediction.

  • Rang Yuchen, Ma Jing
    Data Analysis and Knowledge Discovery. 2025, 9(1): 100-109. https://doi.org/10.11925/infotech.2096-3467.2023.1130
    Abstract (537) PDF (115) HTML (413)   Knowledge map   Save

    [Objective] To reduce inter-modal differences and strengthen the correlation between modalities, this paper proposes a multimodal alignment sentiment analysis model to accurately capture the sentiment tendencies embedded in multimodal data. [Methods] For the textual modality, the original text data, supplemented with image captions, is processed using the RoBERTa pre-trained model for text feature extraction. We used the Clip Vision Model to extract image features for the image modality. The text and image features are aligned through a multimodal alignment layer based on a Multimodal Transformer to obtain enhanced fused features. Finally, the fused multimodal features are inputted into a multilayer perception for sentiment recognition and classification. [Results] The proposed model achieved an accuracy of 71.78% and an F1 score of 68.97% on the MVSA-Multiple dataset, representing improvements of 1.78% and 0.07%, respectively, over the best-performing baseline model. [Limitations] The model’s performance was not validated using additional datasets. [Conclusions] The proposed model effectively promotes inter-modal fusion, achieves better fusion representations, and enhances sentiment analysis.

  • Song Donghuan, Hu Maodi, Ding Jielan, Qu Zihao, Chang Zhijun, Qian Li
    Data Analysis and Knowledge Discovery. 2025, 9(2): 12-25. https://doi.org/10.11925/infotech.2096-3467.2023.0885
    Abstract (496) PDF (183) HTML (325)   Knowledge map   Save

    [Objective] This study addresses the issue of low classification accuracy in conventional text classification tasks due to factors such as sparse domain-specific training data and significant differences between types. [Methods] We constructed a novel classification model based on the BERT-DPCNN-MMOE framework, integrating the deep pyramid convolutional networks with the multi-gate control unit mechanism. Then, we designed multi-task and transfer learning experiments to validate the effectiveness of the new model against eight well-established and innovative models. [Results] This research independently constructed cross-type multi-task data as the basis for training and testing. The BERT-DPCNN-MMOE model outperformed the other eight baseline models in multi-task and transfer learning experiments, with F1 score improvements exceeding 4.7%. [Limitations] Further research is needed to explore the model’s adaptability to other domains. [Conclusions] The BERT-DPCNN-MMOE model performs better in multi-task and cross-type text classification tasks. It is of significance for future specialized intelligence classification tasks.

  • Shen Si, Feng Shuyang, Wu Na, Zhao Zhixiao
    Data Analysis and Knowledge Discovery. 2025, 9(9): 37-48. https://doi.org/10.11925/infotech.2096-3467.2024.0670

    [Objective] This paper aims to enhance the utilization efficiency of governmental information resources and advance the intelligent transformation of public services by addressing the inherent knowledge limitations of general LLMs when processing policy texts. We investigate the effectiveness of a RAG framework to construct a more precise and reliable intelligent policy Q&A system. [Methods] This paper proposes a retrieval-augmented generation framework based on the Chinese policy large language model ChpoGPT. Specifically, the framework retrieves semantically similar policy documents from a knowledge base based on user queries and combines the retrieved results with ChpoGPT to enhance the model’s capabilities for downstream tasks. [Results] Experimental results demonstrate that our framework significantly outperforms existing models on key metrics. The ChpoGPT-based framework achieved a factuality score of nearly 90%. In terms of answer relevance, it scored 80.2%, outperforming the Gemini-1.0-pro model by 2.1%. Furthermore, it attained an answer semantic similarity score of 56.4%, surpassing the ERNIE 4.0 and Gemini-1.0-pro models by 4.1% and 2.8%, respectively. [Limitations] The language model still exhibits some uncontrollable behaviour in its answer output. [Conclusions] The retrieval-augmented generation of policy texts based on LLMs has certain reference value for the intelligent transformation of government services, but it still needs further improvement and optimization.

  • Wang Zitong, Li Chenliang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 94-105. https://doi.org/10.11925/infotech.2096-3467.2023.1305
    Abstract (465) PDF (240) HTML (209)   Knowledge map   Save

    [Objective] To more flexibly capture the spatial-temporal features of traffic flow data and achieve more accurate multivariate traffic flow prediction, this paper proposes a Position-Aware Spatial-Temporal Graph Convolutional Network (PASTGCN). [Methods] First, the traffic data’s spatial and periodic temporal position features are represented as explicit position embeddings. Then, based on the spatiotemporal convolutional structure, we incorporated spatial information into the temporal convolutional network for space-aware sequence modeling. Finally, we used static and dynamic dual graph learning methods to capture spatial dependencies. [Results] We conducted experiments on two real-world traffic flow datasets. The PASTGCN model effectively predicted multivariate traffic flows and reduced errors by up to 1.59% compared to existing deep learning models. [Limitations] The experimental datasets are limited, and the proposed graph learning method increased the time complexity. [Conclusions] The PASTGCN model can effectively utilize spatial-temporal position information to achieve more accurate traffic flow prediction.

  • Feng Ran, Chen Danlei, Hua Bolin
    Data Analysis and Knowledge Discovery. 2025, 9(5): 19-32. https://doi.org/10.11925/infotech.2096-3467.2024.0533
    Abstract (463) PDF (123) HTML (375)   Knowledge map   Save

    [Objective] This paper comprehensively reviews the methods of text augmentation to reveal their current state of development and trends. [Coverage] Using “textual data augmentation” and “text augmentation” as search terms to retrieve literature from Web of Science, Google Scholar and CNKI, we screened out a total of 88 representative papers for review. [Methods] Text augmentation methods were categorized and summarized according to the objects of operation, the details of implementation and the diversity of generated results. On this basis, we conducted a thorough comparison of various methods with regards to their granularity, strengths, weaknesses and applications. [Results] Text augmentation approaches can be divided into text space-based methods and vector space-based methods. The former is intuitive and easily interpretable but may compromise the overall semantic structure of the text, while the latter can directly manipulate semantic features but incurs higher computational complexity. Current studies frequently necessitate external knowledge resources, such as heuristic guidelines and task-specific data. Moreover, the introduction of deep learning algorithms can enhance the novelty and diversity of generated data. [Limitations] We primarily offer a systematic examination of technical principles and performance characteristics of advanced methods, without assessing the developmental stage of platform tools quantitatively. Besides, the analysis is grounded in our chosen literatures and may not encompass all potential application scenarios of text augmentation methods. [Conclusions] Future work should pay more attention to enriching and refining the evaluation metrics for text augmentation techniques and increasing their robustness across different downstream tasks by prompt learning. Retrieval-augmented generation and graph neural networks should be taken seriously for addressing the challenges posed by lengthy texts and limited resources, which can further unlock the potential of text augmentation methods in the field of natural language processing.

  • Liu Yu, Zeng Ziming, Sun Shouqiang
    Data Analysis and Knowledge Discovery. 2025, 9(8): 20-31. https://doi.org/10.11925/infotech.2096-3467.2024.0870

    [Objective] This paper addresses issues of semantic shift in multi-aspect sentences and implicit sentiment analysis in aspect-based sentiment analysis. To this end, it proposes a model based on sentiment enhancement using large language models and graph convolutional neural networks. [Methods] The model uses prompt learning to guide large language models in generating sentiment-enhanced representations of aspect semantics. It then constructs an aspect-semantic, sentiment-knowledge-enhanced graph. Additionally, the paper presents a sentiment-target position weighting algorithm to filter irrelevant information from the syntactic dependency graph. It also introduces aspect masking and gated filtering mechanisms to fully integrate semantic information and accurately identify the sentiment tendency of each aspect. [Results] On all experimental datasets, the proposed model performs slightly less accurately than the other two baseline models on the Restaurant dataset, but still achieves an F1 value of 81.60%. Specifically, the proposed model significantly improves F1 scores on the Laptop, Twitter and MAMS datasets by 1.79, 1.17 and 3.02 percentage points respectively over the optimal baseline model. [Limitations] The role of visual information in aspect-level sentiment analysis is not considered and experiments are only conducted on English datasets. [Conclusions] By leveraging prompt learning to guide large language models in generating sentiment representation words and combining them with graph neural networks, an effective and efficient solution for aspect-level sentiment analysis is provided, significantly improving the accuracy of aspect-level sentiment analysis in text.

  • Shen Yangtai, Qi Jianglei, Ding Hao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 145-153. https://doi.org/10.11925/infotech.2096-3467.2023.0808
    Abstract (367) PDF (92) HTML (304)   Knowledge map   Save

    [Objective] This paper proposes a latent non-negative factorization topic recommendation model based on LDA and transfer learning to improve recommendation accuracy in sparse data scenarios. The new model aims to address the data sparsity issue in publication recommendations. [Methods] We used non-negative matrix factorization to fill the high-dimensional sparse matrix of non-negative data. Then, we constructed a latent topic model based on LDA and non-negative matrix factorization, fully considering the thematic distribution characteristics of user reviews. Additionally, we applied different dimensions of user information to rating prediction to mitigate data sparsity. Finally, we introduced a transfer learning mechanism to extract and transfer model parameters from pre-trained models of related publication categories. This mechanism assisted the feature learning for the target model data and improved the effectiveness of the recommendation for less popular publications. [Results] We conducted comparative experiments against three baseline methods with three publication datasets. The proposed model achieved average precision, F1 score, and NDCG of 0.773 2, 0.708 5, and 0.746 8. The model’s overall performance surpasses other baseline models. [Limitations] When the number of users in the system is too small, other methods are needed for cold-start situations. [Conclusions] The proposed method has strong generalization capabilities for user interest features, alleviates popularity bias and data sparsity, and effectively improves the accuracy of publication recommendations.

  • Han Yixiao, Ma Jing
    Data Analysis and Knowledge Discovery. 2024, 8(12): 18-29. https://doi.org/10.11925/infotech.2096-3467.2023.0923
    Abstract (367) PDF (195) HTML (253)   Knowledge map   Save

    [Objective] In response to the challenges that current multimodal emotion models face in feature fusion, resulting in suboptimal accuracy in emotion classification, we propose the RCHFN multimodal emotion classification model. [Methods] We use the CLIP and Chinese-BERT-wwm models to extract image and text features separately while performing unimodal emotion classification concurrently. Then, we use a residual fusion module consisting of merged residual connections and convolution to fuse image and text features to obtain multimodal emotion classification results. Finally, we pass both unimodal and multimodal emotion classification results to a fully connected layer and adjust dynamic weights to obtain the final emotion classification result. [Results] The experimental results show that the RCHFN model achieved sentiment classification accuracies of 81.25% and 79.21% on the Weibo dataset and the Twitter datasets, respectively, with F1 scores of 80.43% and 78.44%, respectively. Compared to other models designed for similar tasks on the same dataset, the model showed an increase in accuracy of 1.79% and 1.79%, along with F1 score improvements of 2.39% and 2.62%, respectively. [Limitations] Further experiments are needed to establish the generalisation of this model to different datasets and its performance on additional modalities. [Conclusions] The RCHFN model proposed in this study effectively addresses the challenges of fusing multimodal discourse features and improving classification accuracy in emotion classification.

  • Zhai Dongsheng, Zhai Liang, Liang Guoqiang, Zhao Kai
    Data Analysis and Knowledge Discovery. 2025, 9(2): 120-133. https://doi.org/10.11925/infotech.2096-3467.2023.1277
    Abstract (364) PDF (127) HTML (249)   Knowledge map   Save

    [Objective] This study proposes a method for identifying technological evolution paths and explores key technologies and branches in specific domains. It aims to reveal the evolution trajectories of technology. [Methods] Firstly, we devised an unsupervised graph embedding model to integrate patent structural relationships, text and node information propagation, and aggregated knowledge into multi-dimensional semantic vectors. This approach expanded the technological paths while improving community division effectiveness. Secondly, we proposed methods for expanding the main path and derivative paths from the perspective of network topology and semantic correlation. Finally, we constructed a metric for technological junction points to identify the promising fields. [Results] We examined the new method with drone flight control system technology and identified four subfields’ technological evolution paths and branches. We found that pattern recognition, multiprocessor, and data fusion technologies hold promising prospects. [Limitations] Our identification framework does not incorporate the formation mechanism of technological evolution patterns. [Conclusions] The proposed method demonstrates significant advantages in path expansion effectiveness and application versatility.

  • Si Binzhou, Sun Haichun, Wu Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 38-51. https://doi.org/10.11925/infotech.2096-3467.2024.0287

    [Objective] This study proposes a research framework for risk analysis of telecom fraud based on large language models (LLMs) and event fusion to reveal the process of telecom fraud and identify key risk factors. [Methods] We constructed a two-stage hierarchical prompt instruction specific to the telecom fraud domain and extracted risk events and their arguments from fraud cases. The framework integrates semantic dependency analysis with template-matching techniques to obtain the fraud event chains. Considering the diversity in event descriptions, we employed the BERTopic model for sentence vector representation and utilized a clustering algorithm for event fusion. [Results] Our method achieved F1-scores of 67.41% for event extraction and 73.12% for argument extraction in telecom fraud case analysis. Event clustering identified 10 categories of thematic risk events, with “disclosing information” as the highest-risk behavior. [Limitations] The coarse granularity of police report data limits the framework’s early warning capabilities. [Conclusions] The proposed approach, combining LLMs with event fusion clustering, enables the automatic construction of fraud event evolution chains, facilitates risk analysis, and supports the early warning and deterrence of telecom frauds.

  • Gao Yuan, Li Chongyang, Qu Boting, Jiao Mengyun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 158-169. https://doi.org/10.11925/infotech.2096-3467.2024.0784
    Abstract (352) PDF (70) HTML (179)   Knowledge map   Save

    [Objective] This paper aims to advance the research on urban tourism flow network structure, and to address the issues of inaccurate point-of-interest recognition and distorted visiting sequence in current tourist journey reconstruction methods based on travelogue texts. [Methods] This paper proposes a method based on a large language model for reconstructing tourist journeys, and explores the structural characteristics of urban tourism flow networks by combining it with social network analysis methods. [Results] The proposed method for reconstructing tourist journey achieves a precision of 94.00% and a recall of 87.78% in POI recognition, significantly outperforming the statistics-based Conditional Random Fields (CRF) method. The reconstructed journey shows a similarity of 83.81% to the actual journey. [Limitations] Tourist journey reconstruction effects depend to a certain extent on the training effects of the Prompts of the large language model. [Conclusions] The conclusions drawn align with public perception and current research findings when taking Xi’an as a case study, demonstrating the accuracy and versatility of the proposed tourist journey reconstruction method.

  • Shi Xi, Chen Wenjie, Hu Zhengyin, Han Tao, Zhang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(3): 1-15. https://doi.org/10.11925/infotech.2096-3467.2024.0176
    Abstract (340) PDF (233) HTML (286)   Knowledge map   Save

    [Objective] This study aims to efficiently extract scientific experiment knowledge and data from academic literature. It constructs a Scientific Experiment Knowledge Graph(SEKG) to provide high-quality data support for knowledge discovery. [Methods] We utilized Event Knowledge Graph technology to uniformly represent and model the complexity, temporality, and integration of knowledge and data in scientific experiments, thereby establishing the schema layer of the SEKG. Large Language Model was employed to enhance the efficiency of knowledge extraction in the data layer, with an empirical analysis conducted on organic solar cells. [Results] By using manual annotation and fine-tuning large language models, we constructed a scientific experiment knowledge graph in the field of organic solar cells. This SEKG comprises 34 types of nodes and 9 types of relationships, totaling 24,348 nodes and 123,642 relations. [Limitations] The data sources were limited to papers and patents. The construction of the SEKG required substantial manual input from experts, highlighting the need for efficiency improvements. Furthermore, fine-grained research procedures and validation rules in subfields were not considered. [Conclusions] The proposed method provides high-quality data support for applications such as experimental protocol recommendations, scientific experiment evolution analysis, and AI for Science, effectively supporting various knowledge discovery scenarios.

  • Sun Xinxin, Sun Ya’nan, Zhao Yuxiang, Jiang Bin
    Data Analysis and Knowledge Discovery. 2025, 9(7): 104-117. https://doi.org/10.11925/infotech.2096-3467.2024.0633

    [Objective] This study explores the impact of the voice characteristics of AI medical voice assistants on the perceived credibility of the older adults, mainly based on the Computer Are Social Actors (CASA) paradigm and the stereotype model. [Methods] This study conducted a 3 (voice gender: female/male/non-binary) ×2 (communication style: expert/partner) between-subjects experiment to explore the impact of voice gender and communication style of AI medical voice assistants on the perceived credibility and intentions to use among older adults. Additionally, the study sought to elucidate the mechanism of action on the stereotype dimensions of perceived warmth and perceived professionalism. [Results] The results indicate that older adults perceive male expert-type and female partner-type AI medical voice assistants as more credible. Communication style influenced their credibility perception of voice gender through perceived professionalism, and this perceived credibility positively predicted their behavioral intention to use such assistants. [Limitations] As this study was conducted within the context of China’s smart healthcare system development, the generalizability of the findings warrants further validation. [Conclusions] The congruence between vocal characteristics and gender-role stereotypes enhanced older adults’ perceived credibility. AI medical voice assistant design should account for the interplay of multiple vocal factors and contextual suitability.

  • Zhu Xiang, Zhang Yunqiu, Sun Shaodan, Zhang Liman
    Data Analysis and Knowledge Discovery. 2024, 8(12): 125-135. https://doi.org/10.11925/infotech.2096-3467.2023.0869
    Abstract (328) PDF (116) HTML (206)   Knowledge map   Save

    [Objective] This paper proposes a drug knowledge discovery method that fuses meta-path features of heterogeneous knowledge network to improve the performance of drug knowledge discovery. [Methods] Based on different meta-paths connecting drug and target entity in heterogeneous knowledge network, the HeteSim algorithm is used to calculate the multi-dimensional semantic similarity of drug-target entity. These meta-path features are fused with drug similarity and target entity similarity features as feature inputs for machine learning models to achieve drug knowledge discovery. [Results] The drug heterogeneous knowledge network contains 12,015 nodes and 1,895,445 edges. Taking drug-target relation prediction as an example, the 21-dimensional HeteSim features between drug and target were calculated. The AUC value of this method achieved the highest value on the three machine learning models (XGBoost=0.993, RF=0.990, SVM=0.975). The accuracy, precision and F-value of this method are also higher than those of the other two comparison methods. Through literature search of 20 prediction results, it is found that some prediction results can be supported by evidence in previous literature. [Limitations] Although PU learning strategy is used to reduce the influence of sample imbalance, some results will still be distorted. [Conclusions] The drug knowledge discovery method proposed in this study has certain progressiveness and effectiveness, and has certain theoretical and methodological reference significance.

  • Hou Jianhua, Deng Xianjiang, Tang Shiqi
    Data Analysis and Knowledge Discovery. 2025, 9(3): 69-82. https://doi.org/10.11925/infotech.2096-3467.2024.0353
    Abstract (327) PDF (119) HTML (243)   Knowledge map   Save

    [Objective] This study aims to explore the influence of interdisciplinary knowledge integration on the emergence of high-value patents and to delineate their distinctive characteristics. [Methods] High-value patents are operationalized as patents that receive the China Patent Gold Award. Interdisciplinary knowledge integration is quantified by two dimensions: IPC classification and patent knowledge units. Regression analysis investigates the effects of interdisciplinary knowledge integration, measured by these two dimensions, on both patent award status and individual patent value dimensions. [Results] The analysis reveals that high-value patents tend to exhibit a narrower interdisciplinary scope in terms of IPC classification, while simultaneously demonstrating a more diverse knowledge structure. In particular, interdisciplinary knowledge integration, when indicated by IPC classification, shows an inverted U-shaped relationship with patent value. Conversely, interdisciplinary knowledge integration, when indicated by knowledge units, shows a negative correlation with patent value. [Limitations] This study is limited by its reliance on the China Patent Gold Award as the sole proxy for high-value patents, which may not fully encompass the multifaceted nature of high-value patent characteristics. [Conclusions] This research provides valuable insights into the proactive identification and protection of high-value patents. Furthermore, the findings inform strategies to enhance upstream patent quality control and to facilitate effective patent translation and commercial utilization.

  • Duan Yufeng, Xie Jiahong
    Data Analysis and Knowledge Discovery. 2025, 9(9): 25-36. https://doi.org/10.11925/infotech.2096-3467.2024.0965

    [Objective] This study investigates the performance differences among existing large language models (LLMs) in extracting entities and relations of Chinese medical text, and analyzes the influence of the number of examples and relation types on the extraction performance. [Methods] Based on prompt engineering approach, we use the API way to call 9 mainstream LLMs, modifying prompt from two perspectives: the number of examples and the number of relation types. Experiments are conducted using CMeIE-V2 dataset to compare extraction performance. [Results] (Ⅰ) The comprehensive extraction ability of GLM-4-0520 is in the first place, with F1 scores of 0.4422, 0.3869, and 0.3874 when extracting three relation types of “clinical manifestation”, “medication”, and “etiology” respectively. (Ⅱ) When varying the number of examples m in the prompt, the F1 score initially increases with m, and reaches a maximum score of 0.4742 when m=8, but it declines after m>8. (Ⅲ) After increasing the number of relation types to be extracted, n, the F1 score drops significantly: when n=2, the F1 score decreases by 0.1182 compared to n=1, and when n=10, the F1 score is only 0.2949. [Limitations] Currently, there are few public datasets available, so the experimental results are based on a single dataset. Additionally, since medical-domain LLMs are difficult to access via API, all models used in this study are from general domain. [Conclusions] The extraction performance varies greatly among different LLMs; A suitable number of examples can improve the extraction performance, but more is not always better; LLM is not good at extracting multiple relation types at the same time.

  • Meng Xuyang, Wang Hao, Li Yuanqing, Li Yueyan, Deng Sanhong
    Data Analysis and Knowledge Discovery. 2025, 9(9): 1-12. https://doi.org/10.11925/infotech.2096-3467.2024.0914

    [Objective] This paper proposes a paradigm integrating large language models (LLMs) with knowledge graphs (KGs). We aim to address issues such as catastrophic forgetting, poor interpretability of generated content, and excessive demand for data and computational resources in vertical domain question-answering (QA) systems with fine-tuned LLMs. [Methods] First, we constructed a fine-grained KG for the traditional Chinese medical text “Treatise on Cold Damage”. Then, we employed a retrieval-augmented generation (RAG) model to incorporate this KG into a LLM through prompt learning to build a QA system. [Results] Compared to baseline models and fine-tuned models with professional data, the proposed system achieved a 14.67 and 1.33 percentage points higher satisfaction rate in subjective evaluations. In the objective evaluation, our model demonstrated an overall accuracy of 20.00 percentage points higher than the baseline models and 2.00 percentage points lower than the fine-tuned models. [Limitations] The application is limited to the traditional Chinese medicine domain related to the Treatise on Cold Damage. There is also a lack of standardized benchmarks to evaluate the system’s professional capabilities. [Conclusions] The proposed approach enhances the interpretability of generated content from vertical domain QA systems while substantially reducing the need for data and computational resources.

  • Zhu Danhao, Huang Xiaoyu, Li Yaolin, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2025, 9(6): 35-46. https://doi.org/10.11925/infotech.2096-3467.2024.0555
    Abstract (315) PDF (58) HTML (252)   Knowledge map   Save

    [Objective] This study uses large language model technology to automatically summarise legal texts. This addresses issues associated with traditional methods, such as the inadequate handling of lengthy texts and weak logical coherence in summaries. [Methods] This study proposes a method of automatically summarising legal texts based on the fine-tuning of large language models for specific domains. Firstly, a legal text summarisation instruction dataset is constructed. Secondly, two data augmentation strategies are explored: instruction augmentation and result augmentation. Finally, the study will perform domain-specific fine-tuning on a pre-trained model and conduct a multi-dimensional evaluation of the results. [Results] On the CAIL2020 Judicial Summary Dataset, our method achieves improvements of 13.8, 21.3, and 7.4 percentage points in the ROUGE-1, ROUGE-2, and ROUGE-L F1 scores, respectively, compared to the best baseline methods. Both human and automated evaluations further validate the effectiveness of our approach across multiple dimensions. [Limitations] When processing legal texts that are dense with technical terms and complex logical structures, the generated summaries still lack detail accuracy and precision with regard to legal provisions. [Conclusions] Fine-tuning large language models for specific domains can effectively improve the quality of legal text summarisation.

  • Jin Qingwen, Li Hurong, Zhang Chen
    Data Analysis and Knowledge Discovery. 2024, 8(12): 101-111. https://doi.org/10.11925/infotech.2096-3467.2023.0892
    Abstract (315) PDF (78) HTML (156)   Knowledge map   Save

    [Objective] This study explores the application of the LIME algorithm and its evolutions in data storytelling, aiming to leverage the explanatory function of data stories. [Methods] We examined the principles, applications, and evolutionary strategies of the LIME algorithm. Based on this theoretical framework, we constructed a data storytelling process assisted by LIME-related algorithms. We collected a partial dataset for cat and dog recognition from the Kaggle platform, and trained an interpretable model with this dataset. Finally, we applied the new data storytelling model to explain image classification performance. [Results] Using an image of a “tabby cat” as the analysis object, the LIME explanation results and storytelling development curve indicated that the important features affecting the prediction results were the M-shaped stripes, black eyes, and pink nose, and the number of key superpixels being 2. [Limitations] Optimization of feature recognition and automated generation of data stories remain challenges. [Conclusions] Applying LIME-related algorithms in the data storytelling helps transform model predictions and explanation results into interpretable stories, better communicating data analysis outcomes.

  • Wu Shuai, Yang Xiuzhang, He Lin, Gong Zuoquan
    Data Analysis and Knowledge Discovery. 2024, 8(12): 136-148. https://doi.org/10.11925/infotech.2096-3467.2023.1002
    Abstract (306) PDF (92) HTML (150)   Knowledge map   Save

    [Objective] Combining the complex sentence structure features of ancient texts, a method with higher accuracy for identifying entity words in ancient texts was developed to further the development of digital humanities research. [Methods] Trigger words and relative words were used as key feature words to identify entity words, and a sentence pattern template was designed. Based on the characteristics of ancient texts, a Bert-BiLSTM-MHA-CRF model was constructed. The fusion of syntactic features and the Bert-BiLSTM-MHA-CRF model was used to achieve deep and fine-grained entity recognition of ancient texts. [Results] The F1 Score of this method is 0.88 on the conventional annotated test data set, 0.83 on the small sample annotated test data set, 0.79 (The Book of Songs), 0.81 (Master Lü’s Spring and Autumn Annals) and 0.85 (Discourses of the States) on the transfer learning test data set. [Limitations] In the design of syntactic feature templates, only single ancient books are used as feature templates. Semantic information mining does not take into account the structural features of characters such as phonetic symbols and radicals in ancient texts. [Conclusions] In small sample annotation and transfer learning experiments, this method can also achieve accurate named entity recognition of ancient texts, providing high quality corpus data for digital humanities research.

  • Liu Yan, Zhan Yalan, Jiang Ziheng, Li Jinliang, Yan Zhijun, He Chaocheng
    Data Analysis and Knowledge Discovery. 2025, 9(9): 13-24. https://doi.org/10.11925/infotech.2096-3467.2024.0991

    [Objective] To address the insufficient attention in existing literature to the language style characteristics of rumors and the partially truthful dual-faced health information, this paper proposes a multimodal online health rumor detection model incorporating language style features (MWDLS: A Multimodal Wide and Deep Model for Online Health Rumor Detection Considering Language Style). [Methods] The MWDLS model leverages Aristotle’s rhetorical theory to extract persuasive language style features— appealing to emotion, logic, and character—and employs a bidirectional cross-modal interaction fusion strategy with a gating mechanism to achieve joint representation learning and classification prediction of shallow language style features and deep semantic features. [Results] We conducted extensive experiments on a real-world dataset from a leading Chinese social media platform and found that MWDLS outperformed the baseline models. It improved the F1 score of the target task by up to 11.98 percentage points. Notably, for the health rumor category and the dual-faced health information category, MWDLS increased the F1 scores by up to 16.63 and 11.71 percentage points, respectively. [Limitations] The current model does not examine other modalities, such as video and audio, nor does it incorporate large language models or knowledge-aware mechanisms to enhance early detection of health rumors. [Conclusions] By integrating language style features with multimodal deep semantic features, MWDLS effectively enhances the performance of online health rumor detection.

  • Wang Xiaolun, Yao Qian, Lin Jiahui, Zhao Yuxiang, Sun Zhihao, Lin Xinlan
    Data Analysis and Knowledge Discovery. 2025, 9(1): 55-64. https://doi.org/10.11925/infotech.2096-3467.2024.0098
    Abstract (289) PDF (120) HTML (239)   Knowledge map   Save

    [Objective] Based on self-determination theory, this study explores the motivations of service providers to participate in tasks on skill crowdsourcing platforms. [Methods] We retrieved 15,641 bids and 2,385 service provider records from the epwk.com platform. We utilized the TF-IDF and the BERT to analyze text features and calculate motivation variables. Finally, we constructed a negative binomial regression model considering the dependent variables as count variables. [Results] The motivations and behaviors of service providers participating in skill crowdsourcing were significantly correlated at the 1% level (R²=23.10%). Task difficulty improved the model’s explanatory power, negatively moderating competence and reputation (p<0.05) while positively moderating social recognition (p<0.01). [Limitations] The representativeness is limited to a single platform. Future studies could collect data from multiple platforms for comparative validation. External factors such as platform dynamics and policy environments might interfere with the data, which should be considered in future research to deepen the conclusions. [Conclusions] This paper expands the theoretical foundation for service provider participation in crowdsourcing tasks and offers practical insights for service providers, buyers, and platforms.

  • Wen Tingxin, Bai Yunhe
    Data Analysis and Knowledge Discovery. 2024, 8(12): 86-100. https://doi.org/10.11925/infotech.2096-3467.2023.0881
    Abstract (280) PDF (111) HTML (154)   Knowledge map   Save

    [Objective] This study proposes an interpretable model for the interaction quality of fake news groups based on RF-GA-XGBoost and SHAP. Our model mitigates the negative impacts of fake news by leveraging the interaction quality of social media user groups and accurately identifies the causes and mechanisms of positive interactions. [Methods] First, we retrieved 500 fake news articles and 7,029 comments from the Weibo21 dataset. Then, we assessed the fake news groups’ interaction quality across three dimensions: content, form, and comment sentiment. Third, we extracted fake news text features from these dimensions. Fourth, we used the sequential forward search strategy of random forest to extract the optimal feature subset of fake news text. We constructed a prediction model for group interaction quality based on GA-XGBoost, and compared its performance with other mainstream machine learning algorithms such as LR, SVM, and XGBoost. Finally, the SHAP model provides causal explanations for the impact of important features on the group interaction quality. [Results] Our model’s F1-score and AUC values are over 86%, outperforming the comparison models across six performance metrics. Additionally, features such as the number of content characters, words, and negative sentiment words in fake news text significantly influence the interaction quality of social media groups. [Limitations] This paper does not conduct multi-feature interaction interpretation analysis or explore the early high-quality group interaction patterns based on timestamps. [Conclusions] The proposed model accurately identifies the ways in which different features impact group interaction quality, providing effective decision-making support for social media platforms to improve their operational strategies and functional designs.

  • Ye Guanghui, Wang Yujie, Lou Peilin, Zhou Xinghua, Liu Shuyan
    Data Analysis and Knowledge Discovery. 2025, 9(5): 62-76. https://doi.org/10.11925/infotech.2096-3467.2024.0507
    Abstract (280) PDF (74) HTML (212)   Knowledge map   Save

    [Objective] Tracking and observing the characteristics of public opinion circulation during emergencies can facilitate effective public opinion guidance, control, and shared governance. [Methods] Using the case study method, we construct a framework for understanding the macroscopic circulation of public opinion in emergencies. Using social network analysis, complemented by empirical research and natural language processing technology, we conduct an in-depth analysis of the circulation patterns of public opinion from a micro perspective, focusing on the dimensions of subjects, objects, and carriers. Validation analyses are conducted using data from public health emergencies. [Results] From a macro perspective, public opinion circulates across Cyber Space, Physical Space and Psychological Space, providing an interdisciplinary analytical framework for understanding and quantifying public behaviors and responses. At the micro level, public opinion circulates among multiple groups, media, events and platforms, exhibiting four effects respectively: homogeneous diffusion and heterogeneous traversal effect, field resonance and field escape effect, co-temporal and ephemeral effect, and amplified resonance and echo difference effect. [Limitations] The dynamics of social network sentiment are not considered. [Conclusions] By summarizing the laws of cross-domain circulation of public opinion from both macroscopic and microscopic perspectives and conducting empirical research linked to specific events, we provide new insights into the study of public opinion communication.

  • Zhang Lanze, Gu Yijun, Peng Jingjie
    Data Analysis and Knowledge Discovery. 2025, 9(1): 65-78. https://doi.org/10.11925/infotech.2096-3467.2023.1009
    Abstract (271) PDF (91) HTML (178)   Knowledge map   Save

    [Objective] To enhance the accuracy of graph neural networks in credit fraud detection, this paper introduces topological structure analysis. It proposes a graph-based deep fraud detection model (PSI-GNN) integrating prior structural information. [Methods] We embed the attribute information representing the topological structure of central nodes into feature vectors through structural information encoding. Then, we divided the message-passing process into proximal and distal aspects. We aggregated proximal node information based on a shallow graph neural network model and aggregated distal homophily information guided by random walk structural similarity. Finally, we combined the results of the above message passing to obtain node embedding representations. [Results] We examined the new model on the DGraph-Fin and TFinance datasets, which include fraudulent behaviors. The Macro-F1 and AUC of the PSI-GNN model improved by 2.62%, 4.55%, and 4.67%, 2.33%, respectively, compared to nine graph neural network models in related fields. [Limitations] The processing of node structural information incurs significant time overhead. [Conclusions] By modeling the structural attributes and homophily information of credit networks, we can effectively detect credit fraudsters.

  • Cao Kun, Wu Xinnian, Bai Guangzu, Jin Junbao, Zheng Yurong, Li Li
    Data Analysis and Knowledge Discovery. 2025, 9(3): 42-55. https://doi.org/10.11925/infotech.2096-3467.2024.0006
    Abstract (269) PDF (94) HTML (167)   Knowledge map   Save

    [Objective] This study explores methods for identifying key core technologies by integrating the textual content characteristics of “science-technology” and complex network relationships. It supports governments, research institutions, and industries in formulating scientific and technological strategies and conducting innovation activities. [Methods] First, we employed the Sentence-BERTopic model to perform deep semantic fusion and knowledge topic clustering on sentence-level paper and patent text corpora. Then, we constructed a “science-technology” knowledge topic complex network based on the citation relationships of these documents. Third, we improved the traditional PageRank algorithm by incorporating node quality characteristics, time decay factors, the weights of incoming node edges, and outdegree. This approach ranked the importance and influence of nodes within the domain. Finally, we identified key core technologies using the head/tail break method. [Results] We conducted an empirical study on CNC machine tools and identified 53 key core technologies, including thermal error modeling and compensation, CNC machine tools control technology, and feed systems. A comparison with relevant domestic and international policy plans demonstrates that the identified technologies comprehensively encompass the key core technologies in the field. [Limitations] This study lacks an in-depth analysis of citation locations, motivations, behaviors, and purposes, which may affect identification accuracy. [Conclusions] This study reveals the knowledge structure and topological characteristics of science and technology by constructing a “science-technology” complex network and applying the Key Core Rank (KCR) algorithm. The proposed method achieves fine-grained and precise quantitative identification of key core technologies.

  • Hai Jiali, Wang Run, Yuan Liangzhi, Zhang Kairui, Deng Wenping, Xiao Yong, Zhou Tao, Chang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(7): 165-174. https://doi.org/10.11925/infotech.2096-3467.2024.0747

    [Objective] This paper constructs a retrieval-augmented question-answering (QA) system for Traditional Chinese Medicine (TCM) standards, aiming to provide efficient standard knowledge services and promote the research and application of TCM standardization. [Methods] By comparing the performance of large language models such as BaiChuan, Gemma, and Qwen, we chose GPT-3.5 as the base model. Then, we combined data optimization and retrieval-augmented generation to develop a QA system with semantic analysis, contextual association, and answer-generation capabilities. [Results] On a TCM literature-based question generation dataset, the new system achieved answer relevance precision, recall, and F1 scores of 0.879, 0.839 and 0.857, respectively, as well as contextual relevance scores of 0.838, 0.869, and 0.853. On a TCM standards QA dataset, the system achieved answer relevance scores of 0.871, 0.836 and 0.853, all outperforming baseline models. [Limitations] The system’s intent recognition accuracy still requires further improvement. The scale and granularity of the TCM standards knowledge base need to be expanded and refined. [Conclusions] In response to the practical needs of TCM knowledge services, this study developed a retrieval-augmented QA system for TCM standards. The system can effectively answer various questions related to clinical guidelines, herbal medicine standards, and information standards, covering topics such as treatment principles, syndrome classification, therapeutic methods, and technical specifications, demonstrating its strong practicality and feasibility.

  • Chen Jing, Cao Zhixun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 1-13. https://doi.org/10.11925/infotech.2096-3467.2024.0446
    Abstract (251) PDF (115) HTML (177)   Knowledge map   Save

    [Objective] This paper aims to analyse the differences in combating hallucinations in large language models between unstructured knowledge, exemplified by knowledge base resources, and structured knowledge, exemplified by knowledge graph resources, using the Traditional Chinese Medicine (TCM) Q&A domain as a case study. Based on these findings, strategies for improving the ability of large language models to combat hallucinations in vertical domains are discussed. [Methods] The study designs experiments using external knowledge combined with prompt engineering techniques to analyse the differences in prompt effects between knowledge base resources and knowledge graph resources in the TCM Q&A domain. It also investigates the superiority of dynamic triplet strategies and integrated fine-tuning strategies in optimising large language models against hallucinations. [Results] Experimental results show that compared to prompts from unstructured knowledge in the knowledge base, prompts from structured knowledge in the knowledge graph perform better in terms of precision, recall and F1 score, improving by 1.9%, 2.42% and 2.2% respectively to reach 71.44%, 60.76% and 65.31%. Further analysis of the optimisation strategies shows that the combination of the dynamic triplet strategy and fine-tuning had the best effect against hallucinations, achieving precision, recall and F1 scores of 72.47%, 65.87% and 68.62% respectively. [Limitations] This study is limited to a single field, as it was only tested in the field of Traditional Chinese Medicine Q&A, and its generalisability needs to be validated in a wider range of scientific fields. [Conclusions] This study has demonstrated that in the field of Traditional Chinese Medicine, structured knowledge from knowledge graphs outperforms traditional unstructured knowledge in reducing hallucinations and improving the accuracy of model responses. It demonstrates the critical role of structured knowledge in enhancing model comprehension skills. The integration of fine-tuning strategies with knowledge resources provides an effective way to improve performance in large language models. This paper provides a theoretical rationale and methodological support for integrating external knowledge into large language models to improve knowledge performance.

  • Dan Zhiping, Li Lin, Yu Xiaosheng, Lu Yujie, Li Bitao
    Data Analysis and Knowledge Discovery. 2025, 9(9): 102-113. https://doi.org/10.11925/infotech.2096-3467.2024.0957

    [Objective] In light of the fact that hate speech containing no obvious malicious words cannot be effectively identified in Chinese text, a Chinese hate speech detection method integrating multi-dimensional sentiment features (RMSF) was proposed. [Methods] Firstly, the RoBERTa model is used to extract both character- and sentence-level features from the input text, while sentiment dictionaries are used to derive multi-dimensional sentiment attributes. These character and sentiment features are then concatenated and fed into a BiLSTM network to capture deeper contextual semantic information. Subsequently, the output of the BiLSTM is concatenated with the sentence-level features derived from RoBERTa and processed through a multilayer perceptron before being classified using the SoftMax function. To address class imbalance, the focal loss function is applied during model optimization, thereby improving the accurate discrimination of hate speech. [Results] On the TOXICN dataset, the RMSF method achieves precision, recall, and F1 scores of 82.63%, 82.41%, and 82.45%, respectively. On the COLDataset, it achieves precision, recall, and F1 scores of 82.94%, 82.96%, and 82.85%, respectively. Compared to existing approaches, RMSF yields F1 score enhancements of 1.85% and 1.09% on the respective datasets. [Limitations] The hate speech detection method integrating multi-dimensional emotional features relies on tools such as sentiment lexicons. However, the extraction of emotional characteristics is constrained by the lexicon’s content coverage and semantic granularity. [Conclusions] The experimental findings indicate that incorporating multi-dimensional sentiment features into Chinese hate speech detection models can significantly enhance detection performance.

  • An Lu, Zheng Yajing
    Data Analysis and Knowledge Discovery. 2024, 8(12): 1-17. https://doi.org/10.11925/infotech.2096-3467.2023.0974
    Abstract (221) PDF (125) HTML (155)   Knowledge map   Save

    [Objective] This study aims to explore the mechanism of social consensus formation in the context of public emergencies. It proposed the methods for identifying and measuring consensus and identifies important factors that influence consensus formation, providing theoretical and methodological support for relevant departments to formulate effective information dissemination strategies and guide the evolution of public opinion. [Methods] This study takes the microblogging data of the barbecue restaurant incident in a city as a data source, combines the topic model, sentiment analysis and triplet extraction to explore users’ opinions. The degree of consensus among individuals is calculated based on opinion consistency and emotional consistency. Using the information ecology theory, the characteristic variables are constructed from the dimensions of information people, information, and information environment. The consensus degree prediction model is established. The performance of the four machine learning models is compared. The SHapley Additive ExPlanations (SHAP) technique is used to explain the best model. [Results] As a result, the MSE value (1176.9550) and the R-squared value (0.6753) of the CatBoostRegressor model were found to be superior to the other three models. The top five factors in the feature importance ranking show that the proportion of people with higher education, the age gap, and the number of people with firm views are significantly negatively correlated with the degree of group consensus. Similarity of social network structure is significantly positively correlated with the degree of group consensus. The impact of the feature variables varies according to the topic. [Limitations] Social consensus includes intragroup consensus and intergroup consensus. This article focuses only on consensus within different groups, and further research on the evolution of viewpoints and consensus formation mechanisms between different groups can be conducted in the future. [Conclusions] This article proposes a method for identifying and measuring social consensus based on the combination of viewpoint consistency and emotional consistency. Real social media data is used for viewpoint mining and consensus recognition, revealing key factors that influence the formation of social consensus.

  • Siriguleng, Lin Min, Guo Zhendong, Zhang Shujun
    Data Analysis and Knowledge Discovery. 2025, 9(3): 147-160. https://doi.org/10.11925/infotech.2096-3467.2024.0325
    Abstract (215) PDF (88) HTML (178)   Knowledge map   Save

    [Objective] This study addresses the challenges of inefficient fine-tuning and suboptimal extraction performance in deep learning-based entity-relation extraction for ancient texts in low-resource scenarios, which mainly stem from dependency on large-scale annotated data. [Methods] We propose a joint extraction framework combining prompt learning and extractive machine reading comprehension (MRC). First, entity recognition and relation extraction tasks are unified into an MRC framework to streamline model architecture. Second, three lightweight prompt strategies are designed using domain-specific knowledge to reduce task complexity. Finally, we develop MPG-GP, a joint extraction model integrating a pre-trained language model with a global pointer network, to effectively extract etiquette entity-relation triples from ancient texts. [Results] Experiments on a custom ancient etiquette entity-relation extraction dataset show F1-score improvements of 0.32%~6.05% over baseline methods. [Limitations] The prompt templates employ fixed patterns rather than learnable soft prompts, and the prompt engineering design warrants further refinement. [Conclusions] Our approach mitigates reliance on large annotated datasets while improving the accuracy of few-shot joint entity-relation extraction for ancient ritual texts, providing a novel solution for information extraction in low-resource historical documents.

  • Wu Yifan, Ma Songjie, Li Shuqing
    Data Analysis and Knowledge Discovery. 2025, 9(7): 1-14. https://doi.org/10.11925/infotech.2096-3467.2024.0916

    [Objective] To perceive the popularity preferences of users and their friends towards items, more accurate recommendation service can be achieved. [Methods] This paper proposes an item popularity calculation method that integrates contribution and influence. Attention mechanism and recurrent neural network are used to capture user popularity preference representation, and convolutional neural network and graph attention mechanism are also used to obtain friends’ long-term and short-term popularity preferences. [Results] Comparative experiments are conducted using the Douban, Delicious and Yelp datasets, and the evaluation metrics of this method are superior to the suboptimal model DGRec. The highest value of Recall@20 increases by 13.03%, and the highest increase of NDCG is 11.69%. Compared to traditional calculation methods, the proposed popularity calculation method achieves the highest increase in Recall@20 by 11.53%, and the highest increase in NDCG by 10.29%. [Limitations] This method still needs to improve performance when dealing with short sequences. [Conclusions] This method adds user popularity preference representation and user social popularity preference representation, enhances the ability to express the weight of each interaction, and can effectively recommend more long-tail items.