Home Browse Top access

Top access

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Wang Zhenyu, Zhu Xuefang, Yang Rui
    Data Analysis and Knowledge Discovery. 2025, 9(1): 90-99. https://doi.org/10.11925/infotech.2096-3467.2023.1273
    Abstract (1041) PDF (195) HTML (480)   Knowledge map   Save

    [Objective] This paper utilizes large language models (LLMs) to generate high-quality auxiliary knowledge, aiming to improve the performance of multimodal relation extraction. [Methods] We introduced a multimodal similarity detection module to construct multimodal prompt templates, which allow the LLM to integrate visual information and prior knowledge into the generated high-quality auxiliary knowledge. We combined the obtained auxiliary knowledge with the original text and input it into downstream text models to accurately predict entity relationships. [Results] The proposed model outperformed the best-baseline model on the MNRE dataset, achieving 4.09% and 7.84% improvements in accuracy and F1 score. [Limitations] We only examined the proposed model on English datasets. [Conclusions] Comparative experiments and case studies validate the model’s effectiveness in multimodal relation extraction. Our new model provides a direction for applying LLMs to multimodal information extraction tasks in the future.

  • Chen Ting, Ding Honghao, Zhou Haoyu, Wu Jiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 159-171. https://doi.org/10.11925/infotech.2096-3467.2023.1424
    Abstract (1000) PDF (134) HTML (871)   Knowledge map   Save

    [Objective] This study explores the impacts of bullet-screen(danmu)content and behavioral characteristics on consumers purchasing behavior in live-streaming e-commerce, as well as the moderating effect of host-product relevance. [Methods] First, we retrieved the bullet-screen data from the Douyin platform and the consumer data from the Huitun platform based on the Elaboration Likelihood Model. Then, we studied the impacts of bullet-screen content characteristics (central route) and behavior characteristics (peripheral route) on consumer purchasing behavior with text mining and zero-inflated negative binomial regression. We also discussed the moderating effect of host-product relevance with grouping regression. [Results] Information richness, social interaction degree and number of bullet-screen comments positively impact purchasing behavior. The emotional polarity of bullet screen comments exhibits an inverted U-shaped effect on purchasing behavior. Compared with live streaming rooms with low host-product relevance, those with high host-product relevance have broader positive impacts on purchase behavior. [Limitations] We only investigated the bullet-screen data from a single live-streaming e-commerce platform. [Conclusions] This study examines the factors influencing consumers’ actual purchasing behavior from the perspective of bullet-screen comments. It provides insights for improving communication between merchants and consumers in live-streaming e-commerce, ultimately enhancing sales performance.

  • Song Mengpeng, Bai Haiyan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 21-34. https://doi.org/10.11925/infotech.2096-3467.2024.0628
    Abstract (930) PDF (90) HTML (667)   Knowledge map   Save

    [Objective] This paper aims to generate structured literature reviews with references automatically, to assist researchers quickly grasp a specific area of scientific knowledge. [Methods] A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts. The GLM3-6B model was fine-tuned for training by generating 3,000 reviews using a large language model and then revising them manually. The corpus was then converted into high-dimensional vectors and stored in an index. These vectors were retrieved to implement LangChain’s external knowledge base. To solve the problem of poor retrieval of proper nouns, a hybrid search with BM25 was used and reordered to improve retrieval accuracy. [Results] Fine-tuning and hybrid retrieval frameworks were used to construct the literature review generation system, improving the BLEU and ROUGE scores by 109.64% and 40.22% respectively, as well as the authenticity score of manual evaluation by 62.17%. [Limitations] Due to limitations in computational resources, the scale of the local model parameters is small and its generation ability needs to be improved further. [Conclusions] The retrieval-augmented generation technique uses large language models not only generates high-quality literature reviews, and provides traceable evidence for the generated content, as well as assists researchers in intelligent reading.

  • Zhang Jing, Gao Zixin, Ding Weijie
    Data Analysis and Knowledge Discovery. 2025, 9(2): 48-58. https://doi.org/10.11925/infotech.2096-3467.2023.1347
    Abstract (694) PDF (118) HTML (221)   Knowledge map   Save

    [Objective] This paper proposes a new model to effectively classify massive police reports. [Methods] We constructed a text classification model based on BERT-DPCNN. Then, we used the BERT pre-trained model to generate word vectors. The model improved the classification performance by optimizing the activation function in the DPCNN model and enhancing the dynamic learning rate. [Results] We conducted comparative experiments between BERT-DPCNN and six other models, including BERT, BERT-CNN, BERT-RCNN, BERT-RNN, BERT-LSTM, and ERNIE. The BERT-DPCNN achieved the best accuracy, recall, and precision. In the binary classification tasks, the accuracy of BERT-DPCNN exceeded 98%. In the eleven-category tasks, the model’s accuracy exceeded 82%. [Limitations] The model has many parameters, and the limited number of experiments calls for further testing. [Conclusions] The new model effectively improves the accuracy of police report classification, providing data support for police departments in analyzing and assessing police incidents.

  • Sun Wenju, Li Qingyong, Zhang Jing, Wang Danyu, Wang Wen, Geng Yangli’ao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 1-30. https://doi.org/10.11925/infotech.2096-3467.2024.0508
    Abstract (634) PDF (199) HTML (557)   Knowledge map   Save

    [Objective] This study comprehensively reviews the advancements in deep incremental learning techniques from the perspective of addressing catastrophic forgetting, aiming to provide references for the research community. [Coverage] Utilizing search terms such as “Incremental Learning”, “Continual Learning”, and “Catastrophic Forgetting”, we retrieved literature from the Web of Science, Google Scholar, DBLP, and CKNI. By reading and organizing the retrieved literature, a total of 105 representative publications were selected. [Methods] The paper begins by defining incremental learning and outlining its problem formulation and inherent challenges. Subsequently, we categorize incremental learning methods into regularization-based, memory-based, and dynamic architecture-based approaches, and review their theoretical underpinnings, advantages and disadvantages in detail. [Results] We evaluated some classical and recent methods in a unified experimental setting. The experimental results demonstrate that regularization-based methods are efficient in application but cannot fully avoid forgetting; memory-based methods are significantly affected by the number of retained exemplars; and dynamic architecture-based methods effectively prevent forgetting but incur additional computational costs. [Limitations] The scope of this review is limited to deep learning approaches, excluding traditional machine learning techniques. [Conclusions] Under optimal conditions, memory-based and dynamic architecture-based strategies tend to outperform regularization-based approaches. However, the increased complexity of these methods may hinder their practical application. Furthermore, current incremental learning methods show suboptimal performance compared to joint training models, marking a critical direction for future research.

  • Zhang Le, Chen Yansong, Zhang Leihan
    Data Analysis and Knowledge Discovery. 2025, 9(8): 47-58. https://doi.org/10.11925/infotech.2096-3467.2024.0625

    [Objective] This paper proposes a method that enhances features using large language models and integrates them through multi-level cross-fusion. It addresses the issue in multimodal sentiment analysis, where emotional expressions across different modalities are inconsistent, hindering effective collaborative sentiment decision-making. [Methods] To alleviate the conflicting sentiment information among modalities and improve the representation of sentiment features, we used the multimodal large language model to extract the auxiliary sentiment information within each modality. Then, we employed a hierarchical cross-attention mechanism to learn shared emotional features across modalities while mining auxiliary intra-modal emotional features, thereby enhancing the expression of shared semantic sentiment. During the fusion phase, a modality-attention weighted fusion method is introduced to balance the contributions of shared and auxiliary features. Additionally, we utilized a loss function combining multimodal and unimodal inputs to address the sentiment semantic inconsistencies. [Results] The proposed model outperforms baselines on the public datasets CH-SIMS and CMU-MOSI. On CH-SIMS, binary classification accuracy and F1 score increased by 1.77 and 0.63 percentage points, respectively. On CMU-MOSI, improvements of 0.43 and 0.41 percentage points were observed. For CH-SIMS fata with emotional inconsistency, the binary classification accuracy and F1 score have increased by 1.80 and 1.72 percentage points, respectively. This demonstrates that the proposed model can effectively address the issue of inconsistent sentiment semantics across modalities. [Limitations] The model does not account for the impact of personalized information on individuals in videos. [Conclusions] The proposed approach effectively integrates multimodal features using a hierarchical cross-attention mechanism, improves the representation of shared semantic sentiment, and addresses inconsistencies in emotional semantics across different modalities.

  • Rang Yuchen, Ma Jing
    Data Analysis and Knowledge Discovery. 2025, 9(1): 100-109. https://doi.org/10.11925/infotech.2096-3467.2023.1130
    Abstract (613) PDF (116) HTML (469)   Knowledge map   Save

    [Objective] To reduce inter-modal differences and strengthen the correlation between modalities, this paper proposes a multimodal alignment sentiment analysis model to accurately capture the sentiment tendencies embedded in multimodal data. [Methods] For the textual modality, the original text data, supplemented with image captions, is processed using the RoBERTa pre-trained model for text feature extraction. We used the Clip Vision Model to extract image features for the image modality. The text and image features are aligned through a multimodal alignment layer based on a Multimodal Transformer to obtain enhanced fused features. Finally, the fused multimodal features are inputted into a multilayer perception for sentiment recognition and classification. [Results] The proposed model achieved an accuracy of 71.78% and an F1 score of 68.97% on the MVSA-Multiple dataset, representing improvements of 1.78% and 0.07%, respectively, over the best-performing baseline model. [Limitations] The model’s performance was not validated using additional datasets. [Conclusions] The proposed model effectively promotes inter-modal fusion, achieves better fusion representations, and enhances sentiment analysis.

  • Chen Wanzhi, Hou Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 52-65. https://doi.org/10.11925/infotech.2096-3467.2024.0720

    [Objective] To address the issues in multimodal sentiment analysis, such as insufficient multimodal feature extraction, semantic differences between modalities, and lack of interaction, we propose a temporal multimodal sentiment analysis model that integrates multi-level attention and sentiment scale vectors. [Methods] Firstly, we introduced a scalar Long Short-Term Memory network with a multi-head attention mechanism to construct a deep temporal feature modeling network for extracting rich contextual temporal features from text, audio, and visual modalities. Secondly, we employed the text-guided dual-layer cross-modal attention mechanism and the improved self-attention mechanism to facilitate the deep information exchange across modalities, thereby generating two sentiment scale vectors for sentiment intensity and polarity. Finally, the L1 norm of the sentiment intensity vector was multiplied by the normalized sentiment polarity vector to obtain a comprehensive representation of sentiment strength and polarity, thereby enabling accurate sentiment prediction. [Results] Experiments on the CMU-MOSI dataset show that the proposed model achieves good results in both comparative and ablation experiments, outperforming the next-best model by 1.2 and 2.3 percentage points on the Acc7 and Corr metrics, respectively. On the CMU-MOSEI dataset, the proposed model surpasses baseline models across all evaluation metrics, achieving 86.0% in Acc2 and 86.1% in F1 score. [Limitations] Sentiment expression is highly context-dependent, and the sources of sentiment cues may vary across different scenarios. The proposed model may perform poorly when textual information is insufficient. [Conclusions] The proposed model effectively extracts contextual temporal features from various modalities and leverages the rich emotional information in the text modality for deep inter-modal interaction, thereby enhancing the accuracy of sentiment prediction.

  • Shen Si, Feng Shuyang, Wu Na, Zhao Zhixiao
    Data Analysis and Knowledge Discovery. 2025, 9(9): 37-48. https://doi.org/10.11925/infotech.2096-3467.2024.0670

    [Objective] This paper aims to enhance the utilization efficiency of governmental information resources and advance the intelligent transformation of public services by addressing the inherent knowledge limitations of general LLMs when processing policy texts. We investigate the effectiveness of a RAG framework to construct a more precise and reliable intelligent policy Q&A system. [Methods] This paper proposes a retrieval-augmented generation framework based on the Chinese policy large language model ChpoGPT. Specifically, the framework retrieves semantically similar policy documents from a knowledge base based on user queries and combines the retrieved results with ChpoGPT to enhance the model’s capabilities for downstream tasks. [Results] Experimental results demonstrate that our framework significantly outperforms existing models on key metrics. The ChpoGPT-based framework achieved a factuality score of nearly 90%. In terms of answer relevance, it scored 80.2%, outperforming the Gemini-1.0-pro model by 2.1%. Furthermore, it attained an answer semantic similarity score of 56.4%, surpassing the ERNIE 4.0 and Gemini-1.0-pro models by 4.1% and 2.8%, respectively. [Limitations] The language model still exhibits some uncontrollable behaviour in its answer output. [Conclusions] The retrieval-augmented generation of policy texts based on LLMs has certain reference value for the intelligent transformation of government services, but it still needs further improvement and optimization.

  • Song Donghuan, Hu Maodi, Ding Jielan, Qu Zihao, Chang Zhijun, Qian Li
    Data Analysis and Knowledge Discovery. 2025, 9(2): 12-25. https://doi.org/10.11925/infotech.2096-3467.2023.0885
    Abstract (531) PDF (183) HTML (351)   Knowledge map   Save

    [Objective] This study addresses the issue of low classification accuracy in conventional text classification tasks due to factors such as sparse domain-specific training data and significant differences between types. [Methods] We constructed a novel classification model based on the BERT-DPCNN-MMOE framework, integrating the deep pyramid convolutional networks with the multi-gate control unit mechanism. Then, we designed multi-task and transfer learning experiments to validate the effectiveness of the new model against eight well-established and innovative models. [Results] This research independently constructed cross-type multi-task data as the basis for training and testing. The BERT-DPCNN-MMOE model outperformed the other eight baseline models in multi-task and transfer learning experiments, with F1 score improvements exceeding 4.7%. [Limitations] Further research is needed to explore the model’s adaptability to other domains. [Conclusions] The BERT-DPCNN-MMOE model performs better in multi-task and cross-type text classification tasks. It is of significance for future specialized intelligence classification tasks.

  • Feng Ran, Chen Danlei, Hua Bolin
    Data Analysis and Knowledge Discovery. 2025, 9(5): 19-32. https://doi.org/10.11925/infotech.2096-3467.2024.0533
    Abstract (522) PDF (128) HTML (425)   Knowledge map   Save

    [Objective] This paper comprehensively reviews the methods of text augmentation to reveal their current state of development and trends. [Coverage] Using “textual data augmentation” and “text augmentation” as search terms to retrieve literature from Web of Science, Google Scholar and CNKI, we screened out a total of 88 representative papers for review. [Methods] Text augmentation methods were categorized and summarized according to the objects of operation, the details of implementation and the diversity of generated results. On this basis, we conducted a thorough comparison of various methods with regards to their granularity, strengths, weaknesses and applications. [Results] Text augmentation approaches can be divided into text space-based methods and vector space-based methods. The former is intuitive and easily interpretable but may compromise the overall semantic structure of the text, while the latter can directly manipulate semantic features but incurs higher computational complexity. Current studies frequently necessitate external knowledge resources, such as heuristic guidelines and task-specific data. Moreover, the introduction of deep learning algorithms can enhance the novelty and diversity of generated data. [Limitations] We primarily offer a systematic examination of technical principles and performance characteristics of advanced methods, without assessing the developmental stage of platform tools quantitatively. Besides, the analysis is grounded in our chosen literatures and may not encompass all potential application scenarios of text augmentation methods. [Conclusions] Future work should pay more attention to enriching and refining the evaluation metrics for text augmentation techniques and increasing their robustness across different downstream tasks by prompt learning. Retrieval-augmented generation and graph neural networks should be taken seriously for addressing the challenges posed by lengthy texts and limited resources, which can further unlock the potential of text augmentation methods in the field of natural language processing.

  • Wang Zitong, Li Chenliang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 94-105. https://doi.org/10.11925/infotech.2096-3467.2023.1305
    Abstract (505) PDF (240) HTML (236)   Knowledge map   Save

    [Objective] To more flexibly capture the spatial-temporal features of traffic flow data and achieve more accurate multivariate traffic flow prediction, this paper proposes a Position-Aware Spatial-Temporal Graph Convolutional Network (PASTGCN). [Methods] First, the traffic data’s spatial and periodic temporal position features are represented as explicit position embeddings. Then, based on the spatiotemporal convolutional structure, we incorporated spatial information into the temporal convolutional network for space-aware sequence modeling. Finally, we used static and dynamic dual graph learning methods to capture spatial dependencies. [Results] We conducted experiments on two real-world traffic flow datasets. The PASTGCN model effectively predicted multivariate traffic flows and reduced errors by up to 1.59% compared to existing deep learning models. [Limitations] The experimental datasets are limited, and the proposed graph learning method increased the time complexity. [Conclusions] The PASTGCN model can effectively utilize spatial-temporal position information to achieve more accurate traffic flow prediction.

  • Liu Yu, Zeng Ziming, Sun Shouqiang
    Data Analysis and Knowledge Discovery. 2025, 9(8): 20-31. https://doi.org/10.11925/infotech.2096-3467.2024.0870

    [Objective] This paper addresses issues of semantic shift in multi-aspect sentences and implicit sentiment analysis in aspect-based sentiment analysis. To this end, it proposes a model based on sentiment enhancement using large language models and graph convolutional neural networks. [Methods] The model uses prompt learning to guide large language models in generating sentiment-enhanced representations of aspect semantics. It then constructs an aspect-semantic, sentiment-knowledge-enhanced graph. Additionally, the paper presents a sentiment-target position weighting algorithm to filter irrelevant information from the syntactic dependency graph. It also introduces aspect masking and gated filtering mechanisms to fully integrate semantic information and accurately identify the sentiment tendency of each aspect. [Results] On all experimental datasets, the proposed model performs slightly less accurately than the other two baseline models on the Restaurant dataset, but still achieves an F1 value of 81.60%. Specifically, the proposed model significantly improves F1 scores on the Laptop, Twitter and MAMS datasets by 1.79, 1.17 and 3.02 percentage points respectively over the optimal baseline model. [Limitations] The role of visual information in aspect-level sentiment analysis is not considered and experiments are only conducted on English datasets. [Conclusions] By leveraging prompt learning to guide large language models in generating sentiment representation words and combining them with graph neural networks, an effective and efficient solution for aspect-level sentiment analysis is provided, significantly improving the accuracy of aspect-level sentiment analysis in text.

  • Si Binzhou, Sun Haichun, Wu Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 38-51. https://doi.org/10.11925/infotech.2096-3467.2024.0287

    [Objective] This study proposes a research framework for risk analysis of telecom fraud based on large language models (LLMs) and event fusion to reveal the process of telecom fraud and identify key risk factors. [Methods] We constructed a two-stage hierarchical prompt instruction specific to the telecom fraud domain and extracted risk events and their arguments from fraud cases. The framework integrates semantic dependency analysis with template-matching techniques to obtain the fraud event chains. Considering the diversity in event descriptions, we employed the BERTopic model for sentence vector representation and utilized a clustering algorithm for event fusion. [Results] Our method achieved F1-scores of 67.41% for event extraction and 73.12% for argument extraction in telecom fraud case analysis. Event clustering identified 10 categories of thematic risk events, with “disclosing information” as the highest-risk behavior. [Limitations] The coarse granularity of police report data limits the framework’s early warning capabilities. [Conclusions] The proposed approach, combining LLMs with event fusion clustering, enables the automatic construction of fraud event evolution chains, facilitates risk analysis, and supports the early warning and deterrence of telecom frauds.

  • Duan Yufeng, Xie Jiahong
    Data Analysis and Knowledge Discovery. 2025, 9(9): 25-36. https://doi.org/10.11925/infotech.2096-3467.2024.0965

    [Objective] This study investigates the performance differences among existing large language models (LLMs) in extracting entities and relations of Chinese medical text, and analyzes the influence of the number of examples and relation types on the extraction performance. [Methods] Based on prompt engineering approach, we use the API way to call 9 mainstream LLMs, modifying prompt from two perspectives: the number of examples and the number of relation types. Experiments are conducted using CMeIE-V2 dataset to compare extraction performance. [Results] (Ⅰ) The comprehensive extraction ability of GLM-4-0520 is in the first place, with F1 scores of 0.4422, 0.3869, and 0.3874 when extracting three relation types of “clinical manifestation”, “medication”, and “etiology” respectively. (Ⅱ) When varying the number of examples m in the prompt, the F1 score initially increases with m, and reaches a maximum score of 0.4742 when m=8, but it declines after m>8. (Ⅲ) After increasing the number of relation types to be extracted, n, the F1 score drops significantly: when n=2, the F1 score decreases by 0.1182 compared to n=1, and when n=10, the F1 score is only 0.2949. [Limitations] Currently, there are few public datasets available, so the experimental results are based on a single dataset. Additionally, since medical-domain LLMs are difficult to access via API, all models used in this study are from general domain. [Conclusions] The extraction performance varies greatly among different LLMs; A suitable number of examples can improve the extraction performance, but more is not always better; LLM is not good at extracting multiple relation types at the same time.

  • Liu Yan, Zhan Yalan, Jiang Ziheng, Li Jinliang, Yan Zhijun, He Chaocheng
    Data Analysis and Knowledge Discovery. 2025, 9(9): 13-24. https://doi.org/10.11925/infotech.2096-3467.2024.0991

    [Objective] To address the insufficient attention in existing literature to the language style characteristics of rumors and the partially truthful dual-faced health information, this paper proposes a multimodal online health rumor detection model incorporating language style features (MWDLS: A Multimodal Wide and Deep Model for Online Health Rumor Detection Considering Language Style). [Methods] The MWDLS model leverages Aristotle’s rhetorical theory to extract persuasive language style features— appealing to emotion, logic, and character—and employs a bidirectional cross-modal interaction fusion strategy with a gating mechanism to achieve joint representation learning and classification prediction of shallow language style features and deep semantic features. [Results] We conducted extensive experiments on a real-world dataset from a leading Chinese social media platform and found that MWDLS outperformed the baseline models. It improved the F1 score of the target task by up to 11.98 percentage points. Notably, for the health rumor category and the dual-faced health information category, MWDLS increased the F1 scores by up to 16.63 and 11.71 percentage points, respectively. [Limitations] The current model does not examine other modalities, such as video and audio, nor does it incorporate large language models or knowledge-aware mechanisms to enhance early detection of health rumors. [Conclusions] By integrating language style features with multimodal deep semantic features, MWDLS effectively enhances the performance of online health rumor detection.

  • Meng Xuyang, Wang Hao, Li Yuanqing, Li Yueyan, Deng Sanhong
    Data Analysis and Knowledge Discovery. 2025, 9(9): 1-12. https://doi.org/10.11925/infotech.2096-3467.2024.0914

    [Objective] This paper proposes a paradigm integrating large language models (LLMs) with knowledge graphs (KGs). We aim to address issues such as catastrophic forgetting, poor interpretability of generated content, and excessive demand for data and computational resources in vertical domain question-answering (QA) systems with fine-tuned LLMs. [Methods] First, we constructed a fine-grained KG for the traditional Chinese medical text “Treatise on Cold Damage”. Then, we employed a retrieval-augmented generation (RAG) model to incorporate this KG into a LLM through prompt learning to build a QA system. [Results] Compared to baseline models and fine-tuned models with professional data, the proposed system achieved a 14.67 and 1.33 percentage points higher satisfaction rate in subjective evaluations. In the objective evaluation, our model demonstrated an overall accuracy of 20.00 percentage points higher than the baseline models and 2.00 percentage points lower than the fine-tuned models. [Limitations] The application is limited to the traditional Chinese medicine domain related to the Treatise on Cold Damage. There is also a lack of standardized benchmarks to evaluate the system’s professional capabilities. [Conclusions] The proposed approach enhances the interpretability of generated content from vertical domain QA systems while substantially reducing the need for data and computational resources.

  • Gao Yuan, Li Chongyang, Qu Boting, Jiao Mengyun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 158-169. https://doi.org/10.11925/infotech.2096-3467.2024.0784
    Abstract (392) PDF (74) HTML (209)   Knowledge map   Save

    [Objective] This paper aims to advance the research on urban tourism flow network structure, and to address the issues of inaccurate point-of-interest recognition and distorted visiting sequence in current tourist journey reconstruction methods based on travelogue texts. [Methods] This paper proposes a method based on a large language model for reconstructing tourist journeys, and explores the structural characteristics of urban tourism flow networks by combining it with social network analysis methods. [Results] The proposed method for reconstructing tourist journey achieves a precision of 94.00% and a recall of 87.78% in POI recognition, significantly outperforming the statistics-based Conditional Random Fields (CRF) method. The reconstructed journey shows a similarity of 83.81% to the actual journey. [Limitations] Tourist journey reconstruction effects depend to a certain extent on the training effects of the Prompts of the large language model. [Conclusions] The conclusions drawn align with public perception and current research findings when taking Xi’an as a case study, demonstrating the accuracy and versatility of the proposed tourist journey reconstruction method.

  • Zhai Dongsheng, Zhai Liang, Liang Guoqiang, Zhao Kai
    Data Analysis and Knowledge Discovery. 2025, 9(2): 120-133. https://doi.org/10.11925/infotech.2096-3467.2023.1277
    Abstract (391) PDF (128) HTML (271)   Knowledge map   Save

    [Objective] This study proposes a method for identifying technological evolution paths and explores key technologies and branches in specific domains. It aims to reveal the evolution trajectories of technology. [Methods] Firstly, we devised an unsupervised graph embedding model to integrate patent structural relationships, text and node information propagation, and aggregated knowledge into multi-dimensional semantic vectors. This approach expanded the technological paths while improving community division effectiveness. Secondly, we proposed methods for expanding the main path and derivative paths from the perspective of network topology and semantic correlation. Finally, we constructed a metric for technological junction points to identify the promising fields. [Results] We examined the new method with drone flight control system technology and identified four subfields’ technological evolution paths and branches. We found that pattern recognition, multiprocessor, and data fusion technologies hold promising prospects. [Limitations] Our identification framework does not incorporate the formation mechanism of technological evolution patterns. [Conclusions] The proposed method demonstrates significant advantages in path expansion effectiveness and application versatility.

  • Sun Xinxin, Sun Ya’nan, Zhao Yuxiang, Jiang Bin
    Data Analysis and Knowledge Discovery. 2025, 9(7): 104-117. https://doi.org/10.11925/infotech.2096-3467.2024.0633

    [Objective] This study explores the impact of the voice characteristics of AI medical voice assistants on the perceived credibility of the older adults, mainly based on the Computer Are Social Actors (CASA) paradigm and the stereotype model. [Methods] This study conducted a 3 (voice gender: female/male/non-binary) ×2 (communication style: expert/partner) between-subjects experiment to explore the impact of voice gender and communication style of AI medical voice assistants on the perceived credibility and intentions to use among older adults. Additionally, the study sought to elucidate the mechanism of action on the stereotype dimensions of perceived warmth and perceived professionalism. [Results] The results indicate that older adults perceive male expert-type and female partner-type AI medical voice assistants as more credible. Communication style influenced their credibility perception of voice gender through perceived professionalism, and this perceived credibility positively predicted their behavioral intention to use such assistants. [Limitations] As this study was conducted within the context of China’s smart healthcare system development, the generalizability of the findings warrants further validation. [Conclusions] The congruence between vocal characteristics and gender-role stereotypes enhanced older adults’ perceived credibility. AI medical voice assistant design should account for the interplay of multiple vocal factors and contextual suitability.

  • Shen Yangtai, Qi Jianglei, Ding Hao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 145-153. https://doi.org/10.11925/infotech.2096-3467.2023.0808
    Abstract (372) PDF (92) HTML (306)   Knowledge map   Save

    [Objective] This paper proposes a latent non-negative factorization topic recommendation model based on LDA and transfer learning to improve recommendation accuracy in sparse data scenarios. The new model aims to address the data sparsity issue in publication recommendations. [Methods] We used non-negative matrix factorization to fill the high-dimensional sparse matrix of non-negative data. Then, we constructed a latent topic model based on LDA and non-negative matrix factorization, fully considering the thematic distribution characteristics of user reviews. Additionally, we applied different dimensions of user information to rating prediction to mitigate data sparsity. Finally, we introduced a transfer learning mechanism to extract and transfer model parameters from pre-trained models of related publication categories. This mechanism assisted the feature learning for the target model data and improved the effectiveness of the recommendation for less popular publications. [Results] We conducted comparative experiments against three baseline methods with three publication datasets. The proposed model achieved average precision, F1 score, and NDCG of 0.773 2, 0.708 5, and 0.746 8. The model’s overall performance surpasses other baseline models. [Limitations] When the number of users in the system is too small, other methods are needed for cold-start situations. [Conclusions] The proposed method has strong generalization capabilities for user interest features, alleviates popularity bias and data sparsity, and effectively improves the accuracy of publication recommendations.

  • Hou Jianhua, Deng Xianjiang, Tang Shiqi
    Data Analysis and Knowledge Discovery. 2025, 9(3): 69-82. https://doi.org/10.11925/infotech.2096-3467.2024.0353
    Abstract (362) PDF (126) HTML (274)   Knowledge map   Save

    [Objective] This study aims to explore the influence of interdisciplinary knowledge integration on the emergence of high-value patents and to delineate their distinctive characteristics. [Methods] High-value patents are operationalized as patents that receive the China Patent Gold Award. Interdisciplinary knowledge integration is quantified by two dimensions: IPC classification and patent knowledge units. Regression analysis investigates the effects of interdisciplinary knowledge integration, measured by these two dimensions, on both patent award status and individual patent value dimensions. [Results] The analysis reveals that high-value patents tend to exhibit a narrower interdisciplinary scope in terms of IPC classification, while simultaneously demonstrating a more diverse knowledge structure. In particular, interdisciplinary knowledge integration, when indicated by IPC classification, shows an inverted U-shaped relationship with patent value. Conversely, interdisciplinary knowledge integration, when indicated by knowledge units, shows a negative correlation with patent value. [Limitations] This study is limited by its reliance on the China Patent Gold Award as the sole proxy for high-value patents, which may not fully encompass the multifaceted nature of high-value patent characteristics. [Conclusions] This research provides valuable insights into the proactive identification and protection of high-value patents. Furthermore, the findings inform strategies to enhance upstream patent quality control and to facilitate effective patent translation and commercial utilization.

  • Shi Xi, Chen Wenjie, Hu Zhengyin, Han Tao, Zhang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(3): 1-15. https://doi.org/10.11925/infotech.2096-3467.2024.0176
    Abstract (360) PDF (240) HTML (300)   Knowledge map   Save

    [Objective] This study aims to efficiently extract scientific experiment knowledge and data from academic literature. It constructs a Scientific Experiment Knowledge Graph(SEKG) to provide high-quality data support for knowledge discovery. [Methods] We utilized Event Knowledge Graph technology to uniformly represent and model the complexity, temporality, and integration of knowledge and data in scientific experiments, thereby establishing the schema layer of the SEKG. Large Language Model was employed to enhance the efficiency of knowledge extraction in the data layer, with an empirical analysis conducted on organic solar cells. [Results] By using manual annotation and fine-tuning large language models, we constructed a scientific experiment knowledge graph in the field of organic solar cells. This SEKG comprises 34 types of nodes and 9 types of relationships, totaling 24,348 nodes and 123,642 relations. [Limitations] The data sources were limited to papers and patents. The construction of the SEKG required substantial manual input from experts, highlighting the need for efficiency improvements. Furthermore, fine-grained research procedures and validation rules in subfields were not considered. [Conclusions] The proposed method provides high-quality data support for applications such as experimental protocol recommendations, scientific experiment evolution analysis, and AI for Science, effectively supporting various knowledge discovery scenarios.

  • Zhu Danhao, Huang Xiaoyu, Li Yaolin, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2025, 9(6): 35-46. https://doi.org/10.11925/infotech.2096-3467.2024.0555
    Abstract (336) PDF (63) HTML (267)   Knowledge map   Save

    [Objective] This study uses large language model technology to automatically summarise legal texts. This addresses issues associated with traditional methods, such as the inadequate handling of lengthy texts and weak logical coherence in summaries. [Methods] This study proposes a method of automatically summarising legal texts based on the fine-tuning of large language models for specific domains. Firstly, a legal text summarisation instruction dataset is constructed. Secondly, two data augmentation strategies are explored: instruction augmentation and result augmentation. Finally, the study will perform domain-specific fine-tuning on a pre-trained model and conduct a multi-dimensional evaluation of the results. [Results] On the CAIL2020 Judicial Summary Dataset, our method achieves improvements of 13.8, 21.3, and 7.4 percentage points in the ROUGE-1, ROUGE-2, and ROUGE-L F1 scores, respectively, compared to the best baseline methods. Both human and automated evaluations further validate the effectiveness of our approach across multiple dimensions. [Limitations] When processing legal texts that are dense with technical terms and complex logical structures, the generated summaries still lack detail accuracy and precision with regard to legal provisions. [Conclusions] Fine-tuning large language models for specific domains can effectively improve the quality of legal text summarisation.

  • Wang Xiaolun, Yao Qian, Lin Jiahui, Zhao Yuxiang, Sun Zhihao, Lin Xinlan
    Data Analysis and Knowledge Discovery. 2025, 9(1): 55-64. https://doi.org/10.11925/infotech.2096-3467.2024.0098
    Abstract (328) PDF (135) HTML (269)   Knowledge map   Save

    [Objective] Based on self-determination theory, this study explores the motivations of service providers to participate in tasks on skill crowdsourcing platforms. [Methods] We retrieved 15,641 bids and 2,385 service provider records from the epwk.com platform. We utilized the TF-IDF and the BERT to analyze text features and calculate motivation variables. Finally, we constructed a negative binomial regression model considering the dependent variables as count variables. [Results] The motivations and behaviors of service providers participating in skill crowdsourcing were significantly correlated at the 1% level (R²=23.10%). Task difficulty improved the model’s explanatory power, negatively moderating competence and reputation (p<0.05) while positively moderating social recognition (p<0.01). [Limitations] The representativeness is limited to a single platform. Future studies could collect data from multiple platforms for comparative validation. External factors such as platform dynamics and policy environments might interfere with the data, which should be considered in future research to deepen the conclusions. [Conclusions] This paper expands the theoretical foundation for service provider participation in crowdsourcing tasks and offers practical insights for service providers, buyers, and platforms.

  • Ye Guanghui, Wang Yujie, Lou Peilin, Zhou Xinghua, Liu Shuyan
    Data Analysis and Knowledge Discovery. 2025, 9(5): 62-76. https://doi.org/10.11925/infotech.2096-3467.2024.0507
    Abstract (301) PDF (75) HTML (228)   Knowledge map   Save

    [Objective] Tracking and observing the characteristics of public opinion circulation during emergencies can facilitate effective public opinion guidance, control, and shared governance. [Methods] Using the case study method, we construct a framework for understanding the macroscopic circulation of public opinion in emergencies. Using social network analysis, complemented by empirical research and natural language processing technology, we conduct an in-depth analysis of the circulation patterns of public opinion from a micro perspective, focusing on the dimensions of subjects, objects, and carriers. Validation analyses are conducted using data from public health emergencies. [Results] From a macro perspective, public opinion circulates across Cyber Space, Physical Space and Psychological Space, providing an interdisciplinary analytical framework for understanding and quantifying public behaviors and responses. At the micro level, public opinion circulates among multiple groups, media, events and platforms, exhibiting four effects respectively: homogeneous diffusion and heterogeneous traversal effect, field resonance and field escape effect, co-temporal and ephemeral effect, and amplified resonance and echo difference effect. [Limitations] The dynamics of social network sentiment are not considered. [Conclusions] By summarizing the laws of cross-domain circulation of public opinion from both macroscopic and microscopic perspectives and conducting empirical research linked to specific events, we provide new insights into the study of public opinion communication.

  • Dan Zhiping, Li Lin, Yu Xiaosheng, Lu Yujie, Li Bitao
    Data Analysis and Knowledge Discovery. 2025, 9(9): 102-113. https://doi.org/10.11925/infotech.2096-3467.2024.0957

    [Objective] In light of the fact that hate speech containing no obvious malicious words cannot be effectively identified in Chinese text, a Chinese hate speech detection method integrating multi-dimensional sentiment features (RMSF) was proposed. [Methods] Firstly, the RoBERTa model is used to extract both character- and sentence-level features from the input text, while sentiment dictionaries are used to derive multi-dimensional sentiment attributes. These character and sentiment features are then concatenated and fed into a BiLSTM network to capture deeper contextual semantic information. Subsequently, the output of the BiLSTM is concatenated with the sentence-level features derived from RoBERTa and processed through a multilayer perceptron before being classified using the SoftMax function. To address class imbalance, the focal loss function is applied during model optimization, thereby improving the accurate discrimination of hate speech. [Results] On the TOXICN dataset, the RMSF method achieves precision, recall, and F1 scores of 82.63%, 82.41%, and 82.45%, respectively. On the COLDataset, it achieves precision, recall, and F1 scores of 82.94%, 82.96%, and 82.85%, respectively. Compared to existing approaches, RMSF yields F1 score enhancements of 1.85% and 1.09% on the respective datasets. [Limitations] The hate speech detection method integrating multi-dimensional emotional features relies on tools such as sentiment lexicons. However, the extraction of emotional characteristics is constrained by the lexicon’s content coverage and semantic granularity. [Conclusions] The experimental findings indicate that incorporating multi-dimensional sentiment features into Chinese hate speech detection models can significantly enhance detection performance.

  • Hai Jiali, Wang Run, Yuan Liangzhi, Zhang Kairui, Deng Wenping, Xiao Yong, Zhou Tao, Chang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(7): 165-174. https://doi.org/10.11925/infotech.2096-3467.2024.0747

    [Objective] This paper constructs a retrieval-augmented question-answering (QA) system for Traditional Chinese Medicine (TCM) standards, aiming to provide efficient standard knowledge services and promote the research and application of TCM standardization. [Methods] By comparing the performance of large language models such as BaiChuan, Gemma, and Qwen, we chose GPT-3.5 as the base model. Then, we combined data optimization and retrieval-augmented generation to develop a QA system with semantic analysis, contextual association, and answer-generation capabilities. [Results] On a TCM literature-based question generation dataset, the new system achieved answer relevance precision, recall, and F1 scores of 0.879, 0.839 and 0.857, respectively, as well as contextual relevance scores of 0.838, 0.869, and 0.853. On a TCM standards QA dataset, the system achieved answer relevance scores of 0.871, 0.836 and 0.853, all outperforming baseline models. [Limitations] The system’s intent recognition accuracy still requires further improvement. The scale and granularity of the TCM standards knowledge base need to be expanded and refined. [Conclusions] In response to the practical needs of TCM knowledge services, this study developed a retrieval-augmented QA system for TCM standards. The system can effectively answer various questions related to clinical guidelines, herbal medicine standards, and information standards, covering topics such as treatment principles, syndrome classification, therapeutic methods, and technical specifications, demonstrating its strong practicality and feasibility.

  • Cao Kun, Wu Xinnian, Bai Guangzu, Jin Junbao, Zheng Yurong, Li Li
    Data Analysis and Knowledge Discovery. 2025, 9(3): 42-55. https://doi.org/10.11925/infotech.2096-3467.2024.0006
    Abstract (290) PDF (97) HTML (180)   Knowledge map   Save

    [Objective] This study explores methods for identifying key core technologies by integrating the textual content characteristics of “science-technology” and complex network relationships. It supports governments, research institutions, and industries in formulating scientific and technological strategies and conducting innovation activities. [Methods] First, we employed the Sentence-BERTopic model to perform deep semantic fusion and knowledge topic clustering on sentence-level paper and patent text corpora. Then, we constructed a “science-technology” knowledge topic complex network based on the citation relationships of these documents. Third, we improved the traditional PageRank algorithm by incorporating node quality characteristics, time decay factors, the weights of incoming node edges, and outdegree. This approach ranked the importance and influence of nodes within the domain. Finally, we identified key core technologies using the head/tail break method. [Results] We conducted an empirical study on CNC machine tools and identified 53 key core technologies, including thermal error modeling and compensation, CNC machine tools control technology, and feed systems. A comparison with relevant domestic and international policy plans demonstrates that the identified technologies comprehensively encompass the key core technologies in the field. [Limitations] This study lacks an in-depth analysis of citation locations, motivations, behaviors, and purposes, which may affect identification accuracy. [Conclusions] This study reveals the knowledge structure and topological characteristics of science and technology by constructing a “science-technology” complex network and applying the Key Core Rank (KCR) algorithm. The proposed method achieves fine-grained and precise quantitative identification of key core technologies.

  • Zhang Lanze, Gu Yijun, Peng Jingjie
    Data Analysis and Knowledge Discovery. 2025, 9(1): 65-78. https://doi.org/10.11925/infotech.2096-3467.2023.1009
    Abstract (284) PDF (92) HTML (186)   Knowledge map   Save

    [Objective] To enhance the accuracy of graph neural networks in credit fraud detection, this paper introduces topological structure analysis. It proposes a graph-based deep fraud detection model (PSI-GNN) integrating prior structural information. [Methods] We embed the attribute information representing the topological structure of central nodes into feature vectors through structural information encoding. Then, we divided the message-passing process into proximal and distal aspects. We aggregated proximal node information based on a shallow graph neural network model and aggregated distal homophily information guided by random walk structural similarity. Finally, we combined the results of the above message passing to obtain node embedding representations. [Results] We examined the new model on the DGraph-Fin and TFinance datasets, which include fraudulent behaviors. The Macro-F1 and AUC of the PSI-GNN model improved by 2.62%, 4.55%, and 4.67%, 2.33%, respectively, compared to nine graph neural network models in related fields. [Limitations] The processing of node structural information incurs significant time overhead. [Conclusions] By modeling the structural attributes and homophily information of credit networks, we can effectively detect credit fraudsters.

  • Chen Jing, Cao Zhixun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 1-13. https://doi.org/10.11925/infotech.2096-3467.2024.0446
    Abstract (279) PDF (126) HTML (200)   Knowledge map   Save

    [Objective] This paper aims to analyse the differences in combating hallucinations in large language models between unstructured knowledge, exemplified by knowledge base resources, and structured knowledge, exemplified by knowledge graph resources, using the Traditional Chinese Medicine (TCM) Q&A domain as a case study. Based on these findings, strategies for improving the ability of large language models to combat hallucinations in vertical domains are discussed. [Methods] The study designs experiments using external knowledge combined with prompt engineering techniques to analyse the differences in prompt effects between knowledge base resources and knowledge graph resources in the TCM Q&A domain. It also investigates the superiority of dynamic triplet strategies and integrated fine-tuning strategies in optimising large language models against hallucinations. [Results] Experimental results show that compared to prompts from unstructured knowledge in the knowledge base, prompts from structured knowledge in the knowledge graph perform better in terms of precision, recall and F1 score, improving by 1.9%, 2.42% and 2.2% respectively to reach 71.44%, 60.76% and 65.31%. Further analysis of the optimisation strategies shows that the combination of the dynamic triplet strategy and fine-tuning had the best effect against hallucinations, achieving precision, recall and F1 scores of 72.47%, 65.87% and 68.62% respectively. [Limitations] This study is limited to a single field, as it was only tested in the field of Traditional Chinese Medicine Q&A, and its generalisability needs to be validated in a wider range of scientific fields. [Conclusions] This study has demonstrated that in the field of Traditional Chinese Medicine, structured knowledge from knowledge graphs outperforms traditional unstructured knowledge in reducing hallucinations and improving the accuracy of model responses. It demonstrates the critical role of structured knowledge in enhancing model comprehension skills. The integration of fine-tuning strategies with knowledge resources provides an effective way to improve performance in large language models. This paper provides a theoretical rationale and methodological support for integrating external knowledge into large language models to improve knowledge performance.

  • Siriguleng, Lin Min, Guo Zhendong, Zhang Shujun
    Data Analysis and Knowledge Discovery. 2025, 9(3): 147-160. https://doi.org/10.11925/infotech.2096-3467.2024.0325
    Abstract (239) PDF (91) HTML (192)   Knowledge map   Save

    [Objective] This study addresses the challenges of inefficient fine-tuning and suboptimal extraction performance in deep learning-based entity-relation extraction for ancient texts in low-resource scenarios, which mainly stem from dependency on large-scale annotated data. [Methods] We propose a joint extraction framework combining prompt learning and extractive machine reading comprehension (MRC). First, entity recognition and relation extraction tasks are unified into an MRC framework to streamline model architecture. Second, three lightweight prompt strategies are designed using domain-specific knowledge to reduce task complexity. Finally, we develop MPG-GP, a joint extraction model integrating a pre-trained language model with a global pointer network, to effectively extract etiquette entity-relation triples from ancient texts. [Results] Experiments on a custom ancient etiquette entity-relation extraction dataset show F1-score improvements of 0.32%~6.05% over baseline methods. [Limitations] The prompt templates employ fixed patterns rather than learnable soft prompts, and the prompt engineering design warrants further refinement. [Conclusions] Our approach mitigates reliance on large annotated datasets while improving the accuracy of few-shot joint entity-relation extraction for ancient ritual texts, providing a novel solution for information extraction in low-resource historical documents.

  • Zhang Kai, Lv Xueqiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 81-93. https://doi.org/10.11925/infotech.2096-3467.2023.1298
    Abstract (233) PDF (94) HTML (108)   Knowledge map   Save

    [Objective] Taking personification as a representative of unmarked rhetorical categories, this study explores a multidimensional fusion recognition strategy, which holds significance for Chinese rhetorical computing. [Methods] Based on dependency syntax theory, we constructed a cognitive model for generating and understanding personification rhetorical figures through a cognitive framework. Then, we proposed a multidimensional feature fusion automatic recognition method for personification (WPGBA). This method represents and integrates multiple features of rhetorical texts, including word vectors, syntax vectors, part-of-speech vectors, and contextual semantics, using Chinese language textbooks from the K-12 curriculum as experimental data.[Results] We trained the automatic recognition model using the WPGBA method. Experiments showed that the method achieved an accuracy of 90.40%, a recall rate of 87.58%, and an F1 score of 88.65%. Compared to other methods in the experimental group, the accuracy rate was increased by at least 6.27%.[Limitations] New complex sentences may arise in practical applications such as discourse reading comprehension and language proficiency evaluation. Due to the limited scale of the experimental dataset, the generalization ability of the algorithm is restricted.[Conclusions] The integration strategy of expressive and contextual semantic features designed from a cognitive perspective shows good recognition performance for personification rhetorical devices in unmarked categories.

  • Li Shuyu, Zhu Guangli, Li Jiawei, Duan Wenjie, Zhou Ruotong, Zhang Shunxiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 1-11. https://doi.org/10.11925/infotech.2096-3467.2023.1376
    Abstract (232) PDF (119) HTML (158)   Knowledge map   Save

    [Objective] To address the issue of feature sparsity in Chinese ironic short texts, this paper proposes a sarcasm detection method integrating hyperbolic representations. It aims to enhance the accuracy of Chinese sarcasm recognition by extracting hyperbolic representations from short texts. [Methods] Firstly, we used pointwise mutual information and semantic similarity computation to obtain co-occurring word pairs, interjections, and degree adverbs related to sarcasm. We also merged these word sets to construct a hyperbolic representation lexicon. Then, we used the regular expression to match sarcastic texts and obtained a sequence of special punctuations. We extracted these punctuations’ special features with one-hot encoding. The RoBERTa-wwm-ext model is employed to extract semantic features from the text. The WoBERT method transformed the words and word pairs within the hyperbolic representation lexicon into dynamic word vectors, obtaining the hyperbolic representation. Finally, we introduced an improved multi-attention mechanism to focus on text semantics, hyperbolic representations, and special punctuation features and obtained the recognition results through the Softmax function. [Results] We examined the proposed method with merged publicly available Ciron and ChineseSarcasm-Corpus datasets, achieving an accuracy of 81.49% and an F1 value of 81.24%. [Limitations] The constructed hyperbolic representation lexicon relies on corpus quality and has limited generalization ability. [Conclusions] The proposed method can effectively enrich semantic representation and improve the accuracy of Chinese sarcasm detection.

  • Tang Chao, Chen Bo, Tan Zelin, Zhao Xiaobing
    Data Analysis and Knowledge Discovery. 2025, 9(7): 118-129. https://doi.org/10.11925/infotech.2096-3467.2024.0722

    [Objective] This work aims to address the challenge of scarce supervised data in classical Chinese entity extraction by leveraging knowledge distillation techniques to inject knowledge from unsupervised external sources into a student model. [Methods] A large language model is utilized as a generative knowledge teacher model to perform knowledge distillation on unsupervised corpora. Additionally, a dictionary knowledge teacher model is built using supervised data from the ZuoZhuan and GuNer datasets. The knowledge distilled from both teachers is integrated to compile a semi-supervised dataset for classical Chinese entity extraction. The task is then reformulated as a sequence-to-sequence problem, and pre-trained models such as mT5 and UIE are fine-tuned on this dataset. [Results] On the ZuoZhuan and GuNer datasets, the proposed method achieves F1-Score of 89.15% and 95.47%, respectively, outperforming the baseline models SikuBERT and SikuRoBERTa, which were incrementally fine-tuned on classical Chinese corpora, by 8.15% and 9.27% in F1-Score. [Limitations] The method does not incorporate additional entity type information, and the quality of data pre-retrieved by the LLMs may affectt extraction results. [Conclusions] In low-resource settings, the proposed approach effectively distills the knowledge advantages of pre-trained large language models and dictionary resources into the student entity extraction model, significantly improving the performance on classical Chinese entity extraction tasks.

  • Wang Changcong, Mu Dongmei, Jiang Jing, Zhang Xinyue, Zhang Baorui
    Data Analysis and Knowledge Discovery. 2025, 9(5): 104-113. https://doi.org/10.11925/infotech.2096-3467.2024.0556
    Abstract (224) PDF (69) HTML (180)   Knowledge map   Save

    [Objective] To utilize influencing factors to mine disease-disease relationships in the biomedical literature and provide new perspectives for disease association analysis. [Methods] Based on the important role of influencing factor interventions in multimorbidity management, the extraction of disease-influencing factor entity relationships was completed by dependency analysis, combined with complex network analysis techniques for disease community discovery, and a disease association model based on influencing factors was constructed and validated using data from the Chinese Medical Association Journal Database. [Results] This model generated a network of 105 diseases, 453 influencing factors, and 2,067 edges, and discovered nine disease communities with strong internal associations mediated by influencing factors to realize the disease association analysis. [Limitations] The efficacy of acquiring disease-influence factors was found to be less effective for complex long sentences, which consequently reduced the number of multimorbidity associations established based on influence factors. [Conclusions] The disease association model based on influencing factors can obtain finer-grained disease-influencing factor relationships with better representativeness and interpretability. Furthermore, it can provide new research ideas for disease association analysis and multimorbidity co-management.

  • Wu Yifan, Ma Songjie, Li Shuqing
    Data Analysis and Knowledge Discovery. 2025, 9(7): 1-14. https://doi.org/10.11925/infotech.2096-3467.2024.0916

    [Objective] To perceive the popularity preferences of users and their friends towards items, more accurate recommendation service can be achieved. [Methods] This paper proposes an item popularity calculation method that integrates contribution and influence. Attention mechanism and recurrent neural network are used to capture user popularity preference representation, and convolutional neural network and graph attention mechanism are also used to obtain friends’ long-term and short-term popularity preferences. [Results] Comparative experiments are conducted using the Douban, Delicious and Yelp datasets, and the evaluation metrics of this method are superior to the suboptimal model DGRec. The highest value of Recall@20 increases by 13.03%, and the highest increase of NDCG is 11.69%. Compared to traditional calculation methods, the proposed popularity calculation method achieves the highest increase in Recall@20 by 11.53%, and the highest increase in NDCG by 10.29%. [Limitations] This method still needs to improve performance when dealing with short sequences. [Conclusions] This method adds user popularity preference representation and user social popularity preference representation, enhances the ability to express the weight of each interaction, and can effectively recommend more long-tail items.

  • Lv Xueqiang, Wan Tian, Ma Denghao, Cai Zangtai, Chen Yuzhong
    Data Analysis and Knowledge Discovery. 2025, 9(10): 41-53. https://doi.org/10.11925/infotech.2096-3467.2024.0836

    [Objective] Existing keyword extraction methods often suffer from limited attention scope, weak semantic representation, and restricted generative ability. To address these challenges, this paper proposes a patent keyword extraction approach (LLM-PKE) that integrates large language models with multi-feature networks. [Methods] LLM-PKE comprises three modules. In the extraction module, topic information is embedded into a Transformer attention network, combined with Graph Convolutional Networks to enhance sensitivity to topic terms and improve feature extraction. In the generative module, large language models produce keywords highly relevant to patent texts. In the ranking module, the large language model generates similarity scores for each keyword to remove synonyms and less relevant terms, yielding refined patent keywords. [Results] Compared to the best-performing baseline model, the proposed method improves the F1@5 metric by 1.98 percentage points. [Limitations] We use semantic similarity thresholds to remove redundant keywords; however, varying similarity standards across patent texts may limit accuracy and generalizability. [Conclusions] The LLM-PKE model outperforms existing approaches on patent datasets, offering a more effective solution for patent keyword extraction.

  • Chen Chong, Wang Zongshui, Zhao Hong
    Data Analysis and Knowledge Discovery. 2025, 9(5): 1-18. https://doi.org/10.11925/infotech.2096-3467.2024.0447
    Abstract (216) PDF (114) HTML (161)   Knowledge map   Save

    [Objective] This study aims to systematically summarize sequence recommendation methods that incorporate knowledge features through a comprehensive literature review. [Coverage] Using “Sequential Recommendation * Knowledge” and “序列推荐*知识” as advanced search terms, we searched databases including Web of Science, DBLP, Google Scholar, and CNKI. A total of 97 articles were selected with special attention to the core content of specific chapters to ensure their alignment with research needs. [Methods] Employing the literature review approach, we categorized and analyzed sequence recommendation methods from three perspectives: research framework, real-world applications and evaluations, and future research directions. [Results] We constructed a research framework for the application of knowledge features in the sequential recommendation, which includes three components: knowledge feature representation, temporal knowledge enhancement, and sequence recommendation algorithms integrating knowledge features. We also analyzed the limitations of existing evaluation resources from datasets, evaluation metrics, and baseline models and explored future research directions. [Limitations] While the study provides a comprehensive overview of relevant works in the rapidly evolving field of knowledge-enhanced sequential recommendation, it may not cover all existing studies due to the breadth and volume of the literature. [Conclusions] Sequence recommendation algorithms that consider knowledge features enhance the accuracy of recommendations. Integrating multimodal knowledge features contributes to a deeper understanding of user needs.

  • Zhang Yunqiu, Yin Ce
    Data Analysis and Knowledge Discovery. 2025, 9(8): 100-110. https://doi.org/10.11925/infotech.2096-3467.2024.0791

    [Objective] This study applies large language model (LLM) technology to the task of named entity recognition (NER) in Chinese electronic medical records (EMRs), aiming to improve recognition and promote intelligent application in the Chinese medical field. [Methods] First, we used the Huatuo226K Chinese medical question-answering corpus to enhance the proposed model’s understanding of medical knowledge. Then, we applied the Easy Data Augmentation (EDA) technique to augment the CCKS2019 EMR dataset. Finally, the LLaMA3-8B model was fine-tuned using the LoRA method, yielding a model tailored for Chinese NER tasks. [Results] The proposed model showed significant performance improvements on the CCKS2019 Chinese EMR dataset, achieving an overall precision of 0.888 9, recall of 0.866 0, and F1 score of 0.877 3. This represents an increase of 0.161 1 in the F1 score compared to the original model. [Limitations] The study lacks an in-depth exploration of overlapping entities, and discrepancies exist in recognition accuracy across different entity categories. [Conclusions] The proposed model demonstrates the potential of LLM technology in the Chinese medical field and lays the groundwork for building a general-purpose Chinese EMR entity recognition model.