Home Table of Contents

25 June 2025, Volume 9 Issue 6
    

  • Select all
    |
  • Zhang Borui, Yang Ning, Zhang Xin, Wen Yi
    Data Analysis and Knowledge Discovery. 2025, 9(6): 1-20. https://doi.org/10.11925/infotech.2096-3467.2024.0549
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study provides a comprehensive overview of recommendations for scientific data, with the aim of establishing a theoretical basis for the sharing of scientific data. [Coverage] A search was conducted in the CNKI, Web of Science (WOS) and Google Scholar using the keywords such as “scientific data recommendation” and “scientific dataset recommendation”. A total of 71 key articles were identified through thematic and snowball searches. [Methods] A systematic literature review and synthesis approach was used to evaluate existing research. This study provides a comprehensive overview and critical analysis of recommendation models, evaluation metrics and future perspectives. [Results] Recommendations for scientific datasets have been found to play a critical role in facilitating their sharing. Prevalent methods include content filtering, collaborative filtering, graph models, and hybrid filtering. Identified research gaps include the synthesis of multi-source, heterogeneous data, the protection of user privacy, the development of explainable systems, and the evaluation of recommendations. [Limitations] This paper provides an overview of the latest research in this field, focusing on key studies. Due to the inherent diversity of scientific data types, it is not feasible to enumerate every individual study. [Conclusions] Future research directions are identified as integrating heterogeneous information from multiple sources, improving the explainability of recommendations, ensuring privacy protection and refining evaluation methods.

  • Song Mengpeng, Bai Haiyan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 21-34. https://doi.org/10.11925/infotech.2096-3467.2024.0628
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to generate structured literature reviews with references automatically, to assist researchers quickly grasp a specific area of scientific knowledge. [Methods] A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts. The GLM3-6B model was fine-tuned for training by generating 3,000 reviews using a large language model and then revising them manually. The corpus was then converted into high-dimensional vectors and stored in an index. These vectors were retrieved to implement LangChain’s external knowledge base. To solve the problem of poor retrieval of proper nouns, a hybrid search with BM25 was used and reordered to improve retrieval accuracy. [Results] Fine-tuning and hybrid retrieval frameworks were used to construct the literature review generation system, improving the BLEU and ROUGE scores by 109.64% and 40.22% respectively, as well as the authenticity score of manual evaluation by 62.17%. [Limitations] Due to limitations in computational resources, the scale of the local model parameters is small and its generation ability needs to be improved further. [Conclusions] The retrieval-augmented generation technique uses large language models not only generates high-quality literature reviews, and provides traceable evidence for the generated content, as well as assists researchers in intelligent reading.

  • Zhu Danhao, Huang Xiaoyu, Li Yaolin, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2025, 9(6): 35-46. https://doi.org/10.11925/infotech.2096-3467.2024.0555
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study uses large language model technology to automatically summarise legal texts. This addresses issues associated with traditional methods, such as the inadequate handling of lengthy texts and weak logical coherence in summaries. [Methods] This study proposes a method of automatically summarising legal texts based on the fine-tuning of large language models for specific domains. Firstly, a legal text summarisation instruction dataset is constructed. Secondly, two data augmentation strategies are explored: instruction augmentation and result augmentation. Finally, the study will perform domain-specific fine-tuning on a pre-trained model and conduct a multi-dimensional evaluation of the results. [Results] On the CAIL2020 Judicial Summary Dataset, our method achieves improvements of 13.8, 21.3, and 7.4 percentage points in the ROUGE-1, ROUGE-2, and ROUGE-L F1 scores, respectively, compared to the best baseline methods. Both human and automated evaluations further validate the effectiveness of our approach across multiple dimensions. [Limitations] When processing legal texts that are dense with technical terms and complex logical structures, the generated summaries still lack detail accuracy and precision with regard to legal provisions. [Conclusions] Fine-tuning large language models for specific domains can effectively improve the quality of legal text summarisation.

  • Yu Chi, Chen Liang, Xu Haiyun, Mu Lin, Xia Chunzi, Xian Xin
    Data Analysis and Knowledge Discovery. 2025, 9(6): 47-62. https://doi.org/10.11925/infotech.2096-3467.2024.0650
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to perform named entity recognition (NER) of key technical information in patent texts under conditions with limited labeled samples. [Methods] This paper proposes a framework for identifying named entities from patent texts using prompt templates, leveraging the large language model’s extensive general knowledge and powerful semantic understanding capabilities. [Results] An empirical analysis is conducted using the hard disk drive (HDD) head patent annotation dataset TFH-2020 as an example. Experimental results show that under the few-shot learning ability of large language models, the named entity recognition achieves an F1 score of 69%. However, when using supervised fine-tuning methods, the recognition performance drops to 54% (F1 score). This result is in contrast to the performance of large language models in general text named entity recognition. [Limitations] Although the proposed method greatly reduces data annotation costs, it still has a performance gap when compared to the current best deep learning methods that use large amounts of labeled data. Furthermore, the design and optimization of prompt templates, as well as the rapid generation of large-scale instruction sets, still need further improvement. [Conclusions] Compared to random sample selection strategy, the NER performance of large language model using similar sample selection strategy has increased from 29% to 69% (in F1 value). This indicates that sample selection strategy has a significant impact on the performance of large language models in patent NER task, and prompt template is the core of this method, it not only determines the quality of the recognition effect but also influences the choice of optimization methods.

  • Qian Lingfei, Ma Ziyi, Dong Jiajia, Zhu Pengyu, Gao Dequan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 63-72. https://doi.org/10.11925/infotech.2096-3467.2024.0610
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] In order to enhance the efficacy of relation extraction in power communication system fault texts, a multi-level graph convolution document-level relation extraction method that incorporates Ontology information was proposed, taking the domain’s inherent characteristics into account. [Methods] Firstly, word-level embedding was employed for the encoding of the fault text. Secondly, sentence-level and entity-level document graphs were constructed. Thirdly, the semantic information of entity-level, sentence-level and document-level was aggregated by means of convolution. The final construction method, termed the “Ontology-Ontology” approach, was developed in accordance with the ontology conceptual model. To improve the model’s performance, an auxiliary task was incorporated to predict whether an entity pair conformed to the ontology constraint. [Results] Experiments were conducted involving the comparison and ablation of power communication network fault datasets. These data sets were self-built. The findings demonstrated that the proposed method exhibited optimal performance, with F1, Ign_F1 and accuracy values of 97.22%, 95.17% and 97.97%, respectively. [Limitations] The generalisability of the model should be verified in greater depth. [Conclusions] The proposed method has been demonstrated to be an effective solution for extracting relationships from fault knowledge maps of power communication networks. Compared with existing methods, the proposed method has been shown to achieve a superior extraction effect.

  • Ma Yuekun, Zhang Jiaxin
    Data Analysis and Knowledge Discovery. 2025, 9(6): 73-87. https://doi.org/10.11925/infotech.2096-3467.2024.0636
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] In order to comprehensively explore the implicit semantic information in metaphors, accurately capture the semantic differences between metaphors and literal meanings, and enhance the efficacy of metaphor detection, this paper proposes a metaphor detection method based on semantic graph representation and contrastive learning. [Methods] Firstly, the graph convolutional neural network (GCN) is utilised for the extraction of contextual semantic information pertaining to dependency word pairs, thereby enabling the realisation of the contextual semantic graph representation. Secondly, a semantic network based on metaphor cognition is constructed, and GraphSage and meta-path technology are used to learn the potential conceptual semantic associations in the network. The graph representation of the metaphor cognition semantic network is realised. Finally, the bidirectional cross-attention mechanism and multi-view fusion module are used to fuse different features, and supervised contrastive learning is used to capture the similarities and differences between metaphors and literal meanings from the perspectives of sample similarity and domain inconsistency, so as to improve the classifier’s performance in distinguishing metaphors. [Results] For token-level task, F1 values on the MOH-X and TroFi datasets improved by 0.6% and 1.9%, respectively. In relation-task, F1 values on MOH-X, TSV and TroFi datasets increased by 0.6%, 1.0% and 2.7% respectively, reaching the current optimal level. [Limitations] In the process of generating metaphorical cognitive semantic networks, the ambiguity of words may affect them in some cases. [Conclusions] The process of generating metaphorical cognitive semantic networks will be affected by word ambiguity.

  • Zhang Shunxiang, Wen Hua, Zhang Jixu, Ding Yuanyuan, Duan Yujun
    Data Analysis and Knowledge Discovery. 2025, 9(6): 88-98. https://doi.org/10.11925/infotech.2096-3467.2024.0655
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the limitation that existing depression severity sentiment analysis models insufficiently account for patients’ extreme expressions, this study proposes a novel sentiment analysis model specifically targeting extreme expressions in the context of depression severity. [Methods] Character-level and word-level features are first extracted through a combination of Jieba and RoBERTa. These multi-granularity features are then fused and fed into a Bi-directional Long Short-Term Memory (BiLSTM) network to capture sentiment information across different positions within the text. Subsequently, a multi-head attention mechanism is employed to assign varying weights to different textual segments, thereby enabling the model to more accurately capture information related to depressive sentiment. Finally, the output is passed through a fully connected layer and normalized via the Softmax function to produce the final prediction results. [Results] On the Chinese Depressive Text Sentence Corpus, the proposed model achieved an accuracy of 84.14%, a recall of 61.09%, an F1-score of 62.90%, and a precision of 64.81%. On the ZFCD dataset, it achieved an accuracy of 93.59%, a recall of 82.55%, an F1-score of 85.37%, and a precision of 88.38%. [Limitations] This study focuses solely on textual information for depression severity sentiment analysis and does not incorporate multimodal data such as images, audio, or video. [Conclusions] The proposed model effectively identifies depression-related terms and degree adverbs, captures deep semantic information through multi-granularity feature fusion, and significantly enhances the accuracy of depression severity sentiment analysis.

  • Ren Minglun, Gong Ningran
    Data Analysis and Knowledge Discovery. 2025, 9(6): 99-110. https://doi.org/10.11925/infotech.2096-3467.2024.0612
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to address the issue of cognitive limitations in knowledge graphs, which are caused by sparsity of relations and difficulty in exploiting hidden relations. [Methods] We propose a global-aware knowledge graph reasoning model (GAGAT) based on a graph attention network to enhance the accuracy and interpretability of link prediction. This model introduces betweenness centrality as implicit structural information and combines it with relational semantic information to construct a hierarchical attention mechanism. [Results] On the FB15K-237 and WN18RR datasets, the GAGAT model achieves an improvement of 26.5 percentage points and 5 percentage points in the Hits@3 metric compared to ComplEx, an improvement of 15 percentage points and 1.6 percentage points compared to CompGCN, and an improvement of 1 percentage point compared to SD-GAT. This proves the superiority of the GAGAT model in capturing implicit relations and complex semantics. [Limitations] Only betweenness centrality uses implicit structural and relational semantic information for fusion reasoning. We have not yet explored the role of other implicit structural features in reasoning. [Conclusions] By combining implicit structural and relational semantic information, the GAGAT model uncovers hidden relationships in the knowledge graph. This improves the accuracy and interpretability of knowledge graph link prediction and enhances the cognitive decision-making ability of intelligent systems.

  • Lu Wen, Wu Zhendong, Peng Lilan, Chen Xiangrui, Ma Huan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 111-122. https://doi.org/10.11925/infotech.2096-3467.2024.0637
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the challenges posed by the suboptimal quality of entity samples and the imbalanced distribution of positive and negative samples resulting from the proliferation of data volumes during the learning process of knowledge graph embedding (KGE) models, an adaptive knowledge graph embedding model (SSF) integrating subsampling and negative sampling is proposed. [Methods] Firstly, the K-Means clustering algorithm is introduced. This algorithm is based on the entity-aware negative sampling strategy, which is a method of selecting negative samples that are highly correlated with positive samples. This strategy addresses the problems of sample sparsity and sampling quality. Secondly, a multidimensional subsampling strategy is adopted. This strategy is used to dynamically adjust the ratio of positive and negative samples. It also optimizes the structure of sample datasets and ensures balanced distribution of sample categories. Consequently, the construction of a gating network was initiated. The model’s capacity to adaptively select frequency and unique sampling functions is contingent upon the calculation of the word frequency of the dataset. The purpose of this calculation is to improve the accuracy and stability of the output sample embedding. [Results] A comparative experiment is conducted on the FB15K-237 and WNRR18 datasets. The findings indicate that the SSF model exhibits a maximum enhancement of 10.7% in MRR indicators in comparison with the baseline model. [Limitations] Due to the limitation of the scale of open source knowledge graph datasets, the application effect of the model on larger datasets has not been fully verified. Concurrently, the computational intricacy of the negative sampling strategy is substantial, and a comprehensive evaluation of the model's computational complexity and efficiency remains to be conducted. [Conclusions] The SSF model proposed in this paper is an integration of the advantages of subsampling and negative sampling strategies. It has been demonstrated to outperform the baseline model in three indicators of MR, MRR, and Hit@N. This improvement in quality can enhance the quality of knowledge graph embedding and the generalization ability of the model.

  • Jiang Yuzhe, Cheng Quan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 123-135. https://doi.org/10.11925/infotech.2096-3467.2024.0600
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper mines and analyses patients’ temporal and physiological data to provide an accurate and safe reference for medication plans and effective support for doctors’ medication decisions. [Methods] A hybrid medication regimen recommendation model that integrates temporal and vital sign data has been proposed. Firstly, the model uses Transformer architecture, Convolutional Neural Networks (CNNs), and time-aware methodologies to analyse patients’ temporal data individually. Then, we leverage knowledge graph technology and Graph Convolutional Neural Networks (GCNN) to explore patients’ physiological data. Finally, the model incorporates adverse drug-drug interaction information into the recommendation process, thereby providing patients with safe and effective medication regimens. [Results] An empirical study was conducted using a dataset of patients who had been admitted multiple times, drawn from the MIMIC-III dataset. The recommendation model designed in this study achieved Jaccard index improvements of 14.0%, 6.6% and 3.7% over the GRAM, G-BERT and TAHDNet models, respectively. Additionally, the F1 metric increased by 9.3%, 4.4%, and 1.2%, respectively. The model achieved the lowest DDI rate. [Limitations] Although the model considered abnormal signs, it did not take into account the specific value of these signs when learning from patient data. [Conclusions] Integrating and analysing patients’ time series and vital sign data enables the drug recommendation model to learn the characteristics of patients’ conditions more accurately, facilitating the recommendation of more precise medication regimens. Furthermore, considering information on adverse drug interactions when making recommendations can help to ensure safer medication plans for patients.

  • Wen Yan, Sun Huizheng, Bian Wei, Yan Minghai
    Data Analysis and Knowledge Discovery. 2025, 9(6): 136-148. https://doi.org/10.11925/infotech.2096-3467.2024.0370
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] Existing session-based recommendation systems often struggle to accurately capture a user’s dynamic focus of interest while effectively suppressing noise in behavioural streams. In this paper, we address these two challenges by proposing a novel framework that integrates repetition-aware hypergraph construction and co-occurrence ranking optimization. [Methods] First, we find that item recurrence patterns imply higher session cohesion, and propose a repetition-aware hypergraph construction and co-occurrence ranking optimization. Second, we reconstruct the ranking of location information by merging global co-occurrence information. Finally, a module for injecting global co-occurrence information across sessions is introduced to alleviate the data sparsity problem. [Results] Extensive evaluations using three benchmark datasets show significant improvements. Compared to the baselines, P@20 improves by 1.09 percentage points and MRR@20 improves by 0.63 percentage points for the Diginetica dataset, P@20 improves by 8.41 percentage points and MRR@20 improves by 6.29 percentage points for the Tmall dataset, and P@20 improves by 2.91 percentage points and MRR@20 improves by 1 percentage point for the RetailRocket dataset. [Limitations] While our model advances state-of-the-art performance, its effectiveness remains limited in extreme data sparsity scenarios where session interactions are exceptionally sparse. [Conclusions] This work provides new insights into the modelling of behaviour repetition patterns and cross-session dependencies, providing a principled framework for session recommendation systems in practical applications.

  • Zhang Zhipeng, Zhang Liyi
    Data Analysis and Knowledge Discovery. 2025, 9(6): 149-160. https://doi.org/10.11925/infotech.2096-3467.2024.0591
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] From the perspective of the audience attributes of short video advertisements, the psychological and demographic attributes of the audience are extracted to explore their impact on audience engagement. [Methods] Based on self-efficacy theory, data mining and deep learning techniques are employed to construct variables representing the audience’s psychological and demographic attributes. These variables are analysed using a multiple regression model to determine their influence on audience engagement and the moderating effect of product types. [Results] The study reveals that audience perception of advertisement disclosure, the proportion of enthusiastic comments, the proportion of female audiences, and the representation of Generation Z, middle-aged people, and older adults all influence audience engagement. Furthermore, the type of product moderates the observed main effects. [Limitations] The audience engagement metric is relatively simplistic, and could be expanded further by obtaining viewing and purchasing data from the audience. [Conclusions] The psychological and demographic attributes of audiences have a significant impact on their level of engagement, and this effect is moderated by product type.

  • Zeng Wen, Wang Yuefen
    Data Analysis and Knowledge Discovery. 2025, 9(6): 161-171. https://doi.org/10.11925/infotech.2096-3467.2024.0660
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study aims to provide references for decision-making in relation to technological innovation activities by elucidating the distributional characteristics and evolutionary patterns of patent technology transfer types. [Methods] We constructed a database of information on patent technology transfers using data from patent assignments, followed by temporal segmentation and network topology modelling. We operationalised transfer scope and depth metrics through selected patent feature indicators (including assignee diversity, geographic coverage and technology maturity), classifying them into transfer types using stratagem coordinate analysis. We then employed a Markov chain model to analyse the temporal distribution and evolutionary trends of these transfer types. [Results] Among AI patent technology transfers in China, type III is the most prevalent, whereas type I is highly concentrated in the Yangtze River Delta and Pearl River Delta regions. Most provinces and cities exhibit an evolutionary trajectory from type III to type II to type I, with a high probability of temporal persistence for technology transfer types, particularly a 100% self-persistence rate for type I, and reduced cross-level transitions between types. [Limitations] This analysis relies solely on bivariate indicators for technology transfer type classification. Future research could incorporate multidimensional metrics to enhance analytical granularity. [Conclusions] The identified characteristics and evolutionary patterns of technology transfer types offer valuable references for governments and enterprises to formulate targeted patent transfer and commercialization policies and strategies.