Home Browse Online first

Online first

The manuscripts published below will continue to be available from this page until they are assigned to an issue.
Please wait a minute...
  • Select all
    |
  • Xiang Qian, Cai Zangtai, Li Cuo, Ma Denghao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0109
    Online available: 2025-09-12

    [Objective] This study aims to address the issues of insufficient local feature extraction and decoding redundancy in traditional Transformer models for Tibetan news generation. The goal is to enhance the quality, detail expression, and semantic coherence of generated Tibetan news texts.[Methods] Leveraging the title and prompt words as key inputs to enhance the understanding of the text's theme; in the encoder, the Transformer captures the global features of the title, while the CNN extracts the local features of the prompt words, forming a refined text representation through weighted fusion. The decoder employs a hybrid strategy and an autoregressive mechanism to gradually generate news content closely related to the input, reducing redundancy and improving the naturalness and coherence of the text. [Results] On a self-constructed Tibetan news generation dataset, the model achieved BLEU, ROUGE, and Distinct scores of 38.9%, 35.8%, and 47.2%, respectively, showing a significant improvement over baseline models.[Limitations] This study primarily focuses on Tibetan news generation and has not yet been validated in other Tibetan text generation scenarios, such as literary creation or technical documentation. Additionally, the computational efficiency and resource consumption of the model require further optimization.[Conclusions] By innovatively combining global and local feature extraction with a hybrid decoding strategy, this study significantly enhances the quality and detail expression of Tibetan news generation. It provides an innovative solution for Tibetan natural language processing.

  • Yao Jianjun, Zhuang Zicong, LI Ruisheng, Yang Dunshun, Zhang Zhen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0156
    Online available: 2025-09-12

    [Objective]This paper achieves high-quality entity-relation extraction for Chinese Judicial Texts, significantly advancing the precision of domain-specific triplet extraction in judicial judgment documents through an optimized framework tailored to judicial linguistic patterns and semantic dependencies.[Methods] This paper proposes S-CasRel-MF, a joint extraction model based on CasRel with entity feature-sensitive attention mechanisms. The model incorporates a LERT Chinese encoder to enhance semantic modeling of complex Chinese judicial texts, combines self-attention mechanisms and BiLSTM to improve contextual interactive feature capture for entities, addresses error propagation in subject-object extraction through domain-specific feature dictionaries and multi-head attention mechanisms, and mitigates sample imbalance during decoding using a Focal Loss function.[Results]Experimental results demonstrate that the S-CasRel-MF model attains F1-scores of 83.51% and 83.40% on drug-related and theft case datasets, respectively, achieving statistically significant improvements of 9.77% and 8.73% over the CasRel baseline;Compared to other types of extraction models, F1-scores increased by 16.59% and 15.70% on average.[Limitations]The model demonstrates elevated computational complexity during inference phases, while its reliance on external legal entity feature lexicons creates domain adaptation bottlenecks—manifested through lexicon coverage gaps during cross-domain migration—that systematically compromise extraction fidelity.[Conclusions]Our model exhibits superior capability in capturing intricate inter-entity relationships within judicial texts, demonstrating statistically significant performance advantages over existing baselines in joint entity-relation extraction tasks, particularly for semantically dense legal interactions.

  • Wu Huidong, Su Qiudan, Wu Dengsheng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1121
    Online available: 2025-09-12

    [Objective] To effectively integrate journal rank results with tied ranks, reflecting the academic community's overall assessment of publication quality. [Methods] A journal rank aggregation method based on a Linear Pseudo-Boolean Optimization model (LPBO) is proposed. This approach transforms various original ranking results into pairwise comparisons between journals and utilizes a generalized Kendall-τ distance to construct the LPBO model. Under the constraints of ensuring the uniqueness and transitivity of journal rankings, the method achieves an aggregated ranking outcome. [Results] An empirical study on journals in the field of information science and library science revealed that the correlation coefficient between the aggregated results of LPBO and the original ratings is 13.7% higher than that among the original ratings themselves, while preserving tied rank information. The model demonstrates robust performance when handling data with different scales, incomplete information, and varying numbers of journals and rankings. [Limitations] The model may face efficiency challenges when applied to large datasets. [Conclusions] The LPBO method avoids the biases introduced by strict ranking approaches, providing a fair and robust solution for evaluating journal quality and impact, with significant theoretical and practical value.

  • Wang Xianwen, Yin Yixian, Geng Yu, Yu Qianqian, Zhang Guangyao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0171
    Online available: 2025-09-12

    [Objective] From the perspective of patent citing papers, this study explores the relationship between the disruptiveness of scientific papers and their technological impact, enriching the research on the factors influencing the flow of scientific knowledge into the technological domain.[Methods] Using over 680, 000 scientific papers published in the field of artificial intelligence, combined with patent citation data, we have built a large-scale dataset. Applying regression models such as Probit, we conducted analyses across five dimensions: possibility, importance, universality, persistence, and time lag.[Results] The findings reveal a positive correlation between a paper's disruptiveness and the possibility of being cited by patents, indicating that disruptive science is more likely to generate technological impact. Meanwhile, highly disruptive scientific outputs yield more significant, universe, and persistent technological impacts, but has shorter time lags.[Limitations] The motivation and type of citations are not considered, and the patent characteristics are not analyzed.[Conclusions] This study confirms a positive correlation between the disruptiveness of scientific papers and their technological impact, and provides a theoretical foundation for policymaking aimed at accelerating the technological translation of scientific knowledge.

  • Yang Rui, Zhu Xuefang, Wang Zhenyu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1197
    Online available: 2025-09-12

    [Purpose] Based on chain-of-thought prompts, we explore the usability and effectiveness of large language models in multimodal entity disambiguation tasks. [Method] Construct a large language model prompt template based on thought chain prompts, input prior knowledge and multimodal information into the large language model, and assist the model to determine the entity that the mention accurately refers to from the candidate entity set. [Result] Experiments show that in the three datasets of Wiki-MEL, Twitter-MEL and Weibo-MEL, the accuracy of the PLMED model is improved by 15.1%, 11.5% and 4.1% respectively compared with the current most advanced models.[Limitations] The experiment did not explore in detail the performance changes of the large language model in the multimodal entity disambiguation task under different prompt word construction methods.[Conclusions] The large language model based on thought chain prompts can better adapt to multimodal entity disambiguation tasks in different scenarios. The large language model has great application potential in multimodal entity disambiguation tasks.

  • Wang Zhenyu, Zhu Xuefang, Zhang Jundong, Yang Rui, Liu Songyin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1191
    Online available: 2025-09-12

    [Objective] To address the growing demand for diverse and personalized information needs in of users in bibliographic search scenarios and to enhance the performance and user experience of question answering systems.[Methods] This paper constructs a conversational question answering system for bibliographic search (BSCQA). The system employs the model context protocol to integrate large language models with external databases. To improve the accuracy of Text-to-SQL generation, this paper designs and integrates a contrastive learning-based example selection strategy to enhance the model's understanding of query intent in specialized domains.[Results] Experimental results on the bibliographic search semantic parsing dataset constructed in this paper demonstrate that, compared to the zero-shot scenario, the proposed approach improves the execution accuracy of DeepSeek-V3 by 18.1% in the 5-shot scenario.[Limitations] Due to the limited coverage of the experimental dataset, the system's adaptability in cross-domain applications still requires further improvement.[Conclusions] The proposed BSCQA system showcases the potential and application value of large language models in the typical application scenario of intelligent bibliographic retrieval within the library and information science domain, providing a reference for the research and application of conversational question-answering systems in other vertical domains.

  • Wang Zhenyu, Zhu Xuefang, Zhang Jundong, Yang Rui, Liu Songyin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0003
    Online available: 2025-09-12

    [Objective] To address the growing demand for diverse and personalized information needs in of users in bibliographic search scenarios and to enhance the performance and user experience of question answering systems.[Methods] This paper constructs a conversational question answering system for bibliographic search (BSCQA). The system employs the model context protocol to integrate large language models with external databases. To improve the accuracy of Text-to-SQL generation, this paper designs and integrates a contrastive learning-based example selection strategy to enhance the model's understanding of query intent in specialized domains.[Results] Experimental results on the bibliographic search semantic parsing dataset constructed in this paper demonstrate that, compared to the zero-shot scenario, the proposed approach improves the execution accuracy of DeepSeek-V3 by 18.1% in the 5-shot scenario.[Limitations] Due to the limited coverage of the experimental dataset, the system's adaptability in cross-domain applications still requires further improvement.[Conclusions] The proposed BSCQA system showcases the potential and application value of large language models in the typical application scenario of intelligent bibliographic retrieval within the library and information science domain, providing a reference for the research and application of conversational question-answering systems in other vertical domains.

  • Cai Mouxi, Sun Haichun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0278
    Online available: 2025-09-12

    [Objective] To address the low accuracy of multi-step reasoning in Visual Question Answering (VQA) tasks, this study integrates external knowledge and optimizes Large Language Model (LLM) prompting methods. [Methods] A new knowledge-based visual question answering model is proposed, which helps LLM understand image region information by introducing the PromptCap caption model to generate image captions, and integrates open-source and image semantic knowledge retrieval to increase reasoning sources. Meanwhile, a new question decomposition scheme is proposed to reduce reasoning difficulty, and relevant context examples are constructed with multiple reasoning answers integrated to improve accuracy.[Results] The model in this paper achieves an accuracy of 67.2% and 64.8% on the OK-VQA and A-OKVQA datasets respectively, outperforming the current SOTA models. Ablation experiments also verify the effectiveness of each module in the model.[Limitations] The model is more suitable for multi-step reasoning VQA scenarios, as the question decomposition module introduces additional computational and temporal overhead, reducing efficiency for simple VQA tasks. [Conclusions] By leveraging LLMs to optimize the KB-VQA framework, this work provides an effective novel approach for knowledge-based VQA requiring complex reasoning.

  • Chen Danlei, Hua Bolin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0420
    Online available: 2025-09-12

    [Objective] This study explores the mapping between syntactic forms and rhetorical functions of research gap sentences (RGS), integrating multi-level linguistic features to advance a novel framework for sentence-level knowledge extraction in academic texts.[Methods] Guided by the principles of validity, redundancy, and complementarity, we restructure syntactic complexity indices into a unified evaluative framework. Variance analysis and linear regression are conducted to examine differences in syntactic complexity across various types of research gap sentences. To address the automatic classification task, we fine-tune scientific language models to extract textual semantic features and adaptively integrate them with syntactic information through a gated fusion mechanism. Furthermore, a hybrid loss function that dynamically combines cross-entropy and Dice loss is introduced to improve the model’s performance on minority classes.[Results] There are significant inter-type distinctions among research gap sentences across multiple dimensions of syntactic complexity. Our gated fusion-based model outperforms state-of-the-art baselines by at least 1.44% in terms of F1 score, with ablation studies further demonstrating the necessity and rationality of its key components.[Limitations] This study focuses on mining research gap sentences from representative conference papers in the field of artificial intelligence, without extending the analysis beyond disciplinary boundaries. In addition, the relationship between different types of research gaps and measures of scientific innovation or impact has not yet been explored.[Conclusions] The proposed gated fusion-based model achieves accurate classification of fine-grained rhetorical functions in academic texts under low-resource conditions. It effectively supplements and reconciles distinct linguistic cues present at different representational levels, while maintaining a well-balanced trade-off among stability, robustness, and scalability.

  • Dai Wei, Zhu Xingce, Song Yang, Yang Xiao, Geng Xueyu, Ma Jingdong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0298
    Online available: 2025-09-12

    [Objective] To compare the effectiveness of different reasoning methods of large language models (LLM) in policy intelligent question answering (Q&A) within the context of public health emergencies.[Methods] Taking DeepSeek-R1 as the experimental backbone, we implemented retrieval-augmented generation (RAG), knowledge graph (KG) enhanced reasoning, fine-tuning, reasoning with web search support, and self-contained reasoning without external data. All methods were benchmarked against human-annotated Q&A pairs. For the self-contained setting, we further compared the performance of Qwen-QwQ and GPT-4o.[Results] In the automatic evaluation, the combined approach of RAG, KG, and fine-tuning achieved the highest BLEU-4 and ROUGE-L scores (0.259 and 0.494, respectively), followed by the web search method (BLEU-4: 0.225; ROUGE-L: 0.465). In the manual evaluation, the combined approach received the highest score for content accuracy (3.560), while large-parameter models without external data support performed better in fluency, completeness, usability, and credibility.[Limitations] The experimental dataset was limited to publicly available policy texts and did not incorporate data from closed real-world systems. In addition, multimedia formats were not included, which precluded evaluation under multimodal conditions.[Conclusions] Supported by local data, different reasoning methods can substantially reduce hallucination and enhance accuracy in policy Q&A, demonstrating strong applicability to vertical domains such as public health emergencies. Incorporating external data further improves factual accuracy, whereas self-contained reasoning without external data tends to produce responses that are completer and more convincing.

  • Wang Zhenyu, Ping Yifang, Xiao Tong, Wang Jianmin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0197
    Online available: 2025-09-12

    [Objective] To improve the accuracy of learning behavior detection in online education and promote educational intelligence to move from result-oriented to process-interpretable new paradigm, this paper proposes a large language model-driven learner behavior detection model for university catechism courses. [Methods] The model fuses large language models, retrieval-enhanced generation, and interpretable artificial intelligence techniques to construct text-time series data that incorporates learned behavioral features and emotional features. The text data is extracted and converted into time series data by the BERT model, which is trained using the LightGBM model, and then combined with the SHAP method to quantify the marginal contribution of each feature to the model prediction. This approach enables the interpretability of the prediction process. [Results] The proposed model performs excellently in learning behaviour detection tasks, with an accuracy rate of 99.90%, a recall rate of 99.78%, and an F1 score of 97.69%, all of which are significantly better than all baseline models. Compared with the lowest-performing logistic regression model, the three indicators improved by 20.84%, 24.34%, and 21.94%, respectively, fully verifying the model's advantages in recognizing complex features. [Limitations] The data were obtained from a single online learning platform, and the sample granularity was coarse, with some external generalization limitations. [Conclusions] This study effectively improves the accuracy and interpretability of the model by integrating interpretable artificial intelligence and multimodal features to provide decision support for online education platforms in colleges and universities.

  • Yao Tianchen, Wang Hao, Li Peiqi, Bu Wenru, Yuan Ruiyang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0245
    Online available: 2025-09-12

    [Objective]This study aims to address the challenge of constructing the semantic relationship between "imagery" and "implication" in Chinese poetic wine culture, promoting the structured representation and automatic recognition of traditional culture, and providing technical support for the intelligent understanding and dissemination of cultural resources. [Methods]We propose the UOCIP model, which involves a four-stage process including corpus construction, imagery extraction, implication recognition, and knowledge graph construction. The model incorporates adversarial training mechanisms and AIGC-based data augmentation strategies to enhance performance in low-resource scenarios. [Results]Experimental results show that the Macro-F1 score of the imagery extraction model improved from 56.2% to 58.6%. The implication recognition model achieved a Macro-F1 score of 83.4% under a multi-modal attention mechanism. The constructed implication graph covers five types of terminological systems and supports the recognition of out-of-vocabulary terms. [Limitations]Despite the promising outcomes, the model still faces challenges in adapting to atypical poetic styles and in controlling the semantic boundaries of the graph. Its generalization ability and boundary precision require further improvement. [Conclusion]This study establishes an effective paradigm for semantic modeling of Chinese poetic wine culture under cold-start conditions, demonstrating the feasibility and applicability of the UOCIP model in the intelligent processing of traditional culture and offering theoretical and methodological references for future research.

  • Song Yuxin, Liu Lin, Wang Hailong, Liu Jing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0349
    Online available: 2025-09-12

    [Objective] The effect of retrieval granularity on the performance of Retrieval-Augmented Generation (RAG) is reviewed. The trade-off between context integrity and precision for different granularities is also analyzed. [Coverage] Keywords such as “Retrieval-Augmented Generation” and “Retrieval Granularity” were used to retrieve literature from 2020 to 2025 in databases including Google Scholar, ACM Digital Library, and CNKI. Ultimately, a total of 106 representative papers were selected for review. [Methods] RAG methods are categorized by retrieval granularity. On this basis, a thorough comparison was conducted with regards to their technical paths, core mechanisms, innovations, and limitations. [Results] A research framework of “coarse-grained, fine-grained, and hybrid-grained” retrieval is established. The findings reveal a core trade-off: coarse granularity ensures context at the cost of noise, while fine granularity offers precision at the risk of semantic fragmentation. Consequently, the fusion and scheduling mechanisms for hybrid approaches are the main challenge. [Limitations] The review primarily focuses on text-based RAG research methods, with less comprehensive coverage of multimodal RAG research involving images, audio, or video. [Conclusions] The advancement of RAG depends on smarter granularity selection and better information fusion. Future work should explore proposition-level retrieval, dynamic granularity selection, adaptive mechanisms, and the synergy of structured and unstructured knowledge.

  • Zhou Yuhao, Wang Jie, Zhang Shunxiang, Li Jiawei, Zhang Yongqi, Yang Junni
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0028
    Online available: 2025-09-12

    [Objective] To solve the problems of entity overlap and complex relations in topic-oriented sarcasm detection, this paper proposes a prompt learning model that integrates external knowledge-aware attention to improve the accuracy of topic-oriented sarcasm detection.[Methods] Firstly, a topic-oriented prompt learning template is summarized based on the topic and comment text. Secondly, entities in the topic and comment text are identified and aligned with entities in the knowledge graph. The entities and their contexts are used as external knowledge to provide supplementary information. Then, external knowledge-aware attention is designed to measure the importance of knowledge. Finally, the mapping words are specified and the masks are predicted through the Veralizer module.[Results] Experiments on the public ToSarcasm dataset show that the proposed model outperforms the compared advanced models, with an accuracy of 72.25% and an F1 value of 77.16%.[Limitations] This study did not use a learnable soft prompt method to construct the prompt template, and there is still room for further optimization in the prompt design and the selection of mapping words. At the same time, only the ToSarcasm dataset was used for model training, and the generalization ability of the model needs to be improved.[Conclusions] The introduction of external knowledge can effectively solve the problem of entity overlap, prompt learning can effectively solve the problem of complex relations, and the model can effectively improve the accuracy of topic-oriented sarcasm detection.

  • Sun Qinglin, Wang Xiaomei, Chen Ting, Song Xinyu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0257
    Online available: 2025-09-12

    [Objective]To better understand the trajectory of scientific development, this study proposes a general-purpose method for automatically classifying research levels in academic papers, aiming to delineate level boundaries precisely, reveal disciplinary evolution patterns, and support research planning, resource optimization, and efficient knowledge transfer.[Method]Building upon the framework developed by Boyack et al., we construct a cross-disciplinary research level classification scheme. A large language model is fine-tuned through supervised learning on annotated data, integrating deep semantic analysis and prompt engineering techniques to enable automated level identification. The model’s effectiveness is further validated using mapping science structure.[Results]The proposed method achieves an F1 score of 85.45% and an accuracy of 85.44% in the research level classification task, significantly outperforming the multinomial logistic regression baseline (62.71% and 62.00%, respectively), and demonstrates clearer level distinctions within the mapping science structure.[Limitation]Due to computational constraints, comparative experiments across multiple models were not conducted.[Conclusion]The supervised fine-tuning of large language models shows strong accuracy and robustness in research level classification tasks. Optimizing prompt design may further enhance performance, offering effective support for accelerating scientific knowledge translation and guiding discipline development.

  • Ma Weilu, Sun Tan, Zhao Ruixue, Xian Guojian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0225
    Online available: 2025-07-08

    [Objective] To simplify scientific literature summarization and generate intuitive graph-based summaries for enhancing research efficiency.[Methods] Rice breeding-related papers were extracted from the PMC database, and 4,276 full-text-to-summary QA pairs were constructed. Optimal prompts and temperature coefficients were experimentally identified. The Qwen2.5-7B-Instruct large language model was fine-tuned using supervised datasets. The fine-tuned model was then integrated into the GraphRAG framework to generate graph-based summaries for individual papers. Global queries with the optimized prompt were subsequently executed in GraphRAG to produce textual summaries.[Results] Compared to baseline models, the proposed method achieved F1 score improvements of 44.16%, 61.36%, and 54.87% on ROUGE-1, ROUGE-2, and ROUGE-L, respectively. In a 5-point manual evaluation, the method outperformed baselines by an average of 1.78 points, with graph-based summaries demonstrating significantly enhanced intuitiveness.[Limitations] Hardware constraints limited the scale of the selected LLM, potentially restricting generative capability. Additionally, the GraphRAG framework exhibited prolonged index construction times, highlighting the need for efficient inference acceleration in practical applications.[Conclusions] Graph-enhanced retrieval-augmented generation technology effectively captures long-range implicit information in scientific papers, producing comprehensive textual summaries and hierarchically structured graph-based summaries. This methodology improves researchers’ reading efficiency and supports scientific productivity.

  • Yao Yuanzhang, Xu Jian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0011
    Online available: 2025-07-04

    [Objective]This study aims to analyze the phenomenon of semantic differences of interdisciplinary terms across different fields and explore the underlying causes of these semantic variations.[Methods]We utilize pre-trained deep learning models to automate the identification and quantification of semantic differences in terms. A semantic difference degree indicator is designed to quantitatively measure the extent of these differences, and a co-occurrence analysis is conducted for the disciplines involved in the terms.[Results]The identification accuracy of semantic differences based on the pre-trained model reaches 0.8193, and the constructed measurement indicators effectively quantified semantic differences.[Limitations]The study is limited to the semantic differences of Chinese terminology, with a restricted scope in terms of the interdisciplinary range of the terms selected.[Conclusions] The main causes of semantic differences in interdisciplinary terms are identified as: specialization and fragmentation of disciplines, linguistic and contextual differences, hierarchical and abstract conceptualization, cognitive emphasis differences, and the influence of interdisciplinary intersection and integration. This provides new perspectives and methodologies for exploring the reasons behind terminological discrepancies and their relationships with disciplines.

  • Deng Hangyu, Tang Chuan, Pu Yunqiang, Ao Lijuan, Wang Wanjing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0993
    Online available: 2025-07-04

    [Objective] Given the characteristics of large volume, broad scope, and frequent colloquial expressions in U.S. Congressional hearing transcripts, this paper proposes a framework for automatically identifying China’s science and technology security risks. [Methods] Starting from the data features of the hearings and the actual needs of analysts, this study realizes and integrates modules such as text filtering, summary generation, and question-answering by utilizing large language models. [Results] Using the 118th Congress hearings as experimental texts, the F1 score for text filtering, ROUGE-Lsum for summary generation, and the risk point recall rate for the QA system reached 0.7751, 0.6032, and 0.7636 respectively, significantly outperforming the baselines. [Limitations] This method is primarily designed for U.S. Congressional hearing transcripts and needs further validation with more types of data to consider it a general approach. [Conclusions] The proposed method can assist researchers in better extracting technological security risks from U.S. Congressional sources and preparing corresponding strategies.

  • Zhang Shuangbao, Cheng Quan, Zeng Yan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1126
    Online available: 2025-07-04

    [Objective]The utilization of semantic association information between Chinese texts is imperative to enhance the efficacy of extracting unstructured events from the text.[Methods]The present study proposes a Chinese document-level event extraction model (CSDEE) that utilizes an attention mechanism to construct a cross-document interactive semantic network, with the objective of enhancing entity recognition performance. The event extraction task is then completed through document encoding and event extraction information decoding.[Results]The experimental results demonstrate that the CSDEE model attains 80.7%, 84.1%, and 82.3% accuracy, recall, and F1 score in event extraction, respectively, outperforming existing baseline models.The ablation experiments conducted on the model and the generalization experiments on the public datasets ChFinAnn and DuEE-fin further substantiate the efficacy of the model in Chinese document event extraction tasks.[Limitations]At present, the model has only enhanced the performance of document event extraction and has not yet engaged in multi-classification tasks for overlapping event types.[Conclusions] A comprehensive exploration of the parallel semantic information inherent in document-level data has the potential to enhance the precision of document event extraction operations.

  • XIEWei, XIA Hongbin, LIU Yuan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1132
    Online available: 2025-07-04

    [Objective] This study aims to utilize deep learning methods to address the current issue of insufficient utilization of complete entity and relation interaction information in zero-shot relation extraction tasks. [Methods] We propose a Joint Contrastive Learning model (JCL) for zero-shot relation extraction, which integrates entity and relation information based on contrastive learning. Firstly, data augmentation techniques are applied to the original input text to enhance the model's effective information. Secondly, an enhanced cross-attention module is used to deeply integrate entity pairs and jointly process relations, extracting interaction information between entities as well as between entities and relational semantics, thereby amplifying the subtle differences of various relations in the embedding space. Finally, the model is optimized using a combination of cross-entropy loss and contrastive loss. [Results] Compared with the baseline model, the proposed approach achieves improvements on the FewRel dataset with unseen relations: an F1 score increase of 3.12% for 𝑚=5, 5.19% for 𝑚=10, and 1.99% for 𝑚=15. On the Wiki-ZSL dataset, improvements are 7.05% for 𝑚=5, 3.42% for 𝑚=10, and 8.08% for 𝑚=15. [Limitations] The study is limited by the relatively homogeneous and small number of datasets used in this research field. [Conclusions] The proposed Joint Contrastive Learning model for zero-shot relation extraction demonstrates advanced performance on three public datasets, showcasing its efficacy for this specific task.

  • Shengli Zhou, Rui Xu, Tinggui Chen, Shaojie Wang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1138
    Online available: 2025-07-04

    [Objective] To address the insufficient characterization of multimodal features in the AI face-swap fraud process, this study establishes a face-swapping fraud risk identification model (FSFRI) that synergistically integrates multimodal features to optimize victimization risk assessment. [Methods] By comprehensively considering the generation and propagation processes of AI face-swapping fraud, FSFRI extracts four types of features: fake face video frames, traffic composition description features, traffic payload data features, and traffic temporal features. Through the feature fusion module, it achieves complementary integration of cross-modal features. Finally, via risk identification module, FSFRI effectively detects and identifies deception risks. [Results] In the dataset generated through simulation experiments, the FSFRI achieved good identification performance, with an F1 score of 0.92. It also demonstrated strong robustness in low-noise environments (with noise levels ranging from 0 to 0.2), and the F1 score only decreases by 0.019 at a noise ratio of 0.2. [Limitations] Due to the increased complexity of FSFRI from using multimodal features, the model faces higher computational performance demands. The FSFRI's risk identification effectiveness in high-noise environments remains to be further enhanced. [Conclusions] FSFRI can effectively extract and integrate the multi-modal features generated in the process of AI face-changing fraud, and precisely identify AI face-swapping fraud victimization risks.

  • Ma Yingxue, Gan Mingxin, Hu Lei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1235
    Online available: 2025-07-04

    [Objective] To address the issue that deep learning recommendation methods lack modeling of user interest distribution characteristics and cannot fully capture user preferences, a sequential recommendation method based on modeling the aggregative and hierarchical distribution characteristics of user interests is proposed. [Methods] Using attention network and LSTM, representation vectors of users and items are obtained from behavioral sequences, and the positional centers and boundary radii of user interest distributions are learned. The hierarchy and aggregation of interest distribution are characterized by two radii. User preferences are predicted by fitting the distance between candidate item features and the distribution center of user interest to interaction probability. Recommendations are generated by fusing behavior predictions based on neural networks with preference estimation based on interest model. [Results] Experimental results on Amazon dataset demonstrate that compared to the best-performing baseline, the proposed method achieves optimal performance in terms of precision, recall, F-score, coverage and other evaluation metrics, with performance improvements exceeding 10 percentage points. [Limitations] User generated content besides behavior sequence is not considered. Future work can improve interest modeling by integrating user comments and other information. [Conclusions] This method can accurately describe the distribution characteristics of user interest, improve the accuracy of recommendation, and optimize the comprehensive quality of recommendation results.

  • Sun Mengge, Wang Yanpeng, Fu yun, Liu Xiwen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1192
    Online available: 2025-07-04

    [Objective] This study explores the construction of prompt engineering methods for large language models in the task of multi-domain scientific knowledge entity extraction, using scientific short texts as experimental data. The aim is to address the challenges posed by insufficient semantic context and domain diversity in short text entity extraction.[Methods] To tackle the issues of wide domain coverage, condensed semantics leading to insufficient contextual information, and ambiguous entity boundaries in short texts, this study proposes a Scientific Prompt-based entity extraction strategy grounded in knowledge-prompt learning. By integrating the BERTopic method, the strategy dynamically incorporates domain knowledge into prompt design to enhance the semantic understanding and recognition capabilities of large language models, thereby improving extraction accuracy and generalization.[Results] Experimental results demonstrate that under the Scientific Prompt strategy, the F1-Value scores of QWEN2.5-7B, QWEN2.5-7B (fine-tuned), and GPT-4o models are 0.6526, 0.7407, and 0.7878, respectively. In contrast, the Zero-Shot F1-Values for the same models are 0.5534, 0.6165, and 0.6822, respectively. The results indicate that the Scientific Prompt strategy significantly outperforms fine-tuning in open-source models (0.6526 vs 0.6165), with the fine-tuned QWEN2.5-7B model under the prompt strategy slightly surpassing the performance of GPT-4o (0.7407 vs 0.6822).[Limitations] This study only evaluates the proposed strategy on Chinese scientific intelligence short texts, and its applicability to English texts remains untested.[Conclusions] The experiments demonstrate that the Scientific Prompt strategy can significantly enhance the performance of large language models in short text, multi-domain entity extraction tasks without requiring parameter updates. Its effectiveness in unsupervised scientific short texts is also validated, enabling accurate extraction of scientific entities to monitor technological trends. This research provides an important reference for knowledge entity extraction in general scientific short text tasks.

  • Zhang Xiaojuan, Ji Ruyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0018
    Online available: 2025-07-04

    [Objective]This paper proposes a global citation recommendation framework based on both static and dynamic heterogeneous graphs, aiming to enhance the accuracy of citation recommendations.[Methods]This paper first constructs static weighted heterogeneous networks and temporal heterogeneous networks separately. For the static heterogeneous network, the mixed random walks and the skip-gram model are used to generate the embedded representations of nodes, which can capture the local and global network information. For the temporal network, the meta-path instances are first generated based on the meta-path-based random walk, and then the temporal evolution process is modelled in the heterogeneous graph to generate the embedded representations of nodes in the graph. Then, the final embeddings of paper nodes are produced using joint and separate training methods. Finally, candidate citation lists are generated for input papers by calculating the similarity between the final embeddings of paper nodes.[Results]Experimental results show that: the experimental performance of the proposed methods outperform those that only consider dynamic or static information of network; the independent training method performs best in terms of almost all recall metrics (except for recall @40); the uncertainty-based multi-task weighting method achieves the best performance in terms of MRR and MAP metrics, with values of 0.308 and 0.297.[Limitations] The performance of the newly proposed model hasn't been verified across multiple datasets. The running efficiency of the model still needs to be further optimized.

    [Conclusions]Considering both the static and dynamic aspects of the network can effectively enhance the performance of global citation recommendation.

  • Xu Jianmin, Wang Li, Zhang Xiongtao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0730
    Online available: 2025-07-04

    [Objective]The existing social sequential recommendation research is easy to introduce friend information that is dissimilar to the user's interests, and fails to consider the difference in degree of social influence of different users, resulting in limited recommendation performance. In order to make up for the shortcomings of existing researches, an adaptive social sequential recommendation method based on graph attention network is proposed. [Methods]First, the self-attention mechanism is used to model the user behavior sequence and obtain the user's dynamic interest representation. Secondly, a regularization strategy is designed to constrain the graph attention network to aggregate all friend features to accurately model user social interest representation. Finally, an attention-based adaptive fusion method is proposed to accurately integrate dynamic interests and social interests to generate recommendation results. [Results]Compared with the mainstream baseline models, the proposed method can achieve up to 10.8% improvement on HR@10 and 5.3% improvement on NDCG@10. [Limitations]The proposed method has a high dependence on the structure of social networks, and its performance improvement is not significant when the social relationship data is sparse. [Conclusions]The proposed method enables the utilization of social information more comprehensively, predicts user behavior effectively, and improves recommendation performance.

  • Xing Bowen, Chai Mengdan, Xiang Zhuoyuan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0792
    Online available: 2025-07-04

    [Objective] In view of the fact that the abstract of judicial decisions requires consistency with the original text in terms of the facts of the case, the application of law and other elements, we propose a method for generating abstracts of Chinese judicial decisions embedded in the factual consistency assessment of judicial elements.[Methods] Firstly, we define the principles and methods for determining factual consistency of judicial decision summaries; secondly, we determine the preprocessing processes such as data addition, factual consistency error correction and assessment; then, we construct segmented the segmented extraction model and the generative summary model that introduces the knowledge graph of judicial elements, respectively, and carried out the experiments on the CAIL2020 dataset.[Results] The summaries generated by the FC-JDSM model were 67.98%, 55.40%, 64.14%, 78.5%,and 90.01% on the metrics ROUGE-N (N=1, 2, and L), SRO,EM-FCJS, respectively, which were better than the comparative models. The ablation experiments confirm the effectiveness of chunk extraction and factual information introduction.[Limitation] The data obtained from the data enhancement program in EM- FCJS has some deviation from the real data.[Conclusion] Incorporating judicial elements into the consistency assessment and abstract generation process improves the consistency of abstracting Chinese judicial judgment instruments, which is conducive to the impartiality of judicial work.

  • Congjing Ran, Qunzhe Ding, Yonghui Song, Fuxin Wang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1171
    Online available: 2025-07-04

    [Objective] To address the challenge of distinguishing substantive patent transactions in patent transfer data, this study proposes a systematic approach that integrates multiple methods based on the Levenshtein distance algorithm. This approach effectively identifies substantive patent transactions and explores their technical characteristic differences.

    [Methods] A screening process is proposed for different patent transfer scenarios. One of the key steps involves using multiple text similarity algorithms based on Levenshtein distance to calculate the similarity scores of the names and addresses of the parties involved in the transaction. These scores are then combined with a set threshold to exclude non-market-based transaction records related to internal resource reallocation. At the same time, the accuracy of the method is validated through empirical research, and statistical analysis is used to compare the differences in technical indicators across different transaction types.[Results] The experimental results show that this method achieves an accuracy of 81.27% and is effective in identifying patent behaviours that involve substantive transactions. Patents that undergo substantive transactions have significantly higher technical indicators such as the number of independent claims, the number of family patents, and the number of times cited compared to patents that do not undergo substantive transactions (p < 0.05).[Limitations] The dataset's temporal scope is restricted, and the model's adaptability to handle complex address structures requires further refinement to improve generalizability.[Conclusions] This study establishes an effective and scalable methodology for classifying substantive patent transaction behaviours, offering valuable data support for advancing research in technology transfer and patent commercialization.

  • Li Yihong, Yu Yanfang, Yu Qiwei, Li Sujuan, Zhang Shaolong, Ye Junjun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0239
    Online available: 2025-07-04

    [Objective]Large language generation models have brought new ideas to the Chinese open relation extraction task, but how to optimize the quality of the relation extraction results generated by the model has become an important issue.[Methods]This paper proposes a low-cost large model fine-tuning method based on multi-dimensional self-reflective learning enhancement (SRLearn). It automatically guides the model to engage in multi-dimensional self-reflective learning, thereby optimizing the model's Chinese relationship extraction generation quality.

    [Results]Compared to the LoRA+DPO fine-tuning method, the SRLearn method improves performance by 15 percentage points on the WikiRE1.0 dataset and 6.5 percentage points on the DuIE2.0 dataset, validating the effectiveness of this approach.[Limitations]The SRLearn method needs to consider covering more generation quality issues in the future.[Conclusions] The large model fine-tuning method based on Multidimensional self-reflection learning can greatly improve the generation quality of Chinese relation extraction.

  • Su Yanyuan, Dong Xiaoyu, Han Cuijuan, Zhang Yaming
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1157
    Online available: 2025-07-04

    [Objective] A federated learning framework embedded with dual-channel attention convolution is designed. It could solve the difficult problem of cross social networks feature extraction caused by privacy protection restrictions, and identify social robot accounts accurately.[Methods] Firstly, the federated learning framework is adopted to realize data integration of cross social networks. Secondly, the dual-channel attention convolution mechanism is introduced into the local model module to comprehensively mine data features. Thirdly, with the help of basic convolution neural network and blockchain, the local model parameters are integrated in the federated aggregation module to obtain and securely store the optimal model parameters.[Results] The experimental results on the TwiBot-20&Weibo-bot dataset show that the accuracy rate, precision rate, recall rate and F1 value of FL-DCACNN model reaches 91.63%, 97.10%, 97.14% and 96.88%, respectively, and show strong generalization ability.[Limitation] The multi-modal feature extraction only considers the structured data, text data and picture data, but does not involve the video and audio data. [Conclusions] FL-DCACNN model could effectively solve the problem of poor recognition effect of social robots caused by insufficient feature extraction and single data source due to data privacy, so as to further improve model recognition performance and realize accurate recognition of social robots.

  • Zhong Ming, Qian Qing, Zhou Wei, Wu Sizhu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0461
    Online available: 2025-07-04

    [Objective] In view of the characteristics of centralized storage, data security risks, limited computing resources, and urgent user analysis and utilization needs of the National Population Health Data Center (NPHDC), this study explores the construction ideas suitable for the data enclave of NPHDC, so as to provide users with a more efficient, secure, and flexible data processing and analysis environment.[Methods] The types, characteristics, implementation mechanisms, and applicability of different scenarios of data enclaves were summarized. Combined with the data application characteristics of NPHDC, a big data analysis platform for NPHDC was built based on the virtual enclave method integrating of security enhancement, micro-isolation, and artificial intelligence technologies.[Results] The big data analysis platform supported services such as data review, data processing, data analysis and mining, and peer review of the data associated with the user's published papers in NPHDC. It has completed the review tasks of 32,000 datasets of more than 2,800 projects, more than 10,000 data analysis tasks, and more than 5000 data processing tasks, with a data leakage rate of 0% and a resource utilization rate of 80%.[Limitations] It is not possible to realise cross-institutional data sharing with decentralized storage, and it is necessary to explore data enclave research combining privacy-preserving technologies such as multi-party secure computing and federated learning in combination with the development of NPHDC.[Conclusions] It is of great significance to effectively solve the needs of safe sharing and collaborative analysis of population health data centralisation, which is of great significance for the security and sharing and utilisation of national population health scientific data.

  • Yi Haohan, Wang Hao, Zhou Shu, Zheng Xuhui, Zhou Zhengda
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0001
    Online available: 2025-07-04

    [Objective] To address the challenges of named entity recognition (NER) in ancient texts caused by linguistic complexity, diversity, and the scarcity of annotated data.[Methods] We propose a novel RAG-LATS framework that integrates a knowledge base of ancient texts with AI-Search-driven retrieval-augmented generation (RAG). By incorporating the generation, retrieval, reflection, and revision mechanisms of the LATS framework, we enhance the zero-shot NER performance of large language models in the domain of ancient texts.[Results] Experimental results on the CHisIEC public dataset demonstrate that our method outperforms domain-specific fine-tuned models. Specifically, it achieves a 14.44 percentage point improvement in Micro F1 score compared to the Xunzi-Qwen1.5-7B_chat model, and a 16.99 point improvement over the general-purpose Qwen1.5-7B_chat model. [Limitations] The prompt construction method needs further optimization. The computational complexity of the LATS framework may affect efficiency in large-scale data scenarios.[Conclusions] Retrieval-augmented generation effectively enhances the domain knowledge of large language models, while the LATS framework optimizes the accuracy and coherence of model outputs. Together, these advancements significantly improve the performance of large language models in zero-shot NER tasks for ancient texts.

  • Li Hongmin, Yang Wenhao, Ma Hongyang, Wang Jianzhou
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1180
    Online available: 2025-07-04

    [Objective] Considering the non-completeness of urban carbon emission information, the multiplicity of characteristics and the complexity of emission patterns, a comprehensive portrayal of the complex dynamic process of carbon emission is crucial for the improvement of forecasting accuracy.[Methods] A multi-source heterogeneous time-domain convolutional carbon emission forecasting HOSVD-TCN model fusing key information granularity is proposed. Firstly, the original granularity information is captured using automatic extraction techniques, and secondly, the physical text of social media is processed using natural language to form the sentiment values of key information granularity. High-quality tensor representations are generated by high-order singular value decomposition and reconstruction of heterogeneous information, and the reconstructed carbon emissions are used as inputs to the forecasting model. Finally, the time domain convolutional model TCN is used to forecast the carbon emissions.[Results] The experimental results show that the average MAPE value of the three cities of the proposed model is only 6.96%, and the forecasting performance is better than that of other mainstream comparison models.[Limitations] The complexity of multimodal data processing is high and the forecasting effectiveness is limited by the size of the available dataset.[Conclusions] HOSVD-TCN fully combines the feature extraction capability of HOSVD and the spatio-temporal capture capability of TCN, realizing the accurate forecasting of urban carbon emission, and providing powerful technical support and scientific basis for urban planning and management.

  • Zhang Zhengang, Yu Chuanming
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0049
    Online available: 2025-07-04

    [Objective] Based on modeling papers and their attributes (e.g., authors, publication venues) as a knowledge graph, this study aims to enhance the performance of citation count prediction for newly published papers by fine-grained aggregation of the temporal evolution of paper attribute data.[Methodology] We propose a citation count prediction model that incorporates temporal evolution and fine-grained information aggregation. The model consists of four key modules: (1) a graph neighborhood feature aggregation module, which extracts feature representations of academic entities in the knowledge graph; (2) a temporal evolution representation module, which captures the temporal dynamics of paper attribute data; (3) a fine-grained information aggregation module, which leverages a multi-head attention mechanism to aggregate the influence of different attributes on papers; and (4) a prediction module, which outputs citation count predictions. The proposed model is evaluated on the DBLP dataset through empirical studies.[Results] On the DBLP dataset, the proposed model achieves MALE, RMSLE, and R² scores of 0.5141, 0.7098, and 0.3470, respectively, significantly outperforming existing state-of-the-art methods.[Limitations] Due to space constraints, this study only evaluates the model on the DBLP dataset. Future work will focus on validating the model's generalizability across additional datasets.[Conclusion] The proposed model demonstrates superior performance compared to state-of-the-art methods on the DBLP dataset. This study highlights the effectiveness of leveraging the temporal evolution of paper attributes and fine-grained information aggregation to improve citation count prediction for newly published papers.

  • Zhao Guangyu, Duan Yongkang, Geng Qian, Yan Yan, Jin Jian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0214
    Online available: 2025-07-04

    [Objective] Existing pre-trained models exhibit embedding anisotropy and limited domain generalization in government question retrieval, resulting in issues such as low recall, incomplete coverage, reduced accuracy, and suboptimal user experience. To enhance the effectiveness and efficiency of government question retrieval, this paper presents GovSQR, a fine-grained government similar question retrieval model. [Methods] GovSQR leverages structured prompt engineering and few-shot examples to guide a large language model in generating task-specific positive and negative samples. The RoBERTa model is subsequently fine-tuned using supervised SimCSE on the generated triplet data. A dynamic weighted masking mechanism and debiased contrastive loss function are introduced to reduce false negative interference in semantic representations. [Results] Evaluation on a Shenzhen government question dataset shows GovSQR achieves P@1, R@3, and MRR scores of 0.9660, 0.9811, and 0.9729, respectively, outperforming leading contrastive learning models such as InfoCSE and DiffCSE. [Limitations] The data generation process is prone to hallucination, necessitating costly manual verification. Additionally, the model's efficacy on semantically complex or ambiguous queries remains to be further validated. [Conclusions] By combining data augmentation with false negative debiasing, GovSQR learns more discriminative and uniformly distributed embeddings, significantly improving government similar question retrieval accuracy and effectively supporting intelligent government services.

  • Wang Xing, Yuan Weihua, Meng Guangting, Chen Yu, Zong Chen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1221
    Online available: 2025-07-04

    [Objective]To disentangle users’ diverse intents and capture rich node information in bundle recommendation, this paper proposes a bundle recommendation model based on disentanglement-aware dual-channel contrastive learning. [Methods]The local multi-view intention disentanglement module maps node representations to the latent space to obtain disentangled representations. The global hypergraph unified learning module integrates multi-type data and captures high-order correlations. The dual-channel collaborative learning module utilizes contrastive learning to achieve collaborative learning between them. [Results]On public datasets, D2CBR demonstrates significant performance advantages. Compared with the state-of-the-art baselines, the average performance improvement reaches 2.87%, with a maximum of 6.43%. [Limitations] Hypergraph operations, such as the incidence matrix, are often related to the number of nodes in the graph. When processing extremely large-scale datasets, they may lead to relatively large memory and computational overheads, since they are related to the number of nodes in the graph and their application may be limited in scenarios with limited computation resources. [Conclusions]In this paper, the graph variational autoencoder is effectively used to distinguish diverse user intentions, and the hypergraph is utilized to integrate multi-type data, which significantly improved the recommendation performance. The performance surpasses that of the state-of-the-art baselines on public datasets, demonstrating the effectiveness and robustness of the model.

  • Tong Xin, Lin Zhi, Yuan Lining, Wang Jingya, Jin Bo
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0251
    Online available: 2025-07-04

    [Objective] This paper proposes an agent framework to enhance the accuracy and interpretability of risky instruction mining for large language models. [Methods] The framework integrates a language alignment module for unified mapping of multilingual inputs, a hierarchical detection module for multi-stage risk analysis, a dual-channel explanation module to support decision-making, and a consistency verification module to improve reliability when handling complex samples. [Results] Experiments on three risky instruction datasets demonstrate that the proposed method can improve the analysis accuracy of existing tools from 54.75% to as high as 93.75%. Even when using only lightweight open-source models as the core, the accuracy gain exceeds 20%. [Limitations] The inference efficiency of the framework needs improvement, and the structured output generated by some lightweight models lacks stability.[Conclusions] The proposed method provides an effective, interpretable, and cross-lingual enhancement solution for risky instruction mining in large language models.

  • Sun Ran, An Lu, Xie Zilin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1218
    Online available: 2025-07-04

    [Objective] Fine-grained mining of social media users’ opinion shifting in uncertain environments helps comprehensively understand the development of public opinion.[Methods] This study focuses on Twitter users who actively participated in vaccine topic. We build stance detection model on pre-trained language models and neural networks models, and categorize user opinion shift paths into six types. Based on the uncertainty reduction theory, a feature system for predicting opinion shifting is constructed. An opinion shifting prediction model is built using the XGBoost method, and the feature importance is analyzed using the SHAP interpretation method. [Results] The study results show that 46.76% of users have not changed their vaccine stance during the observation period, and the proportion of users experiencing opinion reversal is relatively low. The opinion shifting prediction model built on XGBoost achieved an F1 score of 0.8209, with the feature of interaction user stance similarity being the most important. Moreover, the importance ranking of features on different opinion shifting paths is different.[Limitations] User stance shifting can be influenced by multiple factors, including significant exogenous events. Future work could further explore the impact of such factors on user opinion shifting.[Conclusion] Combining pre-trained language models and neural network models better detects user stances. This paper reveals the factors influencing user opinion shifting in uncertain environment, providing assistance for further work in online monitoring of social media user opinions.

  • Xiang Shuxuan, Mao Jin, Li Gang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0096
    Online available: 2025-07-03

    [Objective] Existed methods overlooked the relation between commercialization potential and patenting strategy in proxy selection, features construction and model structure design. This article proposes a new method for predicting patent commercialization potential. [Methods] The maintenance is applied as the proxy for patent commercialization potential, and LSTM+MTNN model is proposed. The model is comprised of feature processing module and multi-task predicting module. Feature processing module uses Bert+SimCSE and LSTM to get refined continuous feature of patent claims, and concatenate it with numerical features, as the input of multi-task predicting module. Multi-task predicting module is constructed based on the connections of legal events and commercialization potential, and it is formed by merging the shared bottom, the commercialization potential prediction tower and legal event prediction tower, the final output includes result of legal event prediction and commercialization potential prediction. [Results] The experimental results show that the selected numerical characteristics are effective and functional for commercialization potential prediction. Besides, LSTM+MTNN can achieve better performance on accuracy, precision and F1 score than baseline models on three datasets. [Limitations] The utilization of patent text still needs further research, and the methods of representing and predicting patents’ commercialization potential under changing technology environment are to be explored. [Conclusions] Besides numerical characteristics, LSTM+MTNN adds continuous characteristic of patent claims for the input, which enriches input information. LSTM+MTNN utilize the inner connections of legal events and patent commercialization potential with the multi-task structure, which enables the model to learn the connections between two tasks. Both mentioned techniques are proved helpful for model optimization, and make the method proposed in this article functional for patent commercialization potential prediction.

  • Ma Jie, Sun Wenjing, Hao Zhiyuan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0938
    Online available: 2025-07-03

    [Objective] The purpose of this study is to build a high-quality disease prediction model and explore its interpretability. By identifying the key incentives affecting the formation of disease and further analyzing its mode of action on disease, aiming to enable auxiliary diagnosis and precision medicine. [Methods] Taking obesity as the research object, firstly, the random forest model is used to select the most representative features of these disease data; secondly, proposing an enhanced sparrow search algorithm to adaptively obtain the nuclear parameters and penalty coefficient of SVM; then, the optimized SVM model is used to predict and analyze the data samples, and compared with 8 baseline methods; finally, SHAP interpretation framework is utilized to conduct the quantitative analysis of the relationship between the disease incentives and the disease.[Results] The prediction accuracy of this proposed model can reach 85.5%, moreover, the value of accuracy, specificity and Mathews correlation coefficient obtained by the proposed model are all higher than that of others, which proves the effectiveness of the model. In addition, family history, vegetable intake frequency, daily meals, height, gender, transportation usage and high calorie food intake are the key factors affecting the formation of obesity.[Limitations] An empirical study using obesity as an example cannot effectively verify the generalizability of the proposed model; the interaction between the characteristic variables are not analyzed.[Conclusions] The model proposed in this paper not only can have superior prediction accuracy, but also can analyze the effect degree and effect direction of the disease incentives, which can provide decision support for medical institutions.

  • Ni Yuan, Li Xiangyu, Zhang Jian, Dong Feixing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0074
    Online available: 2025-07-03

    [Purpose] Constructing interpretable ensemble learning models to provide a new decision-making approach for predicting the development effectiveness of movie IP derivatives. [Method] Based on the value chain theory, analyze the development process of movie IP derivatives and construct a predictive indicator system. Extract and screen influencing factors based on KLLB model, and construct predictive labels. Propose a development performance prediction model based on AWStacking.

    [Results] The AWStacking algorithm with XGBoost, CatBoost, RF as base learners and LR as meta learner has the best prediction performance, with a macro average accuracy of 0.8699, macro average recall of 0.7889, and macro average F1 value of 0.8216.[Limitations] Due to the limitations of current data availability, the indicators for measuring the development effectiveness of movie IP derivatives can be further optimized to improve the granularity of indicator measurement.[Conclusion] The constructed model provides a basis for judging and predicting the development effectiveness of film IP derivatives, contributing to the healthy development of the film IP derivative market.