Home Browse Online first

Online first

The manuscripts published below will continue to be available from this page until they are assigned to an issue.
Please wait a minute...
  • Select all
    |
  • Liu Tiantian, Peng Fang, Zhu Tianyou, Yang Chao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0679
    Online available: 2026-01-19

    [Objective] To address the low logical and execution accuracy of natural-language-to-SQL models caused by missing real-database information in production, this paper explores deep integration of retrieval-augmented generation into SQL statement creation, aiming to build a model that precisely captures user intent, automatically aligns with the database schema, and produces executable SQL.[Methods] The SQLGPT large model is proposed to efficiently convert users' natural language queries into SQL statements. First, the model retrieves table structure information relevant to the user's query from the database through semantic similarity calculation. Then, it dynamically generates prompts by combining the retrieved table structure information with in-context learning examples to guide the large language model in generating SQL statements.[Results] Experiments on the WikiSQL dataset show that SQLGPT achieves a logical-form accuracy of 86.5 % and an execution accuracy of 92.6 %, outperforming the current state-of-the-art BRIDGE model by 0.4 and 0.8 percentage points, respectively, while also supporting multi-turn conversations with clear advantages.

    [Limitations] Although SQLGPT performs well on the WikiSQL dataset, it relies on a single general dataset, and its robustness and generalization ability have not been fully verified in scenarios such as non-standard table structure naming, complex multi-turn conversations, and industry-specific queries.[Conclusions] It innovatively combines semantic similarity retrieval with dynamic prompting, providing an efficient and accurate solution for the natural language-to-SQL task, which is expected to lower the threshold for users without technical backgrounds to use databases.

  • Guo Haixiang, Zou Yuzhe, Zhao Tiantian, Zhang Wenkai
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0347
    Online available: 2026-01-19

    [Objective] Given the misleading nature, diversity, immediacy, and dynamic characteristics of distorted information during emergencies, traditional models face challenges in semantic comprehension, data coverage, and knowledge updating. Furthermore, the hallucination issue in large language models constrains their application and development across multidisciplinary fields. [Methods] This study constructs a misinformation dataset for sudden events using publicly available datasets. A two-stage large model is designed based on sample embedding and chain-of-thought prompting strategies to mitigate hallucinations, combined with an XLNet-BiLSTM model to mitigate their consequences. This proposes a trustworthy and interpretable framework for identifying misinformation in sudden events. [Results] The proposed framework achieved 85.02% accuracy in identifying misinformation during emergencies, outperforming ablation combinations of individual framework components. [Limitations] Comparative analysis between locally deployed and online-invoked large language models remains insufficient. [Conclusion] This framework ensures consistency and interpretability between identification results and generated explanations, demonstrating transferable information recognition capabilities across diverse emergency scenarios.

  • Qiu Hanqi, Chen Wei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0611
    Online available: 2026-01-19

    [Objective]To propose an integrated research framework that combines technology lifecycle analysis, a multi-dimensional measurement system, and Semantic-Enhanced Hypernetwork Node Representation, aiming to systematically quantify the evolutionary characteristics of technology convergence, address the insufficient utilization of textual information, and comprehensively reveal trends in technology convergence.[Methods]This study divides the technological lifecycle into distinct stages and constructs a temporal hypernetwork for technology convergence based on hypernetwork structures. A multi-level measurement system is designed, encompassing the overall hypernetwork, hyperedges, and nodes, to systematically analyze the evolution of technology convergence. Additionally, a Semantic-Enhanced Hypernetwork Node Representation method (SHNE) is introduced to uncover potential convergence relationships.[Results]Using the all-solid-state battery field as a case study, the research demonstrates that technology convergence evolves through distinct stages, from material exploration and performance optimization to industrial application. It identifies high-value convergence directions—including battery thermal management and modifications to high-nickel cathode interfaces—providing valuable insights for overcoming technological bottlenecks.

    [Limitations]The hypernetwork is constructed solely based on IPC co-occurrence relationships, without incorporating citation relationships or other multi-dimensional information, which limits the fine-grained extraction of technological elements.[Conclusions]This framework reveals the evolutionary characteristics and potential convergence directions in the field of technology, offering new perspectives for research on technology convergence.

  • Yan Qiang, Leng Jidong, Yi Lanli, Jiang Lidan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0775
    Online available: 2026-01-16

    [Objective]To explore the cognitive and emotional response mechanisms of the public in the diffusion of disruptive technologies, and explaining the inherent logic of its social acceptance. [Methods] A three-stage model of technology perception-emotional expression-attitudinal stance is applied to Apollo Go, an autonomous driving service in China. Using over 60,000 user comments from Bilibili, Douyin, and Xiaohongshu, semantic analysis, emotion recognition, and stance detection are conducted to examine cross-platform variations. [Results] Public perceptions are multidimensional, emotional expressions differ significantly across platforms, and attitudes split between technological optimism and institutional concern. Platform context moderates attitude formation by shaping emotional expression. [Limitations] The study relies on Chinese social media data; its generalizability to other cultural contexts requires further testing. [Conclusions] The study demonstrates the interactive mechanisms of perception, emotion, and attitude in the diffusion of disruptive technology, providing new insights for the construction of public technology acceptance models, while also offering methodological references for AI-driven public opinion analysis and scenario decision-making.

  • Zhao Chunyu, Zhang Dandan, Wang Xihua, Chen Xi, Lin Chuanwen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0615
    Online available: 2026-01-16

    [Objective] To systematically review the research progress of data distillation technology, clarify the core challenges and future directions. [Coverage] With keywords such as “data distillation” and “Dataset Distillation”, we retrieved 88 relevant papers from 2021 to 2025 from databases including CNKI, Web of Science, and Google Scholar for this review. [Methods] This study systematically analyzes the basic algorithms based on meta-learning and data matching as well as their representative works, and combines multi-domain application scenarios to analyze the technical characteristics and implementation challenges of data distillation. [Results] Data distillation has shown excellent performance in the fields of computer vision, smart healthcare, recommender systems, natural language processing, and graph learning, but there are still problems such as incomplete basic theory, the need to improve the quality of synthetic data, and the need to strengthen privacy protection. [Limitations] This study focuses on the sorting out of existing research and application analysis, and is insufficient in the in-depth derivation of algorithm principles and the verification of adaptation to complex scenarios. [Conclusions] Data distillation is a technology with a wide range of application prospects. In the future, it is necessary to promote the practical application of data distillation technology from aspects such as theoretical modeling, quality optimization, and technological integration.

  • Wu Dawei, Zhao Yuxiang, Tang Jian, Zhu Qinghua
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0550
    Online available: 2026-01-16

    [Objective]: This study aims to explore the categories of age-friendly design affordances in human-AI interaction contexts and the mechanisms by which these affordances map to the needs of older adults.[Methods]: Drawing on the needs–affordances–features framework and the perspective of affordance actualization, the study identifies key categories of affordances through a meta-ethnographic approach. A card-sorting experiment, using a smart watch scenario, is then conducted to examine the mapping relationships among product features, affordances, and user needs, thereby clarifying the process of affordance actualization.[Results]: The study identifies nine categories of age-friendly design affordances—such as understanding affordances, embodied affordances, and empathetic affordances—and develops an integrated model illustrating the mapping mechanism between features, affordances, actualization processes, and outcomes.[Limitations]: The study relies solely on literature-based data to derive affordance categories, and the exploration of affordance actualization is limited to a specific product scenario.[Conclusions]: This research contributes to the theoretical development of age-friendly design affordances in human-AI interaction and offers practical insights for governments and enterprises to implement human-centered age-friendly design.

  • Tian Xuecan, Li Changwang, Liu Chen, Deng Zeyu, Mao Jin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0863
    Online available: 2026-01-16

    [Objective] To characterise the abnormal fluctuation differences of innovation activities in cutting-edge technology under the context of international competition, providing reference for identifying vulnerable points within the technological system. [Methods] Based on long-term patent application data, multiple models of time series modelling and anomaly detection were used to quantify abnormal performance in technology fields, and EScore was used to evaluate their frontier nature. By cross-analysis, an anomaly profile of cutting-edge technology fields was constructed. [Results] The superior performance of different models varies across different technology fields. Overall, China's technological system shows 'overall stability with local anomalies'. Most anomalies are negative, with sudden changes often lagging by 2–3 years. There is generally a negative correlation between frontier level and abnormality, with cutting-edge technology fields more prone to negative fluctuations, though fields such as the Internet of Things and computational chemistry demonstrate higher resilience. [Limitations] Innovation activities are measured only by application volume, and the evaluation of frontier status relies on a single indicator; future studies could introduce more comprehensive indicator systems. [Conclusion] Constructing anomaly profiles of cutting-edge technology fields based on dual dimensions can effectively reveal vulnerable points and potential breakthrough directions in technological innovation under international competition.

  • Liu Xiumin, Hu Maodi, Song Donghuan, Sun Xi, Zou Dong, Yuan Zhixiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0849
    Online available: 2026-01-16

    [Objective] To improve the performance of large language models in extracting knowledge objects from technological patents, this study addresses the limitations of few-shot prompt method and the insufficient alignment between demonstrations and target tasks. We propose a task-aware dynamic demonstration selection method. [Methods] The demonstration selection problem is formulated as a demonstration-guided gain (DGG) prediction task. Based on the deep semantic interaction between query sentence and candidate demonstrations, a task-aware Cross-Encoder ranking model is constructed. Demonstrations with high guided gains are dynamically selected through a two-stage retrieval-reordering framework. [Results] Experimental results on the genomics Chinese patent dataset show that the dynamic example selection model achieves an F1 score of 64.60% in knowledge object extraction, outperforming the baseline model. The experiments verify the effectiveness of the demonstration-guided gain model in improving the quality of dynamic demonstrations. [Limitations] This experiment is based on genomics Chinese patent text data; the applicability of the model to other domains and text types requires further investigation. [Conclusions] Through experimental validation on genomics patents, the proposed task-aware dynamic demonstration selection method enhances demonstration-task compatibility and improves the performance of large language models in patent knowledge object extraction tasks.

  • Qian Danmin, Zhang Kai, Zhang Jihai, Liu Chengyu, Ma Yeqing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0497
    Online available: 2026-01-16

    [Objective] This study proposes a multi-scale temporal perception and adaptive weight fusion method for multi-modal sentiment analysis.[Methods] This paper designs a multi-scale temporal perception module to capture multi-level temporal information and proposes an adaptive weight fusion architecture that integrates cross-attention, modality importance learning, and feature gating, ultimately completing sentiment classification through weighted fusion.[Results] Experiments on two benchmark datasets show that the proposed method significantly outperforms baseline models, with accuracy and F1-score improved by 2.15% and 2.26% on CMU-MOSI, and by 3.14% and 2.67% on CH-SIMS, respectively.[Limitations] The limitations of this study include the need to improve the discriminative ability for complex and ambiguous emotional expressions, as well as insufficient robustness against modality absence and noise interference. Future work will focus on optimizing the model's performance in these scenarios.[Conclusions] By integrating temporal and cross-modal information via adaptive weighting, this method enhances sentiment recognition performance and demonstrates strong application potential.

  • Yu Houqiang, Lai Xin, Zhang Yang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0350
    Online available: 2026-01-15

    [Objective] This study aims to seize the development opportunities in intelligent infometrics, elucidate its formative process, and provide methodological references for subsequent research. [Coverage] This study formulated search queries based on the theme of AI applications in the field of informetrics. After conducting searches in the Web of Science and CNKI databases, and refining the data through data curation and extended reading, a total of 326 Chinese and English articles were finally identified. [Method] Based on systematic literature retrieval and intensive reading, this paper systematically reviews the research progress of AI applications in the identification, prediction, and classification aspects of informetrics over the past decade. [Result] In terms of recognition applications, AI is mainly applied to paper recognition and fine-grained entity recognition. Regarding prediction applications, AI is mainly employed for predicting the impact of papers, the influence of scholars, and research trends. In the area of classification applications, AI is utilized for classifying papers by discipline, categorizing paper content, and classifying sentiment and motivation. The principles and processes of AI applications in each direction are summarized and interpreted in detail. [Conclusions] The application of AI technology to empower the resolution of complex issues in the field of informetrics has become an inevitable trend. The era of intelligent metrics is coming. Mastery of AI application skills has become an indispensable key capability for professionals in the field of infometrics.

  • Yuan Weikang, Zhang Yujie, Bao Yiyun, Jiang Zhuoren
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0503
    Online available: 2026-01-15

    [Objective] To address the challenges posed by covert hate speech on social media—including its high semantic complexity, the substantial cost of manual annotation, and the limited interpretability of traditional detection methods—this study proposes a novel hate speech detection framework leveraging large language models through self-taught reasoning enhancement, STR4HSD (Self-Taught Reasoning for Hate Speech Detection). [Methods] The methodology comprises three key components. First, a self-taught reasoning enhancement module is constructed based on the Qwen-Max model, following a “Generate-Verify-Reflect-Filter” paradigm. This module facilitates the automatic generation of high-quality Chain-of-Thought (CoT) data enriched with explicit reasoning paths. Second, the Qwen2.5-7B-Instruct model is fine-tuned to enhance its performance in identifying hate speech within Chinese textual contexts, while also improving the transparency of its decision-making process. Third, domain-specific knowledge is incorporated into the model via a priori toxicity lexicons to improve its comprehension of subtle and implicit expressions of hate speech. [Results] Extensive experiments are conducted on two benchmark Chinese datasets: TOXICN, designed for detecting covert hate speech, and TOXICLOAKCN, which includes hate speech with obfuscation-based perturbations. The proposed approach achieves F1 scores of 84.2% and 84.7% on TOXICN and TOXICLOAKCN, respectively, outperforming existing state-of-the-art methods by more than 2% and 4% in terms of F1 score. [Limitations] The current implementation focuses on monolingual text processing and does not yet address multimodal content integration or cross-lingual transferability, particularly with respect to semantic alignment across languages. [Conclusions] STR4HSD presents a promising solution for accurate and interpretable detection of covert hate speech with minimal reliance on human-annotated training data. It offers a scalable and transparent technical approach for content moderation on online social platforms.

  • Wang Ronglei, Ma Yuefeng, Li Shijian, Liang Xun, Song Yang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0659
    Online available: 2026-01-15

    The study addressed the lack of joint analysis of asynchronous evolution in user behavior and social relationships with group sentiment polarization in cyberbullying detection. [Methods] A detection model integrating structural-difference features and group sentiment-polarization features was developed. Two types of discrete-time dynamic graphs were constructed to model user interaction and social-following relationships, respectively. A bidirectional graph convolutional network generated embeddings from both graphs, and structural-difference sequences were derived to characterize asynchronous structural evolution. A sentiment-confidence filtering mechanism produced positive and negative group sentiment-polarization sequences. An enhanced temporal-signal fusion model performed the final classification. [Results] Experiments on real-world social media datasets showed an accuracy of 89.5% and an F1-score of 87.0%, with improvements of about 2% to 7% over representative baseline models. [Limitations] This study was validated solely on specific social media datasets. Future work could incorporate data from more diverse platforms and test the model in real-world environments to assess its generalization capability and robustness. [Conclusions] The fusion of social network structure difference and group sentiment polarization model based on group polarization perspective can significantly improve the detection effect of cyberbullying.

  • Li Jinhao, Zhao Yuxiang, Zhao Yanke, Zhu Qinghua
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0827
    Online available: 2026-01-15

    [Objective]This study aims to explore the characteristics and triggers of users' nostalgic emotions in nostalgic videos based on online comments.[Methods]Over 20,000 user comments from 40 nostalgic-themed videos on Bilibili were subjected to computational grounded analysis. Pattern detection, refinement, and confirmation were conducted using the Qwen large language model and prompt engineering.[Results]Five primary nostalgic elements—“characters,” “events,” “time,” “place,” and “objects”—were extracted from video comments. Three influencing factors triggering nostalgia were identified: sensory processing, ecommendation mechanisms, and social interaction, alongside two sociocultural attributes.[Limitations] Findings derived from computational grounding may overlook insights in user comments due to biases inherent in unsupervised classification. Future research should integrate qualitative methods such as user interviews and cyberethnography.[Conclusions] This study deepens the understanding of core dimensions in nostalgic content creation on social media and provides reference points for designing user experiences in nostalgic videos.

  • Cao Wei, Zhang Yicong, Wang Wenjun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1025
    Online available: 2026-01-15

    [Objective] This study aims to integrate heterogeneous information across countries, markets, and modalities to overcome the limitations of single data sources and enhance the prediction accuracy of China's new energy stock indices.[Methods] We develop a Deeply Coupled Long Short-Term Memory Attention Model (DC-LSTM-Attention-LLM). The model employs ChatGPT with structured zero-shot prompting to quantify investor sentiment from text-based information; constructs an LSTM ensemble to learn deep representations from cross-country, cross-market, and multimodal data; incorporates a multi-head attention mechanism to capture interaction dependencies among heterogeneous features; and performs feature fusion via a shared layer with ReLU activation to generate index predictions.[Results] Empirical analyses on Chinese and U.S. new energy indices show that DC-LSTM-Attention-LLM consistently outperforms nine benchmark models across all evaluation metrics. Specifically, the model achieves an average 12.83% performance improvement relative to the standard LSTM, demonstrating its superiority in forecasting complex financial time series.[Limitations] The accuracy of sentiment recognition using zero-shot prompting is constrained when dealing with complex financial semantics. Future research will introduce advanced prompt engineering to enhance the model's recognition capability and robustness.[Conclusions] The deep modeling approach integrating cross-country, cross-market, and multi-modal data can effectively capture complex market characteristics and significantly enhance the forecasting accuracy of China's new energy stock market.

  • Yang Renbiao, Cao Gaohui
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0769
    Online available: 2026-01-15

    [Objective] This study focuses on the realization pathways of misinformation herd immunity and aims to explore effective approaches for misinformation governance. [Methods] An RP-MHIM model incorporating two evolutionary mechanisms—inoculation-based intervention and natural infection—was developed. By introducing multiple intervention variables, including inoculation frequency, timing, and intensity, the model systematically simulates and analyzes immunity outcomes under different pathways. [Results] In terms of immunity speed, the inoculation pathway achieves herd immunity at approximately t=150iterations, whereas the natural infection pathway requires around 200 iterations. From the perspective of pathway robustness, the inoculation strategy exhibits significantly stronger resistance to disturbances than the natural infection strategy. [Limitations] The model still relies on simplified assumptions concerning individual behavior, platform response mechanisms, and network structures, and its strategy effects have not been empirically validated using real-world data.[Conclusions] The vaccination-based pathway consistently outperforms the natural infection pathway across multiple dimensions, demonstrating higher intervention efficiency and a faster immunity formation process.

  • Wang Xiaochen, Li Shijuan, Huang Wensheng, Zhao Hongmei, Zhang Runtong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0749
    Online available: 2026-01-15

    [Objective] To address the difficulty of achieving prospective identification and early intervention of clinical misdiagnosis, this study proposes an approach integrating prior knowledge with multi-source heterogeneous data for misdiagnosis risk prediction and feature identification, aiming to improve the timeliness and operability of misdiagnosis detection. [Methods] Prior misdiagnosis knowledge was extracted from expert experience and clinical rules to assist in determining misdiagnosis events and to enable knowledge-driven label construction; a hybrid machine learning model was built based on electronic medical records, with interpretable learning introduced to identify key risk features. Model development used data from 19,256 patients provided by the China National Population Health Data Center, and external evaluation and a clinical pilot were conducted on an independent validation cohort of 2,153 patients from Peking University People’s Hospital. [Results] In independent validation, the model achieved an accuracy of 92%, an AUROC of 0.90, and an AUPRC of 0.67; the clinical pilot results showed that the Youden index increased from 84.50% to 92.65% after model deployment. [Limitations] The construction of misdiagnosis labels relies on rules and expert knowledge, and the pilot sample size and duration are limited; generalizability and robustness still require multicenter and longer follow-up validation. [Conclusions] The proposed approach achieves favorable performance in misdiagnosis risk prediction and key feature identification, enhances early intervention capability and decision-support value in real clinical practice, and provides a feasible technical pathway for building an intelligent prevention and control system for clinical misdiagnosis.

  • Ding Shengchun, Gong Jingze, Qin Tianyun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0848
    Online available: 2025-12-31

    [Objective] Aiming at the challenges of diverse types, complex interactive behaviors, and poor reproducibility of cognitive subjects in the context of cognitive warfare, this study introduces large language model (LLM) technology to develop a real-data-based agent modeling and simulation method, thereby providing experimental support for the targeted delivery of intelligence strategies. [Methods] Based on real data from the Al-Ahli Hospital explosion incident, deep learning methods were used to extract the attribute distribution and behavioral probability characteristics of nine typical categories of cognitive subjects. Leveraging large language models, 10,000 agent instances with individual heterogeneity were generated and subsequently integrated into the NetLogo platform for interactive behavior simulation. The model effectiveness was validated from the perspectives of attribute distribution consistency and behavioral pattern differentiation. [Results] The model effectively characterizes the influence differences among cognitive subjects at various levels. Differentiated interactions consistent with a normal distribution emerged during the simulation, which effectively addresses the limitations of rigid rules and insufficient sample representativeness in traditional simulations. [Limitations] The current model focuses on fitting and reproducing behavioral probabilities, and fails to achieve dynamic cognitive evolution based on real-time semantic interaction during simulation operation, leading to inadequate ability to evaluate the effects of in-depth semantic confrontation. [Conclusions] The cognitive subject agent model constructed in this study effectively reproduces behavioral responses after information reception, thus verifying the feasibility of the technical route combining LLM-generated agents and NetLogo simulation. It offers quantifiable and reproducible experimental underpinnings for cognitive space operations and holds significant strategic value.

  • Shen Zhihong, Zhu Xiaojie, Zhu Guoliang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0956
    Online available: 2025-12-31

    [Objective]This paper addresses the critical need for scientific data to evolve from "machine-actionable" to "AI-Ready" under the FAIR principles in AI scenarios, aiming to construct a principle framework for AI-ready scientific data sharing and utilization.[Methods]By systematically analyzing the data requirements of five typical AI tasks—traditional machine learning, large language model pre-training, large language model fine-tuning, retrieval-augmented generation (RAG), and agents—we propose the FAIR×FAIR framework. This framework extends the traditional four FAIR dimensions with a focus on "For AI-Ready" principles and introduces a corresponding hierarchical technology stack.[Results]The FAIR×FAIR framework defines 13 technical requirements for making scientific data AI-ready, providing a systematic solution to bridge the semantic gap between AI tasks and scientific data infrastructure.[Limitations]The practical effectiveness of the proposed framework requires further validation through subsequent domain-specific application cases.[Conclusions] The FAIR×FAIR framework provides a theoretical foundation and practical pathway for scientific data sharing and efficient utilization in the AI era, contributing significantly to the evolution of data-driven research paradigms.

  • Chen Gefei, Liu Qing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0600
    Online available: 2025-12-31

    [Objective] Threshold setting is an important step in mapping knowledge domains. This paper provides an objective and generic method for threshold setting to improve the quality and efficiency of data mining in literature networks. [Methods] This paper reveals the relationship between the performance of knowledge maps and the threshold setting via degree distribution, assortativity and giant component size portion. Based on this, we propose a threshold setting method and test it with practical data. [Results] In the experimental dataset, the average precision of important nodes, clusters and temporal features extracted from networks filtered by our method is improved by 10 percent, and the average completeness is improved by 7 percent compared with empirical methods. [Limitations] The applicability of the method needs to be tested in more domains. [Conclusions] There is a universal relationship between the threshold setting and three network structural properties (degree distribution, assortativity and giant component size portion) in literature networks. Threshold setting based on the network structural properties can improve the quality of data mining in mapping knowledge domains.

  • Lv Tingyu, Li Xiaoying, Deng Panpan, Li Junlian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0719
    Online available: 2025-12-31

    [Objective] To effectively uncover and formally represent the combined medication information from unstructured text, this paper proposes a multi-level method to discover the knowledge of combined drug therapy and also automatically evaluate the above results using large language models. [Methods] Based on the Literature based discovery (LBD) theory, a multi-level knowledge discovery framework for combination therapy has been deliberately designed. The deep learning models have been integrated and optimized for the recognition of drug and disease entities, as well as the extraction of drug -disease treatment relationships and drug combinations, followed by automatic identification of combined drug therapy on the basis of targeted knowledge features. Besides, guided by the customized few-shot prompt in the "role + content" form, the large language model would be adopted for the evaluation of the above results. The presented multi-level and dual strategy will not only improve the accuracy and reliability of knowledge discovery results for combined drug therapy, but also greatly avoid time-consuming and laborious manual reviewing. [Results] On the self-developed dataset from PubMed literature, the proposed method achieved an accuracy rate of 94.29%, and the evaluation results by the gpt-4.1 got a consistency rate of 95.71% compared with manual annotation. [Limitations] Only public scientific literature were collected for quantitative analysis, various types of data such as electronic medical records and adverse drug reactions reports would be utilized for more verifications in the near future. [Conclusions] The multi-level method by integrating deep learning and large language models could efficiently identify the knowledge of combined drug therapy from biomedical literature, which meets the high-precision requirements of large-scale and structured data for downstream tasks, providing a technical path for diverse applications like drug combination decision-making in precision medicine.

  • Wang Haoyu, Zhou Yulin, Huang Ruizhang, Qin Yongbin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0743
    Online available: 2025-12-31

    [Objective]To address the shortcomings of existing sentence prediction models in multi-defendant cases—namely inadequate integration of legal knowledge and poor compliance with sentencing standards—this study proposes a Knowledge-Aided Sentence Prediction (KASP) framework that integrates legal constraints with knowledge-driven prediction.[Method]The KASP framework is proposed to utilize large language models for case fact decomposition; analyze charges and legal provisions to extract foundational sentencing ranges as structured legal priors; and integrate these priors into lightweight predictive model training through a consistency fusion mechanism, achieving knowledge-driven collaborative optimization.[Results]Experiments on the CMDL-small dataset demonstrate that KASP achieves 5.44% and 4.18% improvements in accuracy and F1 score, respectively, compared to the optimal baseline DeepSeek-R1-14B, while exhibiting greater stability in complex multi-defendant scenarios.[Limitations]This study primarily focuses on knowledge modeling for extracting foundational sentencing ranges from legal constraints and discretionary factors, without addressing more complex sentencing rules such as concurrent sentencing for multiple offenses or overlapping statutory provisions.[Conclusion]By incorporating structured legal prior knowledge, the sentencing prediction model achieves enhanced performance and legal compliance in complex cases.

  • Sun Xiaoling, Shen Tong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0280
    Online available: 2025-12-31

    [Objective] Explore the cross-domain knowledge combination characteristics of AI from the fine-grained level of knowledge memes, and integrate multi-dimensional features to improve the combination prediction effect. [Methods] Proposing a cross-view contrastive learning model based on graph neural network called GCN_Contrast, which effectively combines multi-dimensional features through graph contrast learning and feature fusion methods based on attention mechanism to predict potential cross-domain knowledge combinations.[Results] Taking medical informatics as an example, the GCN_Contrast model improves the accuracy 8.4%, 16% and 20% higher than that of the traditional GCN model measuring by p@500, p @100 and p @50 indicators respectively, and the AUC value is increased by 3%.[Limitations] At present, the feature selection of knowledge memes is limited, and the relationship between knowledge memes can be further explored from the perspective of citation network.[Conclusions] GCN_Contrast model can more accurately predict the knowledge meme combination in AI cross-domain research, and supports decision-making for promoting the in-depth integration of AI with basic and cutting-edge scientific research.

  • Zhang Kewei, Yin Jing, Wen Fuquan, An Xiaomi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0810
    Online available: 2025-12-31

    [Objective] In the face of the cognitive challenges posed by the rapid iteration of AI-native technologies and the diversification of application scenarios, this paper aims to establish a concept system and a maturity evolution framework from the perspective of international standardization. This framework will provide a theoretical basis for understanding the development of AI-native technologies, evaluating the behavioral quality of AI-native entities as data intelligence subjects, and formulating differentiated regulatory strategies.[Methods] The text content analysis method is employed to screen and analyze 34 AI-native-related standard documents issued by ITU-T SG13. Following the principles of ISO 704:2022, a maturity model was developed based on “activity–result” feature mapping. Furthermore, representative use cases including collaborative intelligent agents and applications in key vertical industries—are analyzed to elucidate their data roles and associated behavioral assessment patterns. [Results] The study identifies a conceptual system encompassing five categories of characteristic objects and two classes of defining features. It further establishes a three-tier maturity evolution framework, ranging from “AI-assisted” to “fully AI-native” capacities. Analysis of the use cases demonstrates the necessity of aligning governance approaches with scenario-specific risks, adopting either human–AI collaborative regulation or AI-native regulatory mechanisms as appropriate. [Limitations] This paper is limited to the perspective of standardization to construct concept system and maturity evolution framework.[Conclusions] The concept system constructed in this paper provides a standardized consensus foundation for understanding the dynamic evolution of AI-native technologies. The study shows that the focus of governance should shift from performance efficiency to the evaluation of semantic accuracy and ethical quality. It is recommended to adopt a tiered regulatory strategy, applying differentiated regulatory measures for scenarios with varying levels of maturity and risk.

  • Zhang Jingyuan, Yang Lei, Liu Zhaiyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0773
    Online available: 2025-12-30

    [Objective] To address the high computational complexity and limited multi-scale feature-extraction capacity of Transformer-based models—which struggle to balance local details with global context—we propose a Multiscale Lightweight Attention Network (MLA-Net) for depression recognition.[Methods] MLA-Net adopts a lightweight Transformer backbone. A global dual-pooling attention module first harvests video features while preserving global cues; an attention-based spatio-temporal block then models long-range dependencies. Multi-scale convolutions capture information at different granularities, and a cross-fusion strategy further refines the final representation.[Results] Evaluated on a real-world depression dataset, MLA-Net achieves a mean absolute error of 4.90 and a root-mean-square error of 6.88, outperforming state-of-the-art alternatives and verifying its effectiveness and soundness.[Limitations] The current study only exploits facial expressions; speech, text and physiological signals have not yet been incorporated.[Conclusions] By synergistically combining global dual-pooling attention, multi-scale feature extraction and cross-feature fusion, MLA-Net significantly boosts recognition performance.

  • Wang Nan, Wang Juan, Liu Yaowen, Pan Jie, Xia Yixue, Xian Tingyu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0742
    Online available: 2025-12-30

    [Objective] To address the reliance on singular visual artifacts and insufficient robustness against interference in existing AIGC image attribution methods, a novel semantic-guided multi-modal attribution model is proposed.[Methods] A semantic-guided active multi-modal fusion paradigm is proposed. To address the semantic gap between multi-modal features, a semantic mapping mechanism from quantitative fingerprints to natural language is designed. Building upon this, the interaction logic of the cross-attention layer is reconfigured to utilize semantic text as an active query, guiding the model to dynamically focus on critical artifact evidence within the deep feature space.[Results] On the WILD and DRAGON datasets, the F1 scores reached 98.4% and 69.6%, respectively. Quantitative analysis shows that, compared with the unimodal visual baseline, the proposed model improves the F1 score by 5.9% and 11.3%, respectively; compared with the Image-Only (ViT) model, the F1 score in complex scenarios is improved by 3.1% and 6.4%, respectively.[Limitations] The rule-based semantic generation limits adaptive reasoning capabilities; furthermore, the discrimination between technically homologous models with highly similar architectures remains to be improved.[Conclusions] This research confirms that the semantic-guided active multi-modal fusion strategy effectively integrates orthogonal evidence and represents a viable technical pathway for enhancing the robustness of AIGC image attribution in complex scenarios.

  • Du Xianjin, Xu Yuxiang, Fu Hong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0517
    Online available: 2025-12-26

    [Objective]To improve the accuracy and timeliness of identifying technological innovation partners for enterprises, this paper proposes a link prediction method based on a Dynamic Graph Convolutional Network (DGCN) with multi-modal feature fusion.[Methods]This paper proposes an attention-based model for fusing the topological, domain, and semantic features of nodes in a patent collaboration network. A GCN-LSTM architecture is designed with a sliding time window strategy to capture the network's dynamic evolution. The model performs link prediction to identify potential technological innovation partners.[Results]An empirical study on a patent dataset from China's new energy vehicle (NEV) sector from 2015 to 2024 shows that our method significantly outperforms baseline models across all metrics. It achieved an AUC of 0.858, an improvement of 5.0 percentage points over the next-best model, EvolveGCN, and an F1 score of 0.807, which is 3.5 percentage points higher than the runner-up model, DySAT.[Limitations]The study did not fully exploit patent-specific features such as citation relationships and patent value. Furthermore, it did not integrate non-patent, multi-source heterogeneous information, such as corporate R&D investment and market performance.[Conclusions] By effectively capturing the dynamic evolutionary patterns of patent cooperation networks and comprehensively utilizing multimodal patent features, this method can provide more precise and forward-looking data-driven decision support for enterprises engaging in collaborative innovation.

  • Ma Yakun, Su Ying, Hu Guangwei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0356
    Online available: 2025-12-26

    [Objective] This study aims to explore a chronic disease knowledge service framework that integrates multi-source health data and domain knowledge to address limitations in personalized demand identification and the adaptability required for home-based chronic disease management.[Methods]: The framework integrates three key components: (1) a risk assessment system for health monitoring and alerts, (2) a lightweight pre-parser that processes user queries using health profiles and physiological data, and (3) a knowledge graph-enhanced retrieval system for domain-specific adaptation. Together, these elements optimize the model's performance for chronic disease management.[Results] Experiments conducted on diabetes-related cases show that the model achieves higher quantitative scores in diagnostic relevance, terminology hit rate, and diagnostic regularity, providing useful insights for advancing the intelligence and precision of home- and community-based health services. [Limitations]: Current knowledge services primarily rely on textual data; future work will incorporate multimodal data to enhance service comprehensiveness.[Conclusion]: The proposed framework significantly improves knowledge service accuracy and standardization while establishing an extensible technical pathway for intelligent chronic disease management.

  • Hu Xinxin, Qiu Qinjun, Huang Zehua, Lu Xiechun, Cui Qianna, Ma Kai
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1222
    Online available: 2025-12-26

    [Objective] Aiming at the problem that existing entity alignment methods still have the problem of underutilizing the structural and semantic information of the knowledge graph, resulting in poor entity alignment, a proposed entity alignment model MuEmbedNet based on embedding enhancement and entity relationship perception is constructed. [Methods] The MuEmbedNet model achieves the embedding enhancement of entities by generating different embedding representations for the entities and utilizing the attention mechanism and a two-layer improved GCN network, and further utilizes the mutual mapping of entities and relations to fuse the relational features of graph-structured data into the entity features to obtain improved entity embedding representations. [Results] The results show that the MuEmbedNet model achieves average Hits@1, Hits@10, and MRR values of 89.3%, 97.2%, and 92.1%, respectively, on three publicly available cross-lingual datasets, which are higher than all baseline models. Compared to the average optimal performance of the baseline model, the Hits@1, Hits@10, and MRR values are improved by 6.1%, 0.5%, and 5%, respectively. [Limitations] The model performs better in the entity alignment task for two identical languages, while there are limitations in the entity alignment task for different language systems. [Conclusions] By embedding the augmented and entity relationship-aware networks, the semantic information features and structural information of entities can be fully learned, which in turn can effectively improve the entity alignment task.

  • Zhang Jiacheng, Liu Zheli, Xiao Guangwen, Nie Lihai, Wang Yongchang, Shi Liang, Jin Meihong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0850
    Online available: 2025-12-26

    [Objective] To address the fragmentation of value-alignment evaluation systems for large language models, the insufficient coverage of Chinese-specific values, the scarcity of high-quality deep evaluation data, and the lagging evaluation methodologies, this study constructs a methodological framework and toolset for value-alignment assessment tailored to large language models.[Methods] We propose an integrated methodological framework that unifies value rules, evaluation data, and intelligent technologies. Under this framework, we design a three-dimensional evaluation system encompassing “capability–task–indicator,” carry out data collection, augmentation, and expert annotation, and build a systematic deep-evaluation scoring dataset. Ultimately, through pre-training, instruction fine-tuning, and expert-feedback training, we develop a value-alignment evaluation model.[Results] The constructed evaluation model achieves an accuracy of 98.57%, enabling automated assessment of value-alignment levels in large language models. Empirical findings show that domestic models exhibit overall higher alignment than foreign ones, though common issues remain, including insufficient incorporation of red cultural resources, factual and hallucinatory misinformation, weakened ideological expression, over-censorship, and limited dynamic adaptability.[Limitations] The study primarily targets text-based large language models, and its applicability to multimodal models requires further validation. In addition, the evaluation outputs are presented in three tiers—high, medium, and low—leaving room for improvement in interpretability.[Conclusion] This research contributes to improving a value-alignment assessment and governance system with Chinese characteristics, ensuring the healthy development of large language models within a safe, trustworthy, and controllable framework. It also provides essential technical support for effectively implementing mainstream values in China’s economic development and social governance.

  • Xue Zengcan, Zhang Xiaoran, Chen Jiarui, Liu Hai, Tan Jun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0582
    Online available: 2025-12-26

    [Objective] The current research status of knowledge graph completion technologies is summarized at home and abroad to provide a theoretical basis for promoting the in-depth research of knowledge graph completion technologies. [Literature Scope] By using keywords such as  Knowledge Graph Completion and Link Prediction, we conduct a search in authoritative databases including Web of Science, Google Scholar and China National Knowledge Infrastructure (CNKI). A total of 130 representative papers are selected. [Methods] Based on the methods of literature review and summary, the relevant research is reviewed and evaluated from three aspects: recommendation models, evaluation results, and future prospects. [Results] The knowledge graph completion model based on relational semantics can be classified into models for complex relational semantics modeling, models for connected relational semantics modeling, and knowledge graph completion models for implicit, heterogeneous, and sparse relational semantics modeling. In terms of the MRR metric, the SimKGC model, designed for sparse relational semantics, achieves a 4.9% improvement on the WN18RR dataset (0.666 vs. 0.617), while the DaBR model, designed for connected relational semantics, achieves a 3.4% improvement on the FB15k-237 dataset (0.510 vs. 0.476). [Limitations] Some emerging technologies lack large-scale benchmark tests. Due to the wide range of research fields and the abundance of literature, not all relevant studies have been covered. [Conclusions] Compared with traditional methods, cutting-edge technologies have better performance in knowledge graph completion. However, they lack model interpretability and scalability, have difficulties in multimodal and temporal data fusion, and there is a risk of hallucination in large language models. These are also the issues that future research needs to address.

  • ChenXianlai, Xu Anming, Li Chenpeng, Zhu Zelin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0557
    Online available: 2025-12-26

    [Objective] To resolve the semantic alignment issue between clinical descriptions in medical imaging reports and lesion objects in medical images, thereby enhancing the correspondence between them.[Methods]This study propose an Anatomy-Enhanced Medical Visual Grounding method (AEMVG) that improves medical visual grounding capability through two modules: the Anatomy Prior Knowledge Guidance module (APKG) and the Normal Feature Enhanced Lesion Grounding module (NELG). The APKG generates guidance labels for training samples, enabling the model to more accurately comprehend anatomical structure information, thereby narrowing down the localization search space on a global scale and reducing global localization uncertainty.NELG utilizes normal anatomical features as negative samples to enhance lesion identification ability and alleviate local localization uncertainty. [Results]The experiments on the MS-CXR dataset show that AEMVG achieves ACC and mIoU of 0.7246 and 0.6079 respectively, improving by 3.7% and 4.1% over the baseline model, and the visualization analysis indicates that its anatomical localization and lesion recognition are more in line with clinical diagnostic logical thinking. [Limitations]This method has only been validated on X-ray images, and its effectiveness for modalities like CT and MRI remains untested.[Conclusions]AEMVG effectively enhances medical visual grounding models' capabilities in anatomical cognition and lesion differentiation, improving the correlation performance between clinical descriptions and lesion objects.

  • Peng Mingyang, Gao Yan, Lai Yuqiao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0780
    Online available: 2025-12-26

    [Objective] This study proposes an Ordinal-Aware Hierarchical Fusion Network (OAFHN) to address two limitations in hateful meme detection: standard classification losses ignore the ordinal relationship of harmfulness levels, and symmetric penalty mechanisms misalign with content moderation needs. [Methods] First, we design an Ordinal-Aware & False Positive Penalty Loss (OPP-Loss) that reformulates classification as ordinal regression with asymmetric penalty on false positives. Second, we construct a hierarchical multi-path fusion network that leverages vision-language models to generate semantic explanations as knowledge input and employs coarse-grained fusion, semantically-modulated attention, and low-rank bilinear pooling for multi-granularity feature modeling. [Results] On Harm-C and Harm-P datasets, OAFHN achieves F1-scores of 83.46% and 88.39%, improving over existing methods by 0.66 and 0.13 percentage points respectively. Ablation studies validate the effectiveness of OPP-Loss and hierarchical fusion, with OPP-Loss contributing over 8 percentage points in F1-score improvement. [Limitations] The false positive penalty factor requires manual tuning, and the ordinal mapping is statically configured, insufficiently capturing internal heterogeneity within "somewhat harmful" category. [Conclusions] Addressing task-specific challenges at the optimization level, combined with multi-granularity fusion and external knowledge injection, effectively enhances robustness and accuracy of hateful meme detection.

  • Jun Deng, Dongyu Ye, Yidan Xing, Qi Zhang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0558
    Online available: 2025-12-26

    [Objective] To enhance the interpretability and implicit sentiment analysis capabilities of Aspect-Based Sentiment Analysis (ABSA) models, this study proposes an interpretable ABSA method leveraging supervised fine-tuning and reinforcement learning.[Methods] First, an ABSA reasoning dataset was constructed utilizing the DeepSeek R1 model. Second, Large Language Models underwent supervised fine-tuning to enhance their generative and sentiment analysis capabilities. Finally, reinforcement learning was employed to optimize the reasoning process and improve ABSA accuracy.[Results] Experimental results on the SemEval 2014 benchmark dataset demonstrate that the proposed method surpasses State-of-the-Art (SOTA) models, improving the F1 score by 1.26% and implicit sentiment classification accuracy by 3.18%.[Limitations] Experiments were limited to the aspect-based sentiment classification task and have not yet been extended to more complex tasks such as sentiment information extraction.[Conclusions] Reinforcement learning effectively optimizes the model’s reasoning and explanatory processes while enhancing implicit sentiment analysis capabilities. Furthermore, a well-designed compound reward function proves crucial for optimization. The proposed method demonstrates robust performance across both Chinese and English datasets.

  • Ma Yanzhou, Luo Tun, Wu Shengyi, Zhu Qi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0823
    Online available: 2025-12-26

    [Objective] This study focuses on topic-oriented sarcasm detection on Chinese social media. Existing Large Language Models (LLMs) tend to be over-sensitive and often misclassify strong stances or purely negative emotions as sarcasm, which undermines their robustness and accuracy on this task.[Methods] We design a dual-path reasoning framework. In the inductive path, the model retrieves similar cases accompanied by theoretical explanations to support analogy-based reasoning. In the deductive path, we construct a hierarchical judgment framework grounded in Impoliteness Theory and design layered prompts that guide the model from sentiment filtering to intention analysis. Finally, a decision fusion module integrates evidence from both paths to generate the final judgment.[Results]Comparative experiments on the ToSarcasm benchmark show that our method outperforms representative baselines. Using DEEPSEEK-V3.1 as the base model, our framework achieves an F1 score of 83.25% and a macro F1 score of 76.45%, surpassing the best-performing baselines by 10.44 and 14.63 percentage points, respectively.[Limitations]The performance of the inductive reasoning module is contingent upon the quality of the "sample-reasoning chain" database. The research was conducted primarily within a Chinese context, and its cross-lingual generalizability requires further validation. While the deductive framework enhances interpretability, its decision-making process fundamentally relies on the internal mechanisms of the LLM, which are not fully transparent.[Conclusion]The proposed theory-guided, dual-path reasoning framework effectively enhances the robustness and balance of Large Language Models in identifying topic-oriented sarcasm, offering a new paradigm for tackling complex pragmatic reasoning tasks.

  • Song Wenjie, Wang Liang, Xu Chao, Zheng Zhishuai, Zhu Xinjuan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0136
    Online available: 2025-12-26

    [Objective]To promote the dissemination and inheritance of museum artifacts and historical cultural knowledge, a method for constructing a cultural and historical knowledge Q&A system is proposed by integrating Knowledge Graphs (KG) with Retrieval-Augmented Generation (RAG).

    [Methods]First, to address the characteristics of cultural historical corpus texts containing numerous rare characters and classical Chinese, we developed a rare character dictionary and classical Chinese comparative lexicon, constructing a classical Chinese translation tool to reduce error generation in large language models (LLMs) when processing historical documents. Second, we proposed an effective method integrating text vectorization similarity retrieval with knowledge graph retrieval through prompt learning, with meticulously designed task-specific prompt templates to enable efficient knowledge reasoning and user interaction. Using Qin cultural historical materials as a case study, we built the Qin Culture Knowledge Q&A System (ChatQDC).[Results]Experimental results demonstrate that ChatQDC achieves accuracy improvements of approximately 8%, 20%, and 31% in question answering compared to ChatGPT-4, ChatGLM3-6B, and DeepSeek-R1-8B, respectively.[Limitations]The current system is primarily constructed for classical Chinese texts, and future research should further validate the adaptability and extensibility of the method in multi-modal historical data such as Terracotta Warrior images and tabular data.[Conclusions]Unlike traditional professional data fine-tuning methods, this system allows LLM deployment without retraining, enabling low-cost, energy-efficient applications in vertical domains.

  • Yu Chao, Liang Xudong, Zhang Tongyang, Xu Jian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0649
    Online available: 2025-12-26

    [Objective] To break down disciplinary barriers, explore common knowledge between disciplines, and build a bridge for cross-field dialogue.[Methods] Analyze the citation context of existing interdisciplinary common knowledge cases to extract co-citation patterns, then extract these patterns and combine indicators of commonality, interdisciplinarity, and novelty to identify common knowledge with potential value.[Results] 1,044 pairs of candidate interdisciplinary knowledge pairs were identified from 53 million citation records, and an in-depth analysis was conducted on 7 cases with high scores in interdisciplinarity, commonality, and novelty, revealing the intersections of disciplinary issues at the basic methodological level.[Limitations] The number of syntactic patterns currently used needs to be expanded, and there is still room for improvement in the acquisition of interdisciplinary common knowledge instances.[Conclusions] The constructed method based on citation context analysis provides a new path paradigm and methodology for the common connection between disciplines.

  • Zhang Yunqiu, Yin Ce
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0494
    Online available: 2025-12-26

    [Purpose] To address the performance limitations of MoE(Mixture-of-Experts)large language models in parameter mutation scenarios caused by knowledge solidification and linearized reasoning.[Methods]We propose an optimization framework that integrates dynamic knowledge verification with multi-agent collaboration. A knowledge-validated agent game mechanism is constructed, incorporating attention-guided topology reorganization and elastic resource allocation strategies to enhance nonlinear reasoning. Using DeepSeek-R1-8B as the baseline, we apply progressive knowledge distillation and validate the framework on parameter mutation problem sets from medical physics and materials science.[Results]The optimized model achieves an average score of 87.18 on the test set, improving by 11.12 points over the original 8B model, and outperforms the 671B model that relies solely on prompt engineering.[Limitations]The training data is domain-specific, limiting the model’s generalization ability across broader parameter mutation contexts.[Conclusion]The proposed framework dynamically validates knowledge within MoE-based large models through agent-based game interactions, significantly enhancing reasoning performance in complex parameter mutation scenarios, and offers a reference for future research in this field.

  • Ao Yuxuan, Wang Hao, Zhou Shu, Bu Wenru
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0755
    Online available: 2025-12-26

    [Objective] To address the challenges of abstract poetic semantics, misalignment between emotion and imagery, and the insufficient artistic quality of existing results, we propose a poetry-to-image generation approach that balances semantic fidelity and aesthetic expressiveness.[Methods] We build a deep semantic understanding-multimodal imagery space-multi-stage collaborative generation framework: a pretrained language model with multi-task learning extracts structured semantics (emotion, imagery, rhetoric) and aligns them with visual features, followed by three-stage conditional diffusion to produce images.[Results] On our Poetic Visions dataset (3,124 poems and 4,212 images), compared with baselines such as GPT-4o + DALL·E 2, our method achieves an average relative improvement of about 6% over the best baseline in IS (26.87 vs 25.32), FID (14.98 vs 15.75; lower is better), CLIP Score (0.72 vs 0.68), and human ratings (3.7 vs 3.3).[Limitations] The multi-stage pipeline depends on early layout outputs, where deviations may propagate; fine-grained control and stability for long or highly abstract poems remain challenging.[Conclusions] Semantic guidance combined with collaborative multi-stage generation enhances poem–image alignment and artistic expression, offering a reusable framework for digital humanities and creative applications.

  • Han Mingxing, Lin Litao, Ou Shiyan, Xu Liwei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0834
    Online available: 2025-12-26

    [Objective] To address the issues that existing social bots detection studies overlook the subtle differences in generated content and fail to construct an effective multi-dimensional feature integration framework.[Methods] Large Language Models (LLMs) and a Mixture of Experts (MoE) system are used for social bots detection to capture the subtle differences in content features and integrate multi-dimensional features.[Results] On the Twibot-20 and Twibot-22 datasets, the model proposed in this paper demonstrates superiority with its precision exceeding that of other models by at least 1.70% and 4.45%, respectively.[Limitations] Potential adversarial attacks have not been considered; ethical issues concerning misjudgements by social bots have not been considered.[Conclusions] The proposed model provides powerful technical support for maintaining a healthy online ecosystem.

  • Qiu Jingwen, Wang Hao, Yang Simin, Yao Tianchen, Tan Yuyao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2025.0858
    Online available: 2025-12-26

    [Objective] Historical figures are not only important roles on the stage of history but also epitomes of history. Comprehensively and objectively restoring the images of historical figures is of great significance for grasping the connections between historical events and understanding the trends of historical development.[Methods] This paper proposes a framework for generating historical figure portraits integrated with event sentiment analysis, which mainly includes the construction of historical figure event sets, multi-label classification of historical events, and sentiment recognition of historical figure events. On this basis, this paper analyzes historical figures from three aspects: the results of event classification, the matching degree between event sentiment and evaluative sentiment, and the evolutionary trends of event sentiment.

    [Results] The proposed figure behavior unit identification model (ISSI-RM) achieves an average recognition of approximately 3 groups of Figure's Behavior Units (FBUs) per sentence, while effectively resolving issues such as ambiguous or erroneous entity coreference. In the multi-label event classification task, the proposed ABG-MLC model attains an F1-score of 0.8, outperforming the baseline model by 0.14. The best sentiment recognition effect is achieved by adopting prompt learning for event sentiment recognition after completing historical event extraction and classification, which verifies the effectiveness of the recognition process proposed in this paper. The study finds that among multi-label events, the composite label of "political ability + military ability" accounts for the highest proportion; such composite labels in Benji (basic annals) are mostly concentrated on emperors who founded and stabilized dynasties, while in Liezhuan (biographies), they are mostly concentrated on strategists. There is a significant difference between Benji (basic annals) and Liezhuan (biographies) in the matching degree between event sentiment and evaluation sentiment, but positive emotion is the dominant type in the high overlap. The temporal trajectories of character event sentiment can be clustered into four typical patterns.[Limitations] Future research can further expand the corpus and carry out cross-temporal comparative analysis of historical figures.[Conclusions] By integrating event sentiment analysis into the generation of historical figure portraits with digital technologies, this paper addresses the challenges of scalability and objectivity in traditional historical figure research. It provides a new perspective for the study of historical figures and offers practical references for the in-depth excavation of historical documents and the transformation of cultural resources.