Home Table of Contents

25 April 2026, Volume 10 Issue 4
    

  • Select all
    |
  • Qian Li, Jiang Tian, Chang Zhijun, Ding Jielan, Hu Maodi, Liu Yi, Zhang Zhixiong
    Data Analysis and Knowledge Discovery. 2026, 10(4): 2-12. https://doi.org/10.11925/infotech.2096-3467.2026.0087
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To support the pre-training, fine-tuning, and knowledge-enhanced reasoning of general and scientific large models, as well as the large-scale application of intelligent scientific literature information services. [Methods] This paper first analyzes the core challenges and intrinsic requirements faced by the AI4S paradigm, clarifying the conceptual connotation and key characteristics of the scientific literature knowledge base for AI4S. On this basis, it proposes a “data-model-service” three-layer theoretical framework for constructing this knowledge base, systematically outlines the key technologies and feasible implementation paths, and verifies the effectiveness of this theoretical framework through practical cases. [Results] Based on the aforementioned theories and methods, a preliminary infrastructure has been established, consisting of an AI-Ready data infrastructure centered on “Smart Data”, an intelligent model infrastructure centered on large language models for scientific literature and domain-specific models, and a multi-scenario-driven AI4S and AI4Data agent service infrastructure. This infrastructure has successfully supported research innovation activities in typical AI4S scenarios such as intelligent chemical engineering and digital cells. [Limitations] In large-scale practical applications involving multi-user collaboration and cross-domain scenarios, the theoretical framework of the scientific literature knowledge base for AI4S proposed in this paper still requires continuous verification and iterative optimization. [Conclusions] The constructed three-layer theoretical framework can provide a feasible pattern reference for scenarios such as the R&D of general and specialized intelligent models and the standardized processing of AI-ready corpora. The established scientific literature knowledge base will serve as a new type of research infrastructure to promote the improvement of pre-training efficiency and fine-tuning precision of general and scientific large models, as well as the enhancement of digital-intelligence capabilities including knowledge reasoning and computational analysis.

  • Zhang Yuanzhe, Ding Jielan, Qu Zihao, Lu Hongjun, Peng Wenjie, Zhou Jibin, Hu Maodi, Qian Li, Zhang Zhixiong
    Data Analysis and Knowledge Discovery. 2026, 10(4): 13-24. https://doi.org/10.11925/infotech.2096-3467.2026.0088
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study aims to meet the demand for intelligent transformation in chemical engineering, aggregate and mine domain knowledge from scientific literature to construct an AI4S knowledge base for chemical engineering. [Methods] A full-process toolchain is developed to aggregate and curate large-scale chemical engineering scientific literature into an organized raw repository, decompose and reorganize multimodal objects including text, tables, images, and formulas into a multimodal repository, construct a chemical engineering ontology, mine entities and relations using an intelligent chemical engineering large model (ChemELLM 3.0), and integrate and align the results to build a domain-specific knowledge base. [Results] A full-process toolchain covering scientific literature aggregation, cleaning and curation, decomposition and reorganization, knowledge mining, and alignment has been established. A chemical engineering knowledge base containing about 10.2 million raw curated records, about 10.2 million multimodal records, and about 260,000 knowledge triples has been constructed and deployed in the Big Data Center for Full-Chain Petrochemical and Chemical Engineering. [Limitations] For low-frequency long-tail entities and scenario-specific complex knowledge, the proposed knowledge mining method still requires improvement. [Conclusions] By bridging the “curation-parsing-mining” pipeline, this work enables the efficient construction of a chemical engineering knowledge base from scientific literature and provides knowledge support for AI4S in chemical engineering.

  • Yu Shirui, Yu Chi, Hu Zhengyin, Zhou Jibin, Peng Wenjie, Ye Mao, Ren Qianqian
    Data Analysis and Knowledge Discovery. 2026, 10(4): 25-38. https://doi.org/10.11925/infotech.2096-3467.2026.0089
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the challenge of extracting high-value logical chains from massive amounts of chemical knowledge interactions and to overcome the technical limitations of traditional graph mining methods—which struggle to balance deep semantic understanding with high interpretability—this paper proposes a framework for mining key knowledge paths that integrates a chemical knowledge base with large language models. [Methods] First, based on the knowledge system of the chemical knowledge base, we design task-oriented multidimensional prompts to guide the large language model in accurately extracting knowledge entities from raw data; Next, the large language model is constrained using the Chemical Knowledge Base's authoritative terminology system and thesaurus to map entities to the foundation's standard ontology, thereby achieving efficient alignment of knowledge entities; Subsequently, within the knowledge base's graph network, graph mining algorithms are employed to identify candidate knowledge paths between target knowledge entities; Finally, by integrating the large language model's analysis of key knowledge entities and degree centrality, we perform quality filtering on a large number of candidate knowledge paths to identify high-value, interpretable, and traceable key knowledge paths. [Results] Empirical validation was conducted on a dataset of 1,650 question-answer pairs in the chemical engineering field: 10 high-quality key knowledge paths were extracted from each question-answer pair, and manual evaluation by domain experts revealed that 71.8% of these paths were valid. When these paths were injected into the reasoning process of a large language model, the model's answer accuracy improved to 77.7%, significantly higher than the accuracy without path injection (17.3%) and the accuracy relying solely on the model's self-extraction (60.5%). [Limitations] The limitations of this study include: knowledge nodes are represented solely by keywords, making it difficult to comprehensively cover the original information; the single-dimensional path selection strategy restricts in-depth mining in complex scenarios; and fine-grained classification and in-depth value assessment of knowledge paths have not yet been conducted. [Conclusions] The synergy between the chemical engineering knowledge base and large language models significantly improves the quality and efficiency of key knowledge path mining, providing high-quality data support for intelligent research in the chemical engineering field.

  • Wu Yaoting, Chang Yingxiao, Qian Li, Qu Yunpeng, Guo Dan, Ding Jielan, Chang Zhijun, Wang Haolin, Yang Yanxi, Zhu Ziping
    Data Analysis and Knowledge Discovery. 2026, 10(4): 39-54. https://doi.org/10.11925/infotech.2096-3467.2026.0090
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the demand for precise knowledge services in critical stages of digital cell research, such as knowledge discovery and hypothesis generation, an intelligent question-answering framework empowered by a knowledge base infrastructure is proposed. [Methods] A hybrid knowledge base is constructed to provide multimodal data support, designs a query‑aware dynamic retrieval strategy to enable cross‑base interaction and retrieval weight optimization, and guides agents to iteratively refine question answering outputs through a self‑reflection mechanism. [Results] Experimental evaluation on core query benchmarks in the digital cell field demonstrates an average recall rate of 93.2%, validating the framework's capability for precise knowledge filtering. Building upon this, a digital cell knowledge base infrastructure encompassing over 10.77 million scholarly publications is further developed, along with an intelligent knowledge service platform supporting multimodal retrieval. [Limitations] There remains room for improvement in vectorized representation and recall precision for complex query scenarios. [Conclusions] The proposed approach effectively enables deep fusion and retrieval utilization of multimodal knowledge from scientific literature, providing specialized intelligent question-answering support for digital cell research.

  • Wang Qianqian, Liu Chuxuan, Wang Lu, Cheng Laixiu, Dai Jingyi, Qian Li
    Data Analysis and Knowledge Discovery. 2026, 10(4): 55-65. https://doi.org/10.11925/infotech.2096-3467.2026.0091
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To develop an intelligent assessment method for carbon density based on multimodal data mining, enabling dynamic and accurate quantification. [Methods] Multimodal content is parsed using document structure and layout analysis techniques. Key fields such as carbon density are extracted using prompt engineering and retrieval-augmented generation (RAG) with large language model (LLM), combined with optical character recognition (OCR), to construct a database. Carbon density interpolation is performed using regression kriging. [Results] A high-precision carbon density database was constructed, and analysis revealed that the average aboveground carbon density in mining areas increased from its lowest value of 1.13 kg/m² in 2010 to 1.41 kg/m² in 2020, with the recovery period closely aligning with the implementation timeline of multiple environmental protection policies. [Limitations] Due to the uneven quantity and distribution of carbon density data across different years, the accuracy varies considerably among different sampling points. Further collection of relevant data is required to improve the precision. [Conclusions] This study provides data and model support for carbon cycle assessment and offers new approaches for regional ecological monitoring and quantitative policy evaluation.

  • Chang Yuan, Li Ziyue, Kong Yuanbo, Le Xiaoqiu
    Data Analysis and Knowledge Discovery. 2026, 10(4): 66-88. https://doi.org/10.11925/infotech.2096-3467.2026.0133
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To systematically review the methodological framework and application progress of large language model-driven scientific hypothesis generation, and to reveal the current research landscape and development trends in this field. [Coverage] Using keywords such as “Large Language Models” and “Scientific Hypothesis Generation”, we conducted comprehensive searches in three major academic databases: Web of Science, Google Scholar, and CNKI. Representative literature from 2021 to 2026 was screened, resulting in a final set of 98 papers for analysis. [Methods] An analytical framework was established along three dimensions: generation process logic, evolution of technical pathways, and key issues. Existing approaches at each stage—knowledge acquisition, preliminary hypothesis generation, iterative refinement, and evaluation and validation—were systematically reviewed. The underlying technical architectures were analyzed and compared, core difficulties and current solutions were examined in depth, and relevant benchmark datasets and representative applications were summarized. [Results] LLMs' capabilities in knowledge integration and association discovery offer a new paradigm for scientific hypothesis generation, having already yielded experimentally verified hypotheses in real-world scenarios across multiple domains. Current research shows a clear synergistic trend among five core technical pathways: context engineering, supervised fine-tuning, reinforcement learning, planning and search, and multi-agent collaboration. A preliminary methodological system has been established for the core generation process; however, challenges remain in knowledge clue discovery, innovative hypothesis reasoning, and credibility, with model hallucination and intrinsic reasoning capabilities being the primary bottlenecks. [Limitations] As this emerging interdisciplinary field evolves rapidly, some of the most recent works may not be fully covered. This study mainly focuses on reviewing methodological framework and does not systematically compare the quantitative performance of existing methods. [Conclusions] Large language models have demonstrated the capability to assist in generating, or even to autonomously discover scientifically valuable hypotheses, enabling scalable and cross-disciplinary hypothesis exploration. Future research should seek breakthroughs in four key areas: balancing reliability and novelty of generated hypotheses, enhancing LLMs' deep reasoning capabilities, developing innovative human-AI collaborative paradigms, and establishing a closed-loop system integrating hypothesis generation and experimental verification.

  • Qiu Hanqi, Chen Wei
    Data Analysis and Knowledge Discovery. 2026, 10(4): 89-103. https://doi.org/10.11925/infotech.2096-3467.2025.0611
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study proposes a framework that integrates technology life cycle analysis, a multi-dimensional measurement system, and semantic-enhanced hypernetwork node embedding to quantify the evolutionary characteristics of technology convergence and reveal its development trends. [Methods] Development stages are delineated based on technology life cycle analysis, and a temporal technology convergence network is constructed using hypernetwork modeling. A multi-dimensional measurement system is developed across three levels—hypernetwork, hyperedge, and node—to systematically characterize the evolutionary patterns of technology convergence. In addition, a Semantic-Enhanced Hypernetwork Node Embedding (SHNE) method is introduced to uncover potential technology convergence relationships. [Results] A case study of the all-solid-state battery field demonstrates a staged evolutionary process of technology convergence, from material exploration and performance optimization to industrial application. High-value potential convergence directions, including battery thermal management and high-nickel cathode interface modification, are identified. [Limitations] The hypernetwork is constructed solely based on IPC co-occurrence relationships, without incorporating patent citation data, and the extraction of fine-grained technological elements remains limited. [Conclusions] The proposed framework effectively reveals the evolutionary characteristics and potential convergence directions of technology fields, providing a new perspective for studying technology convergence.

  • Zhu Hou, Tan Yawen, Wu Zishuai
    Data Analysis and Knowledge Discovery. 2026, 10(4): 104-115. https://doi.org/10.11925/infotech.2096-3467.2025.0309
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper aims to construct a technical framework for privacy agreement violation detection based on natural language processing technologies, so as to realize the automatic identification of non-compliant content and the semantic interpretation of relevant laws and regulations. [Methods] First, we sort out the Information Security Technology — Personal Information Security Specification (GB/T 35273-2020) and extract 19 core items of privacy agreement content. On this basis, a complete technical framework is then constructed from content identification to violation judgement by integrating text classification, named entity recognition, and QLoRA fine-tuning technology for large language models. [Results] The fine-tuned Gemma-2b model achieves excellent performance in the violation detection task, with the best results on Dataset 1, significantly outperforming the ChatGLM2-6b model (F1-score: 0.7647 vs. 0.3735). Meanwhile, in terms of generating compliance explanations, the Gemma-2b model also surpasses the ChatGLM2-6b model in the BERTScore evaluation (F1-score: 0.8054 vs. 0.7440), indicating better quality of interpretability. [Limitations] The general orientation of current standards restricts the detection granularity in specific scenarios, and the input length limitation of models affects the semantic integrity of context. [Conclusions] The technical framework proposed in this study can quickly identify the core content of privacy policies and conduct interpretable violation detection, enhancing the ability to supervise and monitor the implementation of relevant laws and regulations in privacy policies.

  • Zhao Guangyu, Duan Yongkang, Geng Qian, Yan Yan, Jin Jian
    Data Analysis and Knowledge Discovery. 2026, 10(4): 116-129. https://doi.org/10.11925/infotech.2096-3467.2025.0214
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] When pre-trained language models are applied to government question retrieval, they often suffer from embedding anisotropy and limited domain generalization, this leads to insufficient recall and degraded matching accuracy. To address these challenges, this paper presents GovSQR, a fine-grained semantic retrieval model for government similar question matching. [Methods] GovSQR employs structured prompt engineering with few-shot demonstrations to guide a large language model in generating task-adapted positive and hard negative samples. A supervised SimCSE framework is then used to fine-tune RoBERTa on the resulting triplet data. To mitigate false negative interference, GovSQR incorporates a dynamic weighted masking mechanism and a debiased contrastive loss function. [Results] Experiments on a Shenzhen government question dataset show that GovSQR achieves P@1, R@3, and MRR scores of 0.9660, 0.9811, and 0.9729, respectively, outperforming contrastive learning baselines including InfoCSE and DiffCSE. [Limitations] The LLM-based data generation process is susceptible to hallucination, incurring non-trivial manual verification costs. Moreover, the effectiveness of the model on queries that are semantically complex or ambiguously phrased requires further investigation. [Conclusions] By integrating LLM-driven data augmentation with false negative debiasing, GovSQR yields more discriminative and isotropically distributed sentence embeddings, substantially improving retrieval accuracy in the government question answering domain.

  • Liu Ji, Dai Wei
    Data Analysis and Knowledge Discovery. 2026, 10(4): 130-145. https://doi.org/10.11925/infotech.2096-3467.2024.1135
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper proposes a dual-channel model incorporating syntactic structure and knowledge enhancement (SKE) to improve aspect-level sentiment classification. [Methods] The SKE model comprises two channels: the BERT-Enhanced Graph Network (BEGN), which constructs a supplementary dependency graph using syntactic information from BERT's intermediate layers to strengthen graph convolutional network modeling of dependency relations; and the Semantic Enhanced Knowledge Network (SEKN), which leverages a generative model to produce external knowledge, thereby enriching the semantic representation of sentences. The outputs of both channels are fused via a BiAffine parser, achieving deep integration of syntactic and semantic information. [Results] On the Twitter, Laptop, and Restaurant datasets, compared with the best-performing model among 14 mainstream baselines, SKE achieves accuracy improvements of 4.05, 3.62, and 1.11 percentage points, and Macro-F1 improvements of 4.36, 3.31, and 2.12 percentage points, respectively. [Limitations] The model is evaluated solely on public datasets, and relies exclusively on textual information without incorporating multimodal inputs. [Conclusions] By reinforcing dependency syntactic information and introducing external semantic knowledge, SKE achieves dual enhancement of syntactic and semantic representations, effectively improving aspect-level sentiment classification accuracy, and demonstrates particular applicability in handling sentences with complex syntactic structures and relatively insufficient semantic information.

  • Tong Xin, Lin Zhi, Yuan Lining, Wang Jingya, Jin Bo
    Data Analysis and Knowledge Discovery. 2026, 10(4): 146-160. https://doi.org/10.11925/infotech.2096-3467.2025.0251
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] To address the issues of insufficient accuracy and limited interpretability in risk instruction mining for large language models, this study proposes an agent-driven enhancement framework to improve the performance of existing detection tools. [Methods] The framework integrates four key modules: language alignment, hierarchical detection, dual-stream explanation, and consistency verification. The language alignment module enables unified mapping of multilingual inputs; the hierarchical detection module performs multi-stage risk analysis; the dual-stream explanation module provides rationales for analysis and decision-making; and the consistency verification module enhances reliability when processing complex samples. [Results] Experiments on three risk instruction datasets show that the proposed framework can improve the accuracy of commonly used detection tools from 54.75% to 93.75%. Even when using only open-source models as the core engine, the framework achieves an accuracy improvement of over 20%. [Limitations] The inference efficiency of the framework still needs to be improved, and the stability of structured outputs remains insufficient for some lightweight models. [Conclusions] The proposed framework provides a general, interpretable, and cross-lingual enhancement solution for risk instruction mining in large language models.

  • Hu Xinxin, Qiu Qinjun, Huang Zehua, Lu Xiechun, Cui Qianna, Ma Kai
    Data Analysis and Knowledge Discovery. 2026, 10(4): 161-173. https://doi.org/10.11925/infotech.2096-3467.2024.1222
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper proposes an entity alignment model based on embedding enhancement and entity-relation awareness, named ERAEA (Embedding-enhanced and Relation-Aware Entity Alignment) to address the problem of poor alignment performance in existing entity alignment methods caused by insufficient utilization of structural and semantic information in large-scale knowledge graphs. [Methods] The ERAEA model generates multiple embedding representations for entities and enhances entity embeddings by incorporating an attention mechanism and a two-layer improved GCN. Furthermore, through mutual mapping between entities and relations, relational features from graph-structured data are integrated into entity representations to obtain enriched entity embeddings. [Results] Experimental results show that the ERAEA model achieves average Hits@1, Hits@10, and MRR values of 89.3%, 97.2%, and 92.1%, respectively, on three publicly available cross-lingual datasets, outperforming all baseline models. Compared with the average best performance among baseline methods, Hits@1, Hits@10, and MRR are improved by 6.1, 0.5, and 5.0 percentage points, respectively. [Limitations] The model struggles to establish reliable cross-lingual mappings when applied to entity alignment across different language families. [Conclusions] By embedding the augmented and entity relationship-aware modules, the semantic information features and structural information of entities can be fully learned, which in turn can effectively improve the entity alignment task.