Home Table of Contents

25 March 2025, Volume 9 Issue 3
    

  • Select all
    |
  • Shi Xi, Chen Wenjie, Hu Zhengyin, Han Tao, Zhang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(3): 1-15. https://doi.org/10.11925/infotech.2096-3467.2024.0176
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study aims to efficiently extract scientific experiment knowledge and data from academic literature. It constructs a Scientific Experiment Knowledge Graph(SEKG) to provide high-quality data support for knowledge discovery. [Methods] We utilized Event Knowledge Graph technology to uniformly represent and model the complexity, temporality, and integration of knowledge and data in scientific experiments, thereby establishing the schema layer of the SEKG. Large Language Model was employed to enhance the efficiency of knowledge extraction in the data layer, with an empirical analysis conducted on organic solar cells. [Results] By using manual annotation and fine-tuning large language models, we constructed a scientific experiment knowledge graph in the field of organic solar cells. This SEKG comprises 34 types of nodes and 9 types of relationships, totaling 24,348 nodes and 123,642 relations. [Limitations] The data sources were limited to papers and patents. The construction of the SEKG required substantial manual input from experts, highlighting the need for efficiency improvements. Furthermore, fine-grained research procedures and validation rules in subfields were not considered. [Conclusions] The proposed method provides high-quality data support for applications such as experimental protocol recommendations, scientific experiment evolution analysis, and AI for Science, effectively supporting various knowledge discovery scenarios.

  • Feng Yong, Shen Jintao, Xu Hongyan, Wang Rongbing, Liu Tingting, Zhang Yonggang
    Data Analysis and Knowledge Discovery. 2025, 9(3): 16-27. https://doi.org/10.11925/infotech.2096-3467.2024.0003
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] Due to the insufficient fusion of multimodal data and the lack of consideration for the impact of intermodal heterogeneity in existing sentiment analysis research, the accuracy of sentiment classification is not high, a cross fusion multimodal sentiment analysis model based on the Translate mechanism is proposed. [Methods] Firstly, the Translate mechanism is employed to achieve mutual transformation between text, image, and audio modality features. Subsequently, the transformed modality features are fused with the target modality features (unimodal fusion) to mitigate the impact of intermodal heterogeneity on the model performance. Finally, cross-modal fusion is used to enable comprehensive interaction among different modality features, generating multimodal features that effectively capture unimodal information for sentiment classification via a classifier. [Results] Comparative experiments with current mainstream sentiment analysis models are conducted on the CMU-MOSI and CMU-MOSEI public datasets. The results show that the proposed model achieves a 0.96% improvement in accuracy and a 1.00% improvement in F1-Score compared to suboptimal models. [Limitations] The contribution of each modality to sentiment analysis varies in multimodal data, and the model does not specifically consider scenarios where the contribution of the image and audio modalities is higher than that of the text modalities. [Conclusions] The proposed model fully integrates intermodal information, avoids the influence of intermodal heterogeneity, and can effectively improve the overall performance.

  • Zhao Yong, Fu Zhongmeng, Wang Yunuo, Mao Beinan
    Data Analysis and Knowledge Discovery. 2025, 9(3): 28-41. https://doi.org/10.11925/infotech.2096-3467.2024.0048
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study aims to analyze the evolutionary characteristics of academic lineages by constructing networks of disciplinary knowledge transfer. [Methods] First, academic genealogy networks are established, and scholars’ generational positions are identified based on mentor-mentee relationship data. Then, using disciplinary knowledge classification systems, a hybrid recommendation algorithm is employed to reorganize and link scholars’ knowledge domains. This process facilitates the construction of knowledge transfer networks, enabling an analysis of the intergenerational and temporal evolution characteristics of disciplinary knowledge. [Results] The methodology is systematically demonstrated through the acquisition of academic genealogy and publication data in the field of computer science. The findings indicate that intergenerational relationships among scholars are predominantly two-generational, with cross-disciplinary mentorship playing a significant role. As the number of generations within the field increases, the transfer of disciplinary knowledge tends to be more biased towards propagation among sub-disciplines within the same field or inheritance within the same sub-discipline. [Limitations] Knowledge transfer occurs in a variety of ways, however, this study does not fully consider the integration of citation data and other multi-source data. Additionally, the scale and quality of academic genealogy data may affect the analysis results. [Conclusions] By constructing disciplinary knowledge transfer networks from the perspective of academic genealogy and conducting evolutionary analysis, this study can refine the granularity of the description of the scientific development process and provide methodological insights for exploring the micro-mechanisms of knowledge transfer.

  • Cao Kun, Wu Xinnian, Bai Guangzu, Jin Junbao, Zheng Yurong, Li Li
    Data Analysis and Knowledge Discovery. 2025, 9(3): 42-55. https://doi.org/10.11925/infotech.2096-3467.2024.0006
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study explores methods for identifying key core technologies by integrating the textual content characteristics of “science-technology” and complex network relationships. It supports governments, research institutions, and industries in formulating scientific and technological strategies and conducting innovation activities. [Methods] First, we employed the Sentence-BERTopic model to perform deep semantic fusion and knowledge topic clustering on sentence-level paper and patent text corpora. Then, we constructed a “science-technology” knowledge topic complex network based on the citation relationships of these documents. Third, we improved the traditional PageRank algorithm by incorporating node quality characteristics, time decay factors, the weights of incoming node edges, and outdegree. This approach ranked the importance and influence of nodes within the domain. Finally, we identified key core technologies using the head/tail break method. [Results] We conducted an empirical study on CNC machine tools and identified 53 key core technologies, including thermal error modeling and compensation, CNC machine tools control technology, and feed systems. A comparison with relevant domestic and international policy plans demonstrates that the identified technologies comprehensively encompass the key core technologies in the field. [Limitations] This study lacks an in-depth analysis of citation locations, motivations, behaviors, and purposes, which may affect identification accuracy. [Conclusions] This study reveals the knowledge structure and topological characteristics of science and technology by constructing a “science-technology” complex network and applying the Key Core Rank (KCR) algorithm. The proposed method achieves fine-grained and precise quantitative identification of key core technologies.

  • Dou Luyao, Zhou Zhigang, Shen Jing, Feng Yu, Miao Junzhong
    Data Analysis and Knowledge Discovery. 2025, 9(3): 56-68. https://doi.org/10.11925/infotech.2096-3467.2024.0023
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study aims to address the challenges of identifying high-value patents, specifically the issue of long-distance dependencies in sequence modeling and the extraction of key features from patent text sequences, and to improve both the accuracy and interpretability of high-value patent identification. [Methods] We propose XLBBC, a model for high-value patent identification, which integrates the pre-trained XLNet model and a bidirectional attention mechanism (BiAttention). The XLNet model is utilized for patent text representation and semantic extraction, while a BiGRU network captures global sequence information. The BiAttention layer is incorporated to allow the model to focus on different segments of the input sequence, and a CNN layer captures key phrases and patterns in the patent text. Empirical research is conducted using a mixed patent dataset from industries including amorphous alloys, industrial robotics, perovskite solar cells, and gene chip. [Results] The XLBBC model demonstrates strong performance, achieving an accuracy of 0.89 and consistency of 0.65 on a dataset of 40,000 patent records. The prediction accuracy of the model is around 42%, which is a 9% improvement over existing models. [Limitations] The model does not account for the relationship and integration mechanisms between standard-essential patents and high-value patents. Additionally, there is room for improvement in the efficiency and scalability of the algorithm. [Conclusions] The XLBBC model outperforms traditional methods in handling complex textual data. It shows superior performance in text classification compared to CNN-based ensemble models. XLNet excels in global semantic understanding, and placing the attention layer between the XLNet-BiGRU and CNN layers leads to the best overall model performance.

  • Hou Jianhua, Deng Xianjiang, Tang Shiqi
    Data Analysis and Knowledge Discovery. 2025, 9(3): 69-82. https://doi.org/10.11925/infotech.2096-3467.2024.0353
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study aims to explore the influence of interdisciplinary knowledge integration on the emergence of high-value patents and to delineate their distinctive characteristics. [Methods] High-value patents are operationalized as patents that receive the China Patent Gold Award. Interdisciplinary knowledge integration is quantified by two dimensions: IPC classification and patent knowledge units. Regression analysis investigates the effects of interdisciplinary knowledge integration, measured by these two dimensions, on both patent award status and individual patent value dimensions. [Results] The analysis reveals that high-value patents tend to exhibit a narrower interdisciplinary scope in terms of IPC classification, while simultaneously demonstrating a more diverse knowledge structure. In particular, interdisciplinary knowledge integration, when indicated by IPC classification, shows an inverted U-shaped relationship with patent value. Conversely, interdisciplinary knowledge integration, when indicated by knowledge units, shows a negative correlation with patent value. [Limitations] This study is limited by its reliance on the China Patent Gold Award as the sole proxy for high-value patents, which may not fully encompass the multifaceted nature of high-value patent characteristics. [Conclusions] This research provides valuable insights into the proactive identification and protection of high-value patents. Furthermore, the findings inform strategies to enhance upstream patent quality control and to facilitate effective patent translation and commercial utilization.

  • Xie Xiaodong, Wu Jie, Sheng Yongxiang, Wang Jiangang, Zhou Xiao
    Data Analysis and Knowledge Discovery. 2025, 9(3): 83-95. https://doi.org/10.11925/infotech.2096-3467.2024.0050
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] To promote collaboration among inventors and enhance innovation efficiency, this study identifies potential collaborators and distinguishes their types from patent documents. [Methods] From a dynamic network perspective, this study considered the structural and attribute changes of the inventor collaboration network over time. We proposed a method of identifying potential collaborators for inventors based on dynamic graph convolutional networks, allowing for further classification of inventor partners. [Results] Utilizing patent data from the integrated circuit sector for empirical validation, the proposed method achieved an AUC of 0.8464, an error rate of 0.2897, ER+ of 0.0830, and ER- of 0.2067. All of them significantly outperformed baseline models. [Limitations] The study only considers inventors’ patent information while ignoring other multi-source innovation outputs, such as academic papers. [Conclusions] The proposed method effectively enhances the accuracy of potential partner identification by leveraging dynamic changes in network structures and node attributes. Identifying potential collaborators and categorizing their types help inventors formulate collaboration strategies, enhancing efficiency and outcomes. This study effectively supplements the existing frameworks for partner selection methodologies.

  • Zhang Xiaoli, Kuang Heng
    Data Analysis and Knowledge Discovery. 2025, 9(3): 96-105. https://doi.org/10.11925/infotech.2096-3467.2023.1150
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study proposes a link prediction method based on dynamic graph neural networks for text embedding, aiming to model and predict the integration trend of technological innovation in artificial intelligence. It also reveals potential technological connections and innovation pathways. [Methods] We integrated patent abstract texts into the node feature representations of the dynamic graph neural networks. By leveraging the learning capabilities of dynamic graph neural networks, we obtained more accurate link prediction results. [Results] Using the domestic AI field as an example, the method achieved an AUC index improvement of approximately 0.06 compared to similar and traditional graph representation learning models. [Limitations] Due to the high dimensionality of embeddings, it is difficult to integrate with graph neural networks. We did not use large language models for embedding patent abstracts. [Conclusions] The proposed method has a high predictive accuracy, enhancing the credibility of AI patent convergence forecasting. It is an effective way to predict fine-grained links.

  • Liu Xiaohui, Tao Chengxu, Xu Wei, Wu Jiang
    Data Analysis and Knowledge Discovery. 2025, 9(3): 106-116. https://doi.org/10.11925/infotech.2096-3467.2024.0972
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper provides a method for pricing personal data in the context of large language models by quantifying privacy loss and compensating for privacy value. [Methods] Based on the premise hypothesis and differential privacy, we proposed a method that evaluates the data privacy value and quantifies the data value using directional statistics. We assessed the new method with the SST-2 dataset. [Results] Regarding the relationship between privacy parameters and accuracy, as well as budget and accuracy, the model accuracy increases with the increase of privacy parameters or budget, demonstrating the new method’s effectiveness. [Limitations] The choice of dataset and model architecture is relatively limited. The pricing mechanism only considers the impact of privacy factors on the pricing. [Conclusions] The proposed method can evaluate the privacy value of data and quantify the data’s value, providing support for personal data pricing in the context of large language models.

  • Li Wanbin, Shen Si
    Data Analysis and Knowledge Discovery. 2025, 9(3): 117-126. https://doi.org/10.11925/infotech.2096-3467.2023.1394
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study establishes a unified entity recognition framework, which effectively identifies valuable entities in unstructured academic literature. [Methods] We adopted the BERT+Global Pointer (GP) framework to model entity boundaries with a unified approach. Then, we designed the cross-entropy loss function for the pointer mechanism. Finally, we conducted multi-model comparison verification using CRF, GPT-4, and BERT. [Results] The proposed model demonstrated robust precision, recall, and F1 scores across datasets. The average F1 scores on non-nested datasets reached 95.38% and 79.81%, while on nested datasets, the scores reached 66.91% and 61.47%. Moreover, the overall model performance surpassed the comparison models without requiring manually designed feature templates. [Limitations] For the overall recognition of nested entities, further optimization of the GP model is necessary to efficiently and accurately identify relevant entities from an application perspective. [Conclusions] The GP framework effectively leverages entity location features in unified recognition. For complex nested entities, it not only improves recognition accuracy but also enhances convenience in identification.

  • Yu Xiaosheng, Huang Ying, Zhang Yuntao, Chen Peng
    Data Analysis and Knowledge Discovery. 2025, 9(3): 127-135. https://doi.org/10.11925/infotech.2096-3467.2024.0203
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] To address the challenges of multiple nested entities and semantic ambiguity in English texts, this study proposes a nested entity recognition method named GTR-NNER, which integrates multi-dimensional word information. [Methods] The proposed method employs a triaffine attention-guided graph convolutional network (GCN) module to integrate multiple types of word information, including word content, word position, word boundary, word label, and syntactic information. Based on the extracted multi-dimensional information, span enumeration is performed, followed by entity recognition through a discriminator. [Results] The proposed GTR-NNER method achieves an average F1 score of 84.38% and 91.44% on two nested NER datasets through 10-fold cross-validation. Additionally, on two partially nested datasets, GENIA and ACE2005, it attains F1 scores of 82.19% and 89.27%, respectively. [Limitations] The integration of multi-dimensional word information slows down the model’s convergence speed. [Conclusions] Incorporating multi-dimensional word information into NER models effectively enhances the performance of nested entity recognition.

  • Zhang Dongliang, Liao Yongan, Cheng Ge
    Data Analysis and Knowledge Discovery. 2025, 9(3): 136-146. https://doi.org/10.11925/infotech.2096-3467.2024.0336
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper proposes an enhanced case similarity calculation method, addressing the limitations of existing case similarity calculation methods in capturing long-distance, global, and discontinuous legal relationships between key legal elements, as well as the challenges in distinguishing between textually similar but legally dissimilar cases. [Methods] First, we constructed a case knowledge graph to structurally represent factual elements. Then, we combined graph convolutional networks with bidirectional long short-term memory networks to encode the graph and perceive complex legal relationships between subjects and objects. Finally, we introduced a hard/easy mixed negative sample mining mechanism to improve the model’s ability to distinguish difficult cases. [Results] Experiments conducted on the benchmark dataset provided by CAIL show that our proposed model outperforms the champion model by 11% and the optimal attention-based convolutional neural network method by 7%. [Limitations] The construction of the case knowledge graphs may affect the efficiency of similarity computation. However, this issue can be mitigated by strategies such as offline graph construction and node pre-vectorization. [Conclusions] Our method effectively perceives complex legal relationships between key legal elements, learns the distinctions and connections between different cases, and significantly improves the performance of case similarity computation.

  • Siriguleng, Lin Min, Guo Zhendong, Zhang Shujun
    Data Analysis and Knowledge Discovery. 2025, 9(3): 147-160. https://doi.org/10.11925/infotech.2096-3467.2024.0325
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study addresses the challenges of inefficient fine-tuning and suboptimal extraction performance in deep learning-based entity-relation extraction for ancient texts in low-resource scenarios, which mainly stem from dependency on large-scale annotated data. [Methods] We propose a joint extraction framework combining prompt learning and extractive machine reading comprehension (MRC). First, entity recognition and relation extraction tasks are unified into an MRC framework to streamline model architecture. Second, three lightweight prompt strategies are designed using domain-specific knowledge to reduce task complexity. Finally, we develop MPG-GP, a joint extraction model integrating a pre-trained language model with a global pointer network, to effectively extract etiquette entity-relation triples from ancient texts. [Results] Experiments on a custom ancient etiquette entity-relation extraction dataset show F1-score improvements of 0.32%~6.05% over baseline methods. [Limitations] The prompt templates employ fixed patterns rather than learnable soft prompts, and the prompt engineering design warrants further refinement. [Conclusions] Our approach mitigates reliance on large annotated datasets while improving the accuracy of few-shot joint entity-relation extraction for ancient ritual texts, providing a novel solution for information extraction in low-resource historical documents.