Home Browse Top access

Top access

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Yu Bengong, Xing Yu, Zhang Shuwen
    Data Analysis and Knowledge Discovery. 2024, 8(11): 22-32. https://doi.org/10.11925/infotech.2096-3467.2023.0746
    Abstract (975) PDF (204) HTML (593)   Knowledge map   Save

    [Objective] To fully extract features from multiple modalities, align and integrate multimodal features, and design downstream tasks, we propose an aspect-based sentiment analysis model of multimodal collaborative contrastive learning (MCCL-ABSA). [Methods] Firstly, on the text side, we utilized the similarity between aspect words and their encoding within sentences. On the image side, the model used the similarity of images encoded in different sequences after random cropping to construct positive and negative samples required for contrastive learning. Secondly, we designed the loss function for contrastive learning tasks to learn more distinguishable feature representation. Finally, we fully integrated text and image features for multimodal aspect-based sentiment analysis while dynamically fine-tuning the encoder by combining contrastive learning tasks. [Results] On the TWITTER-2015 dataset, our model’s accuracy and F1 scores improved by 0.82% and 2.56%, respectively, compared to the baseline model. On the TWITTER-2017 dataset, the highest accuracy and F1 scores were 0.82% and 0.25% higher than the baseline model. [Limitations] We need to examine the model’s generalization on other datasets. [Conclusions] The MCCL-ABSA model effectively improves feature extraction quality, achieves feature integration with a simple and efficient downstream structure, and enhances the efficacy of multimodal sentiment classification.

  • Wang Zhenyu, Zhu Xuefang, Yang Rui
    Data Analysis and Knowledge Discovery. 2025, 9(1): 90-99. https://doi.org/10.11925/infotech.2096-3467.2023.1273
    Abstract (916) PDF (191) HTML (384)   Knowledge map   Save

    [Objective] This paper utilizes large language models (LLMs) to generate high-quality auxiliary knowledge, aiming to improve the performance of multimodal relation extraction. [Methods] We introduced a multimodal similarity detection module to construct multimodal prompt templates, which allow the LLM to integrate visual information and prior knowledge into the generated high-quality auxiliary knowledge. We combined the obtained auxiliary knowledge with the original text and input it into downstream text models to accurately predict entity relationships. [Results] The proposed model outperformed the best-baseline model on the MNRE dataset, achieving 4.09% and 7.84% improvements in accuracy and F1 score. [Limitations] We only examined the proposed model on English datasets. [Conclusions] Comparative experiments and case studies validate the model’s effectiveness in multimodal relation extraction. Our new model provides a direction for applying LLMs to multimodal information extraction tasks in the future.

  • Chen Ting, Ding Honghao, Zhou Haoyu, Wu Jiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 159-171. https://doi.org/10.11925/infotech.2096-3467.2023.1424
    Abstract (819) PDF (127) HTML (716)   Knowledge map   Save

    [Objective] This study explores the impacts of bullet-screen(danmu)content and behavioral characteristics on consumers purchasing behavior in live-streaming e-commerce, as well as the moderating effect of host-product relevance. [Methods] First, we retrieved the bullet-screen data from the Douyin platform and the consumer data from the Huitun platform based on the Elaboration Likelihood Model. Then, we studied the impacts of bullet-screen content characteristics (central route) and behavior characteristics (peripheral route) on consumer purchasing behavior with text mining and zero-inflated negative binomial regression. We also discussed the moderating effect of host-product relevance with grouping regression. [Results] Information richness, social interaction degree and number of bullet-screen comments positively impact purchasing behavior. The emotional polarity of bullet screen comments exhibits an inverted U-shaped effect on purchasing behavior. Compared with live streaming rooms with low host-product relevance, those with high host-product relevance have broader positive impacts on purchase behavior. [Limitations] We only investigated the bullet-screen data from a single live-streaming e-commerce platform. [Conclusions] This study examines the factors influencing consumers’ actual purchasing behavior from the perspective of bullet-screen comments. It provides insights for improving communication between merchants and consumers in live-streaming e-commerce, ultimately enhancing sales performance.

  • Song Mengpeng, Bai Haiyan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 21-34. https://doi.org/10.11925/infotech.2096-3467.2024.0628
    Abstract (814) PDF (73) HTML (557)   Knowledge map   Save

    [Objective] This paper aims to generate structured literature reviews with references automatically, to assist researchers quickly grasp a specific area of scientific knowledge. [Methods] A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts. The GLM3-6B model was fine-tuned for training by generating 3,000 reviews using a large language model and then revising them manually. The corpus was then converted into high-dimensional vectors and stored in an index. These vectors were retrieved to implement LangChain’s external knowledge base. To solve the problem of poor retrieval of proper nouns, a hybrid search with BM25 was used and reordered to improve retrieval accuracy. [Results] Fine-tuning and hybrid retrieval frameworks were used to construct the literature review generation system, improving the BLEU and ROUGE scores by 109.64% and 40.22% respectively, as well as the authenticity score of manual evaluation by 62.17%. [Limitations] Due to limitations in computational resources, the scale of the local model parameters is small and its generation ability needs to be improved further. [Conclusions] The retrieval-augmented generation technique uses large language models not only generates high-quality literature reviews, and provides traceable evidence for the generated content, as well as assists researchers in intelligent reading.

  • Zhang Jing, Gao Zixin, Ding Weijie
    Data Analysis and Knowledge Discovery. 2025, 9(2): 48-58. https://doi.org/10.11925/infotech.2096-3467.2023.1347
    Abstract (636) PDF (115) HTML (182)   Knowledge map   Save

    [Objective] This paper proposes a new model to effectively classify massive police reports. [Methods] We constructed a text classification model based on BERT-DPCNN. Then, we used the BERT pre-trained model to generate word vectors. The model improved the classification performance by optimizing the activation function in the DPCNN model and enhancing the dynamic learning rate. [Results] We conducted comparative experiments between BERT-DPCNN and six other models, including BERT, BERT-CNN, BERT-RCNN, BERT-RNN, BERT-LSTM, and ERNIE. The BERT-DPCNN achieved the best accuracy, recall, and precision. In the binary classification tasks, the accuracy of BERT-DPCNN exceeded 98%. In the eleven-category tasks, the model’s accuracy exceeded 82%. [Limitations] The model has many parameters, and the limited number of experiments calls for further testing. [Conclusions] The new model effectively improves the accuracy of police report classification, providing data support for police departments in analyzing and assessing police incidents.

  • Zhou Zhigang, Dou Luyao, Li Yi, Bai Zengliang
    Data Analysis and Knowledge Discovery. 2024, 8(12): 52-61. https://doi.org/10.11925/infotech.2096-3467.2023.0883
    Abstract (574) PDF (106) HTML (407)   Knowledge map   Save

    [Objective] This paper identifies potential high-value patents by deeply mining the feature information embedded in patent texts based on bilateral semantics and text sequence features. [Methods] First, we constructed a mixed patent dataset from the fields of amorphous alloys, industrial robots, and gene chips. Then, we employed the BERT word vector model to achieve contextual semantic association and word meaning interpretation of patent texts. Third, we utilized the BiGRU network to extract global text sequence information while CNN captured local text sequence information. Finally, we predicted potential high-value patents by combining “bilateral semantics+global+local” semantic and sequence features. [Results] The proposed BERT-BiGRU-CNN model outperforms existing models and is more suitable for predicting potential high-value patents on a large data scale. Our new model achieves a prediction accuracy of over 35%, about 4% higher than the existing ones. [Limitations] The relationship and integration mechanism between standard essential and high-value patents have yet to be considered, and the algorithm complexity needs further optimization. [Conclusions] The BERT-BiGRU-CNN model performs better in text classification tasks than the CNN model. Our new model improves the prediction accuracy of potentially high-value patents by capturing global and local text sequence features.

  • Zhang Le, Chen Yansong, Zhang Leihan
    Data Analysis and Knowledge Discovery. 2025, 9(8): 47-58. https://doi.org/10.11925/infotech.2096-3467.2024.0625

    [Objective] This paper proposes a method that enhances features using large language models and integrates them through multi-level cross-fusion. It addresses the issue in multimodal sentiment analysis, where emotional expressions across different modalities are inconsistent, hindering effective collaborative sentiment decision-making. [Methods] To alleviate the conflicting sentiment information among modalities and improve the representation of sentiment features, we used the multimodal large language model to extract the auxiliary sentiment information within each modality. Then, we employed a hierarchical cross-attention mechanism to learn shared emotional features across modalities while mining auxiliary intra-modal emotional features, thereby enhancing the expression of shared semantic sentiment. During the fusion phase, a modality-attention weighted fusion method is introduced to balance the contributions of shared and auxiliary features. Additionally, we utilized a loss function combining multimodal and unimodal inputs to address the sentiment semantic inconsistencies. [Results] The proposed model outperforms baselines on the public datasets CH-SIMS and CMU-MOSI. On CH-SIMS, binary classification accuracy and F1 score increased by 1.77 and 0.63 percentage points, respectively. On CMU-MOSI, improvements of 0.43 and 0.41 percentage points were observed. For CH-SIMS fata with emotional inconsistency, the binary classification accuracy and F1 score have increased by 1.80 and 1.72 percentage points, respectively. This demonstrates that the proposed model can effectively address the issue of inconsistent sentiment semantics across modalities. [Limitations] The model does not account for the impact of personalized information on individuals in videos. [Conclusions] The proposed approach effectively integrates multimodal features using a hierarchical cross-attention mechanism, improves the representation of shared semantic sentiment, and addresses inconsistencies in emotional semantics across different modalities.

  • Chen Wanzhi, Hou Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 52-65. https://doi.org/10.11925/infotech.2096-3467.2024.0720

    [Objective] To address the issues in multimodal sentiment analysis, such as insufficient multimodal feature extraction, semantic differences between modalities, and lack of interaction, we propose a temporal multimodal sentiment analysis model that integrates multi-level attention and sentiment scale vectors. [Methods] Firstly, we introduced a scalar Long Short-Term Memory network with a multi-head attention mechanism to construct a deep temporal feature modeling network for extracting rich contextual temporal features from text, audio, and visual modalities. Secondly, we employed the text-guided dual-layer cross-modal attention mechanism and the improved self-attention mechanism to facilitate the deep information exchange across modalities, thereby generating two sentiment scale vectors for sentiment intensity and polarity. Finally, the L1 norm of the sentiment intensity vector was multiplied by the normalized sentiment polarity vector to obtain a comprehensive representation of sentiment strength and polarity, thereby enabling accurate sentiment prediction. [Results] Experiments on the CMU-MOSI dataset show that the proposed model achieves good results in both comparative and ablation experiments, outperforming the next-best model by 1.2 and 2.3 percentage points on the Acc7 and Corr metrics, respectively. On the CMU-MOSEI dataset, the proposed model surpasses baseline models across all evaluation metrics, achieving 86.0% in Acc2 and 86.1% in F1 score. [Limitations] Sentiment expression is highly context-dependent, and the sources of sentiment cues may vary across different scenarios. The proposed model may perform poorly when textual information is insufficient. [Conclusions] The proposed model effectively extracts contextual temporal features from various modalities and leverages the rich emotional information in the text modality for deep inter-modal interaction, thereby enhancing the accuracy of sentiment prediction.

  • Rang Yuchen, Ma Jing
    Data Analysis and Knowledge Discovery. 2025, 9(1): 100-109. https://doi.org/10.11925/infotech.2096-3467.2023.1130
    Abstract (488) PDF (113) HTML (377)   Knowledge map   Save

    [Objective] To reduce inter-modal differences and strengthen the correlation between modalities, this paper proposes a multimodal alignment sentiment analysis model to accurately capture the sentiment tendencies embedded in multimodal data. [Methods] For the textual modality, the original text data, supplemented with image captions, is processed using the RoBERTa pre-trained model for text feature extraction. We used the Clip Vision Model to extract image features for the image modality. The text and image features are aligned through a multimodal alignment layer based on a Multimodal Transformer to obtain enhanced fused features. Finally, the fused multimodal features are inputted into a multilayer perception for sentiment recognition and classification. [Results] The proposed model achieved an accuracy of 71.78% and an F1 score of 68.97% on the MVSA-Multiple dataset, representing improvements of 1.78% and 0.07%, respectively, over the best-performing baseline model. [Limitations] The model’s performance was not validated using additional datasets. [Conclusions] The proposed model effectively promotes inter-modal fusion, achieves better fusion representations, and enhances sentiment analysis.

  • Sun Wenju, Li Qingyong, Zhang Jing, Wang Danyu, Wang Wen, Geng Yangli’ao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 1-30. https://doi.org/10.11925/infotech.2096-3467.2024.0508
    Abstract (486) PDF (188) HTML (433)   Knowledge map   Save

    [Objective] This study comprehensively reviews the advancements in deep incremental learning techniques from the perspective of addressing catastrophic forgetting, aiming to provide references for the research community. [Coverage] Utilizing search terms such as “Incremental Learning”, “Continual Learning”, and “Catastrophic Forgetting”, we retrieved literature from the Web of Science, Google Scholar, DBLP, and CKNI. By reading and organizing the retrieved literature, a total of 105 representative publications were selected. [Methods] The paper begins by defining incremental learning and outlining its problem formulation and inherent challenges. Subsequently, we categorize incremental learning methods into regularization-based, memory-based, and dynamic architecture-based approaches, and review their theoretical underpinnings, advantages and disadvantages in detail. [Results] We evaluated some classical and recent methods in a unified experimental setting. The experimental results demonstrate that regularization-based methods are efficient in application but cannot fully avoid forgetting; memory-based methods are significantly affected by the number of retained exemplars; and dynamic architecture-based methods effectively prevent forgetting but incur additional computational costs. [Limitations] The scope of this review is limited to deep learning approaches, excluding traditional machine learning techniques. [Conclusions] Under optimal conditions, memory-based and dynamic architecture-based strategies tend to outperform regularization-based approaches. However, the increased complexity of these methods may hinder their practical application. Furthermore, current incremental learning methods show suboptimal performance compared to joint training models, marking a critical direction for future research.

  • Song Donghuan, Hu Maodi, Ding Jielan, Qu Zihao, Chang Zhijun, Qian Li
    Data Analysis and Knowledge Discovery. 2025, 9(2): 12-25. https://doi.org/10.11925/infotech.2096-3467.2023.0885
    Abstract (453) PDF (177) HTML (285)   Knowledge map   Save

    [Objective] This study addresses the issue of low classification accuracy in conventional text classification tasks due to factors such as sparse domain-specific training data and significant differences between types. [Methods] We constructed a novel classification model based on the BERT-DPCNN-MMOE framework, integrating the deep pyramid convolutional networks with the multi-gate control unit mechanism. Then, we designed multi-task and transfer learning experiments to validate the effectiveness of the new model against eight well-established and innovative models. [Results] This research independently constructed cross-type multi-task data as the basis for training and testing. The BERT-DPCNN-MMOE model outperformed the other eight baseline models in multi-task and transfer learning experiments, with F1 score improvements exceeding 4.7%. [Limitations] Further research is needed to explore the model’s adaptability to other domains. [Conclusions] The BERT-DPCNN-MMOE model performs better in multi-task and cross-type text classification tasks. It is of significance for future specialized intelligence classification tasks.

  • Wang Zitong, Li Chenliang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 94-105. https://doi.org/10.11925/infotech.2096-3467.2023.1305
    Abstract (430) PDF (239) HTML (182)   Knowledge map   Save

    [Objective] To more flexibly capture the spatial-temporal features of traffic flow data and achieve more accurate multivariate traffic flow prediction, this paper proposes a Position-Aware Spatial-Temporal Graph Convolutional Network (PASTGCN). [Methods] First, the traffic data’s spatial and periodic temporal position features are represented as explicit position embeddings. Then, based on the spatiotemporal convolutional structure, we incorporated spatial information into the temporal convolutional network for space-aware sequence modeling. Finally, we used static and dynamic dual graph learning methods to capture spatial dependencies. [Results] We conducted experiments on two real-world traffic flow datasets. The PASTGCN model effectively predicted multivariate traffic flows and reduced errors by up to 1.59% compared to existing deep learning models. [Limitations] The experimental datasets are limited, and the proposed graph learning method increased the time complexity. [Conclusions] The PASTGCN model can effectively utilize spatial-temporal position information to achieve more accurate traffic flow prediction.

  • Li Hui, Pang Jingwei
    Data Analysis and Knowledge Discovery. 2024, 8(11): 11-21. https://doi.org/10.11925/infotech.2096-3467.2023.0744
    Abstract (415) PDF (162) HTML (290)   Knowledge map   Save

    [Objective] To effectively utilize information containing audio and video and fully capture the multi-modal interaction among text, image, and audio, this study proposes a multi-modal sentiment analysis model for online users (TIsA) incorporating text, image, and STFT-CNN audio feature extraction. [Methods] First, we separated the video data into audio and image data. Then, we used BERT and BiLSTM to obtain text feature representations and applied STFT to convert audio time-domain signals to the frequency domain. We also utilized CNN to extract audio and image features. Finally, we fused the features from the three modalities. [Results] We conducted empirical research using the “9.5 Luding Earthquake” public sentiment data from Sina Weibo. The proposed TIsA model achieved an accuracy, macro-averaged recall, and macro-averaged F1 score of 96.10%, 96.20%, and 96.10%, respectively, outperforming related baseline models. [Limitations] We should have explored the more profound effects of different fusion strategies on sentiment recognition results. [Conclusions] The proposed TIsA model demonstrates high accuracy in processing audio-containing videos, effectively supporting online public opinion analysis.

  • Li Jiawei, Zhang Shunxiang, Li Shuyu, Duan Wenjie, Wang Yuqing, Deng Jinke
    Data Analysis and Knowledge Discovery. 2024, 8(11): 1-10. https://doi.org/10.11925/infotech.2096-3467.2023.1005
    Abstract (400) PDF (192) HTML (298)   Knowledge map   Save

    [Objective] This paper proposes a Chinese implicit sentiment analysis model based on text graph representation. It fully utilizes external knowledge and context to enhance implicit sentiment text and achieve word-level semantic interaction. [Methods] First, we modeled the target sentence and context as a text graph with words as nodes. Then, we obtained the semantic expansion of the word nodes in the graph through external knowledge linking. Finally, we used the Graph Attention Network to transfer semantic information between the nodes of this text graph. We also obtained the text graph representation through the Readout function. [Results] We evaluated the model on the publicly available implicit sentiment analysis dataset SMP2019-ECISA. Its F1 score reached 78.8%, at least 1.2% higher than the existing model. [Limitations] The size of the generated text graph is related to the length of the text, leading to significant memory and computational overhead for processing long text. [Conclusions] The proposed model uses graph structure to model the relationship between external knowledge, context, and the target sentence at the word level. It effectively represents text semantics and enhances the accuracy of implicit sentiment analysis.

  • Shen Si, Feng Shuyang, Wu Na, Zhao Zhixiao
    Data Analysis and Knowledge Discovery. 2025, 9(9): 37-48. https://doi.org/10.11925/infotech.2096-3467.2024.0670

    [Objective] This paper aims to enhance the utilization efficiency of governmental information resources and advance the intelligent transformation of public services by addressing the inherent knowledge limitations of general LLMs when processing policy texts. We investigate the effectiveness of a RAG framework to construct a more precise and reliable intelligent policy Q&A system. [Methods] This paper proposes a retrieval-augmented generation framework based on the Chinese policy large language model ChpoGPT. Specifically, the framework retrieves semantically similar policy documents from a knowledge base based on user queries and combines the retrieved results with ChpoGPT to enhance the model’s capabilities for downstream tasks. [Results] Experimental results demonstrate that our framework significantly outperforms existing models on key metrics. The ChpoGPT-based framework achieved a factuality score of nearly 90%. In terms of answer relevance, it scored 80.2%, outperforming the Gemini-1.0-pro model by 2.1%. Furthermore, it attained an answer semantic similarity score of 56.4%, surpassing the ERNIE 4.0 and Gemini-1.0-pro models by 4.1% and 2.8%, respectively. [Limitations] The language model still exhibits some uncontrollable behaviour in its answer output. [Conclusions] The retrieval-augmented generation of policy texts based on LLMs has certain reference value for the intelligent transformation of government services, but it still needs further improvement and optimization.

  • Feng Ran, Chen Danlei, Hua Bolin
    Data Analysis and Knowledge Discovery. 2025, 9(5): 19-32. https://doi.org/10.11925/infotech.2096-3467.2024.0533
    Abstract (369) PDF (111) HTML (287)   Knowledge map   Save

    [Objective] This paper comprehensively reviews the methods of text augmentation to reveal their current state of development and trends. [Coverage] Using “textual data augmentation” and “text augmentation” as search terms to retrieve literature from Web of Science, Google Scholar and CNKI, we screened out a total of 88 representative papers for review. [Methods] Text augmentation methods were categorized and summarized according to the objects of operation, the details of implementation and the diversity of generated results. On this basis, we conducted a thorough comparison of various methods with regards to their granularity, strengths, weaknesses and applications. [Results] Text augmentation approaches can be divided into text space-based methods and vector space-based methods. The former is intuitive and easily interpretable but may compromise the overall semantic structure of the text, while the latter can directly manipulate semantic features but incurs higher computational complexity. Current studies frequently necessitate external knowledge resources, such as heuristic guidelines and task-specific data. Moreover, the introduction of deep learning algorithms can enhance the novelty and diversity of generated data. [Limitations] We primarily offer a systematic examination of technical principles and performance characteristics of advanced methods, without assessing the developmental stage of platform tools quantitatively. Besides, the analysis is grounded in our chosen literatures and may not encompass all potential application scenarios of text augmentation methods. [Conclusions] Future work should pay more attention to enriching and refining the evaluation metrics for text augmentation techniques and increasing their robustness across different downstream tasks by prompt learning. Retrieval-augmented generation and graph neural networks should be taken seriously for addressing the challenges posed by lengthy texts and limited resources, which can further unlock the potential of text augmentation methods in the field of natural language processing.

  • Shen Yangtai, Qi Jianglei, Ding Hao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 145-153. https://doi.org/10.11925/infotech.2096-3467.2023.0808
    Abstract (354) PDF (92) HTML (294)   Knowledge map   Save

    [Objective] This paper proposes a latent non-negative factorization topic recommendation model based on LDA and transfer learning to improve recommendation accuracy in sparse data scenarios. The new model aims to address the data sparsity issue in publication recommendations. [Methods] We used non-negative matrix factorization to fill the high-dimensional sparse matrix of non-negative data. Then, we constructed a latent topic model based on LDA and non-negative matrix factorization, fully considering the thematic distribution characteristics of user reviews. Additionally, we applied different dimensions of user information to rating prediction to mitigate data sparsity. Finally, we introduced a transfer learning mechanism to extract and transfer model parameters from pre-trained models of related publication categories. This mechanism assisted the feature learning for the target model data and improved the effectiveness of the recommendation for less popular publications. [Results] We conducted comparative experiments against three baseline methods with three publication datasets. The proposed model achieved average precision, F1 score, and NDCG of 0.773 2, 0.708 5, and 0.746 8. The model’s overall performance surpasses other baseline models. [Limitations] When the number of users in the system is too small, other methods are needed for cold-start situations. [Conclusions] The proposed method has strong generalization capabilities for user interest features, alleviates popularity bias and data sparsity, and effectively improves the accuracy of publication recommendations.

  • Han Yixiao, Ma Jing
    Data Analysis and Knowledge Discovery. 2024, 8(12): 18-29. https://doi.org/10.11925/infotech.2096-3467.2023.0923
    Abstract (329) PDF (136) HTML (218)   Knowledge map   Save

    [Objective] In response to the challenges that current multimodal emotion models face in feature fusion, resulting in suboptimal accuracy in emotion classification, we propose the RCHFN multimodal emotion classification model. [Methods] We use the CLIP and Chinese-BERT-wwm models to extract image and text features separately while performing unimodal emotion classification concurrently. Then, we use a residual fusion module consisting of merged residual connections and convolution to fuse image and text features to obtain multimodal emotion classification results. Finally, we pass both unimodal and multimodal emotion classification results to a fully connected layer and adjust dynamic weights to obtain the final emotion classification result. [Results] The experimental results show that the RCHFN model achieved sentiment classification accuracies of 81.25% and 79.21% on the Weibo dataset and the Twitter datasets, respectively, with F1 scores of 80.43% and 78.44%, respectively. Compared to other models designed for similar tasks on the same dataset, the model showed an increase in accuracy of 1.79% and 1.79%, along with F1 score improvements of 2.39% and 2.62%, respectively. [Limitations] Further experiments are needed to establish the generalisation of this model to different datasets and its performance on additional modalities. [Conclusions] The RCHFN model proposed in this study effectively addresses the challenges of fusing multimodal discourse features and improving classification accuracy in emotion classification.

  • Shi Xi, Chen Wenjie, Hu Zhengyin, Han Tao, Zhang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(3): 1-15. https://doi.org/10.11925/infotech.2096-3467.2024.0176
    Abstract (319) PDF (226) HTML (269)   Knowledge map   Save

    [Objective] This study aims to efficiently extract scientific experiment knowledge and data from academic literature. It constructs a Scientific Experiment Knowledge Graph(SEKG) to provide high-quality data support for knowledge discovery. [Methods] We utilized Event Knowledge Graph technology to uniformly represent and model the complexity, temporality, and integration of knowledge and data in scientific experiments, thereby establishing the schema layer of the SEKG. Large Language Model was employed to enhance the efficiency of knowledge extraction in the data layer, with an empirical analysis conducted on organic solar cells. [Results] By using manual annotation and fine-tuning large language models, we constructed a scientific experiment knowledge graph in the field of organic solar cells. This SEKG comprises 34 types of nodes and 9 types of relationships, totaling 24,348 nodes and 123,642 relations. [Limitations] The data sources were limited to papers and patents. The construction of the SEKG required substantial manual input from experts, highlighting the need for efficiency improvements. Furthermore, fine-grained research procedures and validation rules in subfields were not considered. [Conclusions] The proposed method provides high-quality data support for applications such as experimental protocol recommendations, scientific experiment evolution analysis, and AI for Science, effectively supporting various knowledge discovery scenarios.

  • Zhai Dongsheng, Zhai Liang, Liang Guoqiang, Zhao Kai
    Data Analysis and Knowledge Discovery. 2025, 9(2): 120-133. https://doi.org/10.11925/infotech.2096-3467.2023.1277
    Abstract (318) PDF (125) HTML (210)   Knowledge map   Save

    [Objective] This study proposes a method for identifying technological evolution paths and explores key technologies and branches in specific domains. It aims to reveal the evolution trajectories of technology. [Methods] Firstly, we devised an unsupervised graph embedding model to integrate patent structural relationships, text and node information propagation, and aggregated knowledge into multi-dimensional semantic vectors. This approach expanded the technological paths while improving community division effectiveness. Secondly, we proposed methods for expanding the main path and derivative paths from the perspective of network topology and semantic correlation. Finally, we constructed a metric for technological junction points to identify the promising fields. [Results] We examined the new method with drone flight control system technology and identified four subfields’ technological evolution paths and branches. We found that pattern recognition, multiprocessor, and data fusion technologies hold promising prospects. [Limitations] Our identification framework does not incorporate the formation mechanism of technological evolution patterns. [Conclusions] The proposed method demonstrates significant advantages in path expansion effectiveness and application versatility.

  • Zhu Xiang, Zhang Yunqiu, Sun Shaodan, Zhang Liman
    Data Analysis and Knowledge Discovery. 2024, 8(12): 125-135. https://doi.org/10.11925/infotech.2096-3467.2023.0869
    Abstract (311) PDF (114) HTML (191)   Knowledge map   Save

    [Objective] This paper proposes a drug knowledge discovery method that fuses meta-path features of heterogeneous knowledge network to improve the performance of drug knowledge discovery. [Methods] Based on different meta-paths connecting drug and target entity in heterogeneous knowledge network, the HeteSim algorithm is used to calculate the multi-dimensional semantic similarity of drug-target entity. These meta-path features are fused with drug similarity and target entity similarity features as feature inputs for machine learning models to achieve drug knowledge discovery. [Results] The drug heterogeneous knowledge network contains 12,015 nodes and 1,895,445 edges. Taking drug-target relation prediction as an example, the 21-dimensional HeteSim features between drug and target were calculated. The AUC value of this method achieved the highest value on the three machine learning models (XGBoost=0.993, RF=0.990, SVM=0.975). The accuracy, precision and F-value of this method are also higher than those of the other two comparison methods. Through literature search of 20 prediction results, it is found that some prediction results can be supported by evidence in previous literature. [Limitations] Although PU learning strategy is used to reduce the influence of sample imbalance, some results will still be distorted. [Conclusions] The drug knowledge discovery method proposed in this study has certain progressiveness and effectiveness, and has certain theoretical and methodological reference significance.

  • Gao Yuan, Li Chongyang, Qu Boting, Jiao Mengyun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 158-169. https://doi.org/10.11925/infotech.2096-3467.2024.0784
    Abstract (308) PDF (64) HTML (143)   Knowledge map   Save

    [Objective] This paper aims to advance the research on urban tourism flow network structure, and to address the issues of inaccurate point-of-interest recognition and distorted visiting sequence in current tourist journey reconstruction methods based on travelogue texts. [Methods] This paper proposes a method based on a large language model for reconstructing tourist journeys, and explores the structural characteristics of urban tourism flow networks by combining it with social network analysis methods. [Results] The proposed method for reconstructing tourist journey achieves a precision of 94.00% and a recall of 87.78% in POI recognition, significantly outperforming the statistics-based Conditional Random Fields (CRF) method. The reconstructed journey shows a similarity of 83.81% to the actual journey. [Limitations] Tourist journey reconstruction effects depend to a certain extent on the training effects of the Prompts of the large language model. [Conclusions] The conclusions drawn align with public perception and current research findings when taking Xi’an as a case study, demonstrating the accuracy and versatility of the proposed tourist journey reconstruction method.

  • Liu Yu, Zeng Ziming, Sun Shouqiang
    Data Analysis and Knowledge Discovery. 2025, 9(8): 20-31. https://doi.org/10.11925/infotech.2096-3467.2024.0870

    [Objective] This paper addresses issues of semantic shift in multi-aspect sentences and implicit sentiment analysis in aspect-based sentiment analysis. To this end, it proposes a model based on sentiment enhancement using large language models and graph convolutional neural networks. [Methods] The model uses prompt learning to guide large language models in generating sentiment-enhanced representations of aspect semantics. It then constructs an aspect-semantic, sentiment-knowledge-enhanced graph. Additionally, the paper presents a sentiment-target position weighting algorithm to filter irrelevant information from the syntactic dependency graph. It also introduces aspect masking and gated filtering mechanisms to fully integrate semantic information and accurately identify the sentiment tendency of each aspect. [Results] On all experimental datasets, the proposed model performs slightly less accurately than the other two baseline models on the Restaurant dataset, but still achieves an F1 value of 81.60%. Specifically, the proposed model significantly improves F1 scores on the Laptop, Twitter and MAMS datasets by 1.79, 1.17 and 3.02 percentage points respectively over the optimal baseline model. [Limitations] The role of visual information in aspect-level sentiment analysis is not considered and experiments are only conducted on English datasets. [Conclusions] By leveraging prompt learning to guide large language models in generating sentiment representation words and combining them with graph neural networks, an effective and efficient solution for aspect-level sentiment analysis is provided, significantly improving the accuracy of aspect-level sentiment analysis in text.

  • Chang Bolin, Yuan Yiguo, Li Bin, Xu Zhixing, Feng Minxuan, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2024, 8(11): 102-113. https://doi.org/10.11925/infotech.2096-3467.2023.0834
    Abstract (300) PDF (130) HTML (170)   Knowledge map   Save

    [Objective] This paper proposes an integrated model incorporating radical information to improve the low accuracy and efficiency of existing automatic word segmentation and part-of-speech tagging for Classical Chinese. [Methods] Based on over 70,000 Chinese characters and their radicals, we constructed a radical vector representation model, Radical2Vector. We combined this model with SikuRoBERTa for representing Classic Chinese texts, forming an integrated BiLSTM-CRF model as the main experimental framework. Additionally, we designed a dual-layer scheme for word segmentation and part-of-speech tagging. Finally, we conducted experiments on the Zuo Zhuan dataset. [Results] The model achieved an F1 score of 95.75% for the word segmentation task and 91.65% for the part-of-speech tagging task. These scores represent 8.71% and 13.88% improvements over the baseline model. [Limitations] The approach only incorporates a single radical for each character and does not utilize other components of the characters. [Conclusions] The proposed model successfully integrates radical information, effectively enhancing the performance of textual representation for Classical Chinese. This model demonstrates exceptional performance in word segmentation and part-of-speech tagging tasks.

  • Si Binzhou, Sun Haichun, Wu Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 38-51. https://doi.org/10.11925/infotech.2096-3467.2024.0287

    [Objective] This study proposes a research framework for risk analysis of telecom fraud based on large language models (LLMs) and event fusion to reveal the process of telecom fraud and identify key risk factors. [Methods] We constructed a two-stage hierarchical prompt instruction specific to the telecom fraud domain and extracted risk events and their arguments from fraud cases. The framework integrates semantic dependency analysis with template-matching techniques to obtain the fraud event chains. Considering the diversity in event descriptions, we employed the BERTopic model for sentence vector representation and utilized a clustering algorithm for event fusion. [Results] Our method achieved F1-scores of 67.41% for event extraction and 73.12% for argument extraction in telecom fraud case analysis. Event clustering identified 10 categories of thematic risk events, with “disclosing information” as the highest-risk behavior. [Limitations] The coarse granularity of police report data limits the framework’s early warning capabilities. [Conclusions] The proposed approach, combining LLMs with event fusion clustering, enables the automatic construction of fraud event evolution chains, facilitates risk analysis, and supports the early warning and deterrence of telecom frauds.

  • Jin Qingwen, Li Hurong, Zhang Chen
    Data Analysis and Knowledge Discovery. 2024, 8(12): 101-111. https://doi.org/10.11925/infotech.2096-3467.2023.0892
    Abstract (294) PDF (77) HTML (140)   Knowledge map   Save

    [Objective] This study explores the application of the LIME algorithm and its evolutions in data storytelling, aiming to leverage the explanatory function of data stories. [Methods] We examined the principles, applications, and evolutionary strategies of the LIME algorithm. Based on this theoretical framework, we constructed a data storytelling process assisted by LIME-related algorithms. We collected a partial dataset for cat and dog recognition from the Kaggle platform, and trained an interpretable model with this dataset. Finally, we applied the new data storytelling model to explain image classification performance. [Results] Using an image of a “tabby cat” as the analysis object, the LIME explanation results and storytelling development curve indicated that the important features affecting the prediction results were the M-shaped stripes, black eyes, and pink nose, and the number of key superpixels being 2. [Limitations] Optimization of feature recognition and automated generation of data stories remain challenges. [Conclusions] Applying LIME-related algorithms in the data storytelling helps transform model predictions and explanation results into interpretable stories, better communicating data analysis outcomes.

  • Wu Shuai, Yang Xiuzhang, He Lin, Gong Zuoquan
    Data Analysis and Knowledge Discovery. 2024, 8(12): 136-148. https://doi.org/10.11925/infotech.2096-3467.2023.1002
    Abstract (284) PDF (90) HTML (132)   Knowledge map   Save

    [Objective] Combining the complex sentence structure features of ancient texts, a method with higher accuracy for identifying entity words in ancient texts was developed to further the development of digital humanities research. [Methods] Trigger words and relative words were used as key feature words to identify entity words, and a sentence pattern template was designed. Based on the characteristics of ancient texts, a Bert-BiLSTM-MHA-CRF model was constructed. The fusion of syntactic features and the Bert-BiLSTM-MHA-CRF model was used to achieve deep and fine-grained entity recognition of ancient texts. [Results] The F1 Score of this method is 0.88 on the conventional annotated test data set, 0.83 on the small sample annotated test data set, 0.79 (The Book of Songs), 0.81 (Master Lü’s Spring and Autumn Annals) and 0.85 (Discourses of the States) on the transfer learning test data set. [Limitations] In the design of syntactic feature templates, only single ancient books are used as feature templates. Semantic information mining does not take into account the structural features of characters such as phonetic symbols and radicals in ancient texts. [Conclusions] In small sample annotation and transfer learning experiments, this method can also achieve accurate named entity recognition of ancient texts, providing high quality corpus data for digital humanities research.

  • Zhu Danhao, Huang Xiaoyu, Li Yaolin, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2025, 9(6): 35-46. https://doi.org/10.11925/infotech.2096-3467.2024.0555
    Abstract (283) PDF (52) HTML (227)   Knowledge map   Save

    [Objective] This study uses large language model technology to automatically summarise legal texts. This addresses issues associated with traditional methods, such as the inadequate handling of lengthy texts and weak logical coherence in summaries. [Methods] This study proposes a method of automatically summarising legal texts based on the fine-tuning of large language models for specific domains. Firstly, a legal text summarisation instruction dataset is constructed. Secondly, two data augmentation strategies are explored: instruction augmentation and result augmentation. Finally, the study will perform domain-specific fine-tuning on a pre-trained model and conduct a multi-dimensional evaluation of the results. [Results] On the CAIL2020 Judicial Summary Dataset, our method achieves improvements of 13.8, 21.3, and 7.4 percentage points in the ROUGE-1, ROUGE-2, and ROUGE-L F1 scores, respectively, compared to the best baseline methods. Both human and automated evaluations further validate the effectiveness of our approach across multiple dimensions. [Limitations] When processing legal texts that are dense with technical terms and complex logical structures, the generated summaries still lack detail accuracy and precision with regard to legal provisions. [Conclusions] Fine-tuning large language models for specific domains can effectively improve the quality of legal text summarisation.

  • Teng Fei, Zhang Qi, Qu Jiansheng, Li Haiying, Liu Jiangfeng, Liu Boyu
    Data Analysis and Knowledge Discovery. 2024, 8(11): 33-46. https://doi.org/10.11925/infotech.2096-3467.2023.0767
    Abstract (277) PDF (150) HTML (136)   Knowledge map   Save

    [Objective] This study utilizes big data analytics to identify key and core technologies, improving the accuracy of identification results and providing robust data support for future technological innovation and large-scale applications. [Methods] We proposed a key and core technology identification method using the patent competitiveness index and Doc-LDA topic model based on the definitions of key and core technology concepts. The method distinguished topics by evaluating their strength, topic co-occurrence strength, and effective cohesion constraint coefficient. [Results] Taking new energy vehicles (EVs) as an empirical research example, a total of 10 key and core technologies were identified: fuel cells, solid-state power batteries, high-efficiency high-density motor drive system, lightweight plastic and composite materials, cellular communication, electro-mechatronics integration, multi-gear transmission, vehicle operations, intelligent control, and autonomous driving. Further trend analysis was conducted. [Limitations] Due to the limited granularity of topic refinement, some potential micro-mechanisms have not been fully revealed. [Conclusions] Using the patent competitiveness index and the Doc-LDA topic model provides a comprehensive assessment of the market value and competitive advantage of technologies. The proposed method also enhances the accuracy of technology development trend predictions.

  • Hou Jianhua, Deng Xianjiang, Tang Shiqi
    Data Analysis and Knowledge Discovery. 2025, 9(3): 69-82. https://doi.org/10.11925/infotech.2096-3467.2024.0353
    Abstract (275) PDF (107) HTML (195)   Knowledge map   Save

    [Objective] This study aims to explore the influence of interdisciplinary knowledge integration on the emergence of high-value patents and to delineate their distinctive characteristics. [Methods] High-value patents are operationalized as patents that receive the China Patent Gold Award. Interdisciplinary knowledge integration is quantified by two dimensions: IPC classification and patent knowledge units. Regression analysis investigates the effects of interdisciplinary knowledge integration, measured by these two dimensions, on both patent award status and individual patent value dimensions. [Results] The analysis reveals that high-value patents tend to exhibit a narrower interdisciplinary scope in terms of IPC classification, while simultaneously demonstrating a more diverse knowledge structure. In particular, interdisciplinary knowledge integration, when indicated by IPC classification, shows an inverted U-shaped relationship with patent value. Conversely, interdisciplinary knowledge integration, when indicated by knowledge units, shows a negative correlation with patent value. [Limitations] This study is limited by its reliance on the China Patent Gold Award as the sole proxy for high-value patents, which may not fully encompass the multifaceted nature of high-value patent characteristics. [Conclusions] This research provides valuable insights into the proactive identification and protection of high-value patents. Furthermore, the findings inform strategies to enhance upstream patent quality control and to facilitate effective patent translation and commercial utilization.

  • Sun Xinxin, Sun Ya’nan, Zhao Yuxiang, Jiang Bin
    Data Analysis and Knowledge Discovery. 2025, 9(7): 104-117. https://doi.org/10.11925/infotech.2096-3467.2024.0633

    [Objective] This study explores the impact of the voice characteristics of AI medical voice assistants on the perceived credibility of the older adults, mainly based on the Computer Are Social Actors (CASA) paradigm and the stereotype model. [Methods] This study conducted a 3 (voice gender: female/male/non-binary) ×2 (communication style: expert/partner) between-subjects experiment to explore the impact of voice gender and communication style of AI medical voice assistants on the perceived credibility and intentions to use among older adults. Additionally, the study sought to elucidate the mechanism of action on the stereotype dimensions of perceived warmth and perceived professionalism. [Results] The results indicate that older adults perceive male expert-type and female partner-type AI medical voice assistants as more credible. Communication style influenced their credibility perception of voice gender through perceived professionalism, and this perceived credibility positively predicted their behavioral intention to use such assistants. [Limitations] As this study was conducted within the context of China’s smart healthcare system development, the generalizability of the findings warrants further validation. [Conclusions] The congruence between vocal characteristics and gender-role stereotypes enhanced older adults’ perceived credibility. AI medical voice assistant design should account for the interplay of multiple vocal factors and contextual suitability.

  • Wen Tingxin, Bai Yunhe
    Data Analysis and Knowledge Discovery. 2024, 8(12): 86-100. https://doi.org/10.11925/infotech.2096-3467.2023.0881
    Abstract (258) PDF (106) HTML (137)   Knowledge map   Save

    [Objective] This study proposes an interpretable model for the interaction quality of fake news groups based on RF-GA-XGBoost and SHAP. Our model mitigates the negative impacts of fake news by leveraging the interaction quality of social media user groups and accurately identifies the causes and mechanisms of positive interactions. [Methods] First, we retrieved 500 fake news articles and 7,029 comments from the Weibo21 dataset. Then, we assessed the fake news groups’ interaction quality across three dimensions: content, form, and comment sentiment. Third, we extracted fake news text features from these dimensions. Fourth, we used the sequential forward search strategy of random forest to extract the optimal feature subset of fake news text. We constructed a prediction model for group interaction quality based on GA-XGBoost, and compared its performance with other mainstream machine learning algorithms such as LR, SVM, and XGBoost. Finally, the SHAP model provides causal explanations for the impact of important features on the group interaction quality. [Results] Our model’s F1-score and AUC values are over 86%, outperforming the comparison models across six performance metrics. Additionally, features such as the number of content characters, words, and negative sentiment words in fake news text significantly influence the interaction quality of social media groups. [Limitations] This paper does not conduct multi-feature interaction interpretation analysis or explore the early high-quality group interaction patterns based on timestamps. [Conclusions] The proposed model accurately identifies the ways in which different features impact group interaction quality, providing effective decision-making support for social media platforms to improve their operational strategies and functional designs.

  • Wang Xiaolun, Yao Qian, Lin Jiahui, Zhao Yuxiang, Sun Zhihao, Lin Xinlan
    Data Analysis and Knowledge Discovery. 2025, 9(1): 55-64. https://doi.org/10.11925/infotech.2096-3467.2024.0098
    Abstract (253) PDF (115) HTML (207)   Knowledge map   Save

    [Objective] Based on self-determination theory, this study explores the motivations of service providers to participate in tasks on skill crowdsourcing platforms. [Methods] We retrieved 15,641 bids and 2,385 service provider records from the epwk.com platform. We utilized the TF-IDF and the BERT to analyze text features and calculate motivation variables. Finally, we constructed a negative binomial regression model considering the dependent variables as count variables. [Results] The motivations and behaviors of service providers participating in skill crowdsourcing were significantly correlated at the 1% level (R²=23.10%). Task difficulty improved the model’s explanatory power, negatively moderating competence and reputation (p<0.05) while positively moderating social recognition (p<0.01). [Limitations] The representativeness is limited to a single platform. Future studies could collect data from multiple platforms for comparative validation. External factors such as platform dynamics and policy environments might interfere with the data, which should be considered in future research to deepen the conclusions. [Conclusions] This paper expands the theoretical foundation for service provider participation in crowdsourcing tasks and offers practical insights for service providers, buyers, and platforms.

  • Du Jialin, Wang Xizi, Hu Guangwei
    Data Analysis and Knowledge Discovery. 2024, 8(11): 59-71. https://doi.org/10.11925/infotech.2096-3467.2023.0778
    Abstract (249) PDF (157) HTML (126)   Knowledge map   Save

    [Objective] This study investigates the factors influencing public satisfaction with government-citizen interaction platforms. We constructed an analysis model for factors affecting public satisfaction. [Methods] We extracted micro-level variables from the leadership mailbox corpus, which were combined with macroeconomic variables to establish a public satisfaction analysis model using the Gradient Boosting Decision Tree (GBDT) method. We also eliminated less influential variables with SHAP analysis to optimize the model. [Results] The proposed model outperformed comparison models across accuracy, recall, precision, and F1-score. Key features affecting public satisfaction with the leadership mailbox include GDP growth rate, PCDI growth rate, CPI growth rate, message topic, message type, and response mode. [Limitations] The study did not explore a broader range of influencing factors or more extensive government-citizen interaction scenarios. [Conclusions] The new model optimizes the variable selection process and visualizes how each feature influences the level, direction, and manner of public satisfaction with government responses. The model is a data-driven tool for administrative decision-making.

  • Ye Guanghui, Wang Yujie, Lou Peilin, Zhou Xinghua, Liu Shuyan
    Data Analysis and Knowledge Discovery. 2025, 9(5): 62-76. https://doi.org/10.11925/infotech.2096-3467.2024.0507
    Abstract (246) PDF (73) HTML (183)   Knowledge map   Save

    [Objective] Tracking and observing the characteristics of public opinion circulation during emergencies can facilitate effective public opinion guidance, control, and shared governance. [Methods] Using the case study method, we construct a framework for understanding the macroscopic circulation of public opinion in emergencies. Using social network analysis, complemented by empirical research and natural language processing technology, we conduct an in-depth analysis of the circulation patterns of public opinion from a micro perspective, focusing on the dimensions of subjects, objects, and carriers. Validation analyses are conducted using data from public health emergencies. [Results] From a macro perspective, public opinion circulates across Cyber Space, Physical Space and Psychological Space, providing an interdisciplinary analytical framework for understanding and quantifying public behaviors and responses. At the micro level, public opinion circulates among multiple groups, media, events and platforms, exhibiting four effects respectively: homogeneous diffusion and heterogeneous traversal effect, field resonance and field escape effect, co-temporal and ephemeral effect, and amplified resonance and echo difference effect. [Limitations] The dynamics of social network sentiment are not considered. [Conclusions] By summarizing the laws of cross-domain circulation of public opinion from both macroscopic and microscopic perspectives and conducting empirical research linked to specific events, we provide new insights into the study of public opinion communication.

  • Zhang Lanze, Gu Yijun, Peng Jingjie
    Data Analysis and Knowledge Discovery. 2025, 9(1): 65-78. https://doi.org/10.11925/infotech.2096-3467.2023.1009
    Abstract (240) PDF (90) HTML (152)   Knowledge map   Save

    [Objective] To enhance the accuracy of graph neural networks in credit fraud detection, this paper introduces topological structure analysis. It proposes a graph-based deep fraud detection model (PSI-GNN) integrating prior structural information. [Methods] We embed the attribute information representing the topological structure of central nodes into feature vectors through structural information encoding. Then, we divided the message-passing process into proximal and distal aspects. We aggregated proximal node information based on a shallow graph neural network model and aggregated distal homophily information guided by random walk structural similarity. Finally, we combined the results of the above message passing to obtain node embedding representations. [Results] We examined the new model on the DGraph-Fin and TFinance datasets, which include fraudulent behaviors. The Macro-F1 and AUC of the PSI-GNN model improved by 2.62%, 4.55%, and 4.67%, 2.33%, respectively, compared to nine graph neural network models in related fields. [Limitations] The processing of node structural information incurs significant time overhead. [Conclusions] By modeling the structural attributes and homophily information of credit networks, we can effectively detect credit fraudsters.

  • Cao Kun, Wu Xinnian, Bai Guangzu, Jin Junbao, Zheng Yurong, Li Li
    Data Analysis and Knowledge Discovery. 2025, 9(3): 42-55. https://doi.org/10.11925/infotech.2096-3467.2024.0006
    Abstract (238) PDF (92) HTML (138)   Knowledge map   Save

    [Objective] This study explores methods for identifying key core technologies by integrating the textual content characteristics of “science-technology” and complex network relationships. It supports governments, research institutions, and industries in formulating scientific and technological strategies and conducting innovation activities. [Methods] First, we employed the Sentence-BERTopic model to perform deep semantic fusion and knowledge topic clustering on sentence-level paper and patent text corpora. Then, we constructed a “science-technology” knowledge topic complex network based on the citation relationships of these documents. Third, we improved the traditional PageRank algorithm by incorporating node quality characteristics, time decay factors, the weights of incoming node edges, and outdegree. This approach ranked the importance and influence of nodes within the domain. Finally, we identified key core technologies using the head/tail break method. [Results] We conducted an empirical study on CNC machine tools and identified 53 key core technologies, including thermal error modeling and compensation, CNC machine tools control technology, and feed systems. A comparison with relevant domestic and international policy plans demonstrates that the identified technologies comprehensively encompass the key core technologies in the field. [Limitations] This study lacks an in-depth analysis of citation locations, motivations, behaviors, and purposes, which may affect identification accuracy. [Conclusions] This study reveals the knowledge structure and topological characteristics of science and technology by constructing a “science-technology” complex network and applying the Key Core Rank (KCR) algorithm. The proposed method achieves fine-grained and precise quantitative identification of key core technologies.

  • Chen Jing, Cao Zhixun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 1-13. https://doi.org/10.11925/infotech.2096-3467.2024.0446
    Abstract (220) PDF (114) HTML (156)   Knowledge map   Save

    [Objective] This paper aims to analyse the differences in combating hallucinations in large language models between unstructured knowledge, exemplified by knowledge base resources, and structured knowledge, exemplified by knowledge graph resources, using the Traditional Chinese Medicine (TCM) Q&A domain as a case study. Based on these findings, strategies for improving the ability of large language models to combat hallucinations in vertical domains are discussed. [Methods] The study designs experiments using external knowledge combined with prompt engineering techniques to analyse the differences in prompt effects between knowledge base resources and knowledge graph resources in the TCM Q&A domain. It also investigates the superiority of dynamic triplet strategies and integrated fine-tuning strategies in optimising large language models against hallucinations. [Results] Experimental results show that compared to prompts from unstructured knowledge in the knowledge base, prompts from structured knowledge in the knowledge graph perform better in terms of precision, recall and F1 score, improving by 1.9%, 2.42% and 2.2% respectively to reach 71.44%, 60.76% and 65.31%. Further analysis of the optimisation strategies shows that the combination of the dynamic triplet strategy and fine-tuning had the best effect against hallucinations, achieving precision, recall and F1 scores of 72.47%, 65.87% and 68.62% respectively. [Limitations] This study is limited to a single field, as it was only tested in the field of Traditional Chinese Medicine Q&A, and its generalisability needs to be validated in a wider range of scientific fields. [Conclusions] This study has demonstrated that in the field of Traditional Chinese Medicine, structured knowledge from knowledge graphs outperforms traditional unstructured knowledge in reducing hallucinations and improving the accuracy of model responses. It demonstrates the critical role of structured knowledge in enhancing model comprehension skills. The integration of fine-tuning strategies with knowledge resources provides an effective way to improve performance in large language models. This paper provides a theoretical rationale and methodological support for integrating external knowledge into large language models to improve knowledge performance.

  • Hai Jiali, Wang Run, Yuan Liangzhi, Zhang Kairui, Deng Wenping, Xiao Yong, Zhou Tao, Chang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(7): 165-174. https://doi.org/10.11925/infotech.2096-3467.2024.0747

    [Objective] This paper constructs a retrieval-augmented question-answering (QA) system for Traditional Chinese Medicine (TCM) standards, aiming to provide efficient standard knowledge services and promote the research and application of TCM standardization. [Methods] By comparing the performance of large language models such as BaiChuan, Gemma, and Qwen, we chose GPT-3.5 as the base model. Then, we combined data optimization and retrieval-augmented generation to develop a QA system with semantic analysis, contextual association, and answer-generation capabilities. [Results] On a TCM literature-based question generation dataset, the new system achieved answer relevance precision, recall, and F1 scores of 0.879, 0.839 and 0.857, respectively, as well as contextual relevance scores of 0.838, 0.869, and 0.853. On a TCM standards QA dataset, the system achieved answer relevance scores of 0.871, 0.836 and 0.853, all outperforming baseline models. [Limitations] The system’s intent recognition accuracy still requires further improvement. The scale and granularity of the TCM standards knowledge base need to be expanded and refined. [Conclusions] In response to the practical needs of TCM knowledge services, this study developed a retrieval-augmented QA system for TCM standards. The system can effectively answer various questions related to clinical guidelines, herbal medicine standards, and information standards, covering topics such as treatment principles, syndrome classification, therapeutic methods, and technical specifications, demonstrating its strong practicality and feasibility.

  • An Lu, Zheng Yajing
    Data Analysis and Knowledge Discovery. 2024, 8(12): 1-17. https://doi.org/10.11925/infotech.2096-3467.2023.0974
    Abstract (202) PDF (124) HTML (140)   Knowledge map   Save

    [Objective] This study aims to explore the mechanism of social consensus formation in the context of public emergencies. It proposed the methods for identifying and measuring consensus and identifies important factors that influence consensus formation, providing theoretical and methodological support for relevant departments to formulate effective information dissemination strategies and guide the evolution of public opinion. [Methods] This study takes the microblogging data of the barbecue restaurant incident in a city as a data source, combines the topic model, sentiment analysis and triplet extraction to explore users’ opinions. The degree of consensus among individuals is calculated based on opinion consistency and emotional consistency. Using the information ecology theory, the characteristic variables are constructed from the dimensions of information people, information, and information environment. The consensus degree prediction model is established. The performance of the four machine learning models is compared. The SHapley Additive ExPlanations (SHAP) technique is used to explain the best model. [Results] As a result, the MSE value (1176.9550) and the R-squared value (0.6753) of the CatBoostRegressor model were found to be superior to the other three models. The top five factors in the feature importance ranking show that the proportion of people with higher education, the age gap, and the number of people with firm views are significantly negatively correlated with the degree of group consensus. Similarity of social network structure is significantly positively correlated with the degree of group consensus. The impact of the feature variables varies according to the topic. [Limitations] Social consensus includes intragroup consensus and intergroup consensus. This article focuses only on consensus within different groups, and further research on the evolution of viewpoints and consensus formation mechanisms between different groups can be conducted in the future. [Conclusions] This article proposes a method for identifying and measuring social consensus based on the combination of viewpoint consistency and emotional consistency. Real social media data is used for viewpoint mining and consensus recognition, revealing key factors that influence the formation of social consensus.