Home Browse Top access

Top access

  • Published in last 1 year
  • In last 2 years
  • In last 3 years
  • All

Please wait a minute...
  • Select all
    |
  • Yu Bengong, Xing Yu, Zhang Shuwen
    Data Analysis and Knowledge Discovery. 2024, 8(11): 22-32. https://doi.org/10.11925/infotech.2096-3467.2023.0746
    Abstract (941) PDF (197) HTML (550)   Knowledge map   Save

    [Objective] To fully extract features from multiple modalities, align and integrate multimodal features, and design downstream tasks, we propose an aspect-based sentiment analysis model of multimodal collaborative contrastive learning (MCCL-ABSA). [Methods] Firstly, on the text side, we utilized the similarity between aspect words and their encoding within sentences. On the image side, the model used the similarity of images encoded in different sequences after random cropping to construct positive and negative samples required for contrastive learning. Secondly, we designed the loss function for contrastive learning tasks to learn more distinguishable feature representation. Finally, we fully integrated text and image features for multimodal aspect-based sentiment analysis while dynamically fine-tuning the encoder by combining contrastive learning tasks. [Results] On the TWITTER-2015 dataset, our model’s accuracy and F1 scores improved by 0.82% and 2.56%, respectively, compared to the baseline model. On the TWITTER-2017 dataset, the highest accuracy and F1 scores were 0.82% and 0.25% higher than the baseline model. [Limitations] We need to examine the model’s generalization on other datasets. [Conclusions] The MCCL-ABSA model effectively improves feature extraction quality, achieves feature integration with a simple and efficient downstream structure, and enhances the efficacy of multimodal sentiment classification.

  • Wang Zhenyu, Zhu Xuefang, Yang Rui
    Data Analysis and Knowledge Discovery. 2025, 9(1): 90-99. https://doi.org/10.11925/infotech.2096-3467.2023.1273
    Abstract (879) PDF (181) HTML (341)   Knowledge map   Save

    [Objective] This paper utilizes large language models (LLMs) to generate high-quality auxiliary knowledge, aiming to improve the performance of multimodal relation extraction. [Methods] We introduced a multimodal similarity detection module to construct multimodal prompt templates, which allow the LLM to integrate visual information and prior knowledge into the generated high-quality auxiliary knowledge. We combined the obtained auxiliary knowledge with the original text and input it into downstream text models to accurately predict entity relationships. [Results] The proposed model outperformed the best-baseline model on the MNRE dataset, achieving 4.09% and 7.84% improvements in accuracy and F1 score. [Limitations] We only examined the proposed model on English datasets. [Conclusions] Comparative experiments and case studies validate the model’s effectiveness in multimodal relation extraction. Our new model provides a direction for applying LLMs to multimodal information extraction tasks in the future.

  • Zhang Jing, Gao Zixin, Ding Weijie
    Data Analysis and Knowledge Discovery. 2025, 9(2): 48-58. https://doi.org/10.11925/infotech.2096-3467.2023.1347
    Abstract (621) PDF (109) HTML (169)   Knowledge map   Save

    [Objective] This paper proposes a new model to effectively classify massive police reports. [Methods] We constructed a text classification model based on BERT-DPCNN. Then, we used the BERT pre-trained model to generate word vectors. The model improved the classification performance by optimizing the activation function in the DPCNN model and enhancing the dynamic learning rate. [Results] We conducted comparative experiments between BERT-DPCNN and six other models, including BERT, BERT-CNN, BERT-RCNN, BERT-RNN, BERT-LSTM, and ERNIE. The BERT-DPCNN achieved the best accuracy, recall, and precision. In the binary classification tasks, the accuracy of BERT-DPCNN exceeded 98%. In the eleven-category tasks, the model’s accuracy exceeded 82%. [Limitations] The model has many parameters, and the limited number of experiments calls for further testing. [Conclusions] The new model effectively improves the accuracy of police report classification, providing data support for police departments in analyzing and assessing police incidents.

  • Song Mengpeng, Bai Haiyan
    Data Analysis and Knowledge Discovery. 2025, 9(6): 21-34. https://doi.org/10.11925/infotech.2096-3467.2024.0628
    Abstract (579) PDF (58) HTML (323)   Knowledge map   Save

    [Objective] This paper aims to generate structured literature reviews with references automatically, to assist researchers quickly grasp a specific area of scientific knowledge. [Methods] A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts. The GLM3-6B model was fine-tuned for training by generating 3,000 reviews using a large language model and then revising them manually. The corpus was then converted into high-dimensional vectors and stored in an index. These vectors were retrieved to implement LangChain’s external knowledge base. To solve the problem of poor retrieval of proper nouns, a hybrid search with BM25 was used and reordered to improve retrieval accuracy. [Results] Fine-tuning and hybrid retrieval frameworks were used to construct the literature review generation system, improving the BLEU and ROUGE scores by 109.64% and 40.22% respectively, as well as the authenticity score of manual evaluation by 62.17%. [Limitations] Due to limitations in computational resources, the scale of the local model parameters is small and its generation ability needs to be improved further. [Conclusions] The retrieval-augmented generation technique uses large language models not only generates high-quality literature reviews, and provides traceable evidence for the generated content, as well as assists researchers in intelligent reading.

  • Rang Yuchen, Ma Jing
    Data Analysis and Knowledge Discovery. 2025, 9(1): 100-109. https://doi.org/10.11925/infotech.2096-3467.2023.1130
    Abstract (457) PDF (104) HTML (325)   Knowledge map   Save

    [Objective] To reduce inter-modal differences and strengthen the correlation between modalities, this paper proposes a multimodal alignment sentiment analysis model to accurately capture the sentiment tendencies embedded in multimodal data. [Methods] For the textual modality, the original text data, supplemented with image captions, is processed using the RoBERTa pre-trained model for text feature extraction. We used the Clip Vision Model to extract image features for the image modality. The text and image features are aligned through a multimodal alignment layer based on a Multimodal Transformer to obtain enhanced fused features. Finally, the fused multimodal features are inputted into a multilayer perception for sentiment recognition and classification. [Results] The proposed model achieved an accuracy of 71.78% and an F1 score of 68.97% on the MVSA-Multiple dataset, representing improvements of 1.78% and 0.07%, respectively, over the best-performing baseline model. [Limitations] The model’s performance was not validated using additional datasets. [Conclusions] The proposed model effectively promotes inter-modal fusion, achieves better fusion representations, and enhances sentiment analysis.

  • Sun Wenju, Li Qingyong, Zhang Jing, Wang Danyu, Wang Wen, Geng Yangli’ao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 1-30. https://doi.org/10.11925/infotech.2096-3467.2024.0508
    Abstract (451) PDF (170) HTML (398)   Knowledge map   Save

    [Objective] This study comprehensively reviews the advancements in deep incremental learning techniques from the perspective of addressing catastrophic forgetting, aiming to provide references for the research community. [Coverage] Utilizing search terms such as “Incremental Learning”, “Continual Learning”, and “Catastrophic Forgetting”, we retrieved literature from the Web of Science, Google Scholar, DBLP, and CKNI. By reading and organizing the retrieved literature, a total of 105 representative publications were selected. [Methods] The paper begins by defining incremental learning and outlining its problem formulation and inherent challenges. Subsequently, we categorize incremental learning methods into regularization-based, memory-based, and dynamic architecture-based approaches, and review their theoretical underpinnings, advantages and disadvantages in detail. [Results] We evaluated some classical and recent methods in a unified experimental setting. The experimental results demonstrate that regularization-based methods are efficient in application but cannot fully avoid forgetting; memory-based methods are significantly affected by the number of retained exemplars; and dynamic architecture-based methods effectively prevent forgetting but incur additional computational costs. [Limitations] The scope of this review is limited to deep learning approaches, excluding traditional machine learning techniques. [Conclusions] Under optimal conditions, memory-based and dynamic architecture-based strategies tend to outperform regularization-based approaches. However, the increased complexity of these methods may hinder their practical application. Furthermore, current incremental learning methods show suboptimal performance compared to joint training models, marking a critical direction for future research.

  • Chen Ting, Ding Honghao, Zhou Haoyu, Wu Jiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 159-171. https://doi.org/10.11925/infotech.2096-3467.2023.1424
    Abstract (432) PDF (116) HTML (326)   Knowledge map   Save

    [Objective] This study explores the impacts of bullet-screen(danmu)content and behavioral characteristics on consumers purchasing behavior in live-streaming e-commerce, as well as the moderating effect of host-product relevance. [Methods] First, we retrieved the bullet-screen data from the Douyin platform and the consumer data from the Huitun platform based on the Elaboration Likelihood Model. Then, we studied the impacts of bullet-screen content characteristics (central route) and behavior characteristics (peripheral route) on consumer purchasing behavior with text mining and zero-inflated negative binomial regression. We also discussed the moderating effect of host-product relevance with grouping regression. [Results] Information richness, social interaction degree and number of bullet-screen comments positively impact purchasing behavior. The emotional polarity of bullet screen comments exhibits an inverted U-shaped effect on purchasing behavior. Compared with live streaming rooms with low host-product relevance, those with high host-product relevance have broader positive impacts on purchase behavior. [Limitations] We only investigated the bullet-screen data from a single live-streaming e-commerce platform. [Conclusions] This study examines the factors influencing consumers’ actual purchasing behavior from the perspective of bullet-screen comments. It provides insights for improving communication between merchants and consumers in live-streaming e-commerce, ultimately enhancing sales performance.

  • Song Donghuan, Hu Maodi, Ding Jielan, Qu Zihao, Chang Zhijun, Qian Li
    Data Analysis and Knowledge Discovery. 2025, 9(2): 12-25. https://doi.org/10.11925/infotech.2096-3467.2023.0885
    Abstract (428) PDF (171) HTML (252)   Knowledge map   Save

    [Objective] This study addresses the issue of low classification accuracy in conventional text classification tasks due to factors such as sparse domain-specific training data and significant differences between types. [Methods] We constructed a novel classification model based on the BERT-DPCNN-MMOE framework, integrating the deep pyramid convolutional networks with the multi-gate control unit mechanism. Then, we designed multi-task and transfer learning experiments to validate the effectiveness of the new model against eight well-established and innovative models. [Results] This research independently constructed cross-type multi-task data as the basis for training and testing. The BERT-DPCNN-MMOE model outperformed the other eight baseline models in multi-task and transfer learning experiments, with F1 score improvements exceeding 4.7%. [Limitations] Further research is needed to explore the model’s adaptability to other domains. [Conclusions] The BERT-DPCNN-MMOE model performs better in multi-task and cross-type text classification tasks. It is of significance for future specialized intelligence classification tasks.

  • Wang Zitong, Li Chenliang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 94-105. https://doi.org/10.11925/infotech.2096-3467.2023.1305
    Abstract (414) PDF (233) HTML (164)   Knowledge map   Save

    [Objective] To more flexibly capture the spatial-temporal features of traffic flow data and achieve more accurate multivariate traffic flow prediction, this paper proposes a Position-Aware Spatial-Temporal Graph Convolutional Network (PASTGCN). [Methods] First, the traffic data’s spatial and periodic temporal position features are represented as explicit position embeddings. Then, based on the spatiotemporal convolutional structure, we incorporated spatial information into the temporal convolutional network for space-aware sequence modeling. Finally, we used static and dynamic dual graph learning methods to capture spatial dependencies. [Results] We conducted experiments on two real-world traffic flow datasets. The PASTGCN model effectively predicted multivariate traffic flows and reduced errors by up to 1.59% compared to existing deep learning models. [Limitations] The experimental datasets are limited, and the proposed graph learning method increased the time complexity. [Conclusions] The PASTGCN model can effectively utilize spatial-temporal position information to achieve more accurate traffic flow prediction.

  • Zhao Jiayi, Xu Yuemei, Gu Hanwen
    Data Analysis and Knowledge Discovery. 2024, 8(10): 44-53. https://doi.org/10.11925/infotech.2096-3467.2023.0714
    Abstract (400) PDF (116) HTML (182)   Knowledge map   Save

    [Objective] This study addresses the performance degradation due to catastrophic forgetting when multilingual models handle tasks in new languages. [Methods] We proposed a multilingual sentiment analysis model, mLMs-EWC, based on continual learning. The model incorporates continual learning into multilingual models, enabling it to learn new language features while retaining the linguistic characteristics of previously learned languages. [Results] In continual sentiment analysis experiments involving three languages, the mLMs-EWC model outperformed the Multi-BERT model by approximately 5.0% in French and 4.5% in English tasks. Additionally, the mLMs-EWC model was evaluated on a lightweight distilled model, showing an improvement of up to 24.7% in English tasks. [Limitations] This study focuses on three widely used languages, and further validation is needed to assess the model’s generalization capability to other languages. [Conclusions] The proposed model can alleviate catastrophic forgetting in multilingual sentiment analysis tasks and achieve continual learning on multilingual datasets.

  • Li Jiawei, Zhang Shunxiang, Li Shuyu, Duan Wenjie, Wang Yuqing, Deng Jinke
    Data Analysis and Knowledge Discovery. 2024, 8(11): 1-10. https://doi.org/10.11925/infotech.2096-3467.2023.1005
    Abstract (386) PDF (188) HTML (282)   Knowledge map   Save

    [Objective] This paper proposes a Chinese implicit sentiment analysis model based on text graph representation. It fully utilizes external knowledge and context to enhance implicit sentiment text and achieve word-level semantic interaction. [Methods] First, we modeled the target sentence and context as a text graph with words as nodes. Then, we obtained the semantic expansion of the word nodes in the graph through external knowledge linking. Finally, we used the Graph Attention Network to transfer semantic information between the nodes of this text graph. We also obtained the text graph representation through the Readout function. [Results] We evaluated the model on the publicly available implicit sentiment analysis dataset SMP2019-ECISA. Its F1 score reached 78.8%, at least 1.2% higher than the existing model. [Limitations] The size of the generated text graph is related to the length of the text, leading to significant memory and computational overhead for processing long text. [Conclusions] The proposed model uses graph structure to model the relationship between external knowledge, context, and the target sentence at the word level. It effectively represents text semantics and enhances the accuracy of implicit sentiment analysis.

  • Li Hui, Pang Jingwei
    Data Analysis and Knowledge Discovery. 2024, 8(11): 11-21. https://doi.org/10.11925/infotech.2096-3467.2023.0744
    Abstract (362) PDF (159) HTML (236)   Knowledge map   Save

    [Objective] To effectively utilize information containing audio and video and fully capture the multi-modal interaction among text, image, and audio, this study proposes a multi-modal sentiment analysis model for online users (TIsA) incorporating text, image, and STFT-CNN audio feature extraction. [Methods] First, we separated the video data into audio and image data. Then, we used BERT and BiLSTM to obtain text feature representations and applied STFT to convert audio time-domain signals to the frequency domain. We also utilized CNN to extract audio and image features. Finally, we fused the features from the three modalities. [Results] We conducted empirical research using the “9.5 Luding Earthquake” public sentiment data from Sina Weibo. The proposed TIsA model achieved an accuracy, macro-averaged recall, and macro-averaged F1 score of 96.10%, 96.20%, and 96.10%, respectively, outperforming related baseline models. [Limitations] We should have explored the more profound effects of different fusion strategies on sentiment recognition results. [Conclusions] The proposed TIsA model demonstrates high accuracy in processing audio-containing videos, effectively supporting online public opinion analysis.

  • Zhou Zhigang, Dou Luyao, Li Yi, Bai Zengliang
    Data Analysis and Knowledge Discovery. 2024, 8(12): 52-61. https://doi.org/10.11925/infotech.2096-3467.2023.0883
    Abstract (348) PDF (101) HTML (199)   Knowledge map   Save

    [Objective] This paper identifies potential high-value patents by deeply mining the feature information embedded in patent texts based on bilateral semantics and text sequence features. [Methods] First, we constructed a mixed patent dataset from the fields of amorphous alloys, industrial robots, and gene chips. Then, we employed the BERT word vector model to achieve contextual semantic association and word meaning interpretation of patent texts. Third, we utilized the BiGRU network to extract global text sequence information while CNN captured local text sequence information. Finally, we predicted potential high-value patents by combining “bilateral semantics+global+local” semantic and sequence features. [Results] The proposed BERT-BiGRU-CNN model outperforms existing models and is more suitable for predicting potential high-value patents on a large data scale. Our new model achieves a prediction accuracy of over 35%, about 4% higher than the existing ones. [Limitations] The relationship and integration mechanism between standard essential and high-value patents have yet to be considered, and the algorithm complexity needs further optimization. [Conclusions] The BERT-BiGRU-CNN model performs better in text classification tasks than the CNN model. Our new model improves the prediction accuracy of potentially high-value patents by capturing global and local text sequence features.

  • Shen Yangtai, Qi Jianglei, Ding Hao
    Data Analysis and Knowledge Discovery. 2025, 9(1): 145-153. https://doi.org/10.11925/infotech.2096-3467.2023.0808
    Abstract (344) PDF (87) HTML (284)   Knowledge map   Save

    [Objective] This paper proposes a latent non-negative factorization topic recommendation model based on LDA and transfer learning to improve recommendation accuracy in sparse data scenarios. The new model aims to address the data sparsity issue in publication recommendations. [Methods] We used non-negative matrix factorization to fill the high-dimensional sparse matrix of non-negative data. Then, we constructed a latent topic model based on LDA and non-negative matrix factorization, fully considering the thematic distribution characteristics of user reviews. Additionally, we applied different dimensions of user information to rating prediction to mitigate data sparsity. Finally, we introduced a transfer learning mechanism to extract and transfer model parameters from pre-trained models of related publication categories. This mechanism assisted the feature learning for the target model data and improved the effectiveness of the recommendation for less popular publications. [Results] We conducted comparative experiments against three baseline methods with three publication datasets. The proposed model achieved average precision, F1 score, and NDCG of 0.773 2, 0.708 5, and 0.746 8. The model’s overall performance surpasses other baseline models. [Limitations] When the number of users in the system is too small, other methods are needed for cold-start situations. [Conclusions] The proposed method has strong generalization capabilities for user interest features, alleviates popularity bias and data sparsity, and effectively improves the accuracy of publication recommendations.

  • Li Ying, Li Ming
    Data Analysis and Knowledge Discovery. 2024, 8(10): 89-99. https://doi.org/10.11925/infotech.2096-3467.2023.0683
    Abstract (342) PDF (122) HTML (236)   Knowledge map   Save

    [Objective] This paper proposes a recommendation method for supplementary question-and-answer (Q&A) based on a multi-label, multi-document Q&A classification model enhanced by transfer learning. It aims to identify and recommend supplementary answers in online Q&A communities. [Methods] We introduced new features alongside existing ones to classify the supplementary relationships between questions and answers. Then, we established a transfer learning-enhanced multi-label, multi-document classification model to identify and recommend supplementary answers. [Results] We conducted three meta-tasks on real datasets from the Zhihu community. The proposed method improves precision, recall, and F1 score by 48.29%, 15.75%, and 32.53%, respectively, on average. [Limitations] The method was only applied to health-related Q&A topics in Zhihu and has yet to be validated across different platforms or topics. [Conclusions] The proposed recommendation method effectively recommends supplementary answers. It helps users in Q&A communities obtain more comprehensive answers and promote knowledge utilization within the community.

  • Cheng Quan, Jiang Shihui, Li Zhuozhuo
    Data Analysis and Knowledge Discovery. 2024, 8(10): 112-124. https://doi.org/10.11925/infotech.2096-3467.2023.0638
    Abstract (321) PDF (94) HTML (164)   Knowledge map   Save

    [Objective] This paper aims to achieve semantic discovery and relation extraction from a large amount of complex user-generated information from an online healthcare platform. [Methods] First, we constructed a semantic discovery model for online health information based on an improved CasRel model. Then, we introduced the ERNIE-Health pre-trained model, which is more suitable for the healthcare domain, into the text encoding layer of the CasRel-based model. Finally, we used a multi-level pointer network in the entity and relation decoding layer to annotate and fuse subject features for relations and object decoding via neural networks. [Results] Compared to the original model, the improved CasRel entity-relation extraction model increased the F1-scores of entity recognition and entity-relation extraction tasks for online health information semantic discovery by 7.62% and 4.87%, respectively. [Limitations] The overall effectiveness of the model still needs to be validated with larger datasets and empirical studies on health information from different disease types. [Conclusions] Three sets of comparative experiments validated the effectiveness of the improved CasRel entity-relation extraction model for online diabetes health information semantic discovery tasks.

  • Feng Ran, Chen Danlei, Hua Bolin
    Data Analysis and Knowledge Discovery. 2025, 9(5): 19-32. https://doi.org/10.11925/infotech.2096-3467.2024.0533
    Abstract (317) PDF (105) HTML (229)   Knowledge map   Save

    [Objective] This paper comprehensively reviews the methods of text augmentation to reveal their current state of development and trends. [Coverage] Using “textual data augmentation” and “text augmentation” as search terms to retrieve literature from Web of Science, Google Scholar and CNKI, we screened out a total of 88 representative papers for review. [Methods] Text augmentation methods were categorized and summarized according to the objects of operation, the details of implementation and the diversity of generated results. On this basis, we conducted a thorough comparison of various methods with regards to their granularity, strengths, weaknesses and applications. [Results] Text augmentation approaches can be divided into text space-based methods and vector space-based methods. The former is intuitive and easily interpretable but may compromise the overall semantic structure of the text, while the latter can directly manipulate semantic features but incurs higher computational complexity. Current studies frequently necessitate external knowledge resources, such as heuristic guidelines and task-specific data. Moreover, the introduction of deep learning algorithms can enhance the novelty and diversity of generated data. [Limitations] We primarily offer a systematic examination of technical principles and performance characteristics of advanced methods, without assessing the developmental stage of platform tools quantitatively. Besides, the analysis is grounded in our chosen literatures and may not encompass all potential application scenarios of text augmentation methods. [Conclusions] Future work should pay more attention to enriching and refining the evaluation metrics for text augmentation techniques and increasing their robustness across different downstream tasks by prompt learning. Retrieval-augmented generation and graph neural networks should be taken seriously for addressing the challenges posed by lengthy texts and limited resources, which can further unlock the potential of text augmentation methods in the field of natural language processing.

  • Han Yixiao, Ma Jing
    Data Analysis and Knowledge Discovery. 2024, 8(12): 18-29. https://doi.org/10.11925/infotech.2096-3467.2023.0923
    Abstract (302) PDF (134) HTML (189)   Knowledge map   Save

    [Objective] In response to the challenges that current multimodal emotion models face in feature fusion, resulting in suboptimal accuracy in emotion classification, we propose the RCHFN multimodal emotion classification model. [Methods] We use the CLIP and Chinese-BERT-wwm models to extract image and text features separately while performing unimodal emotion classification concurrently. Then, we use a residual fusion module consisting of merged residual connections and convolution to fuse image and text features to obtain multimodal emotion classification results. Finally, we pass both unimodal and multimodal emotion classification results to a fully connected layer and adjust dynamic weights to obtain the final emotion classification result. [Results] The experimental results show that the RCHFN model achieved sentiment classification accuracies of 81.25% and 79.21% on the Weibo dataset and the Twitter datasets, respectively, with F1 scores of 80.43% and 78.44%, respectively. Compared to other models designed for similar tasks on the same dataset, the model showed an increase in accuracy of 1.79% and 1.79%, along with F1 score improvements of 2.39% and 2.62%, respectively. [Limitations] Further experiments are needed to establish the generalisation of this model to different datasets and its performance on additional modalities. [Conclusions] The RCHFN model proposed in this study effectively addresses the challenges of fusing multimodal discourse features and improving classification accuracy in emotion classification.

  • Shi Xi, Chen Wenjie, Hu Zhengyin, Han Tao, Zhang Kai
    Data Analysis and Knowledge Discovery. 2025, 9(3): 1-15. https://doi.org/10.11925/infotech.2096-3467.2024.0176
    Abstract (298) PDF (218) HTML (247)   Knowledge map   Save

    [Objective] This study aims to efficiently extract scientific experiment knowledge and data from academic literature. It constructs a Scientific Experiment Knowledge Graph(SEKG) to provide high-quality data support for knowledge discovery. [Methods] We utilized Event Knowledge Graph technology to uniformly represent and model the complexity, temporality, and integration of knowledge and data in scientific experiments, thereby establishing the schema layer of the SEKG. Large Language Model was employed to enhance the efficiency of knowledge extraction in the data layer, with an empirical analysis conducted on organic solar cells. [Results] By using manual annotation and fine-tuning large language models, we constructed a scientific experiment knowledge graph in the field of organic solar cells. This SEKG comprises 34 types of nodes and 9 types of relationships, totaling 24,348 nodes and 123,642 relations. [Limitations] The data sources were limited to papers and patents. The construction of the SEKG required substantial manual input from experts, highlighting the need for efficiency improvements. Furthermore, fine-grained research procedures and validation rules in subfields were not considered. [Conclusions] The proposed method provides high-quality data support for applications such as experimental protocol recommendations, scientific experiment evolution analysis, and AI for Science, effectively supporting various knowledge discovery scenarios.

  • Zhai Dongsheng, Zhai Liang, Liang Guoqiang, Zhao Kai
    Data Analysis and Knowledge Discovery. 2025, 9(2): 120-133. https://doi.org/10.11925/infotech.2096-3467.2023.1277
    Abstract (288) PDF (115) HTML (176)   Knowledge map   Save

    [Objective] This study proposes a method for identifying technological evolution paths and explores key technologies and branches in specific domains. It aims to reveal the evolution trajectories of technology. [Methods] Firstly, we devised an unsupervised graph embedding model to integrate patent structural relationships, text and node information propagation, and aggregated knowledge into multi-dimensional semantic vectors. This approach expanded the technological paths while improving community division effectiveness. Secondly, we proposed methods for expanding the main path and derivative paths from the perspective of network topology and semantic correlation. Finally, we constructed a metric for technological junction points to identify the promising fields. [Results] We examined the new method with drone flight control system technology and identified four subfields’ technological evolution paths and branches. We found that pattern recognition, multiprocessor, and data fusion technologies hold promising prospects. [Limitations] Our identification framework does not incorporate the formation mechanism of technological evolution patterns. [Conclusions] The proposed method demonstrates significant advantages in path expansion effectiveness and application versatility.

  • Zhu Xiang, Zhang Yunqiu, Sun Shaodan, Zhang Liman
    Data Analysis and Knowledge Discovery. 2024, 8(12): 125-135. https://doi.org/10.11925/infotech.2096-3467.2023.0869
    Abstract (284) PDF (111) HTML (168)   Knowledge map   Save

    [Objective] This paper proposes a drug knowledge discovery method that fuses meta-path features of heterogeneous knowledge network to improve the performance of drug knowledge discovery. [Methods] Based on different meta-paths connecting drug and target entity in heterogeneous knowledge network, the HeteSim algorithm is used to calculate the multi-dimensional semantic similarity of drug-target entity. These meta-path features are fused with drug similarity and target entity similarity features as feature inputs for machine learning models to achieve drug knowledge discovery. [Results] The drug heterogeneous knowledge network contains 12,015 nodes and 1,895,445 edges. Taking drug-target relation prediction as an example, the 21-dimensional HeteSim features between drug and target were calculated. The AUC value of this method achieved the highest value on the three machine learning models (XGBoost=0.993, RF=0.990, SVM=0.975). The accuracy, precision and F-value of this method are also higher than those of the other two comparison methods. Through literature search of 20 prediction results, it is found that some prediction results can be supported by evidence in previous literature. [Limitations] Although PU learning strategy is used to reduce the influence of sample imbalance, some results will still be distorted. [Conclusions] The drug knowledge discovery method proposed in this study has certain progressiveness and effectiveness, and has certain theoretical and methodological reference significance.

  • Jin Qingwen, Li Hurong, Zhang Chen
    Data Analysis and Knowledge Discovery. 2024, 8(12): 101-111. https://doi.org/10.11925/infotech.2096-3467.2023.0892
    Abstract (283) PDF (73) HTML (125)   Knowledge map   Save

    [Objective] This study explores the application of the LIME algorithm and its evolutions in data storytelling, aiming to leverage the explanatory function of data stories. [Methods] We examined the principles, applications, and evolutionary strategies of the LIME algorithm. Based on this theoretical framework, we constructed a data storytelling process assisted by LIME-related algorithms. We collected a partial dataset for cat and dog recognition from the Kaggle platform, and trained an interpretable model with this dataset. Finally, we applied the new data storytelling model to explain image classification performance. [Results] Using an image of a “tabby cat” as the analysis object, the LIME explanation results and storytelling development curve indicated that the important features affecting the prediction results were the M-shaped stripes, black eyes, and pink nose, and the number of key superpixels being 2. [Limitations] Optimization of feature recognition and automated generation of data stories remain challenges. [Conclusions] Applying LIME-related algorithms in the data storytelling helps transform model predictions and explanation results into interpretable stories, better communicating data analysis outcomes.

  • Gao Yuan, Li Chongyang, Qu Boting, Jiao Mengyun
    Data Analysis and Knowledge Discovery. 2025, 9(4): 158-169. https://doi.org/10.11925/infotech.2096-3467.2024.0784
    Abstract (282) PDF (59) HTML (115)   Knowledge map   Save

    [Objective] This paper aims to advance the research on urban tourism flow network structure, and to address the issues of inaccurate point-of-interest recognition and distorted visiting sequence in current tourist journey reconstruction methods based on travelogue texts. [Methods] This paper proposes a method based on a large language model for reconstructing tourist journeys, and explores the structural characteristics of urban tourism flow networks by combining it with social network analysis methods. [Results] The proposed method for reconstructing tourist journey achieves a precision of 94.00% and a recall of 87.78% in POI recognition, significantly outperforming the statistics-based Conditional Random Fields (CRF) method. The reconstructed journey shows a similarity of 83.81% to the actual journey. [Limitations] Tourist journey reconstruction effects depend to a certain extent on the training effects of the Prompts of the large language model. [Conclusions] The conclusions drawn align with public perception and current research findings when taking Xi’an as a case study, demonstrating the accuracy and versatility of the proposed tourist journey reconstruction method.

  • Shi Bin, Wang Hao, Liu Maolin, Deng Sanhong
    Data Analysis and Knowledge Discovery. 2024, 8(10): 146-158. https://doi.org/10.11925/infotech.2096-3467.2023.0688
    Abstract (281) PDF (117) HTML (193)   Knowledge map   Save

    [Objective] This study aims to construct a Chinese Ceramic Image Description Model (CCI-ClipCap) to provide technical support for ceramic culture research and digital preservation. [Methods] Based on ClipCap, the prompt paradigm is introduced to improve the model’s understanding of cross-modal data, enabling automatic description of ceramic images. Additionally, we proposed a text similarity evaluation method tailored for structured textual representation. [Results] The CCI-ClipCap model improved the multi-modal fusion process with the prompt paradigm, effectively extracting information from ceramic images and generating accurate textual descriptions. Compared to baseline models, the Bleu and Rouge values increased by 0.04 and 0.14, respectively. [Limitations] The data used originated from the British Museum collections, not native Chinese datasets. This single-source data may affect the model’s performance. [Conclusions] The CCI-ClipCap model generates text with rich levels of expression, demonstrating a soild understanding of ceramic knowledge and exhibiting high professionalism.

  • Chang Bolin, Yuan Yiguo, Li Bin, Xu Zhixing, Feng Minxuan, Wang Dongbo
    Data Analysis and Knowledge Discovery. 2024, 8(11): 102-113. https://doi.org/10.11925/infotech.2096-3467.2023.0834
    Abstract (272) PDF (127) HTML (146)   Knowledge map   Save

    [Objective] This paper proposes an integrated model incorporating radical information to improve the low accuracy and efficiency of existing automatic word segmentation and part-of-speech tagging for Classical Chinese. [Methods] Based on over 70,000 Chinese characters and their radicals, we constructed a radical vector representation model, Radical2Vector. We combined this model with SikuRoBERTa for representing Classic Chinese texts, forming an integrated BiLSTM-CRF model as the main experimental framework. Additionally, we designed a dual-layer scheme for word segmentation and part-of-speech tagging. Finally, we conducted experiments on the Zuo Zhuan dataset. [Results] The model achieved an F1 score of 95.75% for the word segmentation task and 91.65% for the part-of-speech tagging task. These scores represent 8.71% and 13.88% improvements over the baseline model. [Limitations] The approach only incorporates a single radical for each character and does not utilize other components of the characters. [Conclusions] The proposed model successfully integrates radical information, effectively enhancing the performance of textual representation for Classical Chinese. This model demonstrates exceptional performance in word segmentation and part-of-speech tagging tasks.

  • Teng Fei, Zhang Qi, Qu Jiansheng, Li Haiying, Liu Jiangfeng, Liu Boyu
    Data Analysis and Knowledge Discovery. 2024, 8(11): 33-46. https://doi.org/10.11925/infotech.2096-3467.2023.0767
    Abstract (268) PDF (145) HTML (124)   Knowledge map   Save

    [Objective] This study utilizes big data analytics to identify key and core technologies, improving the accuracy of identification results and providing robust data support for future technological innovation and large-scale applications. [Methods] We proposed a key and core technology identification method using the patent competitiveness index and Doc-LDA topic model based on the definitions of key and core technology concepts. The method distinguished topics by evaluating their strength, topic co-occurrence strength, and effective cohesion constraint coefficient. [Results] Taking new energy vehicles (EVs) as an empirical research example, a total of 10 key and core technologies were identified: fuel cells, solid-state power batteries, high-efficiency high-density motor drive system, lightweight plastic and composite materials, cellular communication, electro-mechatronics integration, multi-gear transmission, vehicle operations, intelligent control, and autonomous driving. Further trend analysis was conducted. [Limitations] Due to the limited granularity of topic refinement, some potential micro-mechanisms have not been fully revealed. [Conclusions] Using the patent competitiveness index and the Doc-LDA topic model provides a comprehensive assessment of the market value and competitive advantage of technologies. The proposed method also enhances the accuracy of technology development trend predictions.

  • Wu Shuai, Yang Xiuzhang, He Lin, Gong Zuoquan
    Data Analysis and Knowledge Discovery. 2024, 8(12): 136-148. https://doi.org/10.11925/infotech.2096-3467.2023.1002
    Abstract (266) PDF (82) HTML (114)   Knowledge map   Save

    [Objective] Combining the complex sentence structure features of ancient texts, a method with higher accuracy for identifying entity words in ancient texts was developed to further the development of digital humanities research. [Methods] Trigger words and relative words were used as key feature words to identify entity words, and a sentence pattern template was designed. Based on the characteristics of ancient texts, a Bert-BiLSTM-MHA-CRF model was constructed. The fusion of syntactic features and the Bert-BiLSTM-MHA-CRF model was used to achieve deep and fine-grained entity recognition of ancient texts. [Results] The F1 Score of this method is 0.88 on the conventional annotated test data set, 0.83 on the small sample annotated test data set, 0.79 (The Book of Songs), 0.81 (Master Lü’s Spring and Autumn Annals) and 0.85 (Discourses of the States) on the transfer learning test data set. [Limitations] In the design of syntactic feature templates, only single ancient books are used as feature templates. Semantic information mining does not take into account the structural features of characters such as phonetic symbols and radicals in ancient texts. [Conclusions] In small sample annotation and transfer learning experiments, this method can also achieve accurate named entity recognition of ancient texts, providing high quality corpus data for digital humanities research.

  • Chen Wanzhi, Hou Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 52-65. https://doi.org/10.11925/infotech.2096-3467.2024.0720

    [Objective] To address the issues in multimodal sentiment analysis, such as insufficient multimodal feature extraction, semantic differences between modalities, and lack of interaction, we propose a temporal multimodal sentiment analysis model that integrates multi-level attention and sentiment scale vectors. [Methods] Firstly, we introduced a scalar Long Short-Term Memory network with a multi-head attention mechanism to construct a deep temporal feature modeling network for extracting rich contextual temporal features from text, audio, and visual modalities. Secondly, we employed the text-guided dual-layer cross-modal attention mechanism and the improved self-attention mechanism to facilitate the deep information exchange across modalities, thereby generating two sentiment scale vectors for sentiment intensity and polarity. Finally, the L1 norm of the sentiment intensity vector was multiplied by the normalized sentiment polarity vector to obtain a comprehensive representation of sentiment strength and polarity, thereby enabling accurate sentiment prediction. [Results] Experiments on the CMU-MOSI dataset show that the proposed model achieves good results in both comparative and ablation experiments, outperforming the next-best model by 1.2 and 2.3 percentage points on the Acc7 and Corr metrics, respectively. On the CMU-MOSEI dataset, the proposed model surpasses baseline models across all evaluation metrics, achieving 86.0% in Acc2 and 86.1% in F1 score. [Limitations] Sentiment expression is highly context-dependent, and the sources of sentiment cues may vary across different scenarios. The proposed model may perform poorly when textual information is insufficient. [Conclusions] The proposed model effectively extracts contextual temporal features from various modalities and leverages the rich emotional information in the text modality for deep inter-modal interaction, thereby enhancing the accuracy of sentiment prediction.

  • He Jun, Yu Jianjun, Rong Xiaohui
    Data Analysis and Knowledge Discovery. 2024, 8(10): 136-145. https://doi.org/10.11925/infotech.2096-3467.2023.0645
    Abstract (250) PDF (300) HTML (143)   Knowledge map   Save

    [Objective] This paper aims to ensure the objectivity, timeliness, and accuracy of the overall budget performance evaluation of research institutions, and to improve the efficiency of performance evaluation work. [Methods] We proposed a method for predicting research institutions’ overall budget performance evaluation based on LightGBM. Our method integrates various data from scientific research management information systems. It uses machine learning algorithms to analyze and predict the overall budget performance evaluation results by correlating research inputs and outputs with performance. [Results] In the application of the overall budget performance evaluation of research institutions, the accuracy of the proposed method reached 94.12%. The human resources required for the budget performance evaluation process were reduced from 10 people to 5, and the time cost was shortened from 38 days to about 10 days. [Limitations] Some performance evaluation indicators are subjective and difficult to quantify using business data from scientific research management information systems. [Conclusions] The proposed method has excellent performance in predicting overall budget performance evaluation results. It reduces the fairness issues due to subjective evaluation, and saves the human resources and time costs in budget performance evaluation, thus improving their efficiency.

  • Yu Bengong, Cao Chengwei
    Data Analysis and Knowledge Discovery. 2024, 8(10): 54-65. https://doi.org/10.11925/infotech.2096-3467.2023.0722
    Abstract (249) PDF (108) HTML (102)   Knowledge map   Save

    [Objective] This paper aims to address the problem in current aspect-based sentiment analysis research, where the use of sentiment knowledge to enhance syntactic dependency graphs overlooks syntactic reachability and positional relationships between words and does not adequately extract semantic information. [Methods] We proposed an aspect-based sentiment analysis model based on a position-weighted reachability matrix and multi-space semantic information extraction. First, we used a reachability matrix to incorporate syntactic reachability relationships between words into the syntactic dependency graph, and we employed position-weighting to adjust the matrix to enhance contextual feature extraction. Then, we integrated the sentiment features with the enhanced dependency graph to extract aspect word features. Third, we use the multi-head self-attention mechanism combined with a graph convolutional network (GCN) to learn contextual semantic information from multiple feature spaces. Finally, we fused feature vectors containing positional information, syntactic information, affective knowledge, and semantic information for sentiment polarity classification. [Results] Compared to the best-performing models, the proposed model improved accuracy on the Lap14, Rest14, and Rest15 datasets by 1.00%, 1.25%, and 0.76%. When using BERT, the PRM-GCN- BERT model’s accuracy on the Lap14, Rest14, Rest15, and Rest16 datasets increased by 0.50%, 0.22%, 1.98%, and 0.31%. [Limitations] The proposed model was not applied to Chinese or other language datasets. [Conclusions] The proposed model enhances feature aggregation in graph convolutional networks, improves contextual feature extraction, and boosts semantic learning effectiveness, thereby significantly improving the accuracy of aspect-based sentiment analysis.

  • Xu Haoshuai, Hong Liang, Hou Wenjun
    Data Analysis and Knowledge Discovery. 2024, 8(10): 66-76. https://doi.org/10.11925/infotech.2096-3467.2023.0973
    Abstract (242) PDF (117) HTML (120)   Knowledge map   Save

    [Objective] This paper addresses the challenge of constructing label mapping in prompt learning-based relation extraction methods when labeled data is scarce. [Methods] The proposed approach enhances prompt effectiveness by injecting relational semantics into the prompt template. Data augmentation is performed through prompt ensemble, and an instance-level attention mechanism is used to extract important features during the prototype construction process. [Results] On the public FewRel dataset, the accuracy of the proposed method surpasses the baseline model by 2.13%, 0.55%, 1.40%, and 2.91% in four few-shot test scenarios, respectively. [Limitations] The method does not utilize learnable virtual prompt templates in constructing prompt templates, and there is still room for improvement in the representation of answer words. [Conclusions] The proposed method effectively mitigates the problem of limited information and insufficient accuracy in prototype construction under few-shot scenarios, improving the model’s accuracy in few-shot relation extraction tasks.

  • Wen Tingxin, Bai Yunhe
    Data Analysis and Knowledge Discovery. 2024, 8(12): 86-100. https://doi.org/10.11925/infotech.2096-3467.2023.0881
    Abstract (242) PDF (101) HTML (122)   Knowledge map   Save

    [Objective] This study proposes an interpretable model for the interaction quality of fake news groups based on RF-GA-XGBoost and SHAP. Our model mitigates the negative impacts of fake news by leveraging the interaction quality of social media user groups and accurately identifies the causes and mechanisms of positive interactions. [Methods] First, we retrieved 500 fake news articles and 7,029 comments from the Weibo21 dataset. Then, we assessed the fake news groups’ interaction quality across three dimensions: content, form, and comment sentiment. Third, we extracted fake news text features from these dimensions. Fourth, we used the sequential forward search strategy of random forest to extract the optimal feature subset of fake news text. We constructed a prediction model for group interaction quality based on GA-XGBoost, and compared its performance with other mainstream machine learning algorithms such as LR, SVM, and XGBoost. Finally, the SHAP model provides causal explanations for the impact of important features on the group interaction quality. [Results] Our model’s F1-score and AUC values are over 86%, outperforming the comparison models across six performance metrics. Additionally, features such as the number of content characters, words, and negative sentiment words in fake news text significantly influence the interaction quality of social media groups. [Limitations] This paper does not conduct multi-feature interaction interpretation analysis or explore the early high-quality group interaction patterns based on timestamps. [Conclusions] The proposed model accurately identifies the ways in which different features impact group interaction quality, providing effective decision-making support for social media platforms to improve their operational strategies and functional designs.

  • Du Jialin, Wang Xizi, Hu Guangwei
    Data Analysis and Knowledge Discovery. 2024, 8(11): 59-71. https://doi.org/10.11925/infotech.2096-3467.2023.0778
    Abstract (241) PDF (154) HTML (116)   Knowledge map   Save

    [Objective] This study investigates the factors influencing public satisfaction with government-citizen interaction platforms. We constructed an analysis model for factors affecting public satisfaction. [Methods] We extracted micro-level variables from the leadership mailbox corpus, which were combined with macroeconomic variables to establish a public satisfaction analysis model using the Gradient Boosting Decision Tree (GBDT) method. We also eliminated less influential variables with SHAP analysis to optimize the model. [Results] The proposed model outperformed comparison models across accuracy, recall, precision, and F1-score. Key features affecting public satisfaction with the leadership mailbox include GDP growth rate, PCDI growth rate, CPI growth rate, message topic, message type, and response mode. [Limitations] The study did not explore a broader range of influencing factors or more extensive government-citizen interaction scenarios. [Conclusions] The new model optimizes the variable selection process and visualizes how each feature influences the level, direction, and manner of public satisfaction with government responses. The model is a data-driven tool for administrative decision-making.

  • Hou Jianhua, Deng Xianjiang, Tang Shiqi
    Data Analysis and Knowledge Discovery. 2025, 9(3): 69-82. https://doi.org/10.11925/infotech.2096-3467.2024.0353
    Abstract (241) PDF (91) HTML (160)   Knowledge map   Save

    [Objective] This study aims to explore the influence of interdisciplinary knowledge integration on the emergence of high-value patents and to delineate their distinctive characteristics. [Methods] High-value patents are operationalized as patents that receive the China Patent Gold Award. Interdisciplinary knowledge integration is quantified by two dimensions: IPC classification and patent knowledge units. Regression analysis investigates the effects of interdisciplinary knowledge integration, measured by these two dimensions, on both patent award status and individual patent value dimensions. [Results] The analysis reveals that high-value patents tend to exhibit a narrower interdisciplinary scope in terms of IPC classification, while simultaneously demonstrating a more diverse knowledge structure. In particular, interdisciplinary knowledge integration, when indicated by IPC classification, shows an inverted U-shaped relationship with patent value. Conversely, interdisciplinary knowledge integration, when indicated by knowledge units, shows a negative correlation with patent value. [Limitations] This study is limited by its reliance on the China Patent Gold Award as the sole proxy for high-value patents, which may not fully encompass the multifaceted nature of high-value patent characteristics. [Conclusions] This research provides valuable insights into the proactive identification and protection of high-value patents. Furthermore, the findings inform strategies to enhance upstream patent quality control and to facilitate effective patent translation and commercial utilization.

  • Zhu Yujing, Chen Fang, Wang Xuezhao
    Data Analysis and Knowledge Discovery. 2024, 8(10): 1-13. https://doi.org/10.11925/infotech.2096-3467.2023.0699
    Abstract (239) PDF (135) HTML (178)   Knowledge map   Save

    [Objective] In response to Western technology export controls on China, this study proposes a method for identifying critical core technologies by mapping the U.S. Commerce Control List (CCL) to a patent-based dual-layer network. The goal is to provide a reference for selecting and prioritizing technology breakthrough directions. [Methods] The study integrates the CCL and patent data to build a dual-layer network consisting of a CCL-related network and a weighted patent citation network. We used a community detection algorithm to identify technology clusters in both layers and calculated the semantic similarity of inter-layer clusters to achieve automatic mapping. Using Word2Vec and the n-gram method, we extracted keywords from each cluster to represent technical topics. Finally, we identified the patent clusters with the highest similarity to the CCL clusters as critical core technologies. [Results] Empirical results in industrial software demonstrate that this method identifies 12 distinct patent clusters with the highest similarity to the CCL clusters, all of which have a similarity of over 0.85. They involve integrated circuit IP cores, precision measurement, process control, motion control, and turbine detection. Literature research has verified them as key core technologies in industrial software. [Limitations] The study only focused on industrial software for empirical research. The technical approach can be improved, and the identification results require further interpretation and analysis. [Conclusions] The proposed method efficiently and accurately identifies key core technology at a micro-level, features a high degree of automation, and is highly readable, providing significant practical application value.

  • Si Binzhou, Sun Haichun, Wu Yue
    Data Analysis and Knowledge Discovery. 2025, 9(7): 38-51. https://doi.org/10.11925/infotech.2096-3467.2024.0287

    [Objective] This study proposes a research framework for risk analysis of telecom fraud based on large language models (LLMs) and event fusion to reveal the process of telecom fraud and identify key risk factors. [Methods] We constructed a two-stage hierarchical prompt instruction specific to the telecom fraud domain and extracted risk events and their arguments from fraud cases. The framework integrates semantic dependency analysis with template-matching techniques to obtain the fraud event chains. Considering the diversity in event descriptions, we employed the BERTopic model for sentence vector representation and utilized a clustering algorithm for event fusion. [Results] Our method achieved F1-scores of 67.41% for event extraction and 73.12% for argument extraction in telecom fraud case analysis. Event clustering identified 10 categories of thematic risk events, with “disclosing information” as the highest-risk behavior. [Limitations] The coarse granularity of police report data limits the framework’s early warning capabilities. [Conclusions] The proposed approach, combining LLMs with event fusion clustering, enables the automatic construction of fraud event evolution chains, facilitates risk analysis, and supports the early warning and deterrence of telecom frauds.

  • Zhang Jinzhu, Sun Wenwen, Qiu Mengmeng
    Data Analysis and Knowledge Discovery. 2024, 8(10): 14-27. https://doi.org/10.11925/infotech.2096-3467.2023.0724
    Abstract (226) PDF (119) HTML (112)   Knowledge map   Save

    [Objective] This study aims to expand the heterogeneous network in citation recommendations by including more nodes and relationships. It seeks to provide deep semantic representations and reveal how different relationships impact citation recommendations, ultimately improving the effectiveness of such recommendations. [Methods] By introducing semantic links, we constructed a heterogeneous network representation learning model incorporating an attention mechanism. This model generates deep semantic and structural representations, as well as similarity metrics for citation recommendations. We also conducted ablation experiments to explore the impact of different factors on citation recommendation. [Results] After introducing semantic links, the citation recommendation model’s AUC improved by 0.012. With the addition of a dual-layer attention mechanism, there was a further improvement of 0.079 in AUC. Compared to the baseline model CR-HBNE, the AUC and AP improved by 0.185 and 0.204, respectively. [Limitations] Manual selection of relationship paths is inefficient, and evaluating the recommendation results based on only two metrics is relatively simplistic. [Conclusions] The proposed method fully utilizes the complex associations and deep semantic information among citations, effectively improving citation recommendation performance.

  • Cao Kun, Wu Xinnian, Bai Guangzu, Jin Junbao, Zheng Yurong, Li Li
    Data Analysis and Knowledge Discovery. 2025, 9(3): 42-55. https://doi.org/10.11925/infotech.2096-3467.2024.0006
    Abstract (216) PDF (84) HTML (120)   Knowledge map   Save

    [Objective] This study explores methods for identifying key core technologies by integrating the textual content characteristics of “science-technology” and complex network relationships. It supports governments, research institutions, and industries in formulating scientific and technological strategies and conducting innovation activities. [Methods] First, we employed the Sentence-BERTopic model to perform deep semantic fusion and knowledge topic clustering on sentence-level paper and patent text corpora. Then, we constructed a “science-technology” knowledge topic complex network based on the citation relationships of these documents. Third, we improved the traditional PageRank algorithm by incorporating node quality characteristics, time decay factors, the weights of incoming node edges, and outdegree. This approach ranked the importance and influence of nodes within the domain. Finally, we identified key core technologies using the head/tail break method. [Results] We conducted an empirical study on CNC machine tools and identified 53 key core technologies, including thermal error modeling and compensation, CNC machine tools control technology, and feed systems. A comparison with relevant domestic and international policy plans demonstrates that the identified technologies comprehensively encompass the key core technologies in the field. [Limitations] This study lacks an in-depth analysis of citation locations, motivations, behaviors, and purposes, which may affect identification accuracy. [Conclusions] This study reveals the knowledge structure and topological characteristics of science and technology by constructing a “science-technology” complex network and applying the Key Core Rank (KCR) algorithm. The proposed method achieves fine-grained and precise quantitative identification of key core technologies.

  • Zhang Lanze, Gu Yijun, Peng Jingjie
    Data Analysis and Knowledge Discovery. 2025, 9(1): 65-78. https://doi.org/10.11925/infotech.2096-3467.2023.1009
    Abstract (215) PDF (86) HTML (129)   Knowledge map   Save

    [Objective] To enhance the accuracy of graph neural networks in credit fraud detection, this paper introduces topological structure analysis. It proposes a graph-based deep fraud detection model (PSI-GNN) integrating prior structural information. [Methods] We embed the attribute information representing the topological structure of central nodes into feature vectors through structural information encoding. Then, we divided the message-passing process into proximal and distal aspects. We aggregated proximal node information based on a shallow graph neural network model and aggregated distal homophily information guided by random walk structural similarity. Finally, we combined the results of the above message passing to obtain node embedding representations. [Results] We examined the new model on the DGraph-Fin and TFinance datasets, which include fraudulent behaviors. The Macro-F1 and AUC of the PSI-GNN model improved by 2.62%, 4.55%, and 4.67%, 2.33%, respectively, compared to nine graph neural network models in related fields. [Limitations] The processing of node structural information incurs significant time overhead. [Conclusions] By modeling the structural attributes and homophily information of credit networks, we can effectively detect credit fraudsters.

  • Wang Xiaolun, Yao Qian, Lin Jiahui, Zhao Yuxiang, Sun Zhihao, Lin Xinlan
    Data Analysis and Knowledge Discovery. 2025, 9(1): 55-64. https://doi.org/10.11925/infotech.2096-3467.2024.0098
    Abstract (211) PDF (89) HTML (165)   Knowledge map   Save

    [Objective] Based on self-determination theory, this study explores the motivations of service providers to participate in tasks on skill crowdsourcing platforms. [Methods] We retrieved 15,641 bids and 2,385 service provider records from the epwk.com platform. We utilized the TF-IDF and the BERT to analyze text features and calculate motivation variables. Finally, we constructed a negative binomial regression model considering the dependent variables as count variables. [Results] The motivations and behaviors of service providers participating in skill crowdsourcing were significantly correlated at the 1% level (R²=23.10%). Task difficulty improved the model’s explanatory power, negatively moderating competence and reputation (p<0.05) while positively moderating social recognition (p<0.01). [Limitations] The representativeness is limited to a single platform. Future studies could collect data from multiple platforms for comparative validation. External factors such as platform dynamics and policy environments might interfere with the data, which should be considered in future research to deepen the conclusions. [Conclusions] This paper expands the theoretical foundation for service provider participation in crowdsourcing tasks and offers practical insights for service providers, buyers, and platforms.