Home Browse Online first

Online first

The manuscripts published below will continue to be available from this page until they are assigned to an issue.
Please wait a minute...
  • Select all
    |
  • Dan Zhiping, Li Lin, Yu Xiaosheng, Lu Yujie, Li Bitao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0957
    Online available: 2025-02-12

    [Objective] Aiming at the issues that hate speech without obvious malicious words cannot be effectively identified in Chinese text, a Chinese hate speech detection method integrating multi-dimensional sentiment features (RMSF) was proposed. [Method] Firstly, RoBERTa is employed to extract character and sentence level features of the input text, while sentiment dictionaries are utilized to derive multi-dimensional sentiment features. Subsequently, the character features and sentiment features are concatenated and fed into the BiLSTM network to capture deeper contextual semantic information. Finally, the output of BiLSTM and the sentence features of RoBERTa were concatenated, processed by the MLP layer, and the SoftMax function was applied for category prediction. To tackle the issue of imbalanced classes, the focal loss function is used to optimize the model, so as to accurately distinguish whether the input text is hate speech. [Results] The RMSF method achieves precision, recall, and F1 scores of 82.63%, 82.41%, and 82.45%, respectively, on the TOXICN dataset. On the COLDataset, it attains precision of 82.82%, recall of 82.81%, and an F1 score of 82.69%. Compared to existing methods, RMSF demonstrates improvements in F1 scores of 1.85% and 0.94%, respectively.[Limitations] The effectiveness of the hate speech detection method that integrates multi-dimensional sentiment features is constrained by its reliance on sentiment dictionaries, as the quality of emotional feature extraction depends on the comprehensiveness of these dictionaries.[Conclusion] Integrating multi-dimensional sentiment features into the Chinese hate speech detection model significantly enhances detection performance, and experimental results confirm the effectiveness of the proposed approach.

  • Yang Ying, Zhang Lingfeng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0733
    Online available: 2025-02-11

    [Objective] Existing research on multimodal review helpfulness primarily focuses on the simple integration of image and text modalities. This paper aims to investigate the impact of product domain knowledge and dynamic interactions between images and text on the helpfulness of review, thereby enhancing the performance of identifying multimodal review helpfulness. [Methods]In this paper, we propose a domain knowledge-enhanced multimodal review helpfulness detection method. Initially, domain keywords are extracted based on the latent thematic features of the review text, combined with topic attention to represent knowledge features. A knowledge-enhanced dynamic text-image interaction module was developed, which leverages a knowledge-enhanced intra-modality self-attention mechanism to integrate knowledge with text and images. Furthermore, it employs a knowledge-enhanced inter-modality co-attention mechanism to refine feature representations through the dynamic interaction between knowledge-enriched text and images. [Results]The detection F1-score on the Amazon dataset reached 89.57%, which is an improvement of 0.9% over the best baseline model. [Limitations]This paper only conducts experiments on English datasets, and the performance on Chinese datasets remains to be further investigated. [Conclusions]This paper leverages domain knowledge to enhance the model, which not only effectively improves the performance of identifying the helpfulness of reviews but also adeptly extracts key information from images and text, thereby increasing the model's interpretability.

  • Hai Jiali, Wang Run, Yuan Liangzhi, Zhang Kairui, Deng Wenping, Xiao Yong, Zhou Tao, Chang Kai
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0747
    Online available: 2025-02-11

    [Objective]To develop a retrieval-augmented question-answering(QA) system for Traditional Chinese Medicine (TCM) standards, aimed at accurately and effectively delivering high-quality standardized TCM knowledge and practical experience to clinicians and the general public. This system seeks to enhance the research and application of TCM standardization. [Methods]By comparing the performance of existing large language models (such as BaiChuan, Gemma, Tongyi Qianwen, etc.), the GPT-3.5 model was selectively chosen as the foundational model. This was combined with data optimization and retrieval-augmented generation techniques to develop a TCM standards question-answering system with capabilities in semantic analysis, contextual association, and content generation. [Results]The retrieval-augmented TCM standards QA system demonstrated answer relevance with precision, recall, and F1 scores of 87.9%, 83.9%, and 85.7%, respectively, on the TCM literature question generation dataset, and 87.1%, 83.6%, and 85.3%, respectively, on the TCM standards QA dataset. Contextual relevance on the TCM literature question generation dataset showed precision, recall, and F1 scores of 83.8%, 86.9%, and 85.3% respectively.These metrics outperformed the compared models, indicating that this system can more accurately answer questions related to TCM standards.[Limitations]The current system's intent recognition module requires further optimization, and the TCM standards knowledge base needs to be expanded and refined at a more granular level. [Conclusions]This study addresses the practical needs of TCM knowledge services by exploring the construction of a retrieval-augmented TCM standards QA system. This system can answer various questions related to TCM treatment guidelines, herbal medicine standards, information standards, etc., including treatment principles, disease classification, treatment methods, and technical requirements of TCM standards, demonstrating high practicality and feasibility.

  • Tang Chao, Chen Bo, Tan Zelin, Zhao Xiaobing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0722
    Online available: 2025-02-11

    [Objective] Through knowledge distillation techniques, this work aims to inject additional knowledge from unsupervised external sources into a student entity extraction model for classical Chinese texts, alleviating the issue of scarce supervised data in this task. [Methods] A large language model is utilized as a generative knowledge teacher model to perform knowledge distillation on unsupervised corpora. A dictionary knowledge teacher model is constructed based on the supervised data from the ZuoZhuan and GuNer datasets, and the distilled dictionary knowledge is combined to construct a semi-supervised dataset for classical Chinese entity extraction. The task is then reformulated as a sequence-to-sequence problem, and pre-trained models like mT5 and UIE are fine-tuned on this dataset. [Results] On the ZuoZhuan and GuNer datasets, the proposed method achieves F1 scores of 89.15% and 95.47% respectively, outperforming the baseline models SikuBERT and SikuRoBERTa, which were incrementally fine-tuned on classical Chinese corpora, by 8.15 and 9.27 percentage points in F1 score. [Limitations] Additional entity information is not included, and the quality of data pre-recalled by the large model can affect the extraction results. [Conclusions] In low-resource settings, the proposed approach effectively distills the knowledge advantages of pre-trained large language models and dictionary resources into the student entity extraction model, significantly improving the performance on classical Chinese entity extraction tasks.

  • Zeng Wen, Wang Yuefen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0660
    Online available: 2025-02-11

    [Objective] This article aims to provide decision-making references for scientific and technological innovation activities by studying the distribution characteristics and evolution laws of patent technology transfer types. [Methods] Based on the input and output of patent assignment, a patent technology transfer information database was constructed, and time periods were divided and networks were constructed. Patent feature values were selected to construct indicators of transfer scope and depth, and technology transfer types were defined based on strategic coordinate diagrams. The type distribution and evolution trend in different time periods were analyzed using Markov chain methods. [Results] Type III is the most common type of AI patent technology transfer in China, while type I is highly concentrated in the Yangtze River Delta and Pearl River Delta regions. Most provinces and cities follow the development pattern of type III to type II to type I. The probability of maintaining technology transfer types over time is high, especially the maintenance probability of type I is 100%, and cross-level transitions between types are reduced. [Limitations] Only two dimensional indicators are selected for research on technology transfer types, and in the future, multi-dimensional indicators can be used for analysis. [Conclusions] The characteristics and evolution laws of technology transfer types obtained from the research can provide reference for the government and enterprises to formulate targeted patent transfer and transformation policies and strategies.

  • Gao Yuan, Li Chongyang, Qu Boting, Jiao Mengyun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0784
    Online available: 2025-02-11

    [Objective] To optimize the research on the structure of tourism flow networks, and to address the issues of inaccurate point-of-interest recognition and distorted visiting sequence in current tourist journey reconstruction methods based on travelogue texts.[Methods] This paper proposes a method for reconstructing tourist journey based on large language models, and combines social network analysis to explore the structural characteristics of urban tourism flow networks.[Results] The proposed method for reconstructing tourist journey achieves a precision of 94.00% and a recall of 87.78% in POI recognition, significantly outperforming the statistics-based Conditional Random Fields (CRF) method. The reconstructed journey show a similarity of 83.81% to the actual journey. [Limitations]The effectiveness of tourists' journey reconstruction relies to some extents on the result of the large language model Prompt training. [Conclusion] The conclusions drawn align with public perception and current research findings when taking Xi'an as a case study, demonstrating the accuracy and versatility of the proposed tourist journey reconstruction method.

  • He Duokui, Tang Zhongjun, Chen Qianqian, Wang Yiran, Hu Feng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0982
    Online available: 2025-02-11

    [Objective] To propose a dynamic topic modeling method for short texts driven by large language models that ensures topic identification accuracy and reveals the patterns of topic evolution. [Methods] We comprehensively utilize Instruction Tuning, Retrieval-Augmented Generation (RAG) and clustering techniques to enhance the accuracy of topic identification. Based on the topic mapping relationships, we conduct comprehensive statistics on the topics in chronological order to reveal the patterns of topic evolution. [Results] We validated our proposed dynamic topic modeling method using four short text datasets and found that it outperformed the second-best model by an average of 6.15 and 7.71 percentage points in Topic Coherence (TC) and Topic Diversity (TD) scores, respectively. Through ablation experiments, we further analyzed the impact of Fine-Tuning, RAG, and clustering techniques on topic identification performance. Moreover, our study uncovered distinctive topic evolution patterns across different datasets, notably including “M” and “L” structures. [Limitations] Future research could integrate knowledge graphs to optimize RAG, enhancing topic identification capabilities, and validate the model's versatility by applying it to short texts from multiple domains. [Conclusions] Experimental results demonstrate that the proposed method has obvious advantages in topic identification and topic evolution.

  • Yan Liu, Yalan Zhan, Zihezng Jiang, Jinliang Li, Zhijun Yan, Chaocheng He
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0991
    Online available: 2025-02-11

    [Objective] Existing studies on health rumor detection typically consider rumor detection as a binary classification task and leverages linguistic content features to make prediction, overlooking the language style features and partially misleading situation where both true and false content coexist in a post. To bridge the research gaps, this study proposes a novel multimodal wide and deep approach for online health rumor detection based on Aristotle's rhetorical theory (MWDLS).[Methods] The MWDLS leverages Aristotle's Rhetorical Theory to extract persuasive language style features. It also employs a bidirectional cross-modal fusion strategy combined with a gating mechanism to jointly learn representations of shallow and deep features for label predictions.[Results] The results of an extensive set of experiments on a real-world dataset from Weibo suggest that the proposed MWDLS framework significantly outperforms the baselines, achieving a 1.74% ~ 11.98% improvement in F1 score.[Limitations] Future work can combine the proposed framework with large language models to explore new directions for performance improvement.[Conclusions] This study proposes a theory-driven multimodal framework for health rumor detection and validates its effectiveness using a real-world social media dataset, thereby offering significant theoretical and practical implications.

  • Zhang Xinsheng, Li Jiang, Wang Minghu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0726
    Online available: 2025-02-11

    [Objective] Addressing challenges of indistinct boundaries, sparse data, and manual annotation limits in traditional craftsmanship entity extraction, this paper integrates traditional intangible cultural heritage skills with named entity tagging. It presents the ER-BFAS model, utilizing boundary features and attention sequences for skill entity identification in textual corpora.[Methods] By integrating entity boundary attribute features into the joint embedding layer of text labels, feature vectors are formulated through an attention mechanism. Additionally, bidirectional LSTM is employed to grasp the interconnectedness of skill-related entity labels, bolstering the model's capacity to discern various labels. Subsequently, the CRF layer predicts entity labels, selecting the one with the highest conditional probability as the ultimate prediction.[Results] Compared with other sequence labeling models, the ER-BFAS model reaches an F1 score of 85% and over 90% accuracy across distinct tags on the traditional skill dataset, achieving a precision rate of 75% on the general DGRE dataset.[Limitations] The study is confined by the limited variety of experimental data types and does not address intricate entity relationships.[Conclusions] The ER-BFAS effectively identifies entity boundaries in traditional skills and datasets, enhancing entity recognition in intangible cultural heritage traditional skills.

  • Sun Xinxin, Sun Yanan, Zhao Yuxiang, Jiang Bin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0633
    Online available: 2025-02-11

    [Objective] Based on the CASA paradigm and stereotype model, this study explores the impact of the voice characteristics of AI medical voice assistants on the perceived credibility of the older adults.[Methods] This study conducted a 3 (voice gender: female/male/non-binary) × 2 (communication style: expert/partner) between-subjects experiment to examine the impact of voice gender and communication style of AI medical voice assistants on the perceived credibility and willingness to use by older adults, as well as the mechanism of action on the stereotype dimensions of perceived warmth and perceived professionalism.[Results] The results show that the elderly have a higher perceived credibility of male expert-type and female partner-type AI medical voice assistants. At the same time, communication style affects the older adults' perceived credibility of the gender of the voice through the perceived professional dimension. Perceived credibility has been verified to positively influence and predict the older adults' intention to use AI medical voice assistants.[Limitations] Based on the status of intelligent construction of China's medical system, the global universality of the research conclusions needs to be explored.[Conclusion] The matching of voice characteristics and stereotypes has a positive impact on the perceived credibility of the older adults. In the design and application of AI medical voice assistants, the interaction between multiple voice factors and scenario applicability should be considered.

  • ZHANG Le, CHEN Yansong, ZHANG Leihan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0625
    Online available: 2025-02-11

    [Objective] In multimodal sentiment analysis, there exists a phenomenon of inconsistent sentiment expression among different modalities, which suppresses the effect of multiple modalities in collaborative decision-making of sentiment. To address this problem, a multimodal sentiment analysis method based on feature enhancement of large models and multi-level cross-fusion is proposed.[Methods] To alleviate the conflicting sentiment information among various modalities and improve the expression of common sentiment features, the multimodal large language model is used to extract the internal auxiliary sentiment information of the modality. The hierarchical cross-attention mechanism is used to learn the common sentiment features between modalities and explore the auxiliary sentiment features within the modality to improve the expression of common sentiment semantics. In the fusion stage, a modality attention weighted fusion method is proposed to effectively balance the contribution of common sentiment features and auxiliary sentiment features, and a loss function combining multimodal and single-modal is introduced to solve the inconsistent sentiment semantics problem.[Results] Compared with existing methods, the model proposed in this paper has achieved better performance than the contrastive models on the public datasets CH-SIMS and CMU-MOSI. Specifically, on CH-SIMS, the binary classification accuracy and F1 score have increased by 1.77 and 1.63 percentage points, respectively. On CMU-MOSI, the binary classification accuracy and F1 score have increased by 0.43 and 0.41 percentage points, respectively. Moreover, on the sentiment inconsistent data of CH-SIMS, the binary classification accuracy and F1 score have increased by 1.80 and 1.72 percentage points, respectively. This demonstrates that the proposed model can effectively address the issue of inconsistent sentiment semantics among different modalities.

    [Limitations] The personalized information of the characters in the video has not been taken into consideration, and the impact brought by the personalized information of the characters has been neglected.[Conclusions] This paper utilizes a hierarchical cross-attention mechanism to effectively fuse features from various modalities and enhance the expression of common semantics, which can effectively address the issue of inconsistent sentiment semantics among different modalities.

  • Duan Yufeng, Bai Ping
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1128
    Online available: 2025-02-11

    [Objective] To achieve early detection of rumors, explore rumor detection models and methods based on text content. [Methods] A Chinese health rumor detection model based on large language model knowledge-enhanced multi-scale graph neural networks is proposed. The model first builds a text graph for a single text to capture hidden information in the sentence; then, it extracts entity information from the text through prompt engineering for knowledge enhancement; finally, it adopts a multi-scale graph neural network model combined with feature decomposition for rumor detection. [Results] The accuracy of the model proposed in this study on the CHECKED and LTCR datasets achieved 95.21% and 87.39%, respectively, which is superior to the selected baseline models. [Limitations] The model proposed in this study only detects rumors based on text information, without using multimodal information such as images and videos. [Conclusions] Utilizing large language models for knowledge enhancement not only facilitates faster and more convenient entity extraction but also enhances the semantic information of sentences. The use of multi-scale graph neural networks with feature decomposition can maintain computational stability while capturing multi-scale features, and building text graphs for individual texts is more flexible and convenient. This study demonstrated good performance in Chinese health rumor detection using the aforementioned steps.

  • Wang Song, Jiao Haiyan, Liu Xinmin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1086
    Online available: 2025-02-11

    [Objective] In depth exploration of the thematic structure and dynamic changes of valuable user generated content aims to effectively extract and analyze co-creation knowledge, promote efficient collection and utilization of multi-modal knowledge. [Methods] Taking into account the text semantics and image features of multi-modal user generated content, knowledge aggregation and topic mining are implemented based on the research paradigm of "model driven+knowledge enhancement". Firstly, the BERT, Doc2Vec, ResNet, and K-BERT models are integrated to capture deep vector representations of multi-level text and images and complete knowledge enhanced; Secondly, construct a distance matrix to characterize the intrinsic relationships between multi-modal contents; Finally, with the help of spectral clustering and DTM models, the dynamic evolution of knowledge topics is analyzed in depth.

    [Results] Using typical online virtual community data for verification, the experimental results show that the multi-modal model integrating "long and short text+image+external knowledge" can more effectively improve the knowledge aggregation effect, with a CH index of 62.25, which is superior to other combination models. The analysis under multi-modal deep clustering can clarify the evolution process of core knowledge themes.[Limitations] Mainly based on the fusion application of text and image data, lacking exploration of rich media content such as audio and video.[Conclusions] The constructed deep learning fusion model helps to enhance the theme recognition ability and evolutionary analysis effect of multi-modal valuable content, and can improve the management quality and utilization efficiency of co-creation knowledge.

  • Liu Qigang, Wang Yinfan, Mu Lifeng, Xu Wei, Sun Xiangyang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0984
    Online available: 2025-02-11

    [Objective] To address the issues that traditional literature analysis tools are not suitable for conducting fine-grained literature analysis and are not capable of constructing the network of research problems and research methods. [Methods] Propose to construct a problem and method-oriented fine-grained academic graph. For achieving this goal, a new ontology for describing research features of each paper and a new contextual information integrated multi-relation join extraction model have been developed, and the innovative application of Large Language Model (LLM) in information extraction and graph-based Q&A has been proposed. [Results] The F1 scores of entity and relation classification have been improved by 9.18% and 8.07% by our triplet extraction and augmentation method. The constructed problem and method-oriented graph is capable of effectively supporting the research hotspot analysis and literature association analysis. With the support of LLM, high-quality graph-based Q&A could also be accomplished.[Limitations] The ontology of the proposed graph is not perfect, which cannot meet the requirement for comprehensive literature content analysis; the efficiency of triplet extraction based on GPT-4o LLM is not high enough, which makes our solution incompetent for building large-scale graph.[Conclusions] The proposed problem and method-oriented graph is capable of supporting fine-grained literature review and literature association analysis from problem and method dimensions; LLMs could play an important role in academic graph construction and application.

  • Jing Hao, Wu Xinnian, Li Huijia, Zhu Zhongming
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.1150
    Online available: 2025-02-11

    [Objective] This study aims to provide an efficient and accurate solution for the multi-label classification task of domain-specific scientific literature, while also offering insights into the potential applications of knowledge distillation techniques in other tasks.[Method] We first use labels generated by a large model to pre-annotate domain-specific scientific literature, and then employ knowledge distillation to transfer knowledge from the large model to a smaller model, thereby training a lightweight multi-label automatic classification model. The model is evaluated on a friction science domain dataset, comparing the performance of the distillation-trained model with that of a supervised model trained on manually annotated labels, in terms of accuracy and F1 score.[Results] The experimental results indicate that the lightweight classification model trained with labels generated by the large model performs exceptionally well across multiple performance metrics, with accuracy and F1 score exceeding 0.96 and 0.86, respectively, while significantly reducing both time and cost expenditures.[Limitations] This study still has room for improvement in terms of dataset diversity, the robustness of the framework, and the optimization of label quality. Further exploration is also needed regarding the extension of its application scenarios.[Conclusion] This research demonstrates the application value of knowledge distillation in multi-label classification of domain-specific scientific literature, highlighting its advantages in terms of efficiency and cost-effectiveness. Future work can focus on optimizing the distillation process, expanding its applicable scenarios, and providing support for the efficient handling of more tasks.

  • Zheng Bili, Hou Jianhua
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0224
    Online available: 2025-02-11

    [Objective] To explore the differences in knowledge creation rhythms between elite and ordinary researchers, thereby revealing the essential characteristics of academic careers.[Methods] Using a knowledge creation capability matrix, this study analyzes researchers across 19 disciplinary fields, focusing on knowledge sources and diffusion. Rhythmic features of active and dormant periods are calculated.[Results] Elite researchers experience significantly more active periods than ordinary researchers, with longer durations. They also have a higher probability of entering active periods during mid-career stages. Publication volume, citation counts, and collaborative relationships show a significant positive correlation with active periods.[Limitations ]The study does not fully account for the complex interactions between disciplines and the influence of diverse cultural backgrounds on research rhythms.[Conclusions ] The perspective of knowledge creation capability highlights differences in research rhythms, providing a theoretical basis for understanding academic career development and decision-making.

  • Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0791
    Online available: 2025-01-07

    [Objective]Based on large model technology, using Llama3-8b as the base model, this study aims to improve the performance of Chinese electronic medical record (EMR) named entity recognition (NER) tasks, and to promote the intelligent application in the Chinese medical field.[Method]First, we enhance the model’s understanding of medical knowledge by utilizing the Huatuo226K Chinese medical question-answering corpus. Next, we apply Easy Data Augmentation (EDA) to augment the CCKS2019 electronic medical record dataset, and fine-tune the Llama3-8b model using the LoRa method. The result is a model specifically tailored for Chinese electronic medical record named entity recognition.[Results]The fine-tuned Llama3 model shows significant improvement on the CCKS2019 Chinese electronic medical record dataset, achieving an overall precision of 0.8889, a recall rate of 0.8660, and an F1 score of 0.8773. This represents an increase of 0.1611 in F1 score compared to the original model.[Limitations]The study does not delve deeply into entity overlap phenomena, and there is a certain disparity in recognition accuracy across different entity categories.[Conclusion]The proposed framework for Chinese electronic medical record named entity recognition based on large models validates the application potential of large model technology in the Chinese medical field and provides a foundation for establishing a general-purpose model for Chinese electronic medical record named entity recognition. Future research will focus on methods for handling entity category imbalance, overlap, and nested entities to further refine the technology.

  • Shen Si, Feng Shuyang, Wu Na, Zhao Zhixiao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0670
    Online available: 2025-01-07

    [Objective]To improve the utilization efficiency of government information resources and promote the intelligent transformation of government services, this paper explores the effectiveness of retrieval-augmented generation technology based on large language models for policy texts, aiming to provide support for policy text mining, utilization, formulation, and decision-making. [Methods]Based on the Chinese policy large language model ChpoGPT, this paper proposes a retrieval-augmented generation framework. Specifically, the framework retrieves semantically similar policy documents from a knowledge base based on user queries and combines the retrieved results with ChpoGPT to enhance the model's downstream task capabilities. [Results]Experimental results demonstrate that this framework surpasses existing models in terms of factual consistency, answer relevance, and semantic similarity. Specifically, the ChpoGPT model achieves the best performance in Faithfulness, reaching nearly 90%. In Answer Relevance, it reaches 80.2%, surpassing the Gemini-1.0-pro model by 2.1%. In terms of Answer semantic similarity, it reaches 56.4%, showing improvements of 4.1% and 2.8% compared to the ernie-4.0 and Gemini-1.0-pro models, respectively. These findings indicate that the framework can effectively improve the quality and accuracy of policy text generation.Limitations]Through analyzing the experimental results, this paper finds that the language model still has some uncontrollability in answer output. [Conclusions]The retrieval-augmented generation of policy texts based on LLMs has certain reference value for the intelligent transformation of government services, but it still needs further improvement and optimization.

  • Ren Minglun, Gong Ningran
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0612
    Online available: 2025-01-07

    [Objective] To solve the problem of cognitive limitations in knowledge graphs due to the sparsity of relations and the difficulty of exploiting hidden relations. [Methods] A global-aware knowledge graph reasoning model (GAGAT) based on graph attention network is proposed to construct a hierarchical attention mechanism to enhance the accuracy and interpretability of link prediction by introducing betweenness centrality as the implicit structural information and combining with relational semantic information. [Results] The GAGAT model shows 26.5% and 5% improvement in Hits@3 metrics compared to ComplEx, 15% and 1.6% compared to CompGCN, and 1% and 1% compared to SD-GAT on FB15K-237 and WN18RR datasets, which proves its superiority in capturing implicit relations and complex semantics. [Limitations] In this paper, we only consider betweenness centrality as a fusion of implicit structural information and relational semantic information, and we have not explored the role of other structural features in reasoning, which needs to be further researched and verified. [Conclusions] By fusing implicit structural information and relational semantic information, the GAGAT model further mines the hidden relationships in the knowledge graph, effectively improves the accuracy and interpretability of the knowledge graph link prediction, and provides a strong support for the cognitive ability of intelligent systems.

  • Zhou Ning, Ma Li, Xu Ke
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0457
    Online available: 2025-01-07

    [Objective] To address the limitations of the traditional LDA model in handling short texts, particularly the abstracts of Chinese medicine papers which are rich in jargon and have poor interpretability of subject terms, an improved LDA model (I-LDA) with fused rough data inference is proposed. [Methods] The most representative keywords are extracted using the TextRank algorithm that incorporates rough data inference. The domain vocabulary weights are improved by constructing domain-specific dictionaries. Expand the scope of subject word selection by combining rough data inference. [Results] In terms of topic coherence and distance between topics, I-LDA model improves by about 25% compared with traditional LDA, HCA, HDP models. [Limitations] Due to the abundance of specialized terms in the abstracts of Chinese medicine research papers, the preset dictionaries used in the experiments may not cover all related terms, potentially impacting the model's effectiveness in topic modeling. [Conclusions] The experimental results indicate that the I-LDA model performs excellently in topic modeling of Chinese medicine research paper abstracts, and the topics identified are more representative and professional.

  • Zhang Li, Hu Jingxuan, Liu Xiwen, Lu Wei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0924
    Online available: 2025-01-07

    [Objective] Existing author disambiguation research focuses little on the Chinese-English joint author disambiguation task. This paper focuses on this task to create fundamental research material for it.[Methods] This paper proposes a method to automatically construct an author disambiguation dataset by utilizing scholarly resources publicly available on the Internet and uses this method to build a large labeled dataset called CHEN-AND for the Chinese-English joint author disambiguation task. Based on the dataset, this paper develops and evaluates several basic joint author disambiguation methods.[Results] The evaluation results show that the best-performing methods achieve P-F1 and B3-F1 of 79.86% and 84.25%, respectively, which are significantly lower than the accuracies of widely explored English-language author disambiguation methods.[Limitations] CHEN-AND focuses more on researchers in the fields of S&E, which is different from the author discipline distribution in large-scale literature databases, mainly because CSCD, the Chinese indexing database used in the construction of this dataset, is a database catering to the fields of S&E.[Conclusions] In this paper, basic research materials such as the CHEN-AND dataset are made public so that subsequent research can explore more efficient cross-language author disambiguation methods and build a national high-end academic information exchange platform for Chinese researchers.

  • Tian Xuecan, Wang Li
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0379
    Online available: 2025-01-07

    [Objective] In response to the limited automation and strong dependence on experience in the detection process of weak signals at the forefront of technology, a detection method that integrates two signal processing strategies of signal amplification and denoising is proposed.[Methods] By simulating the signal processing flow, the signal is first preprocessed using the RE; then the weak signal is amplified using the N-gram model and the TF-IDF algorithm; subsequently, the iterative thresholding shrinkage algorithm (ISTA) is used to measure the future growth trend of the weak signal and further filter out noise; finally, the growth signal is integrated using the K-means++ algorithm enhanced by the Word2vec model.[Results] Signal amplification and signal filtering, as two core processing strategies, effectively avoid the phenomenon of noise drowning out weak signals, improving the accuracy and focus of weak signal detection at the forefront of technology.[Limitations] The current evaluation of detection effects still relies on professional knowledge, and more objective evaluation methods need to be further explored; weak signal detection based on a single data source is conducted, and future work needs to further expand the data sources.[Conclusions] The automated detection framework proposed in this paper has to some extent reduced the dependence on human experience and has achieved effective and accurate detection results.

  • Song Mengpeng, Bai Haiyan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0628
    Online available: 2025-01-07

    [Objective]Automatically generates structured literature reviews with references to assist research users in quickly understanding a particular area of scientific knowledge.[Methods]A corpus was constructed by selecting 70,000 papers from the NSTL platform and identifying moves in the abstracts.The GLM3-6B model was fine-tuned for training by constructing 3,000 reviews of data through large language model generation and manual revision.By converting the corpus into high-dimensional vectors, utilizing index storage for vectors, and then retrieving vectors to implement LangChain's external knowledge base.To solve the problem of poor retrieval of proper nouns, hybrid search with BM25 is used and reordered to improve retrieval accuracy.[Results]Constructing the literature review generation system by fine-tuning and hybrid retrieval framework the model improved the BLEU and ROUGE scores by 109.63% and 40.21%, and the authenticity score of manual evaluation by 62.17%.[Limitations] Due to the limitation of computational resources, the scale of local model parameters is small, and the generation ability needs to be continued to improve.[Conclusions]Utilizing retrieval-augmented generation technique to take advantage of large language models not only generates high-quality literature reviews, but also provides evidence-based traceability for the generated content and assists researchers in intelligent reading.

  • Liu Yu, Zeng Ziming, Sun Shouqiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0870
    Online available: 2025-01-07

    [Objective]To address the issues of semantic shift in multi-aspect sentences and implicit sentiment analysis in aspect-based sentiment analysis, this paper proposes an aspect-based sentiment analysis model based on sentiment enhancement using large language models and graph convolutional neural networks.[Methods]The model employs prompt learning to guide large language models in generating sentiment-enhanced representations of aspect semantics. It then constructs an aspect semantic sentiment knowledge-enhanced graph. In addition, this paper proposes an emotion-target position weighting algorithm to filter irrelevant information in the syntactic dependency graph. At the same time, it introduces an aspect mask and gated filtering mechanism to fully integrate semantic information, accurately identifying the sentiment tendency of each aspect.[Results]The performance of this model on the experimental dataset shows that only the accuracy on the Restaurant dataset is slightly lower than the other two baseline models. However, its F1 score is still as high as 81.60. Specifically, the F1 scores on the Laptop, Twitter, and MAMS datasets showed significant improvements, with increases of 1.79%, 0.97%, and 3.02% respectively compared to the best baseline model.[Limitations]The role of visual information in aspect-level sentiment analysis is not considered, and experiments are conducted only on English datasets.[Conclusions]By leveraging prompt learning to guide large language models in generating sentiment representation words and combining them with graph neural networks, an effective and efficient aspect-level sentiment analysis solution is provided, significantly improving the accuracy of aspect-level sentiment analysis in text.

  • Meng Xuyang, Wang Hao, Li Yuanqing, Li Yueyan, Deng Sanhong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0914
    Online available: 2025-01-07

    [Objective] The ancient grammar and linear content presentation of traditional Chinese medicine (TCM) ancient books have created obstacles for people to acquire TCM knowledge, while the combination of large language models and knowledge graphs can make it more convenient for people to obtain reliable TCM knowledge. [Methods] Based on the "Shang Han Lun" (Treatise on Febrile Diseases), a fine-grained knowledge graph of TCM ancient books was constructed, and the retrieval-augmented generation technology was used to integrate the knowledge graph into a large language model through prompt learning to construct a question answering system. [Results] Compared with the baseline model and the model fine-tuned with professional data, the mean overall satisfaction rate of the system in subjective evaluation is 14.67% and 1.33% higher respectively, the overall accuracy rate of the system in objective evaluation is 20% higher and slightly lower by 2%. [Limitations] Currently, the system has only been applied in the field of TCM related to "Treatise on Febrile Diseases", and there is a lack of standards for evaluating the professional capabilities of the system in verifying its effectiveness. [Conclusions] This method not only enhances the accuracy and interpretability of large language model responses, but also saves a lot of resources.

  • Xu Chun, Su Mingyu, Ma Huan, Ji Shuangyan, Wang Mengmeng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0773
    Online available: 2025-01-07

    [Objective] In response to challenges in the tourism sector, including fragmented knowledge and limited annotated data, which result in low fine-tuning efficiency and suboptimal extraction performance, this research investigates methods for entity-relation joint extraction in few-shot scenarios.[Methods] GLM (generative language model) is used to encode the input text for representation after the prompt learning based on tourism domain, combined with global pointer network to complete the potential relationship prediction and entity recognition under specific relationship, and extract the relationship triad.[Results]We conducted experiments on the self-constructed tourism dataset and the Baidu DuIE dataset. The F1 values of the model were 90.51% and 89.45%, which were improved by 2.37 and 0.16 percentage points compared with the traditional relationship extraction model.[Limitations] The prompt learning is only applied in the tourism domain and with specific encoders. There is still room for expansion in application scenarios. [Conclusions] The proposed method can better perform joint entity and relation extraction on tourism texts. Prompt learning and large language model encoders can alleviate the issue of poor model performance in small sample scenarios, effectively improving the accuracy of entity and relation extraction.

  • Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0442
    Online available: 2025-01-07

    [Objective] Solve the problem of limited information of rumor data and lack of associated common sense information to improve the accuracy of rumor identification. [Methods] A Multi-Branch Graph Convolutional Inference Network (MGCIN) is proposed, based on a bi-directional graph convolutional network with the addition of a common sense inference module COMET, which is both responsible for generating feature representations fused with common sense inference information and inputting them into the convolutional network, and at the same time acting as an auxiliary module independently generating categorization labels to influence decision-making in the final stage. [Results] Experiments on three public datasets, Twitter15, Twitter16 and PHEME, show that the proposed method outperforms most baseline models. The model achieves accuracies of 0.878, 0.898, and 0.776 respectively, and has excellent rumor early detection performance. [Limitation] The multimodality of background and common sense information related to rumor data still needs to be studied in depth. [Conclusion] The model proposed in this paper can better simulate the human thought process, effectively integrates text features, communication features and common sense information, and provides new ideas and methods for rumor detection research.

  • Chen Jing, Cao Zhixun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0446
    Online available: 2025-01-07

    [Objective] This paper aims to analyze the differences in combating hallucinations in large language models between unstructured knowledge exemplified by knowledge base resources and structured knowledge exemplified by knowledge graph resources, using the Traditional Chinese Medicine (TCM) Q&A domain as a case study. It further discusses strategies to enhance the capability of large language models to combat hallucinations in vertical domains based on these findings.[Methods] The study designed experiments using external knowledge combined with prompt engineering techniques to analyze the differences in prompting effects between knowledge base resources and knowledge graph resources in the TCM Q&A domain. It also explores the superiority of dynamic triplet strategies and integrated fine-tuning strategies in optimizing large language models against hallucinations.[Results] Experimental results show that compared to prompts from unstructured knowledge in the knowledge base, prompts from structured knowledge in the knowledge graph perform better in terms of accuracy, recall, and F1 score, improving by 1.9%, 2.42%, and 2.2% respectively, reaching 71.44%, 60.76%, and 65.31%. Further analysis of optimization strategies revealed that the combination of dynamic triplet strategy and fine-tuning yielded the best effects against hallucinations, achieving accuracy, recall, and F1 scores of 72.47%, 65.87%, and 68.62%, respectively. [Limitations] This study is limited to a single field, having been tested only in the domain of Traditional Chinese Medicine Q&A, and its generalizability needs to be validated in a broader range of scientific fields. [Conclusions] This study has demonstrated that in the field of Traditional Chinese Medicine, structured knowledge from knowledge graphs surpasses traditional unstructured knowledge in reducing hallucinations and enhancing the accuracy of model responses. It reveals the critical role of structured knowledge in boosting model comprehension abilities; the integration of fine-tuning strategies with knowledge resources provides an effective pathway for performance enhancement in large language models. This paper provides theoretical justification and methodological support for integrating external knowledge into large language models to enhance knowledge services.

  • Jiang Yuzhe, Cheng Quan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0600
    Online available: 2025-01-07

    [Objective]The mining and analysis of patients' temporal and physiological data are conducted to provide accurate and safe medication plan reference for patients and effective medication decision support for doctors.[Methods]A hybrid medication regimen recommendation model that integrates temporal and vital sign data has been proposed. Initially, the model employs a Transformer architecture, Convolutional Neural Networks (CNN), and time-aware methodologies to individually mine patients' temporal data. Subsequently, it leverages knowledge graph technology and Graph Convolutional Neural Networks (GCNN) to explore patients' physiological data. Ultimately, the model incorporates information regarding adverse Drug-Drug Interaction into the recommendation process, thereby providing patients with safe and effective medication regimens.[Results]An empirical study was conducted using patients admitted multiple times from the MIMIC-III dataset. The recommendation model designed in this study showed an improvement of 13.9%, 6.6%, and 3.7% over the GRAM, G-BERT, and TAHDNet models, respectively, in the Jaccard index. Additionally, it increased by 9.3%, 3.2% and 1.2% respectively in the F1 metrics. The model achieved the lowest DDI rate.[Limitations]Although the model considered the abnormal signs of patients, it did not consider the specific value of abnormal signs of patients when learning the patient's signs data.[Conclusions]By integrating and analyzing patients' time series data and vital sign data, the drug recommendation model can more accurately learn the characteristics of patients' conditions, thereby facilitating the recommendation of more precise medication regimens for patients. Additionally, taking into account information on adverse drug interactions during the recommendation can assist in proposing safer medication plans for patients.

  • DING Shengchun, Long Xiang, Ye Zi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024-0732
    Online available: 2025-01-07

    [Objective] To enhance enterprises' situational awareness and emergency decision-making abilities through a situational awareness model for sudden incidents.[Methods] This paper establish a model from perception and understanding dimensions. It utilizes knowledge graphs and deep learning for event knowledge extraction, constructs a corporate emergency symbiotic system based on symbiosis theory and Logistic growth model. The " Earth Pit Pickled Mustard Greens" incident is used as a case study.[Results] Construct the model for the " Earth Pit Pickled Mustard Greens " incident, followed by an analysis of the situational development stages and the relationships among situational elements,  providing situational assessment recommendations. [Limitations] Data quality and knowledge graph structure need improvement.[Conclusions] The proposed model helps extract situational elements and identify relationships, enhancing emergency decision-making capabilities.

  • Ye Guanghui, Wang Yujie, Lou Peilin, Zhou Xinghua, Liu Shuyan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0507
    Online available: 2025-01-07

    [Objective]With the rapid convergence of online and offline boundaries, the complexity of public opinion dissemination in emergencies has greatly increased, and tracking and observing the characteristics of public opinion circulation in emergencies can assist in the realization of public opinion guidance and control and shared governance. [Methods] Using the case study method, we propose a framework for the macroscopic circulation of public opinion in emergencies; using social network analysis, supplemented by empirical research and natural language processing technology, we analyze in depth the circulation law of public opinion in the dimensions of subjects, objects and carriers from the micro perspective, and conduct validation analyses by combining the data of public health emergencies. [Results] From a macro perspective, public opinion circulates in Cyber Space, Physical Space and Psychological Space, providing an interdisciplinary analytical framework for understanding and quantifying public behaviors and responses. From a micro perspective, public opinion circulates at the level of multiple groups, media, events and platforms, showing four effects respectively: homogeneous diffusion and heterogeneous traversal effect, field resonance and field escape effect, co-temporal and ephemeral effect, and amplified resonance and echo difference effect. [Limitations]The dynamics of social network sentiment are not taken into account.[Conclusions] Summarising the laws of cross-domain circulation of public opinion from both macroscopic and microscopic perspectives, and carrying out empirical research in conjunction with specific events, we provide new ideas for the study of public opinion communication.

  • Ren Gang, Cheng Lingfeng, Jia Ziyao, Wang Anning
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0236
    Online available: 2025-01-07

    [Objective]: Leveraging the multimodal information from images and texts, the ITRHP multimodal review helpfulness prediction model is proposed based on image-text matching technology.[Methods]: First, Faster R-CNN and Bi-GRU models are used to extract image and text features, respectively; second, the regions matching each other between text and images are captured by a co-attention mechanism to improve the consistency of feature expression; then, positive and negative attention mechanism are introduced to obtain the shared semantic information of matching and mismatching word-region pairs, and an adaptive matching threshold learning module is used to better recognize the word-region pairs with the highest similarity; finally, the semantic information is passed to the fully connected layer to obtain the final classification results.[Results]: The experimental results show that the ITRHP model achieves an accuracy of 80.17% and 80.27% on the Yelp and Amazon dataset, respectively, and an F1 value of 79.38% and 89.01%, respectively. Compared to the benchmark model, the accuracy of the model improved by up to 2.8% and 2.42% on the two datasets, and the F1 values improved by up to 2.70% and 7.48%, respectively.

    [Limitations]: In this study, we will further consider fusing more review features (e.g., review sentiment, reviewer feature information) to construct a more comprehensive and effective multimodal review helpfulness fusion mechanism, so as to further improve the model performance.[Conclusions]: The ITRHP model proposed in this study effectively utilizes multi-modal information through image-text matching technology, which solves the problem of low classification accuracy in the multimodal helpfulness prediction models.

  • Qian Lingfei, Ma Ziyi, Dong Jiajia, Zhu Pengyu, Gao Dequan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0610
    Online available: 2025-01-07

    [Objective]In order to improve the performance of relation extraction of power communication system fault text, a multi-level graph convolution document level relation extraction method considering ontology information was proposed, according to the characteristics of the domain.[Methods] Firstly, word level embedding was used to encode the fault text. Secondly, sentence-level and entity-level document graphs were constructed. Thirdly, the semantic information of entity-level, sentence-level and document-level were aggregated with convolution. Finally, the "ontology-ontology" side construction method was designed according to the ontology conceptual model, and the auxiliary task of "predicting whether the entity pair conforms to the ontology constraint" was added to improve the model performance.[Results] Ablation and comparison experiments were conducted on self-built power communication network fault data sets. The results showed that the proposed method performed best, with F1, Ign F1 and AUC values reaching 97.22%, 95.17% and 97.97%, respectively.[Limitations] The generalization ability of the model needs to be further verified.

    [Conclusions] The proposed method is suitable for the relationship extraction task of fault knowledge map of power communication network, and has better extraction effect than the existing methods.

  • Duan Yufeng, Xie Jiahong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0965
    Online available: 2025-01-07

    [Objective] In order to study the performance differences of existing large language models in China in extracting entities and relations of Chinese medical text, and to analyze the influence of the number of examples and relation types on the extraction effect of the models. [Methods] In this paper, based on prompt engineering approach, we use the API way to call 9 mainstream large language models, modify prompt from two perspectives: the number of examples and the number of relation types, and use CMeIE-V2 dataset to conduct experiment and compare the extraction effect. [Results] (1) The comprehensive extraction ability of glm-4-0520 is in the first place, with F1 values of 0.4422, 0.3869, and 0.3874 when extracting three relation types of “clinical manifestation”, “medication”, and “etiology” respectively. (2) Varying the number of examples in the prompt, , the F1 value begins to increase with , and reaches a maximum value of 0.4742 when , but it decreases after . (3) After increasing the number of relation types to be extracted, , the F1 value decreases significantly. The F1 value at  decreases by 0.1182 compared to , and when  the F1 value is only 0.2949. [Limitations] There are few public datasets available, so the experimental results in this paper are only obtained from a single dataset. Since large language models in the medical vertical field are currently difficult to call through API, the models used in this paper are all from general fields. [Conclusions] The extraction effect varies greatly among different LLMs; A suitable number of examples can improve the extraction effect of the model, but more examples are not better; LLM is not good at extracting multiple relation types at the same time.

  • YuChi, Chen Liang, Xu Haiyun, MuLin, XiaChunzi, Xian Xin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024-0650
    Online available: 2025-01-07

    [Objective]Performing named entity recognition of core technical information in patent texts with few labeled samples.[Methods]Utilizing the features of large language model, which possess abundant general knowledge and significant semantic understanding capabilities, we propose a methodological framework for named entity recognition from patent text using prompt template.[Results]This paper presents an empirical analysis using the TFH-2020 annotated dataset pertinent to hard disk drive. The experimental results demonstrate that with the few-shot learning ability of large language model, the performance of named entities achieves 68% in F1-value, while with the supervised fine-tuning method,it can achieve to 54%, which is contrary to its performance on general texts.[Limitations]Although the proposed method significantly reduces the cost of data annotation, there is still a considerable gap compared to deep learning methods using a large amount of labeled data for model training. In addition, the design and optimization methods of Prompt templates, as well as the technology for rapidly generating large batches of instruction sets, need to be further improved.[Conclusions] Compared to random sample selection strategy, the NER performance of large language model using similar sample selection strategy has increased from 29% to 69% (in F1 value). This indicates that sample selection strategy has a significant impact on the performance of large language models in patent NER task, and prompt template is the core of this method, it not only determines the quality of the recognition effect but also influences the choice of optimization methods.

  • Wanbin Li, Si Shen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1394
    Online available: 2025-01-07

    [Objective] In order to solve the problem that a large number of valuable academic entities exist in unstructured data, a unified entity recognition framework is established to solve different types of NER downstream tasks.[Methods] The BIO labeling system and Json system are used to label the data set, and the nested and non-nested data are identified based on the unified nested entity recognition model Global Pointer. The performance of the model is evaluated from different perspectives and supplemented by the models of CRF, GPT-4 and BERT for comparative experiments.[Results] On different types of data sets, Global Pointer has good performance in accuracy, recall and F1. The average F1 of non-nested data sets reach 95.38 % and 79.81 % respectively, while the average F1 of nested data sets can reach at least 66.91 % and 61.47 % respectively, and the overall performance is better than CRF, GPT-4 and BERT without manual feature template.[Limitations] For the overall nested entity recognition performance, Global Pointer still needs further optimization and improvement in order to identify the corresponding entities efficiently and accurately from the perspective of application.[Conclusions] Based on the idea of “global unification”, Global Pointer can effectively use the location features of entities in the unified entity recognition, so that it can not only improve the accuracy of recognition but also take into account the convenience of recognition for complex nested entities.

  • Chen Wanzhi, Hou Yue
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0720
    Online available: 2025-01-07

    [Objective]To address the issues of insufficient multimodal feature extraction, semantic feature disparity, and inadequate interaction between modalities in multimodal sentiment analysis, a temporal multimodal sentiment analysis model that integrates multi-level attention and sentiment scale vectors is proposed.[Methods]Firstly, the scalar Long Short-Term Memory network with a scalar memory, a scalar update, and new memory mixing is introduced and combined with a multi-head attention mechanism to construct a multimodal deep temporal feature modeling network for extracting rich contextual temporal features from text, audio and visual modalities. Secondly, a text-guided dual-layer cross-modal attention mechanism and an improved self-attention mechanism are utilized to achieve deep information exchange between modalities, generating two sentiment scale vectors for analyzing sentiment intensity and determining whether the sentiment is positive, negative, or neutral. Finally, the L1 norm of the sentiment intensity vector is multiplied by the normalized sentiment polarity vector to obtain a comprehensive representation of intensity and polarity for accurate prediction.[Results]The experimental results on the CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed model achieves favorable outcomes in both comparative and ablation experiments on CMU-MOSI, with Acc7 and Corr improving by 1.2% and 2.3% over the next best model. On the CMU-MOSEI dataset, the model surpasses baseline models across all metrics, achieving Acc2 and F1 scores of 86.0% and 86.1%, respectively.[Limitations]Sentiment expression is highly context-dependent, and the sources of sentiment cues may vary across different scenarios. In situations where textual information is insufficient, performance may degrade.[Conclusions]The proposed model effectively extracts contextual temporal features from various modalities and leverages the rich sentiment information in the text modality for deep interactions between different modalities, thereby enhancing the accuracy of sentiment prediction.

  • Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0287
    Online available: 2025-01-07

    [Objective]To unveil the process of Telecom fraud and identify key risk factors, this paper proposes a research framework for the risk analysis of Telecom fraud incidents based on large language models and event fusion.[Methods]By constructing a two-stage hierarchical prompt directive in the domain of Telecom fraud, the paper extracts risk events and event arguments related to fraud cases. It combines semantic dependency analysis and template matching methods to form a chain of fraud events. Considering the diversity of event descriptions, the paper employs the BERTopic model for sentence vector representation and utilizes clustering algorithms for event fusion.[Results]The method achieves an F1 score of 67.41% for the extraction of fraud case events and 73.12% for argument extraction, with event clustering identifying 10 categories of thematic risk events. Among these, the act of "providing personal information" poses the highest risk. The paper showcases the general evolution process of fraud impersonating public security, prosecution, and law enforcement agencies, exemplified by Ms. Liu's fraud case.[Limitations]The granularity of police situation data is relatively coarse, which undermines the strength of early warning capabilities.[Conclusions]Employing large language models and event fusion clustering methods allows for the automatic construction of event evolution links and the analysis of event risk values, offering support for the early warning and deterrence of Telecom fraud.

  • Jiayue Lu, Xiaoli Chen, Xuezhao Wang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0704
    Online available: 2025-01-07

    [Objective] To explore the patterns and characteristics of the main actors in international scientific research collaboration from the perspective of research content. [Methods] Utilizing large language models and scientific memes to gauge content homogeneity, cooperation patterns are categorized based on homogeneity and frequency, with a quantitative analysis of actor characteristics in terms of partners and content. [Results] The empirical study was carried out in the domain of brain-computer interface. Core countries predominantly exhibit homogeneous and heterogeneous cooperation models, with free cooperation primarily involving exploratory ventures in emerging fields and dependent small country collaborations. Most nations lean towards international cooperation over domestic independence, with a high degree of partner homogeneity in both multilateral and bilateral alliances for leading and dependent small countries. Nations such as China and the UK favor international cooperation grounded in their research content, in contrast to the US and Germany, which show a preference for collaborative research into new topics.[Limitations] The study's scope is limited to content homogeneity, omitting a comprehensive analysis of various homogeneity factors.[Conclusion] This research enriches the analytical tools for international collaboration patterns, offering a new lens for understanding global scientific partnerships.

  • Zhang Yunqiu, Huang Qifei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0714
    Online available: 2025-01-07

    [Objective] The research is conducted from both research platform and social media dimensions to propose a knowledge combination discovery method that integrates multi-source semantic features of knowledge elements value and importance, and is sensitive to the detection of relationships between knowledge elements of low and medium word frequencies. [Methods] Firstly, we design and calculate the semantic features affecting the value and importance of knowledge elements from the perspectives of scientific research results and social media: the contribution of knowledge elements, the influence of the literature they belong to, and the social visibility; Then, a heat conduction model is constructed to integrate the above multiple semantics to explore the potential correlation between knowledge elements and literature; finally, we calculate the weighted Jaccard coefficient between knowledge elements based on the new knowledge elements -documentation network; Finally, based on the new knowledge elements, we calculate the weighted Jaccard coefficients between knowledge elements to achieve knowledge combination discovery. [Results] Selected intelligence CSSCI literature database and Baidu encyclopedia as data sources for visualization and literature retrospective empirical evidence. For the P@50, P@100, P@500, P@1000 and P@2000 indicators, the knowledge combination discovery method that integrates the value and importance of knowledge element is increased by 0.3, 0.23, 0.184, 0.183 and 0.278, respectively. Combinations that have not been verified by the literature, such as "government information resources-industry think tank alliance", "microblog comments-metaphor recognition", and "social impact theory-stochastic resonance", also have high combination potential and application explainability. [Limitations] No further evaluation and analysis of the excavated knowledge combination has been carried out; secondly, the fine-grained semantic relationship mining between knowledge element is still to be solved. [Conclusions] The knowledge combination discovery method proposed in this study has certain reliability and superiority, which can provide a methodological reference for future research on knowledge combination, and the knowledge combination found can provide decision-making suggestions for academic innovation and discipline development.