[Objective] U.S. congressional hearings generate vast amounts of text that are broad in scope and often expressed in colloquial language, posing challenges for intelligence analysis. This paper proposes a framework to automatically identify China’s technology security risks. [Methods] Starting from the features of hearings and the needs of analysts, we utilized large language models to integrate modules such as text filtering, summarization, and question-answering, thereby achieving high-quality intelligent identification. [Results] We examined our method with the 118th Congress hearing transcripts. The F1 score for text filtering, ROUGE-Lsum for summary generation, and the risk point recall rate for the QA system reached 0.7751, 0.6032, and 0.7636, respectively, significantly outperforming baseline models. [Limitations] Our method is primarily designed for U.S. Congressional hearing transcripts. Future work requires validation with more types of corpora to generalize and extend it into a universal approach. [Conclusions] The proposed method provides a powerful tool for extracting and analyzing technology-related intelligence from U.S. congressional hearings, offering valuable support for developing China’s technology security strategies.
[Objective] This paper aims to explore differences in knowledge creation rhythms between elite and ordinary researchers, thereby revealing the essential characteristics of academic careers. [Methods] This study uses a knowledge creation capability to analyse researchers across 19 disciplinary fields, focusing on knowledge sources and diffusion. The rhythmic features of active and dormant periods are then calculated. [Results] Elite researchers have an average of approximately 1.71 active periods, while ordinary researchers have about 1.39 active periods. Elite researchers have approximately 23.02% more active periods than ordinary researchers. Elite researchers have an average of about 2.51 silent periods, while ordinary researchers have about 2.52 silent periods. During the mid-career stage (years 6-15), the probability of elite researchers entering an active period is 28.22%, compared to 8.30% for ordinary researchers. Publication volume, citation counts, and collaborative relationships show a significant positive correlation with active periods. [Limitations] The study does not fully account for the complex interactions between disciplines or the influence of diverse cultural backgrounds on research rhythms. [Conclusions] The perspective of knowledge creation capability highlights differences in research rhythms, providing a theoretical basis for understanding academic career development and decision-making.
[Objective] This study addresses the shortage of highly domain-adaptive labeling data in emergency management and improves the effectiveness of event recognition. [Methods] We propose a continuous automatic labeling framework that integrates ensemble learning with semi-supervised learning. The framework is further combined with named entity recognition, co-occurrence network analysis, and sentiment analysis to construct a comprehensive system for recognizing sudden events. [Results] The proposed mechanism could use 20%~35% of the whole dataset to achieve a recognition effect comparable to, or even exceeding, that obtained with the full dataset. [Limitations] The current evaluation relies solely on data from the China News Service and emphasizes the mining of existing information, leaving opportunities to broaden data sources and expand application scenarios. [Conclusions] By grounding the framework in theory and validating it with empirical data, this research demonstrates the system’s practical effectiveness. It offers insights that may inform future studies on the recognition of emergency events.
[Objective] Existing keyword extraction methods often suffer from limited attention scope, weak semantic representation, and restricted generative ability. To address these challenges, this paper proposes a patent keyword extraction approach (LLM-PKE) that integrates large language models with multi-feature networks. [Methods] LLM-PKE comprises three modules. In the extraction module, topic information is embedded into a Transformer attention network, combined with Graph Convolutional Networks to enhance sensitivity to topic terms and improve feature extraction. In the generative module, large language models produce keywords highly relevant to patent texts. In the ranking module, the large language model generates similarity scores for each keyword to remove synonyms and less relevant terms, yielding refined patent keywords. [Results] Compared to the best-performing baseline model, the proposed method improves the F1@5 metric by 1.98 percentage points. [Limitations] We use semantic similarity thresholds to remove redundant keywords; however, varying similarity standards across patent texts may limit accuracy and generalizability. [Conclusions] The LLM-PKE model outperforms existing approaches on patent datasets, offering a more effective solution for patent keyword extraction.
[Objective] To bridge the cross-modal semantic gap and enhance aspect-related image feature extraction, this paper proposes a multi-modal aspect-level sentiment analysis model based on multi-perspective fusion representation. The model captures fine-grained cross-modal sentiment expressions from global and local perspectives. [Methods] First, the text and image descriptions are jointly encoded from a global perspective and combined with a multi-head self-attention mechanism to capture cross-modal global semantic features. Second, two graph structures are constructed to mine the fine-grained sentiment information from text and images from the local perspective. A syntactic dependency graph is introduced into the text graph structure to enhance text syntactic feature extraction. In the fusion graph structure, null convolution is used to expand the sensory field to extract key information from image patches and enhance the inter-patch feature associations. A multi-head cross-attention further guides the model to focus on aspect-related image features. Finally, global and local fine-grained sentiment information is integrated for aspect-level sentiment classification. [Results] The accuracy and F1 values of this paper’s model are higher than the baseline model on both Twitter-2015 and Twitter-2017 datasets. Compared with the suboptimal model, the accuracy and F1-score improve by 0.44% and 1.51% on Twitter-2015, and by 0.54% and 0.72% on Twitter-2017, respectively. [Limitations] The generalizability of the model has not yet been validated across a broader range of datasets. [Conclusions] The proposed model effectively narrows the semantic gap between modalities and fully extracts the aspect-related image features, which improves the sentiment classification performance.
[Objective] This study aims to enhance the efficiency of policy information retrieval, enable intelligent analysis and comparison of policies, and provide precise decision support for policy formulation by constructing a structured policy knowledge base. [Methods] Using pro-business policies as a case study, we propose a framework based on large language models for efficiently comparing related policies. The framework consists of three core steps: knowledge base construction, retrieval and storage, and answer generation. [Results] Validation on datasets of pro-business policies demonstrates that the framework can automatically integrate multiple policies and perform semantic analysis to construct a knowledge base, supporting policy matching and comparative analysis. The Chroma-RAG model demonstrates clear advantages, achieving 60% on the Hit@1 index, 76% on the Hit@3 index, and 71.13% on the MRR index. Compared with traditional models such as TF-IDF, Word2Vec, USE, BERT, SBERT, DPR, and SimCSE, Chroma-RAG outperforms across retrieval metrics, underscoring the superiority of the proposed framework. [Limitations] The study primarily relies on cross-sectional data, which cannot capture the dynamic evolution of policies during implementation, thereby limiting deeper evaluation of policy impacts. [Conclusions] Knowledge base construction and policy comparison leveraging large language models significantly improve the intelligent analysis and comparison of policy texts. In particular, the approach offers strong decision-support capabilities in policy knowledge base development and comparative policy evaluation, providing valuable guidance for policymakers.
[Objective] This study aims to develop an approach for the early identification of patents with high disruptive potential by analyzing patent data in emerging technological domains. [Methods] Based on technology life cycle theory, we construct an indicator system for detecting disruptive technologies and apply it to the quantum computing domain using patent data from the PatSnap database. An ensemble learning model is employed to identify patents with strong disruptive potential. [Results] Leveraging the BERTopic topic modeling framework, we identify five prominent disruptive research fronts: quantum encryption, quantum processors, superconducting qubits, semiconductor-based quantum technologies, and quantum neural networks. These findings demonstrate the effectiveness and feasibility of the proposed method. [Limitations] The empirical analysis is restricted to quantum computing, without extending to other critical technology fields. Moreover, the framework and indicators are solely based on patent data, suggesting opportunities for incorporating additional data sources. [Conclusions] The proposed method provides a systematic approach for the early recognition of highly disruptive patents and the mapping of disruptive research trajectories. The findings offer valuable insights to inform the formulation and implementation of national science and technology strategies.
[Objective] This study proposes a dynamic topic modelling approach for short texts, driven by fine-tuned large language models (LLMs), aims to ensure high accuracy in topic identification and to reveal patterns in topic evolution. [Methods] The proposed method integrates instruction tuning, retrieval-augmented generation (RAG) and clustering techniques to improve the performance of topic identification. Topic evolution is modelled by establishing topic mapping relationships and conducting a time-sequential statistical analysis. [Results] Experiments on four short text datasets demonstrate that the proposed method outperforms the second-best baseline by an average of 6.15 and 7.71 percentage points in terms of topic coherence (TC) and topic diversity (TD), respectively. Ablation studies further evaluate the individual contributions of fine-tuning, RAG and clustering to the overall performance. Additionally, the method reveals distinct topic evolution patterns across datasets, including M-shaped and L-shaped trends. [Limitations] The current method does not use knowledge graphs to optimise the RAG component further, and its generalisability has yet to be validated across diverse, domain-specific short text corpora. [Conclusions] The proposed approach clearly outperforms others in terms of both topic identification accuracy and the ability to capture meaningful patterns of topic evolution.
[Objective] This study aims to exploit semantic associations across Chinese documents to enhance the performance of document-level event extraction. [Methods] We propose a Chinese document-level event extraction model (CSDEE) based on interactive semantic enhancement. The model utilizes an attention mechanism to construct a cross-document interactive semantic network that enhances entity recognition. The event extraction task is then completed through document encoding and event information decoding. [Results] The CSDEE model achieves a precision of 80.7%, a recall of 84.1%, and an F1 score of 82.3% in event extraction, outperforming existing baseline models. Ablation studies and generalization experiments on the ChFinAnn and DuEE-fin datasets further confirm the efficacy of CSDEE in Chinese document-level event extraction tasks. [Limitations] The current work focuses on improving document-level event extraction performance, without yet addressing multi-classification tasks involving overlapping event types. [Conclusions] Leveraging semantic similarities and associations across related documents can significantly improve the accuracy and robustness of document-level event extraction.
[Objective] This study aims to optimize large language models (LLMs) to improve the quality of relation extraction in Chinese. [Methods] We propose a low-cost fine-tuning model based on multi-dimensional self-reflective learning (SRLearn). It automatically guides large language models to iteratively reflect on and refine their outputs across multiple dimensions. SRLearn enhances the generation quality of Chinese relation extraction results. [Results] Compared to the LoRA+DPO fine-tuning approach, SRLearn achieves performance gains by 15 percentage points on the WikiRE1.0 dataset and 6.7 percentage points on the DuIE2.0 dataset, validating its effectiveness. [Limitations] Future research needs to address more generation quality issues. [Conclusions] The proposed model significantly improves the quality of Chinese relation extraction.
[Objective] This paper proposes an automated method for privacy policy compliance analysis that balances both completeness and consistency, without requiring annotated training samples. [Methods] Drawing on the Personal Information Protection Law of the People’s Republic of China and related regulations and standards, we construct a compliance evaluation framework along two dimensions: completeness and consistency. Then, we develop a Knowledge-Integrated Prompt Learning (KIPL) model, which fine-tunes a pretrained language model with domain knowledge and leverages prompt templates to enable zero-shot compliance analysis. Finally, the model is used to analyze the privacy policies of apps across 14 domains from the Xiaomi App Store. [Results] KIPL outperforms baseline methods by more than 3% in precision and recall on domain-specific datasets. The empirical analysis further uncovers compliance gaps across domains, particularly in areas such as children's privacy and data security. [Limitations] The evaluation sample size is relatively small. [Conclusions] The KIPL model, by combining completeness and consistency analysis, enables automated, zero-shot evaluation of privacy policy compliance, improving performance while reducing costs. The findings not only provide actionable guidance for app developers to refine privacy policies but also deliver valuable compliance insights for regulators, supporting the harmonization and advancement of industry standards.
[Objective] This study introduces convex hull knowledge distillation to enhance the accuracy and efficiency of lightweight models in time series forecasting. [Methods] We propose KDConv, a novel knowledge distillation method that leverages convex hull theory to characterize the shape and distribution of time series data. By refining the distillation loss function, KDConv addresses the limitations of traditional Mean Squared Error (MSE) in capturing the periodic and trend features. [Results] Experiments across multiple datasets show that the KDConv achieves an average improvement of 2.48% in MSE over existing knowledge distillation methods, demonstrating its superior effectiveness. [Limitations] The method’s evaluation is constrained by the limited diversity of available datasets, and its performance may vary with different types of time series. Future research should further examine its generalizability and robustness. [Conclusions] KDConv significantly improves the performance of lightweight models in time series data forecasting with strong periodicity and trends, offering a promising direction for efficient predictive modeling.
[Objective] This study proposes a novel framework, RSM-OC, aiming to overcome the limited consideration of node positions and overlapping community characteristics in rumor suppression. [Methods] RSM-OC employs trust centrality to accurately identify key nodes, integrates overlapping nodes to create a candidate seed set, and applies a genetic algorithm to optimize the positive seed selection. A one-way state-transition linear threshold model is then used to simulate the competitive diffusion between rumors and truth. [Results] Experiments on four real-world datasets show that the RSM-OC improves rumor suppression rate by 23.3% on average compared to the baseline algorithm, and approximately doubles the spread of truth. The framework performs well in dense and medium-scale networks. [Limitations] The computational cost of RSM-OC increases substantially in large-scale networks and may lead to performance bottlenecks. [Conclusions] RSM-OC demonstrates strong effectiveness in both rumor suppression and truth propagation range expansion.