Home Table of Contents

25 February 2025, Volume 9 Issue 2
    

  • Select all
    |
  • Li Shuyu, Zhu Guangli, Li Jiawei, Duan Wenjie, Zhou Ruotong, Zhang Shunxiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 1-11. https://doi.org/10.11925/infotech.2096-3467.2023.1376
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To address the issue of feature sparsity in Chinese ironic short texts, this paper proposes a sarcasm detection method integrating hyperbolic representations. It aims to enhance the accuracy of Chinese sarcasm recognition by extracting hyperbolic representations from short texts. [Methods] Firstly, we used pointwise mutual information and semantic similarity computation to obtain co-occurring word pairs, interjections, and degree adverbs related to sarcasm. We also merged these word sets to construct a hyperbolic representation lexicon. Then, we used the regular expression to match sarcastic texts and obtained a sequence of special punctuations. We extracted these punctuations’ special features with one-hot encoding. The RoBERTa-wwm-ext model is employed to extract semantic features from the text. The WoBERT method transformed the words and word pairs within the hyperbolic representation lexicon into dynamic word vectors, obtaining the hyperbolic representation. Finally, we introduced an improved multi-attention mechanism to focus on text semantics, hyperbolic representations, and special punctuation features and obtained the recognition results through the Softmax function. [Results] We examined the proposed method with merged publicly available Ciron and ChineseSarcasm-Corpus datasets, achieving an accuracy of 81.49% and an F1 value of 81.24%. [Limitations] The constructed hyperbolic representation lexicon relies on corpus quality and has limited generalization ability. [Conclusions] The proposed method can effectively enrich semantic representation and improve the accuracy of Chinese sarcasm detection.

  • Song Donghuan, Hu Maodi, Ding Jielan, Qu Zihao, Chang Zhijun, Qian Li
    Data Analysis and Knowledge Discovery. 2025, 9(2): 12-25. https://doi.org/10.11925/infotech.2096-3467.2023.0885
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study addresses the issue of low classification accuracy in conventional text classification tasks due to factors such as sparse domain-specific training data and significant differences between types. [Methods] We constructed a novel classification model based on the BERT-DPCNN-MMOE framework, integrating the deep pyramid convolutional networks with the multi-gate control unit mechanism. Then, we designed multi-task and transfer learning experiments to validate the effectiveness of the new model against eight well-established and innovative models. [Results] This research independently constructed cross-type multi-task data as the basis for training and testing. The BERT-DPCNN-MMOE model outperformed the other eight baseline models in multi-task and transfer learning experiments, with F1 score improvements exceeding 4.7%. [Limitations] Further research is needed to explore the model’s adaptability to other domains. [Conclusions] The BERT-DPCNN-MMOE model performs better in multi-task and cross-type text classification tasks. It is of significance for future specialized intelligence classification tasks.

  • Chen Wenjia, Yang Lin, Li Jinlin
    Data Analysis and Knowledge Discovery. 2025, 9(2): 26-38. https://doi.org/10.11925/infotech.2096-3467.2023.1435
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study builds a more accurate semantic similarity classification model based on intent recognition to provide precise answer-matching results for Chinese medical and health Q&A services. [Methods] We integrated BERT and Convolutional Neural Networks(CNN) to construct an intent recognition model, which is used as an embedding layer to develop an intent-recognizing twin BERT (ITBERT) semantic classification model. [Results] On the CHIP-STS dataset, compared to single BERT and TextCNN models, the integrated model improved the Top-1 accuracy of intent recognition by 8.2% and 1.5%, reaching 73.6%. The Top-3 accuracy improved by 7.6% and 3.2%, reaching 91.2%, demonstrating the new model’s enhanced intent recognition effectiveness. For semantic similarity classification, the ITBERT model improves the AUC value by 0.015 to 0.087 compared to benchmark models, proving that embedding intent knowledge improves the effectiveness of medical semantic similarity classification. [Limitations] Manually annotated intent information may contain biases and affect the classification results of semantic similarity. [Conclusions] The proposed model can improve intent recognition in medical and health Q&A services. Embedding intent knowledge enhances the accuracy of semantic similarity classification models, contributing to more precise automated Q&A services.

  • Yu Juan, Zhao Huiyun, Wu Shaocheng, Xi Yunjiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 39-47. https://doi.org/10.11925/infotech.2096-3467.2023.1441
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] To reduce semantic deviation and loss caused by language differences and text feature selection in the text classification process while preserving more textual information. [Methods] Firstly, we used a pre-trained SBERT model for sentence representation. Secondly, we calculated the sentence similarity between texts with a Sentence Vectors Rotator’s Similarity method. We also applied sentence weighting within texts to form vectors. Finally, we combined machine learning and neural network classification methods to achieve cross-lingual text classification. [Results] We conducted experiments on multiple cross-lingual text datasets in Chinese, English, Russian, French, and Spanish, and the multilingual public dataset Reuters demonstrated that the proposed method significantly improved accuracy compared to existing methods. Additionally, recall, precision, and F1 scores also showed enhancements. [Limitations] The study does not consider the impact of sentence position within the text on its weight. [Conclusions] The proposed model could reduce semantic deviation and loss, thus improving the performance of cross-lingual text classification.

  • Zhang Jing, Gao Zixin, Ding Weijie
    Data Analysis and Knowledge Discovery. 2025, 9(2): 48-58. https://doi.org/10.11925/infotech.2096-3467.2023.1347
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a new model to effectively classify massive police reports. [Methods] We constructed a text classification model based on BERT-DPCNN. Then, we used the BERT pre-trained model to generate word vectors. The model improved the classification performance by optimizing the activation function in the DPCNN model and enhancing the dynamic learning rate. [Results] We conducted comparative experiments between BERT-DPCNN and six other models, including BERT, BERT-CNN, BERT-RCNN, BERT-RNN, BERT-LSTM, and ERNIE. The BERT-DPCNN achieved the best accuracy, recall, and precision. In the binary classification tasks, the accuracy of BERT-DPCNN exceeded 98%. In the eleven-category tasks, the model’s accuracy exceeded 82%. [Limitations] The model has many parameters, and the limited number of experiments calls for further testing. [Conclusions] The new model effectively improves the accuracy of police report classification, providing data support for police departments in analyzing and assessing police incidents.

  • Pan Hongpeng, Liu Zhongyi
    Data Analysis and Knowledge Discovery. 2025, 9(2): 59-70. https://doi.org/10.11925/infotech.2096-3467.2024.0014
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper aims to identify social network rumor spreaders by leveraging multi-modal data. [Methods] Given the multi-modal nature of rumor propagation and the imbalance in user sample distribution, we first applied an oversampling technique to the raw data. Then, we deeply integrated traditional user attributes and microblogging features with multi-modal information extracted from user-generated content. Third, we constructed the intelligent identification method for social network rumor spreaders, which effectively integrates diverse user features based on the XGBoost model. Additionally, SHAP values were embedded in the model’s output layer to enhance algorithmic interpretability. [Results] The XGBoost model achieves optimal overall performance after sample balancing, with a 12.3% improvement in recall. The identification method incorporating multi-modal information features can attain an accuracy of 0.912, 2.5% higher than the control group. [Limitations] This paper only considered text and image modalities. Future research can be expanded by incorporating audio and video data. [Conclusions] The proposed model can effectively identify social network rumor spreaders.

  • Li Ying, Li Ming
    Data Analysis and Knowledge Discovery. 2025, 9(2): 71-80. https://doi.org/10.11925/infotech.2096-3467.2023.1174
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a dynamic generation method for question descriptions based on the Generator-Evaluator Framework to capture and retrieve Q&A content. [Methods] In the Generator module, we constructed a Q&A encoding layer integrating multiple attention mechanisms. Then, we improved the pointer-generator network with bidirectional attention weights to establish the decoding layer. In the Evaluator module, we constructed a hybrid evaluator by combining reinforcement learning and cross-entropy to optimize the Generator. We also designed reward functions specific to question descriptions to build the optimal question description generation model. [Results] We conducted an experiment using the public dataset “webtext2019zh”, and the proposed method improved the syntactic and semantic performance by 15.26% and 3.34%, on average. [Limitations] This study only focused on the question titles and answers without incorporating answer comments to construct a richer reward function. [Conclusions] The proposed method can generate question descriptions that cover the original question content and reflect the latest answer knowledge.

  • Zhang Kai, Lv Xueqiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 81-93. https://doi.org/10.11925/infotech.2096-3467.2023.1298
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] Taking personification as a representative of unmarked rhetorical categories, this study explores a multidimensional fusion recognition strategy, which holds significance for Chinese rhetorical computing. [Methods] Based on dependency syntax theory, we constructed a cognitive model for generating and understanding personification rhetorical figures through a cognitive framework. Then, we proposed a multidimensional feature fusion automatic recognition method for personification (WPGBA). This method represents and integrates multiple features of rhetorical texts, including word vectors, syntax vectors, part-of-speech vectors, and contextual semantics, using Chinese language textbooks from the K-12 curriculum as experimental data.[Results] We trained the automatic recognition model using the WPGBA method. Experiments showed that the method achieved an accuracy of 90.40%, a recall rate of 87.58%, and an F1 score of 88.65%. Compared to other methods in the experimental group, the accuracy rate was increased by at least 6.27%.[Limitations] New complex sentences may arise in practical applications such as discourse reading comprehension and language proficiency evaluation. Due to the limited scale of the experimental dataset, the generalization ability of the algorithm is restricted.[Conclusions] The integration strategy of expressive and contextual semantic features designed from a cognitive perspective shows good recognition performance for personification rhetorical devices in unmarked categories.

  • Wang Zitong, Li Chenliang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 94-105. https://doi.org/10.11925/infotech.2096-3467.2023.1305
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To more flexibly capture the spatial-temporal features of traffic flow data and achieve more accurate multivariate traffic flow prediction, this paper proposes a Position-Aware Spatial-Temporal Graph Convolutional Network (PASTGCN). [Methods] First, the traffic data’s spatial and periodic temporal position features are represented as explicit position embeddings. Then, based on the spatiotemporal convolutional structure, we incorporated spatial information into the temporal convolutional network for space-aware sequence modeling. Finally, we used static and dynamic dual graph learning methods to capture spatial dependencies. [Results] We conducted experiments on two real-world traffic flow datasets. The PASTGCN model effectively predicted multivariate traffic flows and reduced errors by up to 1.59% compared to existing deep learning models. [Limitations] The experimental datasets are limited, and the proposed graph learning method increased the time complexity. [Conclusions] The PASTGCN model can effectively utilize spatial-temporal position information to achieve more accurate traffic flow prediction.

  • Qiu Jiangnan, Xu Xuedong, Lu Yanxia, Yang Zhilong
    Data Analysis and Knowledge Discovery. 2025, 9(2): 106-119. https://doi.org/10.11925/infotech.2096-3467.2023.1371
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper identifies and classifies issues from public appeals. It also explores regional differences in issue types and response rates. [Methods] Taking pension insurance disputes as an example, the ERNIE model was enhanced with knowledge and data through domain-specific vocabulary construction, key appeal content extraction, and simple data augmentation. An ERNIE-BiLSTM contradiction identification and classification model was developed to deeply analyze contradictions in public appeals in low-data-resource scenarios, addressing existing studies’ lack of quantitative methods for social contradiction analysis. Finally, a differentiation analysis of contradictions was conducted based on the classification results. [Results] During the data collection period, pension insurance payment-related conflicts were more frequent in Henan and Liaoning provinces, while pension insurance service-related conflicts were more prevalent in Guangdong Province and Beijing. Significant differences in response rates were observed across different types of contradictions. [Limitations] This paper does not consider the correlation between different types of conflicts. [Conclusions] This paper reveals the inter-provincial differences in pension insurance disputes, providing governments with insights into hotspots and trends to assist in decision-making.

  • Zhai Dongsheng, Zhai Liang, Liang Guoqiang, Zhao Kai
    Data Analysis and Knowledge Discovery. 2025, 9(2): 120-133. https://doi.org/10.11925/infotech.2096-3467.2023.1277
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study proposes a method for identifying technological evolution paths and explores key technologies and branches in specific domains. It aims to reveal the evolution trajectories of technology. [Methods] Firstly, we devised an unsupervised graph embedding model to integrate patent structural relationships, text and node information propagation, and aggregated knowledge into multi-dimensional semantic vectors. This approach expanded the technological paths while improving community division effectiveness. Secondly, we proposed methods for expanding the main path and derivative paths from the perspective of network topology and semantic correlation. Finally, we constructed a metric for technological junction points to identify the promising fields. [Results] We examined the new method with drone flight control system technology and identified four subfields’ technological evolution paths and branches. We found that pattern recognition, multiprocessor, and data fusion technologies hold promising prospects. [Limitations] Our identification framework does not incorporate the formation mechanism of technological evolution patterns. [Conclusions] The proposed method demonstrates significant advantages in path expansion effectiveness and application versatility.

  • Chen Jing, Zhao Yuke, Lu Quan, Zhang Lu
    Data Analysis and Knowledge Discovery. 2025, 9(2): 134-145. https://doi.org/10.11925/infotech.2096-3467.2023.1393
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This paper reveals the mechanism of user cognitive differences brought about by the interpretability features of navigation tools. [Methods] Two text-based navigation tools, THC-DAT and BOOKMARK, were selected based on themes and directories. Using eye-tracking technology and the Mann-Whitney test, we explored the cognitive differences in completing reading tasks caused by interpretability features such as topic coverage, navigation accuracy, and semantic readability. [Results] The importance of interpretability features of navigation tools varies with task difficulty. For low-difficulty tasks, navigation accuracy significantly impacts cognitive efficiency, cognitive performance, and navigation-assisted cognitive strategy. For high-difficulty tasks, semantic readability significantly affects cognitive efficiency. [Limitations] The sample size was limited and structurally homogeneous. Cognitive differences were only compared between the two types of navigation tools. [Conclusions] This study provides a new perspective for improving the knowledge organization service level of reading navigation tools and optimizing user reading quality.

  • Ni Yuan, Hua Junpeng, Zhang Jian, Yang Cuifen, Zhang Teng
    Data Analysis and Knowledge Discovery. 2025, 9(2): 146-158. https://doi.org/10.11925/infotech.2096-3467.2023.1341
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper integrates sentiment features into the prediction model for danmaku video propagation effects to improve prediction performance and to quantify the impact of various feature variables using model interpretability. [Methods] We extracted sentiment features influencing danmaku video propagation using the BERT-BiLSTM model. Then, we proposed a combined prediction model based on PCA-CVRFE-RF-XGBoost to predict the propagation effect of danmaku videos. Finally, we empirically analyzed using propagation data from 1,515 cultural danmaku videos. [Results] Thirty-one variables were identified, covering three aspects: information quality, source credibility, and perceived quality of information dissemination. For sentiment feature extraction, the BERT-BiLSTM model achieved 0.81 and 0.85 precision rates for positive and negative classifications in the test set, with an F1 score of 0.84. Our prediction model based on CRFE-RFR-XGBoost showed an improvement across four evaluation metrics compared to SVM and BP neural network models. [Limitations] The granularity of sentiment analysis for danmaku text requires further refinement. [Conclusions] The proposed model provides a novel approach for predicting the propagation effects of danmaku videos with complex and highly dynamic sentiment features. Empirical results show that source credibility contributes more to propagation effects than information quality. Key features include media platform reputation, media platform professionalism, personal influence, and content publishing frequency.

  • Chen Ting, Ding Honghao, Zhou Haoyu, Wu Jiang
    Data Analysis and Knowledge Discovery. 2025, 9(2): 159-171. https://doi.org/10.11925/infotech.2096-3467.2023.1424
    Abstract ( ) Download PDF ( ) HTML   Knowledge map   Save

    [Objective] This study explores the impacts of bullet-screen(danmu)content and behavioral characteristics on consumers purchasing behavior in live-streaming e-commerce, as well as the moderating effect of host-product relevance. [Methods] First, we retrieved the bullet-screen data from the Douyin platform and the consumer data from the Huitun platform based on the Elaboration Likelihood Model. Then, we studied the impacts of bullet-screen content characteristics (central route) and behavior characteristics (peripheral route) on consumer purchasing behavior with text mining and zero-inflated negative binomial regression. We also discussed the moderating effect of host-product relevance with grouping regression. [Results] Information richness, social interaction degree and number of bullet-screen comments positively impact purchasing behavior. The emotional polarity of bullet screen comments exhibits an inverted U-shaped effect on purchasing behavior. Compared with live streaming rooms with low host-product relevance, those with high host-product relevance have broader positive impacts on purchase behavior. [Limitations] We only investigated the bullet-screen data from a single live-streaming e-commerce platform. [Conclusions] This study examines the factors influencing consumers’ actual purchasing behavior from the perspective of bullet-screen comments. It provides insights for improving communication between merchants and consumers in live-streaming e-commerce, ultimately enhancing sales performance.