Home Browse Online first

Online first

The manuscripts published below will continue to be available from this page until they are assigned to an issue.
Please wait a minute...
  • Select all
    |
  • Yu Juan, Zhao Huiyun, Wu Shaocheng, Xi Yunjiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1441
    Online available: 2024-10-23

    [Objective] To reduce semantic deviation and semantic loss caused by language diversity and text feature selection so as to retain more text information in the cross-lingual text classification process.[Methods] Firstly, representing cross-lingual sentences using the pretrained models of SBERT. Secondly, calculating the similarity degree of sentences from different cross-lingual texts with a proposed method of Sentence Vectors Rotator’s Similarity (SVRS), and generating each text vector on weighted sentence vectors. Finally, classifying cross-lingual texts through integrating machine learning methods and neural network classification methods.[Results] Experiments on multiple cross-lingual text datasets in Chinese, English, Russian, French, Spanish and the multilingual public dataset Reuters show that the proposed method is a significant improvement compared with existing methods. Its accuracy performs better on other common evaluation metrics for classification including recall, precision and F1 score.[Limitations] Representing a sentence without consideration on the position where the sentence appears in the text on its weighting.[Conclusions] Text representation on sentence vectors weighting could reduce semantic deviation and semantic loss, thus improves the performance of cross-lingual text classification.

  • ZHANG Lanze, GU Yijun, Peng Jingjie
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1009
    Online available: 2024-10-23

    [Objective] Introducing topology analysis to improve graph neural networks in credit fraud detection accuracy. [Methods] Propose PSI-GNN, a graph depth fraud detection model that incorporates a prior structural information. Attribute information characterizing the topology of the central node is embedded in the feature vector through structural information encoding; Secondly, message passing is divided into proximal and distal aspects, one of which is based on a shallow GNN model that aggregates the proximal node information, and the other is oriented to random wander structural similarity to filter and aggregate distal homophilic information; Finally, fusing the above two node embedding results and accomplishing fraud identification. [Results] Experimental results show that compared with nine graph neural network models, PSI-GNN has 2.62%, 4.55% and 4.67%, 2.33% improvement in Macro-F1 and AUC on the DGraph-Fin and TFinance dataset of the credit or transaction network that contains fraud; [Limitations] The reduction of the pre-embedding time overhead for structural information is the focus of further research. [Conclusions] The task of fraud detection can be effectively accomplished by fully utilizing the structural attributes and homophilic information of entities in the credit network.

  • Chen Jing, Zhao Yuke, Lu Quan, Zhang Lu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1393
    Online available: 2024-10-23

    [Objective]Revealing the mechanism of user cognitive differences brought by the interpretable characteristics of navigation tools.[Methods]Two navigation tools based on topics and table of contents, named THC-DAT and BOOKMARK, were selected as research tools. Eye-tracking technology and Mann-Whitney test were used to explore cognitive differences of users caused by the different interpretable characteristics such as topic coverage, navigation accuracy and semantic readability.[Results] The importance of interpretable characteristics of navigation tools varies among different tasks. In low-difficulty tasks, navigation accuracy significantly affects cognitive efficiency, cognitive effectiveness, and navigation-assisted cognitive strategy. In high-difficulty tasks, semantic readability significantly affects cognitive efficiency.[Limitations] The research sample size is limited and the structure is single; Cognitive differences were only compared between two types of navigation tools. [Conclusions] This study provides new ideas for improving the knowledge organization service of navigation tools and optimizing the quality of user reading from the perspective of nterpretability.

  • Cao Kun, Wu Xinnian, Bai Guangzu, Jin Junbao, Zheng Yurong, Li Li
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0006
    Online available: 2024-10-11

    [Objective] Combining textual content features and the complex network relationship between “science and technology”, this study conducts research on the identification method of key core technologies, aiming to provide intelligence support for governments, research institutions, and the industry to rationally formulate scientific and technological strategic plans, or carry out scientific and technological innovation activities. [Methods] The Sentence-BERTopic model is used to perform deep semantic fusion and knowledge topic clustering on sentence-level paper and patent text corpora. Based on the citation relationships of papers and patents, a “science-technology” knowledge topic complex network is constructed, and the traditional PageRank algorithm is improved by combining node quality characteristics, time decay factors, weights of incoming node edges, and outdegree, etc., to objectively rank the importance and influence of nodes in the field. Finally, key core technologies are selected in combination with the head/tail breaks method. [Results] An empirical study was conducted in the field of numerical control machines, resulting in the identification of 53 key core technologies, including thermal error modeling and compensation, numerical control machine tool control technology, and numerical control machine tool feed systems. When compared with relevant domestic and international policy plans, this outcome comprehensively encompassed the key core technologies within the domain, thereby demonstrating the scientific validity and rationality of the methodology employed. [Limitations] The lack of in-depth analysis of citation locations, citation motivations, citation behaviors, and sentence purposes may affect the accuracy of identification. [Conclusions] By constructing a “science-technology” complex network and the KCR algorithm, the knowledge structure and topological characteristics of science and technology can be more comprehensively revealed, achieving fine-grained and precise quantitative identification of key core technologies.

  • ZHU Xiping, XIAO Lijuan, GAO Ang, GUO Lu, YANG Huan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0765
    Online available: 2024-10-11

    [Objective] By utilizing semantic correlation mining across carbon-neutral data, we were able to increase the overall accuracy of triplet extraction.[Methods] For the combined extraction of entity relation based on MacBERT, we suggest the HmBER (Handel Missing Label and Error Boundaries Model Based on MacBERT) model. To significantly enhance the performance of carbon-neutral entity and relation joint extraction, similarity assessment, entity boundary auxiliary training, and entity category characteristics are incorporated into the model.[Results] On the carbon-neutral dataset, a comparison with the results of other SOTA algorithms reveals an increase in the F1 scores of entity and relation extraction results of an average of 2.39% and 13.84%, respectively.[Limitations] The data analysed by this method does not go further into the potential semantics; rather, it infers the link between things from the meaning of the sentence.[Conclusions] The HmBER model proposed in this paper effectively solves the problems of missing labels and entity boundary errors, and the F1 value is increased by 2.39% and 13.84% on the dataset on average.

  • Xia Zhonghua, Qi Jianglei, Ding Hao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0022
    Online available: 2024-10-11

    [Objective] This article proposes a medical publication recommendation model that utilizes cross modal information to improve the accuracy of recommendations. [Methods] In this paper, the standardized extraction of medical knowledge through the Unified Medical Language System is used to pair graphical and textual tags, and then the paired semantic tags are used to align the feature semantics between images and text through comparative learning.A cross-modal cross-attention mechanism is constructed based on the aligned feature semantics, and a recommendation is made by predicting the user's preference of publications through the user's weighting of interest in different modalities. [Results] In this paper, we conducted comparative experiments with 3 latest multimodal baseline methods on 2 publication datasets, and the model's F1 averaged 68%, Precision averaged 64%, and NDCG averaged 63%, and the results of the metrics were generally better than the other baseline models. [Limitations] Additional cold start methods may be required for pre-training data containing only a single mode. [Conclusion] The proposed model has a strong fusion ability of cross-modal information features, which can effectively alleviate the problem of semantic gap between different modalities and improve the accuracy of medical publication recommendation.

  • Wang Xin, Diao Xiuli, Ni Weijian, Zeng Qingtian, Song Zhengguo
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0035
    Online available: 2024-10-11

    [Objective] To trace learners’ learning progress and knowledge state in order to provide personalized educational support.[Methods] In this paper, we propose a Fine-grained Learning Ability boosted Interpretable Knowledge Tracing (FLAB-IKT), which models learners in terms of both knowledge and ability to predict the next moment’s answer results. [Results] It can be found through experiments on three datasets that the knowledge tracing model proposed in this paper has a greater performance improvement compared with many baseline methods.[Limitations] The method in this paper improves the interpretability of the model from the perspective of increasing the learning factors. However, further validation is needed to improve the interpretability of deep learning based knowledge tracking models.[Conclusion] The model proposed in this paper not only has a good improvement in prediction performance, but also can portray the learner model and the prediction process from multiple perspectives, which improves the interpretability of  the knowledge tracing model.

  • Chen Wenjia, Yang Lin, Li Jinlin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1435
    Online available: 2024-10-11

    [Objective] A more accurate semantic similarity classification model based on intent recognition is built to provide more accurate answer matching results for Chinese medical and health query services. [Methods] An intention recognition model is constructed by integrating Bert (Bidirectional Encoder Representations from Transformers) and CNN (Convolutional Neural Networks). Then it is used as an embedding layer to build the intention-recognition embedded twin Bert (ITBert) model for semantic classification. [Results] On the CHIP-STS dataset, compared with benchmark models, the accuracy of the integrated model in the Top-1 results of intention recognition increases by 8.2 and 1.5 percentage points, reaching 73.6%. The accuracy of the Top-3 results increases by 7.6 and 3.2 percentage points, reaching 91.2%. These results demonstrate the improvement of the integrated model in intent recognition. For the semantic similarity classification results, compared to benchmark models, the AUC value of the built ITBert model increases by 0.015-0.087, proving that the embedding of intent knowledge improves the effectiveness of medical semantic similarity classification. [Limitations] There is a certain deviation in the intention information manually annotated, which may affect the classification results of semantic similarity. [Conclusions] The fusion model can improve the intention recognition performance in medical and health query services. Embedding the recognized intention knowledge can improve the accuracy of semantic similarity classification models, which is conducive to providing more accurate medical and health automatic question-answering.

  • Qiu Jiang-nan, Xu Xue-dong, Lu Yan-xia, Yang Zhi-long
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1371
    Online available: 2024-10-11

    【Objective】Identify and multi-label categorization of pension insurance dispute conflicts reflected in public demands, and explore the differences between different types of conflicts and response rates in different regions. [Methods] Firstly, the ERNIE model is enhanced with knowledge and data through the construction of domain lexicon, extraction of key claim contents, and simple data enhancement, on this basis, the ERNIE-BILSTM contradiction identification and classification model is constructed for pension insurance disputes, which realizes the in-depth mining of the contradictions in the public's claims under the scenario of low data resources, and solves the problem of the existing social contradiction analysis by using the qualitative method. Problems. Finally, based on the results, we analyzed the differences of pension insurance disputes. [Results] During Data collection period, Pension insurance payment conflicts are more frequent in Henan and Liaoning provinces, while pension insurance service conflicts are more likely to occur in Guangdong Province and Beijing, and the response rates of different types of conflicts are quite different. [Limitation] This paper does not consider the correlation between different types of conflicts, and the correlation can be further analyzed. [Conclusion] The findings of this paper reveal the inter-provincial differences in pension insurance disputes, which can help policy makers grasp the hotspots and dynamics of disputes and assist government decision-making.

  • Pan Hongpeng, Liu Zhongyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0014
    Online available: 2024-10-11

    [Objective] From a large number of social users, based on the multi-modal data, the social network rumor spreaders can be identified automatically. [Methods] Considering the characteristics of "multi-modal" and "unbalanced user sample", the original data should be oversample-processed firstly, and then traditional features such as user attributes and microblog posts are deeply integrated with multi-modal information features in user-generated content. Based on the XGBoost model, an intelligent identification framework for social network rumor spreaders that can widely integrate the characteristics of social users is constructed. Finally, SHAP values are embedded in the output layer of the model in order to increase the interpretability of the algorithm. [Results]XGBoost has the best overall performance on sample balanced datasets compared to raw data, with a 12.3% increase in recall. The accuracy of the identification method with integrated multi-modal information features can reach 91.2%, which is 2.5% higher than that of the control group.[Limitations] In this paper, only text and picture modes are considered in the multi-modal information features, which can be combined with audio and video modes in the future. [Conclusion]The identification method based on multi-modal data and oversampling training can effectively handle the intelligent identification task of social network rumor spreaders.

  • Jin Bo, Zhang Jiawan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0113
    Online available: 2024-10-11

    [Objective] To address the time-consuming and cumbersome manual sleep staging methods, and the long training time and poor recognition effect of existing automatic sleep staging models, and to improve the accuracy and robustness of sleep staging prediction.[Methods] This paper designs an automatic sleep staging model based on discrete wavelet transform and residual shrinkage network. Firstly, the original physiological signal data is decomposed by Discrete Wavelet Transform, and then multi-resolution feature extraction is carried out through two convolutional neural networks of different sizes. Then, a deep residual shrinkage network is used to model the interdependence of features at the channel level. Finally, a time context encoder with multi-head attention is deployed to effectively capture the temporal dependencies in the features.[Results] Experiments on three public sleep datasets show that the classification accuracy of the proposed model reaches 85.4%, 81.9%, and 84.4% respectively, and the accuracy is improved by 1.0, 0.6, and 0.2 percentage points compared with the optimal baseline model.[Limitations] The proposed model does not perform well on datasets on imbalanced datasets.[Conclusions] The WaveSleep model can effectively improve the efficiency and accuracy of sleep staging prediction and has significant robustness.

  • Chen Ting, Ding Honghao, Zhou Haoyu, Wu Jiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1424
    Online available: 2024-10-11

    [Objective] This study explores the impacts of bullet-screen content and behavior characteristics on consumers purchase behavior in live streaming e-commerce and examines the moderating effect of broadcaster-product relevance. [Methods] First, we retrieved the bullet-screen data from Tiktok platform and the consumption data from Huitun platform. Then, we studied the impacts of bullet-screen content characteristics (central route) and behavior characteristics (peripheral route) on consumer purchase behavior with the help of text mining and zero inflated negative binomial regression based on the elaboration likelihood model. We also discussed the moderating effect of broadcaster-product relevance with grouping regression. [Results] Information richness, social interaction degree and the number of bullet-screen positively impact purchase behavior. The impact of sentiment valence of bullet-screen on purchase behavior is inverted U-shaped. Moreover, compared with live streaming rooms with low broadcaster-product relevance, the sentiment valence of bullet-screen in live streaming rooms with high broadcaster-product relevance has a larger positive impact on purchase behavior. [Limitations] We only investigated the bullet-screen data from one live streaming e-commerce platform. [Conclusion] This study focuses on the influencing factors of consumers’ actual purchase behavior from the perspective of bullet-screen. It provides guidance for live streaming practitioners on how to better communicate with consumers as well as improve the sales performance.

  • Liu Qingtang, Jiang Ruyi, Wu Linjing, Yin Xinghan, Wang Deng, Ma Xinqian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0961
    Online available: 2024-10-11

    [Objective]  Entity relationship extraction models in general domains have poor performance when directly applied to specific domains. Statistical analysis reveals that entity location, type and entity relationship in Tujia ethnic instrumental music texts have strong correlation characteristics, and the paper proposes an entity relationship extraction model that integrates entity location and type characteristics.[Methods] Adopting the Pipeline relationship extraction model, after completing the named entity recognition task, the relative position of each character to the subject-object and entity type features are spliced into the original relationship statement, and then the features are learned through the BERT model, and finally the relationship classification is learned through the fully connected layer. [Results]  The ablation and model comparison experiments on the self-constructed Tujia ethnic instrumental music dataset show that the model (BERT_E) that incorporates the entity type features performs optimally, with an F1-micro of 97.359%.[Limitations]  The sample size is small, and the entity location features do not take entity length into account.[Conclusions] The research results promote the digital protection and intelligent application services of Tujia ethnic instrumental music culture, and also have important reference value for entity relationship extraction in ethnic instrumental music related fields.

  • Dou Luyao, Zhou Zhigang, Shen Jing, Feng Yu, Miao Junzhong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2024.0023
    Online available: 2024-10-11

    [Purpose] To address the issues of long-distance dependencies in sequence modeling and key information extraction from sequence features during the identification of potential high-value patents, thereby enhancing the accuracy and interpretability of identifying potential high-value patents.[Method] A potential high-value patent identification model (XLBBC) based on the pre-trained XLNet model and the BiAttention dual attention mechanism is proposed. The XLNet model is utilized for patent text representation and high-quality semantic extraction, followed by the use of the BiGRU network to obtain global text sequence information. Subsequently, the embedding of the BiAttention layer focuses the model's attention on different parts of the input sequence, while the CNN layer captures key phrases and specific patterns in patent text. An empirical study is conducted on a mixed patent dataset in fields such as amorphous alloys, industrial robots, and perovskite solar cells.[Results] The model demonstrates high accuracy (0.89) and consistency (0.65) advantages with a certain data volume (40,000 patent data points). The model achieves a prediction accuracy of approximately 42%, representing an improvement of around 9% compared to existing research models. [Limitations] Noteworthy limitations include the omission of considerations regarding the correlation and integration mechanisms between standard essential patents and high-value patents. Additionally, there exists room for enhancing algorithmic complexity.[Conclusion] The XLBBC model surpasses composite models such as CNN in text classification, underscoring the efficacy of the XLNet model in global semantic comprehension. Optimal model performance is achieved when the attention layer is strategically positioned between the XLNet-BiGRU and CNN layers.

  • Wang Yudong, Bai Yu, Ye Na, Chen Jianjun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0968
    Online available: 2024-10-09

    [Objective] The aim of this study is to address the issue of topic drift in the context of interactive retrieval for Hyponym expansion.[Method] Utilizing graph attention networks to encode conceptual chains and textual relationship graph nodes, where conceptual chains are modeled through word interaction processes, and the textual relationship graph is obtained from character co-occurrence relationships. By introducing an attention mechanism, this approach aims to overcome the issue of losing query context information in traditional text encoding processes.[Results] Experimental results on a Hyponym expansion test dataset demonstrate that this approach achieves an overall performance improvement, with an F1 score 2.0% higher than existing methods.[Limitations] The proposed method in this paper is applicable to interactive scenarios, relying on interactive information.[Conclusion] The model proposed in this paper effectively integrates the structural and semantic features of conceptual chains into text features. Simultaneously, it calculates attention for both the conceptual chain and candidate text, reducing information loss during the encoding process and alleviating the issue of topic drift.

  • Qian Xiaodong, Shi Yulin, Guo Ying
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0893
    Online available: 2024-10-09

    【Purpose】 In this paper, an improved Deepwalk link prediction algorithm is used to study the node similarity and recommendation of e-commerce networks. 【Methods】 Aiming at the problem that the traditional Deepwalk algorithm treats each node equally in the random walk process, the structure and attribute information of the e-commerce network are biased to the random walk, so as to guide the walking process to traverse different types of nodes in the graph more targeted; To solve the problem that the traditional DeepWalk algorithm can not well represent the relationship between users and commodities by using cosine similarity similarity measurement method, Bhattacharyya c,efficient is introduced into the existing nonlinear similarity calculation model to create a new similarity model. 【Results】Based on this, an optimized Deepwalk model is proposed, and three e-commerce network data sets are used to verify the proposed algorithm. The results show that the accuracy of the optimized algorithm is higher than the traditional Deepwalk algorithm, Node2vec, M-NMF and other six algorithms. 【Conclusion】It shows that the improved algorithm can learn the node embedding vector well, so as to understand the similarity of nodes in the e-commerce network.

  • Xie Jun, Gao Jing, Xu Xinying, Hao Shufeng, Liu Yuxin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0793
    Online available: 2024-10-09

    [Objective] In order to solve the shortcomings such as ignoring affective knowledge when constructing syntactic dependency graph in most GCN models of ABSA, excessive dependency in syntactic dependency graph generates noise, and reducing performance when modeling long distance or incoherent words. [Methods] The sentiment score in SenticNet7 is used to improve syntactic dependency graph and noise reduction for various syntactic dependency types is considered. Secondly, dual Transformer network is used to enhance the performance of long-distance word processing. Meanwhile, the improved syntactic dependency graph can enhance the representation learning of semantic features.[Results] In the five public data sets, F1 values of the proposed model reached 74.97%, 76.13%, 74.83%, 68.01% and 74.54%, respectively. Compared with various benchmark models, F1 values increased by 3.85%, 5.22%, 3.48%, 6.80% and 7.49%, respectively. [Limitations] Because there are a certain proportion of implicit emotion sentences in the data set, the proposed model cannot learn more accurate implicit emotion features, and the analysis results are limited. [Conclusions] The proposed model combines emotional common sense knowledge and syntactic relation after noise reduction to reconstruct Dual-Transformer network, which improves the effect of ABSA.

  • Ye Naifu, Yuan Deyu, Zhang Zhi, Hou Xiaolong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0841
    Online available: 2024-10-09

    [Objective]To deal with the problem that the current textual relation extraction model can only obtain part of the textual features, a dual-channel textual relation extraction model based on cross-attention is constructed to improve the comprehensiveness and accuracy of the textual relation extraction, and to realize the goal of high-performance relation extraction for the domain dataset.[Methods]In this paper, we propose a DCCAM (Dual Channel Cross Attention Model) relational extraction model, design a dual channel structure that fuses sequence channel and graph channel, and construct a cross-attention mechanism with self-attention and gated-attention, which promotes a high degree of fusion of textual features to dig deeper into the text's potential related information. In this paper, experiments are conducted on public datasets and constructed datasets of two types of policing domains.  [Results] Experimental results on the NYT and WebNLG public datasets show that the DCCAM model F1 values are improved by 3% and 4%, respectively, compared to the baseline model. In addition, ablation experiments are also conducted. Experimental results demonstrate the effectiveness of the modules in enhancing text extraction capability. The experimental results on the telecommunication fraud category dataset and the help information network crime category dataset in the police domain show that the DCCAM model can improve the textual relation extraction effect in the police domain, and the experimental F1 values are improved by 8.8% and 11.8% respectively compared with the baseline model, which proves that the DCCAM model is more effective in the police domain.

    [Limitations]Large language model provides a new research idea for textual relation extraction technology and the future exploration of textual relation extraction technology will be carried out from the perspective of a big language model.[Conclusions] The DCCAM model can significantly improve the ability of textual relationship extraction, this model also proves the effectiveness and usefulness of the textual relationship extraction task in the field of policing, which can provide textual correlation analysis and guidelines for policing work.

  • Pang Qinghua, Xu Xun, Zhang Lina
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1076
    Online available: 2024-10-09

    [Objective] In order to address the problem of homogeneity and lack of novelty in weibo topic recommendations, a more comprehensive topic recommendation model is proposed to better meet users' needs for personalized information and enhance their overall experience. [Methods] Firstly, the LDA model is used to extract corresponding topics from users' historical weibo text information to form a weibo-topic matrix and a user-topic matrix respectively. Secondly, the evaluation of users' multi-dimensional interest in weibo topics is conducted based on the dimensions of interaction, attributes, and frequency of weibo. Meanwhile, the simulation of user interest forgetting and decay process is carried out to construct a dynamic user interest preference model, and to obtain the user's similar neighbor set. Finally, through the hybrid recommendation, the ultimate evaluation of user preferences for themes is derived to recommend the top-N topics for the user. [Results] Through ablation experiments on a real dataset, it is found that the topic recommendation model is found to have a higher overall evaluation in terms of F1 value, coverage, and novelty. [Limitations] Topic mining was conducted only from the content of the weibo text, and subsequent studies could further incorporate information such as user comments. [Conclusions] The model can provide users with more diverse and novel recommendations while ensuring a certain accuracy, which effectively solves the problems of singularity and lack of novelty in weibo recommendation.

  • Rang Yuchen, Ma Jing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1130
    Online available: 2024-10-09

    [Objective] Reducing inter-modal differences and strengthening inter-modal correlations allow the model to accurately grasp the emotional tendencies embedded in image text pairs and enhance the effectiveness of sentiment analysis. [Methods] For the image-text pair data, the text side uses RoBERTa pre-trained model for feature extraction after supplementing the image caption, and the image side uses ClipVisionModel to extract image features, and the separately extracted text and image features are passed through a layer of multimodal Transformer-based multimodal alignment layer to get the enhanced fusion features, and finally, the fused features are fed into the multilayer perceptual machine for emotion recognition and classification.[Results] The model proposed in this paper achieves 71.78% accuracy and 68.97% F1 value on MVSA-Multiple dataset, which is higher than all baseline models. The model in this paper improves the accuracy and F1 value by 1.78% and 0.07%, respectively, compared to the optimal performance in the baseline model.[Limitations] Failure to test model performance with additional datasets.[Conclusions] The model proposed in this paper effectively facilitates inter-modal fusion, obtains better fusion representations, and enhances sentiment analysis.

  • Xie Jun, Yang Haiyang, Xun Xinying, Cheng Lan, Zhang Yarui, Lv Jiaqi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.1072
    Online available: 2024-10-09

    [Objective] This article proposes a knowledge graph completion method based on multi-view fusion and multi-feature extraction, aiming to address issues such as low quality of knowledge representation and poor performance of existing models.[Methods] Firstly, multiple single-view networks are generated through a view encoder, and the final knowledge representation of the entity is obtained by using multi-view attention to fuse information from different views. Secondly, semantic and interaction features of the head entity and the relationship are extracted separately using different feature extractors. These features are then combined with the tail entity using a cross-attention module for matching.[Results] The experimental results in the link prediction task show that compared to the baseline model, the Hits@10 metric has improved by 0.4 and 0.9 percentage points on the general datasets FB15k-237 and WN18RR, respectively. Additionally, the Hits@10 metric on the domain datasets Kinship and UMLS reached 99.0% and 99.9%.[Limitations] The relationship was not updated when the view was updated, and the relationship knowledge represents average vector quality.[Conclusions] The multi-view fusion model can effectively improve the quality of knowledge graph representation, while the multi-feature extraction framework can effectively enhance the accuracy of link prediction.

  • Hu Zhongyi, Qin Wei, Wu Jiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0838
    Online available: 2024-10-09

    [Objective] This paper aims to expand the application of diffusion models in the field of text generation, and solve the problem of single and redundant information generated by existing models. [Methods] The TextRank algorithm is used to extract keyword information from the original text, and then the keyword information is integrated into a sequence diffusion model (DiffuSeq) to construct a sequence diffusion model (K-DiffuSeq) that integrates keywords. [Results] Compared to the benchmark models, the K-DiffuSeq model has shown improvement of at least 4.14% in terms of PPL, 42.69% in terms of ROUGE, and 29.43% in terms of diversity measure.

    [Limitations] Only text corpus related to the product was considered, while more rich multimodal product information such as images and videos was ignored. [Conclusions] By integrating keywords, the performance of marketing text generation models can be effectively improved, and this study confirms the potential application of diffusion models in the field of text generation.

  • Yun YangLin, Tang Xiaobin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2023.0868
    Online available: 2024-10-09

    [Objective]Traditional recommendation methods can not effectively generate personalized recommendations when facing the cold start problem, which reduces the accuracy of recommendation and user satisfaction. [Methods]Different from the general task of recommending financial commodities only using commodity information, this paper also carries on the characterization of user history transaction records and introduces multi-level fusion characterization. By linking the two, a recommendation system that incorporates implicit information can capture complex user investment patterns. [Results]Firstly, different benchmark methods were compared, and then a small sample learning scenario was constructed to verify the ability of this model to deal with the commodity cold start problem. The experimental results show that compared with the previous optimal method, the average reciprocal ranking, hit rate and normalized discount cumulative gain of this method are increased by 18.6%, 26.08% and 23.52%, respectively. [Limitations]Since it is not based on the most advanced neural network architecture, more advanced deep neural networks can be used in the future to further improve the recommendation effect.[Conclusions] The results of different ablation experiments and comparison with the benchmark model prove the effectiveness of the proposed method in recommending financial commodities.

  • Wanying Lv, Jie Zhao, Liushen Huang, Zhenning Dong, Zhouyang Liang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1153

    [Objective]Using feature grouping and combining ideas. The grouping provides replaceable features for trust evaluation in the absence of data and reduces search space. The combination effectively reduces the feature dimensions and further alleviates the problem of difficult trust evaluation caused by missing data. [Methods]Based on Markov Blanket to group with distinguishing ability similar features by analyzing the relationship between features distinguishing ability. Based on RVNS methods to search within and between groups to complete feature combinations. [Results]In the case of missing value features, it can effectively provide substitute features when the effect of trust evaluation is stable; the dimension of features is reduced to 1.7%, and the average accuracy of trust evaluation higher than 92%. [Limitations]This study only discusses methods to alleviate the problem of data missing, how to use knowledge of missing-value data can be discussed in the future. [Conclusions]We integrate feature grouping and combination to provide an efficient trust evaluation model, and from two sides alleviate the problem caused by missing data in trust evaluation.

  • Chen Wen, Chen Wei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0075

    [Objective] Emerging topics contained of multi-source data are identified, and a multivariable LSTM with bibliometric indicators is established to predict the popularity of emerging topics.

    [Methods] Firstly, topics of fund projects, papers and patents are identified. Secondly, emerging topics are screened out according to their novelty, growth and persistence. Finally, the indicator of topic  popularity is designed ,and the popularity score of emerging topics is predicted on the multivariable LSTM model with the four bibliometric indicators of fund amount, fund number, cited frequency of average article and patent IPC subclass number.

    [Results]Taking the field of solid oxide fuel cell as an example, the prediction effect of multivariable LSTM with bibliometric indicators is better than BP, KNN, SVM and univariate LSTM, with the lowest MAE (16.534) and RMSE (23.494) and the highest R2 (0.642).

    [Limitations]The patent citation number and other indicators are not selected as input  variables because it is difficult to obtain specific data under each time slice.

    [Conclusions]The inclusion of bibliometric indicators can optimize the popularity prediction effect of emerging topics.


  • Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1167

    [Objective] In order to improve the accuracy of text representation and the effect of following text mining, the structural and functional information of Chinese medical records is used to enrich the semantic connotation of the text representation.

    [Methods] Based on the structure-function features of Chinese medical records, this research innovates semantic representation strategy of the text. Then the BiLSTM-CRF model is used to recognize named entities based on text structure, introducing entity and structure information at the word vector level. The TextCNN model is also used to extract local context features, helping us obtain a vector representation with richer text semantic connotations.

    [Results] In the medical entity recognition experiment, the precision rate, recall rate and F value of entity recognition based on structure-function reached 93.20%, 95.19% and 94.19% respectively; in the text classification experiment, which can verify the text representation method proposed in this article, the classification accuracy rate reached 92.12%.

    [Limitations] It is necessary to strengthen the verification in more texts and refine the structure recognition process, so as to make the proposed method serve the text mining work better.

    [Conclusions] The method proposed in this paper introduces structure-function information of medical records into the text representation work. Related experiments have proved that it cannot only effectively improve the accuracy of named entity recognition, but also enrich the semantic connotation of the text and improve the text representation effect.


  • Yang Yang, Jang Kaizhong, Yuan Mingjun, Hui Lanxin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0115

    [Objective] Aiming at the problem that the number of topics needs to be specified in the traditional LDA model, an adaptive topic number determination method for the field of news topic recognition is proposed.

    [Methods] This paper extracts the news data by using semantics and time series as two views to obtain the corresponding feature vectors. The Co-DPSC algorithm is used to collaboratively train the two views to obtain a semantic feature matrix containing timing effects, and finally the density peak clustering by row after the matrix dimensionality reduction process is obtained, and the result is used as the optimal number of topics.

    [Results] The experimental results show that the precision and F value of the optimal number of topics are improved by considering semantic and temporal factors, among which the precision rate is increased by 35.09%, and the F value is increased by 15.39%.

    [Limitations] The keyword set is clustered, and the method of obtaining keywords affects the effect of clustering and the running time to a certain extent. Because news data requires textual and temporal elements, there are limitations to other types of data.

    [Conclusions] Experiments show that this method combines the timeliness and content of news data to consider the categories of news, which can improve the accuracy of the optimal number of topics to a certain extent.

  • Yang Meifang, Yang Bo
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1308

    [Objective] Effectively learn the text characteristics and contextual semantic relevance of the risk domain, and improve the performance of entity extraction in the enterprise risk domain. [Method] An entity extraction model in the enterprise risk domain based on stroke ELMo embedded in IDCNN-CRF is proposed. First, use the bidirectional language model to pre-train the large-scale unstructured enterprise risk domain data to obtain the stroke ELMo vector as the input feature, then send it to the IDCNN network for training, and then use CRF to process the output layer of IDCNN, and finally get Globally optimal entity sequence labeling in the enterprise risk domain. [Results] The experimental results show that the F value of this model for entity extraction in the enterprise risk domain is 91.5%, which is 2% higher than the extraction performance of BiLSTM-CRF deep neural network models, and the test speed is fast 2.36 times. [Limitations] Fully fusing additional text features on the basis of stroke-based ELMo character vectors can effectively improve the effect of Chinese entity extraction, without considering the universality of this model to extend entity extraction tasks in more fields. [Conclusion] This article gives the specific process of model application, which provides reference for the construction of entity corpus in the field of enterprise risk.

  • Li Xiaomin, Wang Hao, Li Yueyan, Zhao Meng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0183

    [Objective]Geographical names are the product of the development of human society to a certain stage. Geographical names are constantly evolving in the process of social development. Using linked data technology to study the evolution of geographical names, the evolution of geographical names can better play the role of cultural inheritance. have a positive meaning.

    [Method]This paper constructs the knowledge base CGNE_Onto on the evolution of Chinese geographical names, formulates the strong and weak marker words for the identification of evolution types to identify the evolution type sentences in the historical evolution data, and then uses the BERT-BiLSTM-CRF model to identify the time and place name entities in the evolution type sentences. The generated time and place name entities are used as the classes in the ontology to build the ontology knowledge base, and at the same time, the constructed ontology knowledge base for the evolution of administrative division place names is visualized from the perspective of direct path relationship and indirect path relationship. The number of different evolution types of each dynasty and the reasons for their formation are analyzed statistically.

    [Result]The experimental results show that the model proposed in this paper can clearly and intuitively display the evolution of geographical names, which provides a new idea for the analysis and mining of geographical names data.

    [Limitations] Due to the small scale of the dataset in this paper, the evolution feature words also have certain limitations.

    [Conclusion] The knowledge base of place name evolution constructed in this paper can intuitively and clearly show the evolution of place names from ancient times to the present, as well as the evolution types of various dynasties.

  • Zhao RuiJie, Tong XinYu, Liu XiaoHua, Lu YongHe
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1414

    [Objective] A new entity recognition model was proposed to improve the effectiveness of medical entity recognition, realize the mining of new medical knowledge and improve the utilization rate of medical scientific papers.

    [Methods] An Att-BiLSTM-CRF-based pharmaceutical entity recognition model was constructed and tested on the public datasets GENIA Term Annotation Task and BioCreative II Gene Mention Tagging for F1 values and accuracy, respectively.The model was used to annotate the abstracts of biomedical scientific papers.

    [Results] The experimental results show that the model is superior to the two benchmark models. The F1 values of the two data sets are 81.57% and 84.23%, and the accuracy is 92.51% and 97.85%, respectively. Moreover, the model has more advantages in the data sets with extremely unbalanced data.

    [Limitation] The volume of data and application of entity labeling experiments is relatively homogeneous and could be further expanded.

    [Conclusion] The medical entity recognition model based on Att- BILSTM-CRF can improve the effectiveness of entity recognition and realize the mining of new medical knowledge


  • Cheng Peng, Chunxia Zhang, Xin Zhang, Jingtao Guo, Zhendong Niu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0225

    [Objective] In order to solve the problems that incomplete entity information extraction and lack of importance measurement of different timestamps for the event to be reasoned in temporal knowledge graph reasoning. [Methods] A temporal knowledge graph reasoning model based on entity multiple unit encoding(EMUC) is proposed. EMUC introduces three entity feature encodings, including the entity slice feature encodings of the current timestamp, the entity dynamic feature encodings that fuses timestamp embedding and entity static features, and entity segment feature encodings that is relatively stable over historical time steps. Furthermore, a temporal attention mechanism is employed to learn the importance weights of local structural information at different timestamps to the inference target. [Results] The experimental results of the temporal knowledge graph reasoning model in this paper on the ICEWS14 test set are MRR: 0.4704, Hits@1: 40.31%, Hits@3: 50.02%, Hits@10: 59.98%, on the ICEWS18 test set are MRR: 0.4385, Hits@1: 37.55%, Hits@3: 46.92%, Hits@10: 56.85%, and on the YAGO test set are MRR: 0.6564, Hits@1: 63.07%, Hits@3 : 65.87%, Hits@10: 68.37%. Our model outperforms present methods on these evaluating metrics.  [Limitations] EMUC has the limitation of slow inference speed for large-scale datasets. [Conclusions] EMUC captures the multiple features of entities including entity slice feature, entity dynamic feature and entity fragment feature in the temporal knowledge graph. The designed the temporal attention mechanism to measure the importance of historical local structure information for reasoning, which effectively improves the reasoning performance of the temporal knowledge graph.

  • Deng Lu, Hu Po, Li Xuanhong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0034

    [Objective] Mapping the biomedical text to the super thesaurus in the biomedical field to obtain the biomedical terms contained in the text and their corresponding concepts, and integrate the terms and concepts as background knowledge into the text summary model to improve the text summary model in biomedicine The quality of the summary generation on the text.

    [Methods] This method first obtains the important content of the text through extractive abstract technology, and then combines the important content of the text with the knowledge base in the biomedical field to extract the terms contained in the important content of the text and its corresponding knowledge base concept, and integrate it into the neural network generative abstract as background knowledge In the attention mechanism of the model, the model can not only focus on the important information inside the text under the guidance of domain knowledge, but also suppress the noise problems that may occur due to the introduction of external information, and significantly improve the quality of abstract generation.

    [Results] The experimental results on three biomedical field data sets verify the effectiveness of the proposed method. The average ROUGE of the proposed model PG-meta on the three data sets reaches 31.06, which is 1.51 higher than the average ROUGE of the original PG model.

    [Limitations] The impact of different ways of acquiring background knowledge in biomedical fields on the effectiveness of model enhancement remains to be further explored.

    [Conclusions] The proposed method can help the model better learn the deep meaning of biomedical texts and improve the quality of abstract generation.


  • Cao Zhe, Guo Huilan, Wu Jiang, Hu Zhongyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0371

    [Objective] From the perspective of technology-user interaction, the gap between users’ realistic perception of technology and the ideal technical requirements of the metaverse is investigated, and optimization suggestions for relevant technology are proposed.


    [Methods] Based on user reviews of 64 VR products on JD platform, the mixed methods of LDA topic model and BERT language model are used to construct the indicators of attention and affection, so as to quantitatively analyze the users’ perception of VR technology. The comparative analysis is conducted based on the objective attributes of VR products and the technical requirements of the metaverse.


    [Results] Five perceived attributes (function, quality control, use feeling, marketing and audio-visual experience) are extracted from user reviews. The attribute of audio-visual experience has the highest attention and affection whereas marketing is on the contrary. Three attributes of function, use feeling and audio-visual experience have eight progressive or regressive manifestations in the four dimensions of technical requirements in the metaverse (immersion experience, accessibility, interoperability and scalability), which are high immersion, sensory imbalance, multiple connections, time and space constraints, multiplayer interaction, mobile obstacles, multi-functional design and equipment problems.


    [Limitations] The diversity and balance of samples need to be improved, and extended research on other types of metaverse technology equipment is not included.


    [Conclusions] It can be learnt from the process of perceived attributes extraction, perceptual preference recognition and perceptual degree analysis that VR products can meet the technical requirements of the metaverse in immersion experience, but there is still a long way to go to achieve accessibility, interoperability and scalability. Taking objective attributes of products into consideration, a reference for the optimization of the technology in the metaverse can be provided.

  • Yang Defang, Tang Li
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0428

    [Objective] Responsible research and innovation is an important topic of global scientific and technological competition and sustainable development. This paper analyzes the general situation, knowledge base, and research hotspots of responsible research and innovation based on international literature. [Coverage]We used “responsible research and innovation” and “responsible innovation” as the keywords to search in the three core databases of the Web of Science, finally a total of 657 English articles were retrieved. [Methods]This paper combines bibliometrics and visual analysis to investigate the status quo, to mine the knowledge base and research hotspots of responsible research and innovation. [Results]The results show that scholars in the Netherlands and the United Kingdom have led responsible research and innovation. The international publications of China starts in 2014, with a total of 14 papers. This research also finds that the research in the field of responsible research and innovation is based on technology assessment and anticipatory governance, conceptual development in the EU context, conceptual speculation, and strengthening. Research hotspots focus on science, society and governance, conceptual framework and practice, ethics and value of technology development, and sustainability research. [Limitations] The data range of the review should be further expanded, and the dynamic evolution trend of hot spots should be further analyzed. [Conclusions]This study appeals to Chinese scholars to pay attention to international trends in the future research in the field of responsible research and innovation, and combine with unique research problems and research practice in China to escort the responsible development of emerging technologies.

  • Zhang Yongwei, Liu Ting, Liu Chang, Wu Bingxin, Yu Jingsong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0093

    [Objective] This study aims to explore an efficient method for retrieving syntactic information in large text corpora.

    [Methods] Linearized indices are created for syntactic information in line with the features of syntactic information. They can directly provide information required for conditional matching during retrieval and improve retrieval efficiency.

    [Results] An experiment is conducted, using People's Daily Corpus, which contains 28.51 million sentences, to test the speed of queries. The results show that the average time for 26 queries is 802.6 milliseconds, which meets the retrieval efficiency requirements of retrieval systems for large corpora.

    [Limitations] More research is needed to examine proposed method with more queries.

    [Conclusions]The method proposed by this study can help to quickly retrieve lexical, dependency syntactic and constituency syntactic information in large text corpora.


  • Chen Yuanyuan, Ma Jing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1362

    [Objective]In order to solve the problems of low prediction accuracy and difficult fusion of multimodal features in the existing multimodal sarcasm detection model, this paper designs an SC-attention fusion mechanism.

    [Methods]The CLIP and RoBERTa models are used to extract features from three modes: picture, picture attribute and text respectively. SC-attention mechanism was combined with SENet's attention mechanism and Co-attention mechanism to fuse multi-modal features. Guided by the original modal features, attention weights are allocated reasonably. Finally, the features are input to the full connection layer for sarcasm detection.

    [Results]The experimental results show that the accuracy of multimodal sarcasm detection based on SC-attention mechanism is 93.71%, and the F1 index is 91.89%. Compared with the model with the same data set, the accuracy of this model is increased by 10.27%, and the F1 value is increased by 11.5%.

    [Limitations]The generalization of the model needs to be reflected in more data sets.

    [Conclusions]The model proposed in this paper reduces information redundancy and feature loss, and effectively improves the accuracy of multimodal sarcasm detection.


  • Zeng Wen, Wang Yuefen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0161

    [Objective] Based on the comprehensive perspective of the diversity of identification index information and the combination of different weighting and sorting algorithms, combined with the characteristics of large-scale data sets, the construction of core patent portfolio identification methods and their application comparisons are studied.[Methods] Through cross-combination, 5 combined identification methods are constructed, and 6 patent feature information is selected. Taking the field of artificial intelligence as an example, the characteristics and application scenarios of each method are compared from the overall and local levels. [Results] Different combined identification methods maintain high consistency when applied to different datasets and time periods. At the same time, as the number of core patents to be identified increases, the coincidence rate between the two methods gradually decreases. For example, the core patent coincidence rate of method 1 and method 4 has dropped from 80% to 47%. [Limitations] Only one field is applied, and the application characteristics of combination method can be further excavated. [Conclusions] The five combined identification methods constructed can be applied to different results requirements and specific situations of core patent identification based on the scale, dispersion, time span and feature value performance of patent data sets and differences in the development of technical fields. For the rapidly developing field of artificial intelligence, the two methods of entropy weight method weighting combined with grey relational analysis and entropy weight method weighting combined with TOPSIS have better recognition effect.

  • Wang Dailin, Liu Lina, Liu Meiling, Liu Yaqiu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1317

    [Objective] Existing recommendation algorithms mostly recommend books according to the title, keywords and abstract of books, or dig readers' interest preferences according to their book browsing behavior. However, they ignore readers' attention to the content framework of books—catalog. In order to solve the problem that the existing methods lack to express readers' concern about the book catalog, which leads to the unsatisfactory accuracy of recommendation, a reader preference analysis method based on the attention mechanism of book catalog and personalized recommendation model IABiLSTM is proposed.

    [Methods] The semantic features of books are extracted according to the book title and catalog content: BiLSTM network is used to capture the long-distance dependency and word order context information of text, and two-layer Attention mechanism is used to enhance the deeper semantic expression of book catalog features; analyze readers' historical browsing behavior, and use interest function to fit and quantify readers' interest; combine the semantic features of books with readers' interest to generate readers' preference vector, calculate the similarity between the semantic feature vector of candidate books and readers' preference vector to predict the score, and complete personalized book recommendation.

    [Results] MSE, Precision and Recall were investigated on Douban Reading and Amazon data sets respectively. When N value is 50, the results are 1.14% and 1.20%, 89% and 75%, 85% and 73%, respectively, superior to the comparison model, which verifies that the proposed model effectively improves the accuracy of book recommendation.  

    [Limitations] The model is only validated on douban Reading and Amazon data sets, and its generalization performance on other data sets needs to be further verified.  

    [Conclusions] We effectively express readers' interests and preferences by improving the attention to book catalogue and analyzing readers' historical browsing interaction behavior, and makes an important contribution to improving the accuracy of book recommendation. The proposed model is not only suitable for the recommendation task based on the implicit preference mining of book content and readers' browsing behavior, but also can provide important reference for other common NLP tasks.

  • Zhao Pengwu, Li Zhiyi, Lin Xiaoqi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1079

    [Objective] The paper mainly studies the feature extraction of dynamic semantic information in the Chinese task entity relationship and the Chinese character relationship recognition. [Methods] In this paper, the public corpus of character entity relationship is used, and the attention mechanism + improved convolution neural network model is used to automatically extract features from the training data. The experimental results are compared and verified from the multi-dimensional aspects of entity relationship recognition efficiency of different models, entity relationship extraction effect of different relationship labels and entity relationship extraction efficiency of different vector training sets. [Results] Experimental results show that CNN+Attention model is superior to SVM, LR, LSTM, BiLSTM and CNN model in the prediction accuracy and global performance of Chinese character relationship extraction task. And it is 0.9% higher in accuracy, 0.8% higher in recall and 0.8% higher in F1 value than BiLSTM model with relatively better extraction effect. [Limitations] Only a single sample data source is used, multiple data source channels have not been expanded, and the sample data set is not wide enough. [Conclusions] The convolutional neural network based on the attention mechanism can effectively improve the accuracy and recall rate of entity relationship extraction in the task of Chinese character relationship extraction.

  • Zhang Zhipeng, Mao Yusheng, Zhang Liyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1303

    [Objective] An opinion reason sentences classification model is proposed to mine the opinion reason sentences of reviews in online booking platform. [Methods] Firstly, a pretraining corpus containing millions of online reviews is constructed and an ORSC dataset is manually annotated to test the proposed model. Subsequently, the text features of ORSC dataset are extracted by adding the constructed corpus to ERNIE model. Finally, the BiLSTM model is used to merge all the features and identify the reviews containing opinion reasons. [Results] On ORSC datasets, the DERNIE model have reached an accuracy of 91.33% and a F1 value of 91.20%, after BiLSTM fusion features, the accuracy is improved to 94.57% and the F1 value is improved to 94.62%. [Limitations] The pre-trained language models require a large amount of data in the additional corpus, which will affect the computational speed and efficiency. [Conclusions] The features extraction and fusion method based on DERNIE-BiLSTM model can mine opinion reason sentences in online reviews more accurately.