Home Browse Online first

Online first

The manuscripts published below will continue to be available from this page until they are assigned to an issue.
Please wait a minute...
  • Select all
    |
  • Wanying Lv, Jie Zhao, Liushen Huang, Zhenning Dong, Zhouyang Liang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1153

    [Objective]Using feature grouping and combining ideas. The grouping provides replaceable features for trust evaluation in the absence of data and reduces search space. The combination effectively reduces the feature dimensions and further alleviates the problem of difficult trust evaluation caused by missing data. [Methods]Based on Markov Blanket to group with distinguishing ability similar features by analyzing the relationship between features distinguishing ability. Based on RVNS methods to search within and between groups to complete feature combinations. [Results]In the case of missing value features, it can effectively provide substitute features when the effect of trust evaluation is stable; the dimension of features is reduced to 1.7%, and the average accuracy of trust evaluation higher than 92%. [Limitations]This study only discusses methods to alleviate the problem of data missing, how to use knowledge of missing-value data can be discussed in the future. [Conclusions]We integrate feature grouping and combination to provide an efficient trust evaluation model, and from two sides alleviate the problem caused by missing data in trust evaluation.

  • Chen Wen, Chen Wei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0075

    [Objective] Emerging topics contained of multi-source data are identified, and a multivariable LSTM with bibliometric indicators is established to predict the popularity of emerging topics.

    [Methods] Firstly, topics of fund projects, papers and patents are identified. Secondly, emerging topics are screened out according to their novelty, growth and persistence. Finally, the indicator of topic  popularity is designed ,and the popularity score of emerging topics is predicted on the multivariable LSTM model with the four bibliometric indicators of fund amount, fund number, cited frequency of average article and patent IPC subclass number.

    [Results]Taking the field of solid oxide fuel cell as an example, the prediction effect of multivariable LSTM with bibliometric indicators is better than BP, KNN, SVM and univariate LSTM, with the lowest MAE (16.534) and RMSE (23.494) and the highest R2 (0.642).

    [Limitations]The patent citation number and other indicators are not selected as input  variables because it is difficult to obtain specific data under each time slice.

    [Conclusions]The inclusion of bibliometric indicators can optimize the popularity prediction effect of emerging topics.


  • Hu Jiming, Qian Wei, Wen Peng, Lv Xiaoguang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1167

    [Objective] In order to improve the accuracy of text representation and the effect of following text mining, the structural and functional information of Chinese medical records is used to enrich the semantic connotation of the text representation.

    [Methods] Based on the structure-function features of Chinese medical records, this research innovates semantic representation strategy of the text. Then the BiLSTM-CRF model is used to recognize named entities based on text structure, introducing entity and structure information at the word vector level. The TextCNN model is also used to extract local context features, helping us obtain a vector representation with richer text semantic connotations.

    [Results] In the medical entity recognition experiment, the precision rate, recall rate and F value of entity recognition based on structure-function reached 93.20%, 95.19% and 94.19% respectively; in the text classification experiment, which can verify the text representation method proposed in this article, the classification accuracy rate reached 92.12%.

    [Limitations] It is necessary to strengthen the verification in more texts and refine the structure recognition process, so as to make the proposed method serve the text mining work better.

    [Conclusions] The method proposed in this paper introduces structure-function information of medical records into the text representation work. Related experiments have proved that it cannot only effectively improve the accuracy of named entity recognition, but also enrich the semantic connotation of the text and improve the text representation effect.


  • Yang Yang, Jang Kaizhong, Yuan Mingjun, Hui Lanxin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0115

    [Objective] Aiming at the problem that the number of topics needs to be specified in the traditional LDA model, an adaptive topic number determination method for the field of news topic recognition is proposed.

    [Methods] This paper extracts the news data by using semantics and time series as two views to obtain the corresponding feature vectors. The Co-DPSC algorithm is used to collaboratively train the two views to obtain a semantic feature matrix containing timing effects, and finally the density peak clustering by row after the matrix dimensionality reduction process is obtained, and the result is used as the optimal number of topics.

    [Results] The experimental results show that the precision and F value of the optimal number of topics are improved by considering semantic and temporal factors, among which the precision rate is increased by 35.09%, and the F value is increased by 15.39%.

    [Limitations] The keyword set is clustered, and the method of obtaining keywords affects the effect of clustering and the running time to a certain extent. Because news data requires textual and temporal elements, there are limitations to other types of data.

    [Conclusions] Experiments show that this method combines the timeliness and content of news data to consider the categories of news, which can improve the accuracy of the optimal number of topics to a certain extent.

  • Yang Meifang, Yang Bo
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1308

    [Objective] Effectively learn the text characteristics and contextual semantic relevance of the risk domain, and improve the performance of entity extraction in the enterprise risk domain. [Method] An entity extraction model in the enterprise risk domain based on stroke ELMo embedded in IDCNN-CRF is proposed. First, use the bidirectional language model to pre-train the large-scale unstructured enterprise risk domain data to obtain the stroke ELMo vector as the input feature, then send it to the IDCNN network for training, and then use CRF to process the output layer of IDCNN, and finally get Globally optimal entity sequence labeling in the enterprise risk domain. [Results] The experimental results show that the F value of this model for entity extraction in the enterprise risk domain is 91.5%, which is 2% higher than the extraction performance of BiLSTM-CRF deep neural network models, and the test speed is fast 2.36 times. [Limitations] Fully fusing additional text features on the basis of stroke-based ELMo character vectors can effectively improve the effect of Chinese entity extraction, without considering the universality of this model to extend entity extraction tasks in more fields. [Conclusion] This article gives the specific process of model application, which provides reference for the construction of entity corpus in the field of enterprise risk.

  • Li Xiaomin, Wang Hao, Li Yueyan, Zhao Meng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0183

    [Objective]Geographical names are the product of the development of human society to a certain stage. Geographical names are constantly evolving in the process of social development. Using linked data technology to study the evolution of geographical names, the evolution of geographical names can better play the role of cultural inheritance. have a positive meaning.

    [Method]This paper constructs the knowledge base CGNE_Onto on the evolution of Chinese geographical names, formulates the strong and weak marker words for the identification of evolution types to identify the evolution type sentences in the historical evolution data, and then uses the BERT-BiLSTM-CRF model to identify the time and place name entities in the evolution type sentences. The generated time and place name entities are used as the classes in the ontology to build the ontology knowledge base, and at the same time, the constructed ontology knowledge base for the evolution of administrative division place names is visualized from the perspective of direct path relationship and indirect path relationship. The number of different evolution types of each dynasty and the reasons for their formation are analyzed statistically.

    [Result]The experimental results show that the model proposed in this paper can clearly and intuitively display the evolution of geographical names, which provides a new idea for the analysis and mining of geographical names data.

    [Limitations] Due to the small scale of the dataset in this paper, the evolution feature words also have certain limitations.

    [Conclusion] The knowledge base of place name evolution constructed in this paper can intuitively and clearly show the evolution of place names from ancient times to the present, as well as the evolution types of various dynasties.

  • Zhao RuiJie, Tong XinYu, Liu XiaoHua, Lu YongHe
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1414

    [Objective] A new entity recognition model was proposed to improve the effectiveness of medical entity recognition, realize the mining of new medical knowledge and improve the utilization rate of medical scientific papers.

    [Methods] An Att-BiLSTM-CRF-based pharmaceutical entity recognition model was constructed and tested on the public datasets GENIA Term Annotation Task and BioCreative II Gene Mention Tagging for F1 values and accuracy, respectively.The model was used to annotate the abstracts of biomedical scientific papers.

    [Results] The experimental results show that the model is superior to the two benchmark models. The F1 values of the two data sets are 81.57% and 84.23%, and the accuracy is 92.51% and 97.85%, respectively. Moreover, the model has more advantages in the data sets with extremely unbalanced data.

    [Limitation] The volume of data and application of entity labeling experiments is relatively homogeneous and could be further expanded.

    [Conclusion] The medical entity recognition model based on Att- BILSTM-CRF can improve the effectiveness of entity recognition and realize the mining of new medical knowledge


  • Cheng Peng, Chunxia Zhang, Xin Zhang, Jingtao Guo, Zhendong Niu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0225

    [Objective] In order to solve the problems that incomplete entity information extraction and lack of importance measurement of different timestamps for the event to be reasoned in temporal knowledge graph reasoning. [Methods] A temporal knowledge graph reasoning model based on entity multiple unit encoding(EMUC) is proposed. EMUC introduces three entity feature encodings, including the entity slice feature encodings of the current timestamp, the entity dynamic feature encodings that fuses timestamp embedding and entity static features, and entity segment feature encodings that is relatively stable over historical time steps. Furthermore, a temporal attention mechanism is employed to learn the importance weights of local structural information at different timestamps to the inference target. [Results] The experimental results of the temporal knowledge graph reasoning model in this paper on the ICEWS14 test set are MRR: 0.4704, Hits@1: 40.31%, Hits@3: 50.02%, Hits@10: 59.98%, on the ICEWS18 test set are MRR: 0.4385, Hits@1: 37.55%, Hits@3: 46.92%, Hits@10: 56.85%, and on the YAGO test set are MRR: 0.6564, Hits@1: 63.07%, Hits@3 : 65.87%, Hits@10: 68.37%. Our model outperforms present methods on these evaluating metrics.  [Limitations] EMUC has the limitation of slow inference speed for large-scale datasets. [Conclusions] EMUC captures the multiple features of entities including entity slice feature, entity dynamic feature and entity fragment feature in the temporal knowledge graph. The designed the temporal attention mechanism to measure the importance of historical local structure information for reasoning, which effectively improves the reasoning performance of the temporal knowledge graph.

  • Deng Lu, Hu Po, Li Xuanhong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0034

    [Objective] Mapping the biomedical text to the super thesaurus in the biomedical field to obtain the biomedical terms contained in the text and their corresponding concepts, and integrate the terms and concepts as background knowledge into the text summary model to improve the text summary model in biomedicine The quality of the summary generation on the text.

    [Methods] This method first obtains the important content of the text through extractive abstract technology, and then combines the important content of the text with the knowledge base in the biomedical field to extract the terms contained in the important content of the text and its corresponding knowledge base concept, and integrate it into the neural network generative abstract as background knowledge In the attention mechanism of the model, the model can not only focus on the important information inside the text under the guidance of domain knowledge, but also suppress the noise problems that may occur due to the introduction of external information, and significantly improve the quality of abstract generation.

    [Results] The experimental results on three biomedical field data sets verify the effectiveness of the proposed method. The average ROUGE of the proposed model PG-meta on the three data sets reaches 31.06, which is 1.51 higher than the average ROUGE of the original PG model.

    [Limitations] The impact of different ways of acquiring background knowledge in biomedical fields on the effectiveness of model enhancement remains to be further explored.

    [Conclusions] The proposed method can help the model better learn the deep meaning of biomedical texts and improve the quality of abstract generation.


  • Cao Zhe, Guo Huilan, Wu Jiang, Hu Zhongyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0371

    [Objective] From the perspective of technology-user interaction, the gap between users’ realistic perception of technology and the ideal technical requirements of the metaverse is investigated, and optimization suggestions for relevant technology are proposed.


    [Methods] Based on user reviews of 64 VR products on JD platform, the mixed methods of LDA topic model and BERT language model are used to construct the indicators of attention and affection, so as to quantitatively analyze the users’ perception of VR technology. The comparative analysis is conducted based on the objective attributes of VR products and the technical requirements of the metaverse.


    [Results] Five perceived attributes (function, quality control, use feeling, marketing and audio-visual experience) are extracted from user reviews. The attribute of audio-visual experience has the highest attention and affection whereas marketing is on the contrary. Three attributes of function, use feeling and audio-visual experience have eight progressive or regressive manifestations in the four dimensions of technical requirements in the metaverse (immersion experience, accessibility, interoperability and scalability), which are high immersion, sensory imbalance, multiple connections, time and space constraints, multiplayer interaction, mobile obstacles, multi-functional design and equipment problems.


    [Limitations] The diversity and balance of samples need to be improved, and extended research on other types of metaverse technology equipment is not included.


    [Conclusions] It can be learnt from the process of perceived attributes extraction, perceptual preference recognition and perceptual degree analysis that VR products can meet the technical requirements of the metaverse in immersion experience, but there is still a long way to go to achieve accessibility, interoperability and scalability. Taking objective attributes of products into consideration, a reference for the optimization of the technology in the metaverse can be provided.

  • Yang Defang, Tang Li
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0428

    [Objective] Responsible research and innovation is an important topic of global scientific and technological competition and sustainable development. This paper analyzes the general situation, knowledge base, and research hotspots of responsible research and innovation based on international literature. [Coverage]We used “responsible research and innovation” and “responsible innovation” as the keywords to search in the three core databases of the Web of Science, finally a total of 657 English articles were retrieved. [Methods]This paper combines bibliometrics and visual analysis to investigate the status quo, to mine the knowledge base and research hotspots of responsible research and innovation. [Results]The results show that scholars in the Netherlands and the United Kingdom have led responsible research and innovation. The international publications of China starts in 2014, with a total of 14 papers. This research also finds that the research in the field of responsible research and innovation is based on technology assessment and anticipatory governance, conceptual development in the EU context, conceptual speculation, and strengthening. Research hotspots focus on science, society and governance, conceptual framework and practice, ethics and value of technology development, and sustainability research. [Limitations] The data range of the review should be further expanded, and the dynamic evolution trend of hot spots should be further analyzed. [Conclusions]This study appeals to Chinese scholars to pay attention to international trends in the future research in the field of responsible research and innovation, and combine with unique research problems and research practice in China to escort the responsible development of emerging technologies.

  • Zhang Yongwei, Liu Ting, Liu Chang, Wu Bingxin, Yu Jingsong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0093

    [Objective] This study aims to explore an efficient method for retrieving syntactic information in large text corpora.

    [Methods] Linearized indices are created for syntactic information in line with the features of syntactic information. They can directly provide information required for conditional matching during retrieval and improve retrieval efficiency.

    [Results] An experiment is conducted, using People's Daily Corpus, which contains 28.51 million sentences, to test the speed of queries. The results show that the average time for 26 queries is 802.6 milliseconds, which meets the retrieval efficiency requirements of retrieval systems for large corpora.

    [Limitations] More research is needed to examine proposed method with more queries.

    [Conclusions]The method proposed by this study can help to quickly retrieve lexical, dependency syntactic and constituency syntactic information in large text corpora.


  • Chen Yuanyuan, Ma Jing
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1362

    [Objective]In order to solve the problems of low prediction accuracy and difficult fusion of multimodal features in the existing multimodal sarcasm detection model, this paper designs an SC-attention fusion mechanism.

    [Methods]The CLIP and RoBERTa models are used to extract features from three modes: picture, picture attribute and text respectively. SC-attention mechanism was combined with SENet's attention mechanism and Co-attention mechanism to fuse multi-modal features. Guided by the original modal features, attention weights are allocated reasonably. Finally, the features are input to the full connection layer for sarcasm detection.

    [Results]The experimental results show that the accuracy of multimodal sarcasm detection based on SC-attention mechanism is 93.71%, and the F1 index is 91.89%. Compared with the model with the same data set, the accuracy of this model is increased by 10.27%, and the F1 value is increased by 11.5%.

    [Limitations]The generalization of the model needs to be reflected in more data sets.

    [Conclusions]The model proposed in this paper reduces information redundancy and feature loss, and effectively improves the accuracy of multimodal sarcasm detection.


  • Zeng Wen, Wang Yuefen
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0161

    [Objective] Based on the comprehensive perspective of the diversity of identification index information and the combination of different weighting and sorting algorithms, combined with the characteristics of large-scale data sets, the construction of core patent portfolio identification methods and their application comparisons are studied.[Methods] Through cross-combination, 5 combined identification methods are constructed, and 6 patent feature information is selected. Taking the field of artificial intelligence as an example, the characteristics and application scenarios of each method are compared from the overall and local levels. [Results] Different combined identification methods maintain high consistency when applied to different datasets and time periods. At the same time, as the number of core patents to be identified increases, the coincidence rate between the two methods gradually decreases. For example, the core patent coincidence rate of method 1 and method 4 has dropped from 80% to 47%. [Limitations] Only one field is applied, and the application characteristics of combination method can be further excavated. [Conclusions] The five combined identification methods constructed can be applied to different results requirements and specific situations of core patent identification based on the scale, dispersion, time span and feature value performance of patent data sets and differences in the development of technical fields. For the rapidly developing field of artificial intelligence, the two methods of entropy weight method weighting combined with grey relational analysis and entropy weight method weighting combined with TOPSIS have better recognition effect.

  • Wang Dailin, Liu Lina, Liu Meiling, Liu Yaqiu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1317

    [Objective] Existing recommendation algorithms mostly recommend books according to the title, keywords and abstract of books, or dig readers' interest preferences according to their book browsing behavior. However, they ignore readers' attention to the content framework of books—catalog. In order to solve the problem that the existing methods lack to express readers' concern about the book catalog, which leads to the unsatisfactory accuracy of recommendation, a reader preference analysis method based on the attention mechanism of book catalog and personalized recommendation model IABiLSTM is proposed.

    [Methods] The semantic features of books are extracted according to the book title and catalog content: BiLSTM network is used to capture the long-distance dependency and word order context information of text, and two-layer Attention mechanism is used to enhance the deeper semantic expression of book catalog features; analyze readers' historical browsing behavior, and use interest function to fit and quantify readers' interest; combine the semantic features of books with readers' interest to generate readers' preference vector, calculate the similarity between the semantic feature vector of candidate books and readers' preference vector to predict the score, and complete personalized book recommendation.

    [Results] MSE, Precision and Recall were investigated on Douban Reading and Amazon data sets respectively. When N value is 50, the results are 1.14% and 1.20%, 89% and 75%, 85% and 73%, respectively, superior to the comparison model, which verifies that the proposed model effectively improves the accuracy of book recommendation.  

    [Limitations] The model is only validated on douban Reading and Amazon data sets, and its generalization performance on other data sets needs to be further verified.  

    [Conclusions] We effectively express readers' interests and preferences by improving the attention to book catalogue and analyzing readers' historical browsing interaction behavior, and makes an important contribution to improving the accuracy of book recommendation. The proposed model is not only suitable for the recommendation task based on the implicit preference mining of book content and readers' browsing behavior, but also can provide important reference for other common NLP tasks.

  • Zhao Pengwu, Li Zhiyi, Lin Xiaoqi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1079

    [Objective] The paper mainly studies the feature extraction of dynamic semantic information in the Chinese task entity relationship and the Chinese character relationship recognition. [Methods] In this paper, the public corpus of character entity relationship is used, and the attention mechanism + improved convolution neural network model is used to automatically extract features from the training data. The experimental results are compared and verified from the multi-dimensional aspects of entity relationship recognition efficiency of different models, entity relationship extraction effect of different relationship labels and entity relationship extraction efficiency of different vector training sets. [Results] Experimental results show that CNN+Attention model is superior to SVM, LR, LSTM, BiLSTM and CNN model in the prediction accuracy and global performance of Chinese character relationship extraction task. And it is 0.9% higher in accuracy, 0.8% higher in recall and 0.8% higher in F1 value than BiLSTM model with relatively better extraction effect. [Limitations] Only a single sample data source is used, multiple data source channels have not been expanded, and the sample data set is not wide enough. [Conclusions] The convolutional neural network based on the attention mechanism can effectively improve the accuracy and recall rate of entity relationship extraction in the task of Chinese character relationship extraction.

  • Zhang Zhipeng, Mao Yusheng, Zhang Liyi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1303

    [Objective] An opinion reason sentences classification model is proposed to mine the opinion reason sentences of reviews in online booking platform. [Methods] Firstly, a pretraining corpus containing millions of online reviews is constructed and an ORSC dataset is manually annotated to test the proposed model. Subsequently, the text features of ORSC dataset are extracted by adding the constructed corpus to ERNIE model. Finally, the BiLSTM model is used to merge all the features and identify the reviews containing opinion reasons. [Results] On ORSC datasets, the DERNIE model have reached an accuracy of 91.33% and a F1 value of 91.20%, after BiLSTM fusion features, the accuracy is improved to 94.57% and the F1 value is improved to 94.62%. [Limitations] The pre-trained language models require a large amount of data in the additional corpus, which will affect the computational speed and efficiency. [Conclusions] The features extraction and fusion method based on DERNIE-BiLSTM model can mine opinion reason sentences in online reviews more accurately.

  • Hua Bin, Kang Yue, Fan Linhao
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0185

    Intelligent question and answering  Text mining  E-government  Policy knowledge model  Knowledge graph  Knowledge aggregation

  • Hu Zhongyi, Zhang Shuoguo, Wu Jiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0141

    [Objective] To alleviate the problem of inadequate URL representation in phishing websites identification, this study proposes an identification model based on URL multi-granularity feature fusion.

    [Methods] Character-level and word-level features of URLs are deeply learned based on one-hot encoding and BERT, respectively, and then an identification model is constructed by fusing the deep features of both granularities.

    [Results] The accuracy, recall, F-value, and ROC values of the proposed model by fusing multi-granularity URL features reaches 96.1%, 0.98, 0.97, and 0.97, respectively. It has better performance than the single-granularity feature representation-based models, benchmark classifiers, and previous state-of-the-art models.

    [Limitations] In addition to URL feature representation, more features including URL page content need to be extracted further.

    [Conclusion] The proposed model can represent URL features more comprehensively and deeply, which effectively improves the identification performance of phishing websites.


  • CAO Lina, ZHANG Jian, CHEN Jindong, FAN Hui
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0078

    [Objective]To solve the problem of difficulty in accurately depicting the quality of micro, medium, and small enterprises (MSMEs), the comprehensive quality profiling technology of MSMEs based on deep learning is studied.

    [Methods]A comprehensive quality profiling system of MSMEs in 5 dimensions including quality innovation ability, process quality control, product quality level, operational quality and risk, and financial quality are proposed, and the diversified profiling methods are designed for different data types of indicators. Focusing on the web text data of quality spot checks reports, user comments, etc., the comprehensive quality profiling construction technology of MSMEs based on deep learning is proposed.

    [Results]The empirical results show that, in terms of F value, the recognition effect of the pre-trained Bert model for three types of quality entities is 4.66%, 1.99%, and 4.25% higher than the benchmark model respectively and the review classification model based on the pre-trained Word2Vec is 6.03% higher than the traditional TF-IDF model.

    [Limitations] Limited by the availability of data, more dimensions of portraits related to enterprise quality need to be further optimized and improved.

    [Conclusions] Deep learning technology expands the dimension and improves the accuracy of enterprise quality profiling, and provides technical support for service mode innovation of enterprise quality service organization.


  • Qu Zongxi, Sha Yongzhong, Li Yutong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1269

    [Objective] Building an accurate and effective forecasting model for major infectious diseases based on multi-machine learning can predict outbreak trends and help formulate countermeasures in advance.

    [Methods] Based on the Gray Wolf Optimization algorithm, three machine learning optimal weight combinations of ANFIS, LSSVM and LSTM are searched to establish an ensemble prediction model. Experiments are designed to assess the model prediction performance by COVID-19 epidemic data.

    [Results] The results show that ANFIS, LSSVM, and LSTM were suitable for confirmed case, death case, and recovery case scenarios, respectively; the R2 of the ensemble prediction model based on Gray Wolf Optimization reached 0.987, 0.993, and 0.987 for the three scenarios. The average RMSE was reduced by 38.79%, 64.40%, and 53.88% compared to the single model, respectively.

    [Limitations] The model needs to be further verified by using other major infectious disease epidemic data sets.

    [Conclusions] Different machine learning models have their own prediction performance, and the ensemble prediction model based on Gray Wolf Optimization can effectively merge the advantages of multiple machine learning models to obtain stable and accurate prediction results.


  • Gao Jinsong, Zhang Qiang, Li Shuaike, Sun Yanling, Zhou Shubin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1413

    [Objective]To explore the changes of poets in the time-space trajectory and emotional dimension, and provide a new research perspective for knowledge discovery in the humanities field.

    [Context]In order to improve the visualization effect of the current digital humanities research process and the readability of the research results, the poet's emotional trajectory in time and space is expressed by applying ontology technology and GIS technology, and new research ideas and visualization research methods are provided for scholars in related fields.

    [Methods]Taking Li Bai as an example, building the poet ontology model, and the knowledge modeling of the poet's related concepts and relationships is carried out, and then GIS technology is used to display the changes of Li Bai's temporal and spatial emotional trajectory, and explore the tacit knowledge behind it.

    [Results]Li Bai's life trajectories spanned more than half of China, with the most frequent trajectories in Nanjing so far. From the perspective of space, when Tu is Li Bai's "sorrowful and joyful" place, Nanjing is Li Bai's "sorrowful" place. From the perspective of time, Li Bai was more "joyful" than "sorrowful" in his youth, "sorrowful" was more than "joyful" in middle age, and "sorrowful and joyful" in his later years.

    [Conclusions]This paper provides practical experience for the study of poet's emotional trajectories in time and space, and provides new ideas and methods for the study of related issues in the humanities field.


  • Liu Linlin, Gong Daqin, Zhang Yujie, Bai Rujiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1042

    [Objective] Discovering potential technological opportunities is of great significance for promoting scientific and technological progress. This paper proposes a causal knowledge guided technology opportunity discovery method in order to improve the effectiveness of technology opportunity identification, and takes the electric vehicle charging pile as an example. [Method] A three-step method is proposed, which includes automatic extraction of causal pairs, construction of causal network and technical opportunity matching discovery. Firstly, using the rule matching method, the causal pairs contained in multi-source data are automatically extracted based on causal trigger words and rule templates, and represented by triple structure; Then, construct the causal network including technical elements; At the same time, through the steps of emotion recognition and demand word extraction, we can find the demand factors in the process of user use; Finally, through the link prediction of causal network, the potential causal correlation is completed, and matched with user demand factors, so as to finally realize the discovery of technical opportunities. [Results] It is found that the battery performance and price of charging pile are the key factors to improve technical performance and user satisfaction respectively. By comparing the two algorithms, the results show that Graphsage can predict the edge connection more accurately than node2vec, and can effectively identify the potential technical opportunities of the charging pile. [Limitations] It is found that the battery performance and price of charging pile are the key factors to improve technical performance and user satisfaction respectively. By comparing the two algorithms, the results show that graphsage can predict the edge connection more accurately than node2vec, and can effectively identify the potential technical opportunities of the charging pile. [limitation] due to the sparsity of causal network, the accuracy needs to be improved. [Conclusions] The method proposed in this paper can effectively promote the discovery of innovation opportunities of science and technology, find potential uncertain problems, and provide guidance and reference for further technological optimization and industrial upgrading.

  • Ou Guiyan, Pang Na, Wu Jiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1233

    [Purpose] We aim to examine the factors that may affect the patent examination cycle and explore the mechanism behind the patent examination cycle in the field of artificial intelligence in China. [Method] This article takes 78,254 invention patent applications in the field of artificial intelligence in China as the research object, uses the Kaplan-Meier method in survival analysis and the COX proportional hazard regression model to explore the overview of patent examination in the field, and analyzes the characteristics of patent objects and patent subjects based on characteristics, explore the factors that significantly affect the patent examination cycle in this field. [Results] The results show that the average survival period of the overall Chinese invention patent examination process in the field of AI is 32.81 months. Among them, the number of claims, the number of IPC classification numbers, and the number of inventors is the protective factors of the patent examination cycle, which promotes its extension; the number of patent citations is a risk factor, and the more patent citations, the shorter the time required to obtain authorization. Among the types of applicants, universities and scientific research institutions, as well as institutions and organizations, all spend a shorter time on patent examination than individuals. Surprisingly, companies will reduce the risk rate of patent application-authorization, which requires a longer patent examination cycle. [Limitations] The patent examination cycle is closely related to the examination process of the patent office and the personal characteristics of patent examiners. The article failed to obtain more fine-grained data related to it for analysis. [Conclusion] In order to optimize the patent examination procedure and shorten the patent granting cycle, this paper proposes that we can further combine different technical fields and the characteristics of the applicant to establish a diversified examination mode, strengthen the use of automated technology in the patent examination process, and establish classification examination standards to improve the overall patent examination efficiency.

  • Zhang Wanshu, Yao Haitao, Wang Xuefeng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1439

    [Objective] To explore the characteristics of discipline distribution in China, the US and the UK by taking ESI high cited papers as research objects. [Methods] From the perspective of diversity, fusing sub-category and text content, we constructed three indicators: discipline variety, discipline balance and discipline disparity, and took five years as the time window to analyze the trend of the indicators. [Results] There is still a gap between China and the US and the UK in the diversity of Social Sciences and Biomedical Sciences, in the balance of Engineering, Mathematics, Environment & Ecology, and in the disparity of Computer Sciences, Geosciences, Plant & Animal Sciences. Some indicators show an upward trend. [Limitations] The threshold of discipline coverage needs further discussion, and the difference of contribution of author’s country rank remains consideration. [Conclusions] The research is helpful to provide new ideas for discipline evaluation and future discipline distribution.

  • Wang Yan, Xu Meimei, Tong Yujia, Gou Huan, Shan Zhiyi, An Xinying
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0012

    [Objective] Using machine learning to build a prediction and early warning model and evaluation of circulatory system disease death, so as to provide reference for disease prevention.

    [Methods] The death data of circulatory system diseases in a region of China from 2014 to 2018 were used for analysis, and the prediction model was constructed by GAM, RF and XGBoost. The distributed lag nonlinear model is used to calculate the cumulative lag effect results, construct the early warning model and evaluate the model.

    [Results] The cumulative lag effect found that continuous low temperature and high temperature, high sunshine hours and high concentration of environmental pollutants would increase the risk of death from circulatory system diseases. The cumulative seven day relative risks were 1.236, 1.130, 1.56, 1.062, 1.218, 1.153 and 1.796 respectively. The RMSE of RF and XGBoost models are 4.979 and 5.341, with good performance. Age, sex, temperature, sunshine hours, SO2, NO2, Co, O3, PM10, PM2.5 concentration is the characteristic variable screened, and the early warning value is determined from the screened data of cumulative lag effect. The early warning effect is good. The sensitivity, specificity and area under the curve of xgboost prediction results were 0.948, 0.939 and 0.941 respectively.

    [limitations] There is a lack of independent data on concomitant diseases.

    [Conclusions] The increase in the number of deaths in this area is related to the increase of high age, men, temperature, sunshine hours and pollutant concentration. The prediction and early warning model constructed by XGBoost model has good performance, and can provide reference value for disease prevention and intervention in relevant departments.


  • Zhou Ning, Jin Gaoya, Shi Wenqian
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1162

    [Objective] This paper proposes an entity coreference resolution model that integrates neural network and global reasoning to solve the problem of complex entity information in the text and its ambiguity and sparse distribution of referential information, and to explore more effective coreference resolution research methods. [Methods] This paper firstly uses the neural network model to extract the entities and their antecedents in the document, and then combines the context information of the sentence to perform global reasoning, and adds the reasoning results to the neural network model to improve the accuracy of entity coreference resolution. [Results] The experimental results of entity coreference resolution on the OntoNotes5.0 dataset verify the effectiveness of the model proposed in this paper. The entity coreference resolution algorithm integrating neural network and global reasoning can effectively improve the coreference resolution performance and better understand text semantic information. The final model performance reaches 74.76% of F1 score under the CONLL evaluation standard. [Limitations] More precise knowledge reasoning needs to be added. [Conclusion] Comparing the experimental results of this model with other coreference resolution models in recent years proves the practicability and effectiveness of this model.

  • Meng Fansi, Zhong Han, Shi Shuicai, Xie Zekun
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0067

    [Objective]This paper researches the differences of public opinion related to the three-child policy in different provinces.

    [Context]The analysis of public opinion on the three-child policy tends to regard public opinion on the whole network as a whole, ignoring the different demands and concerns of groups in different provinces for the three-child policy.There are problems of simple methods and single text sources in the text research of public opinion on the three-child policy

    [Methods]Firstly, we analyse the heat of three child public opinion based on time series method from a statistical perspective.Then the emotion of three child public opinion based on SVM model is analysed,finding out the negative public opinion and extracting key words based on CRF model.Finally it forms the word cloud.Meanwhile,it conducts the research on public opinion texts of three children in different provinces to get the three child negative public opinion word cloud in different province.And it compares the political or economic statistics with the negative key words of different provinces to analyse the associativity.

    [Results]The experimental results show that the popularity of the three-child policy is higher than that of the policy in the same period. The public opinion is dominated by neutral emotion, accounting for 60.56%, and it is supplemented by positive emotion, accounting for 35.15%.There is a small amount of negative public opinion, accounting for 4.29%.Public opinion concerns in different provinces are different.And these differences are related to political, economic and ecological differences in each province.

    [Conclusions]Public opinion guidance and supervision of the three-child policy should consider the actual conditions of different provinces, respond to people's concerns and timely follow up relevant supporting measures.


  • Yang Haolin, Dong Yongquan, Chen Huafeng, Zhang Guoxi
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0286

    [Objective] The purpose is to solve the problem that most of the existing methods only focus on the multi-truth attribute itself, lacking consideration of the influence of auxiliary attributes, and to improve the F1 of multi-truth discovery. [Methods] We use auxiliary attributes to calculate the source expertise and consensus degree, and combine the activity degree of multi-truth attribute values to get the support degree of the source for conflicting data. By calling the existing truth discovery method to obtain the pseudo tag of truth, the neural network is used to capture the complex relationship between the sources and the conflicting data, and finally deduce all the truths. [Results] The experimental results show that the F1 is improved by 2% on the book dataset and 5% on the movie dataset compared with the suboptimal model. [Limitations] This method combines auxiliary attributes that respond to object features, and the impact of the remaining auxiliary attributes on multi-truth discovery has not been explored. [Conclusions] The method based on the fusion of multi-truth attributes with auxiliary attributes improves the F1 of multi-truth discovery.

  • Yu Yan, Zhu ShengCheng
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1458

    Patent keyword extraction; restriction relationship; claim; TextRank

  • Feng Xiaodong, Hui Kangxin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0038

    [Objective] This paper tries to explore an effective topic clustering method to solve the problems of semantic sparsity and multiple interactions in social media text data. [Methods] The multiple interaction relationship between users and online content in social media is modeled by heterogeneous information network, and the representation of the text is firstly obtained by word embedding methods as the input feature. Based on the heterogeneous graph neural network, the representations of nodes are propagated and aggregated. Finally, the representation of each text node and other nodes will be trained and learned, and unsupervised clustering algorithm is implemented on the text representation for topic clustering. [Results] The experimental results show that the NMI of proposed topic clustering method can achieve 0.83 and 0.86 for post and comment clustering respectively on the English benchmark data set, which is higher than traditional LDA or directly clustering on words or text embedding vectors, including Word2Vec\Doc2Vec\GolVe. [Limitations] Due to the limitation of data, the model does not model the social relationship between users and the multimedia content of online information. [Conclusions] The proposed model can effectively improve the performance of text topic clustering by modeling the multiple interactions of social media contents.

  • Tang Jiao, Zhang Lisheng, Sang Chunyan
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1376

    [Objective]Making up for the shortcomings of existing news recommendations in content information using and long and short-term user interests exploring by considering the current concern and stable preference of users on the basis of making full use of the textual and additional information of news.

    [Methods]A news representation model that integrates textual information such as title and abstract, as well as additional information such as explicit and potential topics. A user representation model that characterizes the long and short-term user interests by exploring the user's current concern and stable preference.

    [Results] Under four evaluation indices, our proposed model scores 69.51%, 34.09%, 37.25%, 43.00% and 66.05%, 30.93%, 34.30%, 40.46% respectively on two large-scale news recommendation datasets, which is higher than seven advanced baseline models.

    [Limitations]We don't give enough consideration to users with few historical behaviors, so the following research will focus on the cold-start users.

    [Conclusions]We got informative news and user representation vectors using advanced natural language processing technics. And the designing of our proposed model can improve the performance of news recommendation effectively.


  • Huang Xuejian, Liu Yuyang, Ma Tinghuai
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0071

    [Objective] To solve the over-smoothing problem of the traditional graph neural network, realize the weight adaptive allocation of different depths and different neighbors of the graph neural network, and improve the performance of academic paper classification.

    [Methods] An improved graph neural network academic paper classification model based on multi-head attention mechanism and residual network structure is proposed. First, based on the multi-head attention mechanism, it learns a variety of related features between documents, and realizes the adaptive distribution of the weight of different neighbor nodes; then, based on the residual network structure, the output of each layer node of the model is aggregated, and the learning of adaptive aggregation radius is provided for the model. Finally, based on the improved graph neural network, the feature representation of each node in the paper citation graph is learned, and the feature is input into the multi-layer fully connected network to obtain the final classification result.

    [Results] The experimental results on large-scale real datasets show that the accuracy of the model reaches 61%, which is 4% and 14% higher than that of the traditional GCN and Transformer models, respectively.

    [Limitations] The classification accuracy of samples with a small proportion of categories and samples that are difficult to distinguish is not high.

    [Conclusions] The improved graph neural network can effectively avoid the over-smoothing problem and realize the adaptive allocation of different weights.

  • Jia Minghua, Wang Xiuli
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2022-0009

    [Objective]To prevent and control financial risks by quantifying the logical relationship of financial risks, and to deal with the unreliable quantification of word frequency of financial events.

    [Methods] A quantitative analysis method of logical relation of financial risk based on BERT and mutual information combined with domain knowledge was proposed, and the relation was quantified in COPA and financial domain data sets.

    [Results] BERT and mutual information can effectively solve the problem of unreliable quantization of word frequency. The accuracy of quantization of logical relation of financial risk reached 80.1%, which improved 3.09~37.39% compared with the benchmark model. [Limitations] Only financial corpora are considered, and its effect on non-financial and other corpora needs to be tested.

    [Conclusions] This method can reveal the evolutionary path of financial risk events and improve the effect of quantitative financial risk logical relationship.


  • You Xindong, Yuan Menglong, Zhang Le, Lv Xueqiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1369

    [Objective] In order to solve the problem of insufficient accuracy of defect word recognition tasks in the consumer product field, this paper proposes a CNN model based on the sememe and multi-features to improve the recognition accuracy.

    [Methods] First of all, the input of the model is a distributed word vector fused with sememe. And part-of-speech features and randomly embedded word position vectors are added to increase information of the word vector. The max pooling is removed from the structure, and the information contained in the depth vector output by the convolution kernel provides more sufficient information for word classification.

    [Results] Compared with the CNN model that only adds word position vectors, the method proposed in this paper improves the accuracy, recall and F1 values by 2.1%, 0.2% and 1.2%, respectively.

    [Limitations] The polarity recognition of the same expression in different scenarios is insufficient.

    [Conclusions] Through ablation experiments, it is proved that the sememe, part-of-speech, and the removal of pooling layer can help improve the performance of domain word recognition model.


  • Cheng Quan, She Dexin
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1452

    [Objective] In order to enhance the scientificity and rationality of drug recommendation in intelligent diagnosis and treatment, patient signs and historical medication data are integrated in the process of drug recommendation using graph neural network.

    [Methods] In this study, we model the transitive relationship between abnormal signs and drugs to achieve precise drug recommendation with sign perception based on Graph Neural Network. By constructing the “sign-patient-drug” heterogeneous graph, the node representation with sign perception is learned by the R-GCN encoder. We further integrate the abnormal signs into the recommendation by designing a sign-aware interaction decoder.

    [Results] An empirical study on drug recommendation was conducted based on the diagnosis and treatment data of three types of diseases in the MIMIC-Ⅲ dataset. Compared with SVD, NeuMF and NGCF model, the results show that the drug recommendation method proposed in this study improved by 13.49%, 12.36% and 1.91% in Recall@20 respectively. The value of NDCG@20 improved by 16.69%, 13.75% and 8.22% respectively.

    [Limitations] The dynamic changes of patient drug use with time were not considered in this method.

    [Conclusions] Compared with other methods, the drug recommendation method based on graph neural network, which integrates patient physical signs information and medication data, is effective and feasible, and can perceive the impact of patient signs on medication, which also provides a basis for the research of accurate drug recommendation by integrating multi-dimensional information.


  • Dong Wenhui, XIONG Hui-xiang, Du Jin, Wang Niu niu
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1457

    [Objective] In order to help scholars quickly find suitable scientific research partners, promote scientific research output and enhance academic exchanges.

    [Methods] Using LDA topic model, PageRank algorithm and social network analysis, this paper comprehensively and deeply excavates the four dimensional characteristics of scholars' natural attributes, interest attributes, ability attributes and social attributes to construct scholars' portraits,and recommend scientific research collaborators based on scholars' preferences.

    [Results] Finally, 14007 documents, 13292 citation data and 11869 authors in the field of Library and information were obtained from CNKI and CSSCI to verify the model proposed in this paper. Finally, 20 potential scientific research collaborators with similar and complementary research interests were recommended to the target scholars..

    [Limitations] This paper fails to solve the cold start problem well, and ignores the contribution of authors in different signing orders to the paper in terms of scholars' ability representation, and the selection of data in the empirical link is limited.

    [Conclusion] This model can effectively recommend potential scientific research collaborators with high authority, high relevance, and high matching characteristics such as scientific research productivity and social relations to target scholars, and has good application value.

  • Shi Yunmei, Yuan Bo, Zhang Le, lv Xueqiang
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1245
    Abstract (208) PDF (363)   Knowledge map   Save

    [Objective]Aiming at the problem of the proliferation of fake comment information published by "Internet Water Army" on e-commerce websites, a fake comment detection method (IMTS) that integrates image information and text semantics for Chinese e-commerce website comments is integrated.

    [Methods]The IMTS method uses the text convolutional neural network (TextCNN) and the Bert pre-training model to extract the features of the text review information, and obtain the corresponding feature vectors. The reviewer features are then integrated, and the model's capture of the overall semantic information is further enhanced by splicing the review text semantics and the output features of the reviewer ID. Then use Residual Network (ResNet) to extract features from pictures posted by users in comments to obtain corresponding visual features, and finally perform multimodal fusion of text features and visual features to detect false comments.

    [Results]The IMTS method achieves 96.36% accuracy, 96.35% recall and 96.35% F1 value on the self-built multimodal Chinese fake comment dataset.

    [Limitations]Due to the limitation of computing power, the dataset in this paper is small in scale, and the Bert pre-training model is used in the text processing stage. In the case of large-scale data computing, the time cost is high.

    [Conclusions] It is effective to use multi-modal thinking and feature fusion method to supplement the fake comment text to detect fake comments. This method can effectively improve the overall detection accuracy of fake comments.

  • Ding Hao, Hu Guangwei, Wang Ting, Suo Wei
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1464
    Abstract (106) PDF (49)   Knowledge map   Save

    [Objective] A potential factor decomposition model based on time series drift was proposed to capture the characteristics of user interest trends to improve recommendation accuracy. [Methods] The temporal dynamic evolution of user preference and the influence of user's past behavior on current behavior were combined to build a model. An auxiliary matrix was constructed to capture the evolution relationship between user's two periods, and a time impact factor was introduced to balance the influence of current and past behavior. [Results] Compared with the baseline method, the accuracy was improved by 40%, 3.75% and 19.8% on average in three experimental data sets, indicating the effectiveness of the proposed algorithm. [Limitations] Because the evolution analysis of interest drift relies on historical data of users, when the amount of historical data is too sparse, other information of users should be used for cold start. [Conclusion] Through experimental comparison, the model in this paper has stronger generalization ability to the characteristics of interest fluctuation, and more accurate analysis and recommendation of the evolution trend of user interest, and effectively improves the recommendation performance of enterprises.

  • Chen Donghua, Zhang Runtong
    Data Analysis and Knowledge Discovery. https://doi.org/10.11925/infotech.2096-3467.2021-1312
    Abstract (201) PDF (377)   Knowledge map   Save

    [Objective] Topic mining and multi-label classification are leveraged to provide decision support for vaccine adverse event (VAE) monitoring and public opinion analysis.

    [Methods] We propose a latent Dirichlet allocation-based VAE topic modeling method with the use of domain knowledge and accordingly develop a public opinion analysis method for vaccine-associated posts based on different strategies of multi-label classification. Finally, we discuss the relationships between user vaccine-related sentiments and the patterns of online user behaviors.

    [Results] The use of sentiment dictionaries and MedDRA terminology sets improves the accuracy of VAE-related sentiment analysis by up to 15.17%. The One-vs-Rest-based methods achieve an accuracy of up to 97.15% while the other methods merely achieve an average accuracy of 80%.

    [Limitations] Lots of non-standard terms about VAE-associated posts on social media have great influence in vaccine-related information extraction. Further use of controlled medical terminology and multimodal information analysis will improve the accuracy of vaccine-related public opinion analysis.

    [Conclusions] VAE topic mining and sentiment analysis improve the accuracy of public opinion analysis and decision support for people after massive vaccination.