Home Table of Contents

25 July 2022, Volume 6 Issue 7
    

  • Select all
    |
    Original article
  • Liu Chunjiang, Li Shuying, Hu Hanlin, Fang Shu
    Data Analysis and Knowledge Discovery. 2022, 6(7): 1-11. https://doi.org/10.11925/infotech.2096-3467.2021.1168
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper systematically reviews the progress and trends of graph database research and applications for complex network analysis. [Coverage] We searched the Web of Science, Scopus, and CNKI database for Chinese and English literature. A total of 15 graph databases and open-source packages, 21 practical cases, and 14 research papers were retrieved. [Methods] First, we compared the mainstream graph database products from China and abroad. Then, we explored the latest solutions for complex network analysis, including algorithms (such as centrality, path finding, link prediction, and community detection), graph visualization, performance and related applications. [Results] The graph database has become an important analysis tool and research method for complex network analysis and big data mining. They also work closely with graph computing engines for complex network analysis. [Limitations] This paper only examined a few representative cases. [Conclusions] The graph database could effectively query, represent and analyze complex network data for their patterns or structures. Their presentation of multi-dimensional data is crucial for mining implicit relationships.

  • Zhang Jinzhu, Wang Qiuyue, Qiu Mengmeng
    Data Analysis and Knowledge Discovery. 2022, 6(7): 12-31. https://doi.org/10.11925/infotech.2096-3467.2022.0142
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper reviews the literature identifying disruptive technologies, aiming to examine research topics and development trends, as well as establish a framework for further studies. [Coverage] We searched Chinese and English papers from CNKI and Web of Science with relevant keywords. We retrieved 1 974 papers published between 2011 and 2020 for quantitative analysis, and 61 papers published between 2001 and 2020 for qualitative analysis. [Methods] First, we identified the popular topics and development trends through quantitative analysis. Then, we examined the highly cited papers and the latest literature to review their research methods. Finally, we built a framework based on the results of quantitative and qualitative analysis which also predicted future trends. [Results] Studies identifying disruptive technologies were more popular in the fields of information technology, medical treatment, chemical industry, and high-end manufacturing. They included multiple-methodology from the perspectives of technologies themselves, products, sci-tech information mining, and external environment. We established three frameworks for disruptive technology identification and explored some future developments. [Limitations] More research on macro indicators, such as society- and economy-related issues, need to be reviewed comprehensively. [Conclusions] The research on disruptive technology identification has become inter-disciplinary, which include more quantitative methodology and the nonlinear algorithms based on deep learning.

  • Ding Hao, Hu Guangwei, Qi Jianglei, Zhuang Guangguang
    Data Analysis and Knowledge Discovery. 2022, 6(7): 32-43. https://doi.org/10.11925/infotech.2096-3467.2021.1148
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper tries to find valuable contents from a large number of medical literatures, aiming to help physicians make diagnosis and improve medical literature recommendation. [Methods] We proposed a new method based on the random forest model and keyword query expansion. First, we used the MeSH dictionary and the automatically constructed acronym dictionary to establish the complete relationship between keywords and corresponding articles at three levels of sentence, paragraph and document. Then, we calculated the multiple similarity between topics and articles. For each article, the PageRank and Authority weights of HITS were calculated through the citation network in the literature set. [Results] Compared with the average of the 10 values with the highest NDCG@100 value from the TREC clinical decision support follow-up evaluation, the overall average difference of the proposed method was within 0.9%, which was very small. [Limitations] Some new literatures or the “Sleeping Beauty” literature may have lower retrieval ranking due to low citation in the early stage. Our method cannot make accurate recommendations for these papers. [Conclusions] The proposed method effectively improves the medical literature recommendation.

  • Li Hui, Hu Jixia, Tong Zhiying
    Data Analysis and Knowledge Discovery. 2022, 6(7): 44-55. https://doi.org/10.11925/infotech.2096-3467.2021.1296
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper examines the evolution of research topics, which helps researchers quickly identify the status quo and trends in their fields. [Methods] First, we merged multi-source datasets and divided the domain research topics by time period. Then, we calculated topic importance with their popularity, density, and closeness centrality. Third, we utilized topic semantic similarity to identify the related ones from adjacent time periods. Finally, we combined the topic importance fluctuation and the topic similarity to decide their evolution types and paths. [Results] We examined our model with papers on artificial intelligence and analyzed the changes of topics in the past 20 years. We identified the popular research topics and their evolution paths, which showed obvious thematic fusion and split development in four periods. [Limitations] The topic naming rules could be more effective and we could not show the whole life cycle of the booming artificial intelligence research. [Conclusions] The proposed model could effectively reveal the topic evolution of research.

  • Wu Jiang, Liu Tao, Liu Yang
    Data Analysis and Knowledge Discovery. 2022, 6(7): 56-69. https://doi.org/10.11925/infotech.2096-3467.2021.1449
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper explores patterns, evolutionary laws, group differences and influences on community recognition of online users’ self-presentation topics. [Methods] Firstly, we identified online users of NetEase music community and constructed their profiles from the perspectives of qualification and participation. Then, we adopted the BERT model to cluster users’ short comments, and identified their self-presentation topics. Third, we utilized cosine similarity to analyze the evolution of topics and group differences. Finally, we used covariance to analyze the impacts of self-presentation topics on community recognition. [Results] There are eight self-presentation topics, while the proportion of “reviews” decreased and “recollection” increased. “Interaction”topics were more popular in “relax” style than in others. The proportion of each topic at different time was almost the same. Under the themes of “recollection”, the cosine similarity value of quality users was higher than those of other users. The cosine similarity of continuous participants was higher than those of the inactive participants. The impact of users’ self-presentation topics on their community recognition was significant at the 0.1 level. [Limitations] More research is needed to examine users of other online communities. [Conclusions] “Recollection” is the most popular one among users’ self-presentation topics, which are affected by styles and time. There was a diversity trend for the topics with the development of the community, as well as obvious differences among user groups.

  • Guo Jinjing, Xia Guanghui, Huang Qi, He Liyun, Zhang Huabing
    Data Analysis and Knowledge Discovery. 2022, 6(7): 70-86. https://doi.org/10.11925/infotech.2096-3467.2021.1263
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] Online health communities provide new information for detecting adverse drug reaction (ADR) signals. This study identifies ADR signals from patients’ reviews and generates early warnings for potential side-effects of antidiabetic drugs. [Methods] First, we retrieved patients’ reviews (adverse reactions) on antidiabetic drugs from Ask a Patient website. Then, we combined natural language processing techniques and lexicons (UMLS and MedDRA) to normalize and map these reviews. Third, we constructed a drug-ADR co-occurrence matrix and used the PRR method to identify drug-ADR pairs meeting the signal detection threshold. Finally, we invited expert to interpret the extracted results, which were evaluated with Drugs.com standards. [Results] A total of 539 drug-ADR pairs were identified, with an overall identification accuracy of 85% and recall of 82%. [Limitations] The accuracy of identifying ADR terms was affected by the inclusion of non-ADR terms, such as examination, surgical operation, and social environment from MedDRA. [Conclusions] The proposed model enriches the data sources and methods of ADR signal detection.

  • Zhang Han, An Xinyu, Liu Chunhe
    Data Analysis and Knowledge Discovery. 2022, 6(7): 87-98. https://doi.org/10.11925/infotech.2096-3467.2021.1364
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper constructs a cross-platform semantic knowledge graph with whole datasets, which helps us find novel drug knowledge. [Methods] First, we developed a new model for the proposed knowledge graph, which integrated semantic relations from PubMed, DrugBank and CTD, as well as knowledge fusion and attribute definition. Then, we conducted drug repositioning with pathway identification and link predication to discover new treatments for cancers. [Results] The F-score of pathway identification (0.57) was better than that of the linkage predication (0.56). The more pathways existing between drugs and indications, the greater possibility of predicting positively. [Limitations] Since the reasoning mechanism was based on the existing associations among knowledge units, it is hard to discover the novel indications for drugs without the known targets. It is difficult to update knowledge graph dynamically due to the huge data volume. [Conclusions] The proposed knowledge graph could effectively find new drug indications as well as improve the efficiency for drug research and development.

  • Zheng Jie, Huang Hui, Qin Yongbin
    Data Analysis and Knowledge Discovery. 2022, 6(7): 99-106. https://doi.org/10.11925/infotech.2096-3467.2022.0040
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper constructs a model to match similar cases with integrated legal knowledge, aiming to improve the accuracy of case matching. [Methods] First, we concatenated the legal knowledge with the case texts, which helped the model learn characteristics of legal knowledge and text information simultaneously. Then, we used the LSTM network to model text segmentally, and increased the length of the accommodated texts. Finally, we used triplet loss and adversarial-based contrastive loss to jointly train the model and enhanced its robustness. [Results] The proposed model significantly improved the accuracy of similar case matching, which is 7.07% higher than the baseline BERT model. [Limitations] We used longer text sequences for matching, which is more time consuming than other models. [Conclusions] The proposed model has stronger matching and generalization ability, which helps legal case retrieval.

  • Zhang Le, Du Yifan, Lü Xueqiang, Dong Zhian
    Data Analysis and Knowledge Discovery. 2022, 6(7): 107-117. https://doi.org/10.11925/infotech.2096-3467.2021.1307
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes an abstracting model for Chinese patents based on integration strategy (STNLTP), aiming to reduce the duplication and long document dependency issues of the existing automatic abstracting techniques. [Methods] First, we introduced a patent term dictionary, and used the sememe vector based on SAT model to represent traditional Chinese medicine patents. Then, with the help of integration strategy, we utilized the TextRank, Lead4 and NMF models to extract key sentences from the patents. Third, we identified the optimal key sentences with the clustering and redundancy removing. Finally, we processed these optimal key sentences with the pointer-generator network based on Transformer character vector to create the abstracts. [Results] Our new model successfully combined the extractive and generative methods. Compared with the existing RLCPAR model, we improved the evaluation indicators of ROUGE-1, ROUGE-2 and ROUGE-L by 2.00%, 9.73% and 2.35%, respectively. [Limitations] There are still some errors in the new abstracts. [Conclusions] The new STNLTP model could effectively generate Chinese patent abstracts.

  • Zhang Shunxiang, Zhang Zhenjiang, Zhu Guangli, Zhao Tong, Huang Ju
    Data Analysis and Knowledge Discovery. 2022, 6(7): 118-127. https://doi.org/10.11925/infotech.2096-3467.2021.1344
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a network model with Bi-LSTM and two-way CNN, which addresses the missing characteristic information for causality identification and improves its accuracy. [Methods] First, we used the Bi-LSTM to generate the text feature matrix for the financial texts. Then, we extracted the causal features from the matrix using two-way CNN with different convolution cores. Third, we spliced the feature vectors obtained by maximum and average pooling methods. Finally, we transferred the spliced vectors to the full connection layer for output. [Results] The accuracy of our new model reached 82.3%, which is at least 3% higher than those of the existing methods. [Limitations] We did not establish specific function module for the financial texts. [Conclusions] The proposed model could effectively identify the causality from the documents.

  • Bian Xiaohui, Xu Tong
    Data Analysis and Knowledge Discovery. 2022, 6(7): 128-140. https://doi.org/10.11925/infotech.2096-3467.2021.0711
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study analyzes the social media posts during the COVID-19 pandemic, aiming to reveal the temporal and spatial differences of public opinion, the sentiment evolution under different circumstances, as well as the trans-regional spreading of the public sentiments. [Methods] Firstly, we utilized the Latent Dirichlet Allocation (LDA) model to generate the latent topics and related keyword groups, which also analyzed public sentiment evolutions from the perspectives of global and individual topics. Then, we described the trans-regional spread of public sentiments based on the social spread model adapted from the classic Independent Cascade Model. [Results] The new model summarized the general rules of the temporal evolution and spatial difference, as well as the impacts of distance to the epidemic centers and the financial levels. We also found two different types of topics indicating reasons for popularity and sentiment differences, as well as multi-view connections among these topics. The strength of trans-regional sentiment spread could be affected by both regional distance and epidemic situation. [Limitations] The new framework could not process the multimodal data. [Conclusions] The proposed model helps the local government make better strategies according to specific conditions, and pay more attention to the impacts of related events. They should also strengthen regional cooperation and coordination for controlling pandemics and monitoring public sentiments.

  • Yang Wenli, Li Nana
    Data Analysis and Knowledge Discovery. 2022, 6(7): 141-151. https://doi.org/10.11925/infotech.2096-3467.2021.1462
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] The paper tries to improve the accuracy of cross-language sentiment classification by narrowing the distribution of bilingual text pairs in the shared space. [Methods] In the process of emotional knowledge transfer, we aligned the word and text pairs simultaneously by adjusting the balance coefficient. Then, we combined the language discriminator to generate the conversion matrix for adversarial network optimization. Finally, we used a multi-feature fusion hierarchical neural network to represent the texts, the contexts, as well as the topic relevance of words and sentences, which addressed the issue of long-distance feature dependence of the texts. [Results] We examined our model on the NLP&CC 2013 standard data sets and the average cross-language sentiment classification accuracy was 83.66%, which was 2.30% higher than the benchmark model. [Limitations] This method was only tested with Chinese and English datasets. More research is needed to evaluate its effectiveness with other languages. [Conclusions] Improving the similarity of bilingual texts could effectively increase the accuracy of cross-language sentiment classification.