Home Table of Contents

25 January 2018, Volume 2 Issue 1
    

  • Select all
    |
    Orginal Article
  • Zhang Zhiqiang,Fan Shaoping,Chen Xiujuan
    Data Analysis and Knowledge Discovery. 2018, 2(1): 1-8. https://doi.org/10.11925/infotech.2096-3467.2017.1330
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper reviews the latest Biomedical Informatics studies and indicates some future directions for data-driven knowledge discovery in precision medicine. [Methods] We summarized the developments of data resources, data analysis platforms and methods, clinical decision-making applications in Biomedical Informatics through literature review and service trials. [Results] Future directions of Biomedical Informatics include building better big data management system, proposing theories and methods for big data analysis, developing new tools and platforms, clinical application of research findings, as well as training senior personnel. [Limitations] More biomedical data resources, methods, and case studies should be added. [Conclusions] This study identifies the future developments of Biomedical Informatics in precision medicine, which utilizes big data analytics to discover more knowledge.

  • Shen Zhihong,Yao Chang,Hou Yanfei,Wu Linhuan,Li Yuepeng
    Data Analysis and Knowledge Discovery. 2018, 2(1): 9-20. https://doi.org/10.11925/infotech.2096-3467.2017.1341
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This article analyzed the concept, connotation and characteristics of the big linked data, aiming to explore possible solutions for technical challenges facing its management. [Methods] We proposed a new model based on NoSQL data management, distributed graph computing and big data pipeline technologies, which designed and develop gETL, a large-scale graph data warehouse processing system. [Results] The proposed system was used in NSFC-KBMS and WDCM projects, which effectively manages large-scale knowledge-data and biological data. [Limitations] The proposed system could be improved with new applications. [Conclusions] The NoSQL data storage, distributed graph computing, and big data pipeline technologies, as well as the gETL system, help us address the challenges facing linked big data management.

  • Guo Shaoqing,Le Xiaoqiu
    Data Analysis and Knowledge Discovery. 2018, 2(1): 21-28. https://doi.org/10.11925/infotech.2096-3467.2017.1091
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to identify the actual value of numerical indicators from the scientific literatures. [Methods] Firstly, we analyzed the Shortest-Path-Tree between the indicator and the digital entities. Then, we used by distant supervision to learn the syntactic and description characteristics of the numerical indicator sentence. Third, we created four types of relationship templates of “more than”, “less than”, “equal” and “times”. Finally, we obtained the real value of these indicators. [Results] We examined the proposed method in the fields of climate changes and astronomy. The F-values were 82.35% and 77.55%, which were above the average of related studies. [Limitations] We did not investigate the indicator real value across multiple sentences. [Conclusions] The proposed method could help us obtain the actual value of numerical indicators effectively.

  • Wang Tingting,Han Man,Wang Yu
    Data Analysis and Knowledge Discovery. 2018, 2(1): 29-40. https://doi.org/10.11925/infotech.2096-3467.2017.0715
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a K-wrLDA model based on adaptive clustering, aiming to improve the subject recognition ability of traditional LDA model, and identify the optimal number of selected topics. [Methods] First, we used the LDA and word2vec models to construct the T-WV matrix containing the probability information and the semantic relevance of the subject words. Then, we selected the number of topics based on the evaluation of clustering effects and the pseudo-F statistic. Finally, we compared the topic identification results of the proposed model with the old ones. [Results] The optimal number of topics was 33 for the proposed model, which also has lower level of perplexity than the traditional ones. [Limitations] The sample size needs to be expanded. [Conclusions] The proposed model, which has better recognition rate than the traditional LDA model, could also calculate the optimal number of topics. The new model may be applied to process large corpus in various fields.

  • Li Weiqing,Wang Weijun
    Data Analysis and Knowledge Discovery. 2018, 2(1): 41-50. https://doi.org/10.11925/infotech.2096-3467.2017.0717
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a method to build product feature dictionary based on large scale review data, aiming to improve its precision and recall. [Methods] First, we constructed a seed dictionary by manually labeling and extending the synonym forest. Then we trained the word vector with large scale product reviews to calculate the semantic similarity and relevance of words. Finally, we identified and categorized the product features to construct the dictionary. [Results] We chose product reviews on mobile-phones, cameras and books to examine the proposed model, which had average precision and recall of 0.774 and 0.855. [Limitations] The proposed method required a great deal of human participation at the marking and verification stages, while it did not consider the implied features of product reviews. [Conclusions] The proposed method could effectively build feature dictionary with better recall.

  • Zhang Pengyi,Wang Danxue,Jiao Yifan,Chen Xiuyu,Wang Jun
    Data Analysis and Knowledge Discovery. 2018, 2(1): 51-63. https://doi.org/10.11925/infotech.2096-3467.2017.0890
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This research characterizes users’ browsing patterns, aiming to predict their purchasing decisions on mobile shopping applications. [Methods] First, we mapped the request parameters of the logs with users’ information behavior types. Then, we used logistic binary regression and C&R decision tree techniques to establish models to predict the buying decisions. The data set included 3,923,429 lines of server logs generated by 290 heavy users of a popular mobile shopping app in March 2015. [Results] We found that the frequency of users’ browsing behaviors was stable during the weekdays and reached its peak every night before bedtime. Users paid much attention to product details and those with deeper browsing behaviors are more likely to read introduction to the shop and share related information. The number of views was in line with the power-law distribution and 90% of the merchandise was checked less than 16 times. We also found that goods viewed by 9 times and placed in the carts were most likely to be bought. There was a positive correlation between the purchases of goods and the numbers of views or sharing of the item and the shop. The C&R decision tree model’s prediction accuracy was slightly higher than that of the Logistic binary regression model. However, the former’s variable types were far less than the latter. [Limitations] Logs cannot fully reflect all users’ behaviors, which lead to some ambiguity of our analysis. The conclusion might not tell the whole story since the logs were generated by heavy users in one month. [Conclusions] The pattern of user browsing and buying behaviors could be used to enhance their experience of the mobile shopping applications. Logistic binary regression might better predict users’ buying decisions than the C&R decision trees model.

  • Qu Jiabin,Ou Shiyan
    Data Analysis and Knowledge Discovery. 2018, 2(1): 64-75. https://doi.org/10.11925/infotech.2096-3467.2017.1114
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] There are lots of irrelevant results among the topics identified by the LDA model, which poses negative effects to the accuracy of evolution analysis. This paper constructs topics evolution paths to analyze their evolution by filtering out noises and calculating relevance. [Methods] First, we filtered out irrelevant topics by their probability of appearing in all documents and the word propensity distribution of topics. Then, we calculated the Jensen-Shannon Divergence to identify related topics. Finally, we constructed the topic evolution paths based on the correlation between topics. [Results] The effectiveness of the proposed method was examined with scientific literature on “machine learning”, which yielded five evolution paths, i.e. rebirth, extinction, succession, division and merger. [Limitations] There are some subjective factors involving the estimated threshold values. [Conclusions] The proposed method could avoid the interference of noise topics, and then identify relevant topics from adjacent time intervals. It helps us discover the evolution of discipline topics more accurately.

  • Zhang Liyi,Li Huiran
    Data Analysis and Knowledge Discovery. 2018, 2(1): 76-87. https://doi.org/10.11925/infotech.2096-3467.2017.1038
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] To explore the impacts of social interaction on online medical question-answering service. [Methods] We proposed a new research model to study online medical question-answering usage. We collected data with questionnaire and examined the proposed model with the Smart PLS 3.0. A total of 371 valid samples were obtained and analyzed. [Results] We found that users were happy after contributing information online. The usefulness and ease of use during human-machine interaction, as well as the cognitive-trust and affection-trust posed positive effect to patients’ usage of online medical question-answering services. We also found the information and emotion support had different impacts on cognitive and affection trust, which led to different behaviors of patient-doctor and patient-patient interactions. [Limitations] Impacts of different diseases and information functions (direct or indirect) on the interaction should be further studied. [Conclusions] Human-human and human-machine interactions have positive effects on patient’s intention of using online medical question-answering services.

  • Mu Dongmei,Wang Ping,Zhao Danning
    Data Analysis and Knowledge Discovery. 2018, 2(1): 88-98. https://doi.org/10.11925/infotech.2096-3467.2017.1053
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper explores the strategy of reducing the data dimension of electronic medical records, aiming to improve the knowledge discovery. [Methods] First, we conducted preliminary dimension reduction through literature review. Then, we used three methods to finish the second round of dimension reduction. We extracted the factors with the eigenvalue greater than 1, with the cumulative contribution rate greater than 85%, as well as factors of significant differences. Finally, we compared results of the three methods with empirical research. [Results] The dimensional reduction methods extracted 8, 17 and 14 attributes respectively. After qualitative and quantitative evaluation, the principal component analysis method yielded the best result, whose dimension of the feature root was larger than 1. [Limitations] The sample size needs to be expanded for more in-depth analysis. [Conclusions] The proposed method could effectively reduce the data dimension of electronic medical records.

  • He Yue,Wang Aixin,Feng Yue,Wang Li
    Data Analysis and Knowledge Discovery. 2018, 2(1): 99-108. https://doi.org/10.11925/infotech.2096-3467.2017.0946
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] As the number of outpatient visits increases, optimizing the layout of pharmacy drugs can improve its service efficiency. [Methods] Firstly, we chose two departments with the largest number of prescriptions, which were divided into four sub groups with the K-means clustering method. Then, we used Apriori algorithm to explore the association rules among them. Finally, we obtained 31 effective drug layout rules and 18 effective drug class rules. [Results] We designed general layout rules for prescription drugs based on the collected data along with national drug storage and display standards, which were approved by the experts. [Limitations] We only studied prescription records from two departments, which might not yield the best association rules. [Conclusions] The proposed method could reduce the workload of pharmacists and the waiting time of patients, which improve the pharmacy services.