Home Table of Contents

25 May 2019, Volume 3 Issue 5
    

  • Select all
    |
  • Jing Shi,Chenlu Li,Yuxing Qian,Liqin Zhou,Bin Zhang
    Data Analysis and Knowledge Discovery. 2019, 3(5): 1-10. https://doi.org/10.11925/infotech.2096-3467.2018.0813
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper identifies and analyzes the information needs of domestic and international question and answer health communities, aiming to find the patterns of regulation evolutions and explore the reasons. [Methods] First, we selected diabetes related data from ManYouBang and DailyStrength. Then, we compared topic evolution and co-occurrence, from the perspectives of theme and time, and using theme coding, social network and content analysis. [Results] The essential needs of HCQA users were “how to treat the disease”. For the chronic disease community, the “diet” theme was closely related to its co-occurrence themes. [Limitations] Our research did not examine the relevance of answers to questions, thus more in-depth study is needed on the topic evolution and content. [Conclusions] Domestic HCQA community is still developing while their foreign counterparts are stable. The former has only “question and answer attribute” while latter has both “Q&A” and “social” attributes.

  • Mengji Zhang,Wanyu Du,Nan Zheng
    Data Analysis and Knowledge Discovery. 2019, 3(5): 11-18. https://doi.org/10.11925/infotech.2096-3467.2018.0871
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper tries to predict stock trends with the help of deep learning models, financial data and related news events. [Methods] First, we built a classification model for news events. Then, we used the recurrent neural networks to construct a forecasting model for stock trends based on news, capital flows and corporate financial reports. [Results] The prediction accuracy was improved by the proposed model (76.22% and 77.36% for the mining and pharmaceutical manufacturing industries). [Limitations] We did not examine the different impacts of news headlines and full-texts on stock market. We only chose news events from the past one year, which needs to be expanded. [Conclusions] News events could improve the accuracy of predicting stock trends.

  • Wancheng Chen,Haoran Dai,Yinghan Jin
    Data Analysis and Knowledge Discovery. 2019, 3(5): 19-26. https://doi.org/10.11925/infotech.2096-3467.2018.0881
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a model based on the HEDONIC theory, aiming to assess home prices more efficiently, cost-effectively and accurately. [Methods] We adopted the spatial analysis method to extract important features from pre-processed data. Then, we built the model with Random Forest, KNN and Neural Networks. [Results] We examined our model with property price data of Seattle (USA) from 2014 to 2015 and found its precision was 11.20% higher than the linear model. [Limitations] The sample data was not retrieved from the same time slice, which might affect the performance of our model. Using this model to assess home prices in China might be biased due to different market environment and other factors. [Conclusions] The proposed model is a reliable method to appraise property prices.

  • Guangshang Gao
    Data Analysis and Knowledge Discovery. 2019, 3(5): 27-40. https://doi.org/10.11925/infotech.2096-3467.2018.1388
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper discusses the classical entity resolution methods and logical thinking in entity resolution theory. [Coverage] Google Scholar and CNKI were respectively used to search literatures with the keywords “Entity Resolution”, “Collective Analysis”, “Crowdsourced”, “Active Learning”, “Privacy-Preserving” and “Entity Resolution” in Chinese. I then obtained a total of 86 representative literatures in conjunction with topic screening, intensive reading and retrospective method. [Methods] For each entity resolution method, the paper first summarizes and analyzes the basic idea of the method, and presents the resolution process through illustration, and then focuses on analyzing the key strategies, algorithms or techniques adopted by the existing research in the process of implementation of the method. [Results] Entity resolution is the basic operation of data quality management, and the key step to find the value of data. [Limitations] There is no in-depth analysis of the evaluation indicators and application of each entity resolution method. [Conclusions] Although existing entity resolution methods can meet the requirements of most applications to some extent, they still face challenges in data heterogeneity, privacy protection and distributed environment in the big data environment.

  • Qiang Liu,Yunwei Chen,Zhiqiang Zhang
    Data Analysis and Knowledge Discovery. 2019, 3(5): 41-50. https://doi.org/10.11925/infotech.2096-3467.2018.1222
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper provides a comprehensive introduction to the Norwegian Model, aiming to promote the development of science and technology evaluation in China. [Methods] With case studies, this paper first discussed the implementation of the Norwegian Model, and the successful stories from regions outside of Norway. Then we explored the application of the Norwegian Model at various levels and subjects. Finally, we compared the Norwegian Model with two classic bibliometric measures. [Results] Six European countries used the Norwegian Model, a performance-based research funding system (PRFS), to promote their scientific publications. [Limitations] The Norwegian Model and its applications have been evolving, therefore, we are not able to discuss their future trends. [Conclusions] The Norwegian Model has some value in science and technology evaluation. More research is needed to explore its applications in China.

  • Jingjing Pei,Xiaoqiu Le
    Data Analysis and Knowledge Discovery. 2019, 3(5): 51-56. https://doi.org/10.11925/infotech.2096-3467.2018.1380
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a method to identify the coordinate text blocks by semantic and layout features, which are distributed in different paragraphs. It also provides a pre-trained model for these knowledge objects. [Methods] First, we used each paragraph as a processing unit and added the layout features based on the character and word vectors. Then, we concatenated multi-dimensional features to represent each paragraph. Third, we employed the convolutional neural network (CNN) model to train the annotated data and obtained the recognition model for coordinate relationship text blocks. [Results] The proposed approach achieved a precision of 96% with manually annotated scientific papers, which was 3% higher than those of the baseline model. The recall was also improved by 2%. [Limitations] Our model can only work with HTML files. More research is needed to examine it with other data formats. [Conclusions] The proposed method is able to effectively identify coordinate text blocks in discourses, which can be used as a pre-trained model for coordinate knowledge objects.

  • Jianhua Liu,Zhixiong Zhang,Qin Zhang
    Data Analysis and Knowledge Discovery. 2019, 3(5): 57-67. https://doi.org/10.11925/infotech.2096-3467.2018.1379
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] The paper tries to describe the evolutionary path of science and technology (S&T) policies using knowledge from documents generated in policy promotion. [Methods] We proposed a multi-index model with direct semantic relationship, direct co-occurrence relationship, in-direct co-occurrence relationship and link path attenuation index. The S&T policy entities and their relationships used in the proposed model were extracted from the policy texts. We described the S&T policy evolution paths along with time properties and then analyzed the structural features of policy entities and their relationship. [Results] We found the evolution paths of these policies at different stages, and 80% of the retrieved paths were existing in the real world. [Limitations] The proposed model relies on human comparison and interpretation. Besides, the sample size needs to be expanded. [Conclusions] This study reveals the evolutionary path of S&T policies based on related records. It expands the scope and depth of S&T policy analysis research.

  • Jinzhu Zhang,Yiming Hu
    Data Analysis and Knowledge Discovery. 2019, 3(5): 68-76. https://doi.org/10.11925/infotech.2096-3467.2018.0659
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to automatically identify scientific references in patent(SRP), and then extract titles from SRP to support in-depth data mining. [Methods] Firstly, we used the Doc2Vec method to generate vectors for the patent citations. Then, we identified the SRPs with support vector machine (SVM). Third, we created vectors for the metadata (such as titles) of SRP, and extracted titles with SVM. [Results] We examined the proposed method with patent citations from the genetic field. The accuracy of SRP recognition and titles extraction reached 99.27% and 92.59% respectively. The latter was 5.96% higher than those of the traditional methods. [Limitations] Manually tagging the training set was very time consuming, and there are format requirements for the experimental data. [Conclusions] The proposed method could effectively identify and extract patent citations and titles.

  • Bengong Yu,Yangnan Chen,Ying Yang
    Data Analysis and Knowledge Discovery. 2019, 3(5): 77-85. https://doi.org/10.11925/infotech.2096-3467.2018.0758
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper tries to find an effective way to classify the non-structured and short-text business complaints, aiming to improve the efficiency of corporate problem solving. [Methods] We first combined the topic model and distributed representation technique to construct a SVM input space vector. Then, we integrated ensemble learning method to build the nBD-SVM text classification model. [Results] We examined the proposed model with business complaint texts and found its precision reached 81.83%, which is much higher than the traditional methods. [Limitations] We only evaluate our model with complaints from one company. [Conclusions] The proposed nBD-SVM model could process short text business complaints effectively.

  • Yuemin Wu,Ganggui Ding,Bin Hu
    Data Analysis and Knowledge Discovery. 2019, 3(5): 86-92. https://doi.org/10.11925/infotech.2096-3467.2018.0818
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a new method to extract relations from Chinese texts automatically. [Methods] We retrieved annual reports of 224 listed agricultural companies from 2015 to 2017. Then we adopted the Gated Recurrent Unit algorithm based on double attention mechanism to extract the needed data. [Results] The average accuracy of our model on the agricultural financial dataset reached 78%. Compared with the Recurrent Neural Network algorithm, the average accuracy of the new model increased by about 12%. [Limitations] We only studied data from 224 companies, which needs to be expanded. [Conclusions] The proposed model can effectively extract relationship from agricultural financial texts.

  • Guang Zhu,Hu Liu,Xinmeng Du
    Data Analysis and Knowledge Discovery. 2019, 3(5): 93-106. https://doi.org/10.11925/infotech.2096-3467.2018.0844
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper studies the usage intention of mobile health APPs (mHealth) and privacy concerns. It analyzes the interferences among various entity behaviors, aiming to improve mHealth privacy protection and increase APP usage. [Methods] Using evolutionary game theory, we proposed a model to examine the patient behaviors, mHealth APP providers and government regulations. Then we analyzed the benefits, costs and loss of different behaviors to establish the payoff matrices and evolutionarily stable strategies (ESSs). Finally, we discussed the impacts of different factors on patient behaviors. [Results] The usage intention of mHealth APP was correlated with benefits from mHealth service and probability of privacy leaking. However, the government’s regulation has few impacts on patient’s behaviors. Investments of mHealth service providers in privacy was correlated with APP usage intention, government regulations, costs and privacy loss, etc. Government regulations were correlated with costs and social credibility. [Limitations] We did not include the nonlinear benefit function in this study. Other factors, such as success rate of regulation and advertisement effects should also be examined. [Conclusions] This study promotes the development of mHealth service by analyzing the impacts of various factors on APP usage privacy protection and government regulation.

  • Yujie Cao,Jin Mao,Rongqing Pan,Zhichao Ba,Gang Li
    Data Analysis and Knowledge Discovery. 2019, 3(5): 107-116. https://doi.org/10.11925/infotech.2096-3467.2018.0905
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] The paper explores evolution of interdisciplinary research, aiming to identify their characteristics. [Methods] We chose “Medical Informatics” as the example of interdisciplinary study and divided its evolution into different phases. Then, we introduced interdisciplinary characteristics from the perspectives of knowledge input and output. Finally, we analyzed the co-word patterns of knowledge output to reveal the research evolution features. [Results] At the beginning, developing, and stable stages of Medical Informatics research, both the interdisciplinary degree indicators and the structural properties of the co-word network were different. At the stable stage, the knowledge began to internalize and specialize while exploding. [Limitations] The sample size of interdisciplinary fields needs to be further expanded. [Conclusions] The changing of interdisciplinary research characteristics are the results of multi-disciplinary knowledge input and output.

  • Cheng Zhou,Hongqin Wei
    Data Analysis and Knowledge Discovery. 2019, 3(5): 117-124. https://doi.org/10.11925/infotech.2096-3467.2018.0674
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a new method for evaluating and classifying patent values. [Methods] With the help of value indicators, we designed a patent value analysis and classification system based on self-organizing maps (SOM) and support vector machine (SVM) techniques. We used the SOM to determine value categories, and then applied the random forest (RF) algorithm to rank value indictors based on their significance. Finally, we improved classification performance with the wrapped feature reduction method. [Results] The value tags determined by SOM effectively represented the patent values. Meanwhile, the value indictors were reduced from 14 to 10, and the classification accuracy was increased from 76.28% to 86.89%. [Limitations] Further refinement of patent values in each category is needed, which might reduce the patent value indicators. [Conclusions] The proposed SOM-RF-SVM method could support research and development activities as well as reduce the dependence on human factors.

  • Jiaming Liang,Jie Zhao,Zhou Jianlong,Zhenning Dong
    Data Analysis and Knowledge Discovery. 2019, 3(5): 125-138. https://doi.org/10.11925/infotech.2096-3467.2018.0665
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper explores new data mining method for implicit user behaviors, aiming to improve the precision of the model for collusive fraud detection. [Methods] First, we proposed a framework for implicit user behaviors analysis. Then, we designed a two-stage algorithm to select the needed implicit features. [Results] We examined our new model with massive data from an existing e-commerce platform and found that the proposed model was more effective than the existing ones. [Limitations] The size of our experimental dataset needs to be expanded. [Conclusions] Using implicit features is an effective way to improve the precision of the collusive fraud detection model.