Home Table of Contents

25 May 2018, Volume 2 Issue 5
    

  • Select all
    |
    Orginal Article
  • Lu Wei,Luo Mengqi,Ding Heng,Li Xin
    Data Analysis and Knowledge Discovery. 2018, 2(5): 1-10. https://doi.org/10.11925/infotech.2096-3467.2018.0052
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a user tagging framework and examines the limitations of tagging image with deep learning techniques, aiming to improve the performance of automatic annotation services. [Methods] We analyzed the user-added tags from one million images on flickr.com to extract the high frequency ones. Then, we mapped these tags with the proposed framework, and compared them with tags from the ImageNet database. Finally, we analyzed images with high frequency tags with the deep learning algorithm - MXNet. [Results] The automatic image annotation techniques based on deep learning could not effectively understand the image’s background knowledge, as well as the image’s descriptions from the human perceptive. [Limitations] Our dataset needs to be expanded and analyzed with other deep learning algorithms. [Conclusions] The development of automatic image annotation, requires us to establish the association between image information, background knowledge, and description, as well as cultivate deductive reasoning and context-aware abilities.

  • Wang Xueying,Wang Hao,Zhang Zixuan
    Data Analysis and Knowledge Discovery. 2018, 2(5): 11-22. https://doi.org/10.11925/infotech.2096-3467.2017.1065
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to extract the semantic information from continuous strings in Chinese patent documents in the field of iron and steel metallurgy. [Methods] First, we collected strings with identified the semantics as the learning corpus. Then, we examined the basic features, as well as characteristics of Chinese characters and strings with the corpus to establish the best model. Finally, we used this model to recognize the semantics of other strings. [Results] The proposed model could effectively extract semantics of the continuous strings. [Limitations] We did not include the identified characters to the training corpus. [Conclusions] The new model could identify the semantics of continuous strings in Chinese patent documents, which could be used to study the continuous strings in English literature.

  • Zhang Tingting,Zhao Yuxiang,Zhu Qinghua
    Data Analysis and Knowledge Discovery. 2018, 2(5): 23-31. https://doi.org/10.11925/infotech.2096-3467.2017.1218
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper analyzes the attributes and task characteristics of crowdsourcing community users, aiming to identify their potential interests or preferences. [Methods] First, we studied the user’s sensitivity to task attributes based on sensitivity analysis method, and constructed a model for mining user’s potential preferences with bipartite graph. Then, we used this model to discover the implicit preferences from user’s behaviors. Finally, we confirmed the validity of the proposed model through experimental analysis. [Results] Our model could effectively identify the degrees of users’ sensitivity to Books, Software, and Music etc. It could also discover users’ potential interests or preferences to Pyrex Oblong Roaster, Oxford, and Cashback etc. to predict their choices. Compared with traditional collaborative filtering algorithms, the proposed model has a smaller MAE value. [Limitations] Our preferences mining model is based on users in the competitive environment, and it does not consider the complementarity among the interests of collaborative users. [Conclusions] The proposed model could accurately understand the users’ interests in crowdsourcing community, and then reveal their potential preferences. It helps us effectively distribute crowdsourcing tasks.

  • Wang Jiaqi,Zhang Junsheng,Qiao Xiaodong
    Data Analysis and Knowledge Discovery. 2018, 2(5): 32-39. https://doi.org/10.11925/infotech.2096-3467.2017.1328
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study proposes an approach to analyze research events from scholarly literature, aiming to help researchers obtain scientific information quickly and grasp the development trends. [Methods] First, we examined the characteristics of scientific literature with the help of metadata and text content analysis. Then, we proposed a representation method for research events, and established semantic links among them to construct a network for related events. [Results] We established the semantic link building rules from the perspectives of temporal and citation relations among scientific research events. [Limitations] More research is needed to automatically or semi-automatically construct semantic links among scientific research events. [Conclusions] The network of scientific research events could improve event-based information analysis and retrieval.

  • Feng Guoming,Zhang Xiaodong,Liu Suhui
    Data Analysis and Knowledge Discovery. 2018, 2(5): 40-47. https://doi.org/10.11925/infotech.2096-3467.2017.1302
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper tries to improve the accuracy of word segmentation for literature with lots of scientific terms. [Methods] First, we programed the DBLC model, which combined the methods of dictionary, statistics and deep learning. Then, we retrieved articles from the Chinese Management Case Center to build the experimental corpus. Finally, we compared the performance of this new model with the existing ones. [Results] The performance of the DBLC model was better than others. Its word segmentation accuracy was up to 96.3%. [Limitations] We did not separate the words of the original dictionary from the new words. We did not re-design the storage structure of the dictionary, which prolonged the computing time of our model. [Conclusions] The proposed DBLC model improves the accuracy of word segmentation, which is also positively co-related to the dictionary size.

  • Li Lin,Li Hui
    Data Analysis and Knowledge Discovery. 2018, 2(5): 48-58. https://doi.org/10.11925/infotech.2096-3467.2018.0007
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a method to compute the semantic similarity of texts based on a concept vector space model. [Methods] First, we analyzed the text by dependency parser and extracted key words. Then, we used word embedding method to build vector space for each document. Third, we measured similarities between the two vector spaces and actual texts. Finally, we evaluated the new similarity measures with the data set of short texts and proposed an algorithm for long document classification. [Results] The proposed method effectively measured the semantic similarity of short texts and long documents. The accuracy of document classification was over 92% for the long ones. [Limitations] The performance of our method relies on the quality of dependency parser and word embedding vectors. [Conclusions] Combining linguistics theory and word embedding technique could efectively measure the semantic similarity among texts. This new method also reduces computation complexity and could be used in document classification, text clustering, and automatic question answering systems.

  • Wen Tingxin,Li Yangzi,Sun Jingshuang
    Data Analysis and Knowledge Discovery. 2018, 2(5): 59-69. https://doi.org/10.11925/infotech.2096-3467.2017.1119
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper tries to reduce the dimension of text feature vector space and then improves the accuracy of text classification. [Methods] We proposed a text feature selection model IFOATFSO based on the improved fruit fly optimization algorithm. It introduced the classification accuracy variance to monitor the convergence degree of the model. We also used the crossover operator, roulette wheel selection method based on simulated annealing mechanism and genetic algorithm to deepen global search and improve population diversity. [Results] The IFOATFSO model, which optimized the feature selection based on CHI method, not only reduced the feature dimension, but also improved the accuracy of text classification by up to 10.5%. [Limitations] The performance of IFOATFSO model for extracting English text features needs to be improved. [Conclusions] The IFOATFSO model improves the text classification.

  • Wang Yong,Wang Yongdong,Guo Huifang,Zhou Yumin
    Data Analysis and Knowledge Discovery. 2018, 2(5): 70-76. https://doi.org/10.11925/infotech.2096-3467.2017.1019
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study aims to solve the issues facing traditional methods measuring item similarity, such as using common rating and poor prediction accuracy in highly sparse data environment. [Methods] First, we constructed the dissimilarity coefficient with the increment of diversity from bioinformatics. Then, we calculated item similarity according to the frequency and distribution of ratings, which effectively addressed the data sparsity issue. Finally, we improved the accuracy of measurement with the item attributes. [Results] Compared with traditional algorithms, the proposed method reduced RMSE by 2.56%, and then increased the F value by 3.88%. [Limitations] The diversity of our recommendation might be insufficient. [Conclusions] The proposed method could effectively measure item similarity.

  • Yang Sinan,Xu Jian,Ye Pingping
    Data Analysis and Knowledge Discovery. 2018, 2(5): 77-87. https://doi.org/10.11925/infotech.2096-3467.2017.1316
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] The paper reviews the main techniques for sentiment analysis of online reviews, and then discusses their major development trends. [Methods] First, we surveyed relevant scientific literature on sentiment analysis of web reviews published in recent years. Then, we summarized the characteristics of visualization methods and analyzed features of visualization tools. [Results] We could visualize the sentiment of web reviews from the perspectives of contents, space-time, and topics. The visualization tools include static, interactive and programming ones. [Conclusions] This paper reviews the major methods and tools for online contents visualization and indicates three major development trends. It could promote the progress of future research and new visualization tools.

  • Qi Huiying,Guo Jianguang
    Data Analysis and Knowledge Discovery. 2018, 2(5): 88-93. https://doi.org/10.11925/infotech.2096-3467.2017.1321
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study explores new ways to integrate multi-source clinical research data based on CDISC standard. [Context] The proposed method simplifies the procedures of submitting research data to the drug regulatory department and speeds up the listing of new drugs. It also promotes the sharing of data from different studies. [Methods] First, we designed a CRF based on the CDISC CDASH standard. Then, we mapped the electronic medical records to the CRF in accordance with the ODM standard. Third, we integrated the medical records with the clinical experimental data in the EDC system. Finally, all data were stored in the standard SDTM format database. [Results] We successfully integrated data from different systems into a CDISC database. [Conclusions] The proposed method effectively integrates electronic medical records and clinical experimental data. It helps us avoid entering duplicated data and improves the efficiency of clinical research.

  • Hua Lingfeng,Yang Gaoming,Wang Xiujun
    Data Analysis and Knowledge Discovery. 2018, 2(5): 94-104. https://doi.org/10.11925/infotech.2096-3467.2017.1009
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] Location-based hybrid recommendation methods are not accurate and have cold-start problem of the existing users in new locations, because they do not incorporate the location information of users well into their design. This paper proposes the Diversity news Location-oriented Recommendation algorithm (DLR), aiming to improve the performance of traditional methods. [Methods] First, we clustered the location tags from users’ historical behavior data. Then, we used the LDA model and the classic collaborative filtering algorithm based on 3D similarity to establish a preference model for each position cluster. Finally, we obtained a user’s current position with the help of GPS, and selected a preference cluster model for this user. [Results] The proposed method generated two preference lists, and chose the Top-n of the two lists as recommended news for the user. [Limitations] The proposed method could not effectively solve the cold start issue facing new users. [Conclusions] The DLR model could improve the diversity and accuracy of recommended news.