Home Table of Contents

25 December 2017, Volume 1 Issue 12
    

  • Select all
    |
    Orginal Article
  • Guo Bo,Li Shouguang,Wang Hao,Zhang Xiaojun,Gong Wei,Yu Zhaojun,Sun Yu
    Data Analysis and Knowledge Discovery. 2017, 1(12): 1-9. https://doi.org/10.11925/infotech.2096-3467.2017.0618
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study conducts a comprehensive analysis of huge amount of reviews generated by E-commerce website users, aiming to assess the marketing strategies. [Methods] We used syntactic parsing, bag of words model and machine learning techniques to examine real-world datasets from JD and TMall. The proposed method could analyze sentiment and extract opinion from the reviews automatically. [Results] The accuracy of the sentiment analysis was 90%. We constructed an automatic vocabulary building mechanism without dictionary dependency. The F-measure of the new system was 71%. [Limitations] The recall of the opinion extraction needs to be improved. [Conclusions] The proposed system could effectively monitor the word-of-mouth issues facing products sold online. It could be transferred to many online business.

  • Wu Jiang,Jin Mengmeng
    Data Analysis and Knowledge Discovery. 2017, 1(12): 10-20. https://doi.org/10.11925/infotech.2096-3467.2017.0789
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to investigate the impacts of listing photos on consumers’ behaviours, with data collected from online room sharing services. [Methods] First, we built a model to describe relationship between the listing photos and consumers’ intention. The model was based on the SOR model and Cue Utilization Theory as well as task-relevant and affection-relevant cues of listing photos. Then, we collected needed data with surveys. Finally, we employed SmartPLS3.2 to examine the proposed model. [Results] Both the task-related and affection-relevant cues of listing photos had positive impacts on perceived diagnosticity and mental imagery of the consumer, which increase the consumer’s intention to use the platform in the future. Product involvement posed significant positive effect to the relationship between the task related cues of listing photos and mental imagery of consumer. [Limitations] We did not include other factor’s impacts on consumer’s behavioral intention. Image recognition is needed in future research. [Conclustions] Listing photos of room sharing platform could influence consumer’s behavioral intention and the product involvement.

  • Hu Xiaoxue
    Data Analysis and Knowledge Discovery. 2017, 1(12): 21-31. https://doi.org/10.11925/infotech.2096-3467.2017.0588
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes an adaptive evolutionary clustering framework for contracted customer segmentation with changing cluster structure, aiming to solve the multi-period dynamic customer segmentation problem. [Methods] The proposed framework could track customer segmentation results within a clustering cycle, which updated the proximity matrix and clustering parameters dynamically. For each clustering period, we eliminated expired clusters from the latest adjacent period based on the contract termination date. Then, we calculated the estimated proximity matrix for current customers. We also changed the exiting clusters’ structure according to data of new customers and developed guidelines to add new clusters. Finally, we examined the proposed algorithm with the updated proximity matrix and parameters to obtain the final clustering results of a specific period. [Results] The proposed framework could significantly improve the efficiency of clustering by excluding the process of selecting and matching clusters. [Limitations] The proposed algorithm was not examined with other datasets. [Conclusions] The proposed framework could effectively track evolutionary trajectories of customer groups and eliminate problems facing traditional methods. It could do multi-period dynamic segmentation for contracted customers.

  • Liu Ruilun,Ye Wenhao,Gao Ruiqing,Tang Mengjia,Wang Dongbo
    Data Analysis and Knowledge Discovery. 2017, 1(12): 32-40. https://doi.org/10.11925/infotech.2096-3467.2017.0817
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This study analyzes the requirements of big data related positions, aiming to identify high-quality candidates for the companies. [Methods] We retrieved job postings in the field of big data from major recruitment websites during the first quarter of 2017. Then, we used the TF-IDF, word2vec, and k-means algorithms to cluster the texts semantically, which were optimized with the help of silhouette coefficient. [Results] We obtained very good clustering results, and divided the job requirements into three categories of capability, education background and work experiences. [Limitations] First, the formats of job announcement posted on different websites were not unified, which affected the data cleaning and clustering. Second, the training set for word2vec was small due to insufficient data retrieved from the Web. [Conclusions] We found that the big data related jobs do not require advanced degrees and the companies prefer experienced candidates. Those applicants with no relevant experience will also be considered. The candidates’ professionalism is more important than their computer skills.

  • He Wanying,Yang Jianlin
    Data Analysis and Knowledge Discovery. 2017, 1(12): 41-48. https://doi.org/10.11925/infotech.2096-3467.2017.0625
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper tries to obtain the tagging data of training corpus for supervised ranking learning tasks. [Methods] First, we proposed a ranking learning method based on the random walk model. Then, we used this method to automatically tag the training data, which also reduced the dependency of ranking on the tags. Finally, we examined our method with the OHSUMED data set. [Results] We finished the ranking learning tasks with only half of samples tagged. Compared with algorithms based on all tagged samples, performance of the proposed method was better than the RankNet algorithm but not as good as the ListNet one. [Limitations] Our method requires a random walk for each query, which is time consuming in practice. [Conclusions] The proposed method can effectively rank the learning results of training data.

  • Yan Jing,Bi Qiang,Li Jie,Wang Fu
    Data Analysis and Knowledge Discovery. 2017, 1(12): 49-62. https://doi.org/10.11925/infotech.2096-3467.2017.0786
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a model to predict the quality of library digital resource aggregation with the help of improved BP neural network based on genetic algorithm. [Methods] The genetic algorithm is simple in computing, less dependent on the problems to be solved, and could quickly calculate concurrent threads. First, we obtained the initial weight and threshold with increased population diversity,selection, crossover and variation. Second, we introduced the improved genetic algorithm to the BP neural network, which rapidly reached the fitness setting level by constantly adjusting the weight and threshold values. Finally, we further optimized the performance of the prediction model. [Results] We used MATLAB R2014a platform to examine the proposed model and the average number of prediction errors was 2.74E-04, which was smaller than the actual data. It took the program 18.56 seconds or three steps to finish the task. The prediction accuracy and efficiency of the proposed model was better than the single genetic or BP algorithms. [Limitations] The quality of sample data needs to be improved. We did not compare our training time and prediction accuracy with those of other quick training functions. The population numbers are limited due to computational complexity. [Conclusions] The proposed model could predict the quality of digital resource aggregation efficiently and objectively.

  • Zhai Dongsheng,Hu Dengjin,Zhang Jie,He Xijun,Liu He
    Data Analysis and Knowledge Discovery. 2017, 1(12): 63-73. https://doi.org/10.11925/infotech.2096-3467.2017.0820
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a new model to process patent information based on machine learning classification algorithm, aiming to determine the level of invention. [Methods] First, we extracted the technology feature words from the patent texts. Then, we constructed the patent technology feature vector with an algorithm trained by Word2Vec. Third, we calculated patent text indicators and backward references to build the training set. Finally, we constructed the new model with machine learning classification algorithm. [Results] We retrieved patents in the field of speech recognition technology with the proposed model. We found that the proportion of advanced level to entry level patents was around 1:4, which was in line with the actual situation. [Limitations] The WordNet dictionary will limit the results of extraction. [Conclusions] The proposed model could effectively identify the advanced patents and recommend them to the business owners.

  • Zhang Yanfeng,Li He,Peng Lihui,Hou Litie
    Data Analysis and Knowledge Discovery. 2017, 1(12): 74-83. https://doi.org/10.11925/infotech.2096-3467.2017.0866
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] We propose a model to identify useful online Chinese reviews, which helps consumers make purchasing decisions. [Methods] First, we calculated six attributes affecting the usefulness of online reviews based on their form and content characteristics. Then, we constructed a usefulness evaluation system with the weighted grey relational degree analysis method. Finally, we created a model to retrieve useful online reviews with k-means clustering method. [Results] We examined the effectiveness of our model with online reviews from Amazon.com. The recall, precision and F values showed that our method could effectively identify the useful online reviews, and classify the polarity ones. [Limitations] The samples, metrics and e-commerce platforms could be further improved. [Conclusions] The proposed method could rank and classify online reviews accurately and reliably.

  • Luo Yanfu,Qian Xiaodong
    Data Analysis and Knowledge Discovery. 2017, 1(12): 84-91. https://doi.org/10.11925/infotech.2096-3467.2017.0724
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper proposes a new algorithm to cluster uncertain data, aiming to reduce the shortcomings inherited from the classic ones. [Methods] First, we modified the measurement of uncertain distance and compared the probability differences between two existing uncertain objects. Then, we defined the cluster centers and proposed a new algorithm to group the data into the related clusters based on the concepts of maximum supporting points and density chain regions. [Results] We used two data sets from the UCI machine learning library to examine the proposed algorithm. We found that the F values of the two data sets increased by 13.23% and 23.44% compared to traditional algorithm (UK-Means and FDBSCAN). It took the algorithm longer time to calculate the distance matrix. Therefore, the overall clustering time was only slightly shorter than the traditional algorithm. [Limitations] There was no appropriate method to define the parameter for the proposed algorithm, and the clustering time was complex. [Conclusions] The proposed algorithm could quickly determine the clustering centers and complete the clustering tasks. The value of t (the only parameter) poses much influence to the clustering results.

  • Jiang Siwei,Xie Zhenping,Chen Meijie,Cai Ming
    Data Analysis and Knowledge Discovery. 2017, 1(12): 92-100. https://doi.org/10.11925/infotech.2096-3467.2017.0955
    Abstract ( ) Download PDF ( ) HTML ( )   Knowledge map   Save

    [Objective] This paper aims to mine the data with continuous numeric and label features. [Methods] We proposed a self-explainable reduction model to represent the data. The proposed model used the new reduction objective to create adaptive discrete division for continuous data dimension. [Results] We examined the new model with standard datasets and found it had better performance than the existing ones. [Limitations] The computational efficiency of the proposed method was not very impressive, which cannot meet the demand of large-scale data mining. [Conclusions] The proposed model is innovative and practical to model the mixed feature data.