Data Analysis and Knowledge Discovery

Select

Examining Product Reviews with Sentiment Analysis and Opinion Mining

Guo Bo,Li Shouguang,Wang Hao,Zhang Xiaojun,Gong Wei,Yu Zhaojun,Sun Yu

Data Analysis and Knowledge Discovery. 2017, 1(12): 1-9. https://doi.org/10.11925/infotech.2096-3467.2017.0618

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study conducts a comprehensive analysis of huge amount of reviews generated by E-commerce website users, aiming to assess the marketing strategies. [Methods] We used syntactic parsing, bag of words model and machine learning techniques to examine real-world datasets from JD and TMall. The proposed method could analyze sentiment and extract opinion from the reviews automatically. [Results] The accuracy of the sentiment analysis was 90%. We constructed an automatic vocabulary building mechanism without dictionary dependency. The F-measure of the new system was 71%. [Limitations] The recall of the opinion extraction needs to be improved. [Conclusions] The proposed system could effectively monitor the word-of-mouth issues facing products sold online. It could be transferred to many online business.

Select

Online Room Listing Photos Affect Consumer’s Intentions

Wu Jiang,Jin Mengmeng

Data Analysis and Knowledge Discovery. 2017, 1(12): 10-20. https://doi.org/10.11925/infotech.2096-3467.2017.0789

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to investigate the impacts of listing photos on consumers’ behaviours, with data collected from online room sharing services. [Methods] First, we built a model to describe relationship between the listing photos and consumers’ intention. The model was based on the SOR model and Cue Utilization Theory as well as task-relevant and affection-relevant cues of listing photos. Then, we collected needed data with surveys. Finally, we employed SmartPLS3.2 to examine the proposed model. [Results] Both the task-related and affection-relevant cues of listing photos had positive impacts on perceived diagnosticity and mental imagery of the consumer, which increase the consumer’s intention to use the platform in the future. Product involvement posed significant positive effect to the relationship between the task related cues of listing photos and mental imagery of consumer. [Limitations] We did not include other factor’s impacts on consumer’s behavioral intention. Image recognition is needed in future research. [Conclustions] Listing photos of room sharing platform could influence consumer’s behavioral intention and the product involvement.

Select

Customer Segmentation with Adaptive Evolutionary Clustering

Hu Xiaoxue

Data Analysis and Knowledge Discovery. 2017, 1(12): 21-31. https://doi.org/10.11925/infotech.2096-3467.2017.0588

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an adaptive evolutionary clustering framework for contracted customer segmentation with changing cluster structure, aiming to solve the multi-period dynamic customer segmentation problem. [Methods] The proposed framework could track customer segmentation results within a clustering cycle, which updated the proximity matrix and clustering parameters dynamically. For each clustering period, we eliminated expired clusters from the latest adjacent period based on the contract termination date. Then, we calculated the estimated proximity matrix for current customers. We also changed the exiting clusters’ structure according to data of new customers and developed guidelines to add new clusters. Finally, we examined the proposed algorithm with the updated proximity matrix and parameters to obtain the final clustering results of a specific period. [Results] The proposed framework could significantly improve the efficiency of clustering by excluding the process of selecting and matching clusters. [Limitations] The proposed algorithm was not examined with other datasets. [Conclusions] The proposed framework could effectively track evolutionary trajectories of customer groups and eliminate problems facing traditional methods. It could do multi-period dynamic segmentation for contracted customers.

Select

Research on Text Clustering Based on Requirements of Big Data Jobs

Liu Ruilun,Ye Wenhao,Gao Ruiqing,Tang Mengjia,Wang Dongbo

Data Analysis and Knowledge Discovery. 2017, 1(12): 32-40. https://doi.org/10.11925/infotech.2096-3467.2017.0817

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study analyzes the requirements of big data related positions, aiming to identify high-quality candidates for the companies. [Methods] We retrieved job postings in the field of big data from major recruitment websites during the first quarter of 2017. Then, we used the TF-IDF, word2vec, and k-means algorithms to cluster the texts semantically, which were optimized with the help of silhouette coefficient. [Results] We obtained very good clustering results, and divided the job requirements into three categories of capability, education background and work experiences. [Limitations] First, the formats of job announcement posted on different websites were not unified, which affected the data cleaning and clustering. Second, the training set for word2vec was small due to insufficient data retrieved from the Web. [Conclusions] We found that the big data related jobs do not require advanced degrees and the companies prefer experienced candidates. Those applicants with no relevant experience will also be considered. The candidates’ professionalism is more important than their computer skills.

Select

Ranking Learning Method Based on Random Walk Model

He Wanying,Yang Jianlin

Data Analysis and Knowledge Discovery. 2017, 1(12): 41-48. https://doi.org/10.11925/infotech.2096-3467.2017.0625

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to obtain the tagging data of training corpus for supervised ranking learning tasks. [Methods] First, we proposed a ranking learning method based on the random walk model. Then, we used this method to automatically tag the training data, which also reduced the dependency of ranking on the tags. Finally, we examined our method with the OHSUMED data set. [Results] We finished the ranking learning tasks with only half of samples tagged. Compared with algorithms based on all tagged samples, performance of the proposed method was better than the RankNet algorithm but not as good as the ListNet one. [Limitations] Our method requires a random walk for each query, which is time consuming in practice. [Conclusions] The proposed method can effectively rank the learning results of training data.

Select

Construction of Aggregation Quality Predicting Model for Digital Resource in Library ——Based on Improved Genetic Algorithm and BP Neural Network

Yan Jing,Bi Qiang,Li Jie,Wang Fu

Data Analysis and Knowledge Discovery. 2017, 1(12): 49-62. https://doi.org/10.11925/infotech.2096-3467.2017.0786

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a model to predict the quality of library digital resource aggregation with the help of improved BP neural network based on genetic algorithm. [Methods] The genetic algorithm is simple in computing, less dependent on the problems to be solved, and could quickly calculate concurrent threads. First, we obtained the initial weight and threshold with increased population diversity,selection, crossover and variation. Second, we introduced the improved genetic algorithm to the BP neural network, which rapidly reached the fitness setting level by constantly adjusting the weight and threshold values. Finally, we further optimized the performance of the prediction model. [Results] We used MATLAB R2014a platform to examine the proposed model and the average number of prediction errors was 2.74E-04, which was smaller than the actual data. It took the program 18.56 seconds or three steps to finish the task. The prediction accuracy and efficiency of the proposed model was better than the single genetic or BP algorithms. [Limitations] The quality of sample data needs to be improved. We did not compare our training time and prediction accuracy with those of other quick training functions. The population numbers are limited due to computational complexity. [Conclusions] The proposed model could predict the quality of digital resource aggregation efficiently and objectively.

Select

Hierarchical Classification Model for Invention Patents

Zhai Dongsheng,Hu Dengjin,Zhang Jie,He Xijun,Liu He

Data Analysis and Knowledge Discovery. 2017, 1(12): 63-73. https://doi.org/10.11925/infotech.2096-3467.2017.0820

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new model to process patent information based on machine learning classification algorithm, aiming to determine the level of invention. [Methods] First, we extracted the technology feature words from the patent texts. Then, we constructed the patent technology feature vector with an algorithm trained by Word2Vec. Third, we calculated patent text indicators and backward references to build the training set. Finally, we constructed the new model with machine learning classification algorithm. [Results] We retrieved patents in the field of speech recognition technology with the proposed model. We found that the proportion of advanced level to entry level patents was around 1:4, which was in line with the actual situation. [Limitations] The WordNet dictionary will limit the results of extraction. [Conclusions] The proposed model could effectively identify the advanced patents and recommend them to the business owners.

Select

Identifying Useful Online Reviews with Semantic Feature Extraction

Zhang Yanfeng,Li He,Peng Lihui,Hou Litie

Data Analysis and Knowledge Discovery. 2017, 1(12): 74-83. https://doi.org/10.11925/infotech.2096-3467.2017.0866

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] We propose a model to identify useful online Chinese reviews, which helps consumers make purchasing decisions. [Methods] First, we calculated six attributes affecting the usefulness of online reviews based on their form and content characteristics. Then, we constructed a usefulness evaluation system with the weighted grey relational degree analysis method. Finally, we created a model to retrieve useful online reviews with k-means clustering method. [Results] We examined the effectiveness of our model with online reviews from Amazon.com. The recall, precision and F values showed that our method could effectively identify the useful online reviews, and classify the polarity ones. [Limitations] The samples, metrics and e-commerce platforms could be further improved. [Conclusions] The proposed method could rank and classify online reviews accurately and reliably.

Select

Uncertain Data Clustering Algorithm Based on Local Density

Luo Yanfu,Qian Xiaodong

Data Analysis and Knowledge Discovery. 2017, 1(12): 84-91. https://doi.org/10.11925/infotech.2096-3467.2017.0724

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new algorithm to cluster uncertain data, aiming to reduce the shortcomings inherited from the classic ones. [Methods] First, we modified the measurement of uncertain distance and compared the probability differences between two existing uncertain objects. Then, we defined the cluster centers and proposed a new algorithm to group the data into the related clusters based on the concepts of maximum supporting points and density chain regions. [Results] We used two data sets from the UCI machine learning library to examine the proposed algorithm. We found that the F values of the two data sets increased by 13.23% and 23.44% compared to traditional algorithm (UK-Means and FDBSCAN). It took the algorithm longer time to calculate the distance matrix. Therefore, the overall clustering time was only slightly shorter than the traditional algorithm. [Limitations] There was no appropriate method to define the parameter for the proposed algorithm, and the clustering time was complex. [Conclusions] The proposed algorithm could quickly determine the clustering centers and complete the clustering tasks. The value of t (the only parameter) poses much influence to the clustering results.

Select

Self-Explainable Reduction Method for Mixed Feature Data Modeling

Jiang Siwei,Xie Zhenping,Chen Meijie,Cai Ming

Data Analysis and Knowledge Discovery. 2017, 1(12): 92-100. https://doi.org/10.11925/infotech.2096-3467.2017.0955

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to mine the data with continuous numeric and label features. [Methods] We proposed a self-explainable reduction model to represent the data. The proposed model used the new reduction objective to create adaptive discrete division for continuous data dimension. [Results] We examined the new model with standard datasets and found it had better performance than the existing ones. [Limitations] The computational efficiency of the proposed method was not very impressive, which cannot meet the demand of large-scale data mining. [Conclusions] The proposed model is innovative and practical to model the mixed feature data.

Please choose a citation manager

Content to export

25 December 2017, Volume 1 Issue 12

模态框（Modal）标题

Please choose a citation manager

Content to export

25 December 2017, Volume 1 Issue 12