Data Analysis and Knowledge Discovery

Select

Detecting Product Review Spam: A Survey

Jiafen Wu,Feicheng Ma

Data Analysis and Knowledge Discovery. 2019, 3(9): 1-15. https://doi.org/10.11925/infotech.2096-3467.2018.0959

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper reviews current studies on fighting product review spam. [Coverage] We searched “review spam” with eight major scholarly databases (e.g., WoS, CNKI and EI, etc.), and retrieved a total of 90 relevant papers. [Methods] First, we adopted systematic review procedure to identify and categorize the methods detecting product review spam. Then, we compared the impacts of spam features on detection performance. [Results] The spam features and detection methods were the key issues in fighting product review spam. The acquisition of large-scale annotation data was a challenging task for current research. [Limitations] We did not examine the detection and classification methods for spammers. [Conclusions] This paper analyzes spam detection methods from the perspectives of data acquisition, spamming features and detection methods. It offers suggestions and directions for future research.

Select

Review of Automatic Labeling for Topic Models

Hongfei Ling,Shiyan Ou

Data Analysis and Knowledge Discovery. 2019, 3(9): 16-26. https://doi.org/10.11925/infotech.2096-3467.2018.1127

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper reviews methods of automatic topic labeling, aiming to promote the development of topic modelling. [Coverage] We used “Topic Labeling OR Topic Labeling OR Topic Tagging OR Topic Indexing” as search term for the Web of Science and CNKI databases. A total of 57 representative literatures on topic labeling were retrieved. [Methods] We categorized the existing methods and then conducted a comparative analysis for them. [Results] Automatic topic labeling usually had two steps: generating candidate labels from a corpus and then ranking them. These methods can be divided into two categories: label generation based on internal or external corpus. [Limitations] We might not be able to cover everything in this field. [Conclusions] More research could be done in automatic labeling, i.e. those for user-generated contents from social media using deep learning technologies.

Select

Determining Best Text Clustering Number with Mean Shift Algorithm

Huaming Zhao,Li Yu,Qiang Zhou

Data Analysis and Knowledge Discovery. 2019, 3(9): 27-35. https://doi.org/10.11925/infotech.2096-3467.2018.1259

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the optimal method for determining the best text clustering number, aiming to improve the effectiveness of related algorithms. [Methods] First, we combined the TF-IDF and Word2Vec algorithms to extract the TopN keyword vectors as text feature expression in corpus. Then, we decided the best number of text clustering with the mean shift algorithm, clustering validity index (Silhouette) and mean square error (MSE) index. [Results] We found that the top 4500 keyword vectors could better represent the text features. The best number of text clustering by Mean Shift algorithm matched the manually optimized results. [Limitations] The size of experimental data sets needs to be expanded. Our results should to be compared with those of other applications. [Conclusions] The proposed method could effectively determin the best text clustering number in an unsupervised way.

Select

Assessing Data Integrity of OpenStreetMap Based on Night Lights

Fei Liu,Xiaoqiang Cheng,Huayi Wu

Data Analysis and Knowledge Discovery. 2019, 3(9): 36-44. https://doi.org/10.11925/infotech.2096-3467.2018.1473

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to address data integrity issues facing the OpenStreetMap (OSM) datasets. [Methods] First, we retrieved the remote censor images of night-lighting brightness as an indicator for cities with strong comprehensive competitiveness. Then, we studied the correlation between night-lighting brightness and OSM completeness, which identified the distribution patterns of high quality data. [Results] We established a regression model for OSM building density and night-lighting brightness. The correlation coefficient was 0.8522. We also found that 84.2% of Chinese cities in our study had building densities closed to the predicted values (the discrepancy was less than 0.5%). The building densities in the other cities were 2% to 7% lower than the expected values. [Limitations] More research is needed to evaluate the performance of this model with other cities. [Conclusions] The remote sensing images help us assess quality of OSM data, which also identifies the “ghost or empty cities”.

Select

A Text Vector Representation Model Merging Multi-Granularity Information

Weimin Nie,Yongzhou Chen,Jing Ma

Data Analysis and Knowledge Discovery. 2019, 3(9): 45-52. https://doi.org/10.11925/infotech.2096-3467.2018.1161

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposed a model to extract semantic features from texts more comprehensively and to improve the representation of semantics by text vectors. [Methods] We obtained the word-granularity, topic-granularity and character-granularity feature vectors with the help of convolutional neural networks. Then, the three feature vectors were combined by the “merging gate” mechanism to generate the final text vectors. Finally, we examined the model with text classification experiment. [Results] The accuracy (92.56%), the precision (92.33%), the recall (92.07%) and the F-score (92.20%), were 2.40%, 2.05%, 1.77% and 1.91% higher than the results of Text-CNN. [Limitations] The Long-distance dependency features need to be included and the corpus size needs to be expanded. [Conclusions] The proposed model could better represent the text semantics.

Select

Measuring Patent Similarity with Word Embedding and Statistical Features

Yan Yu,Lei Chen,Jinde Jiang,Naixuan Zhao

Data Analysis and Knowledge Discovery. 2019, 3(9): 53-59. https://doi.org/10.11925/infotech.2096-3467.2018.1317

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method measuring patent similarities, which explores the semantic relationship between words and improves the performance of these tasks. [Methods] First, we introduced a neural network-based word vector model to obtain semantic information from patent words. Then, we computed the word statistical features to gauge their significance. Finally, we combined the word embedding and statistical features to represent patent texts and measure their similarity. [Results] The accuracy of the proposed method was 13.92% higher than those of the traditional methods. [Limitations] More research is needed to study the selection strategy of auxiliary patent texts. [Conclusions] Combining word embedding and statistical features can effectively improve the patent similarity measurement.

Select

Classifying Short-texts with Class Feature Extension

Yunfei Shao,Dongsu Liu

Data Analysis and Knowledge Discovery. 2019, 3(9): 60-67. https://doi.org/10.11925/infotech.2096-3467.2018.1423

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a short text classification method based on category feature extension, aiming to address the issue of sparse content in short texts. [Methods] We used the improved TF-IDF model and LDA topic model to construct the keyword set and topic distribution set, which were all based on category features. Then, we expanded the content and vector representations of short texts. Finally, we classified short texts with the help of convolutional neural network. [Results] The classification precision rate of the proposed method was improved by 3.0%, and the recall rate was improved by 4.1%. [Limitations] Only examined the new method with convolutional neural network. [Conclusions] The proposed method can improve the effectiveness of categorization procedures for short texts.

Select

Automatic Classification of Ancient Classics with Entity Features

Heran Qin,Liu Liu,Bin Li,Dongbo Wang

Data Analysis and Knowledge Discovery. 2019, 3(9): 68-76. https://doi.org/10.11925/infotech.2096-3467.2019.0135

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper modifies the algorithm of traditional statistical feature words with entity features, aiming to classify ten classics from ancient China. [Methods] For the support vector machine model, we added the traditional TF-IDF, information gain, chi-square test and mutual information to calculate the feature words. Then, we used the named entity to evaluate the classification results. [Results] The highest accuracy of the proposed classifier reached 98.7%. The accuracy was improved by 12.4%, 12.4%, 12.3% and 22.8% respectively with traditional information gain, TF-IDF, mutual information and chi-square test feature calculations. [Limitations] We need to re-label the recognition entities before applying entity features to other texts. [Conclusions] Entity features could improve the effectiveness of text categorization models.

Select

Cross-Language Information Retrieval Based on Weighted Association Patterns and Rule Consequent Expansion

Mingxuan Huang,Shoudong Lu,Hui Xu

Data Analysis and Knowledge Discovery. 2019, 3(9): 77-87. https://doi.org/10.11925/infotech.2096-3467.2019.0301

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new Cross-Language Information Retrieval (CLIR) model, aiming to address the issues facing natural language processing, such as query topic drift and word mismatch. [Methods] First, we explored the frequent item-sets with the weighted association patterns and the pruning strategies based on maximum item weight. Then, we used the confidence and relevance degrees to evaluate the weighted association rules, which helped us extract the high quality expansion terms. Finally, we combined the new terms with the original ones to create new queries for the final lists. [Results] Compared with the monolingual retrieval benchmark, the average increases (AIs) of R-prec and P@10 of the proposed model were 42.49% and 25.53%. Our results were 91.87% and 64.61% higher than the cross language retrieval benchmark. Compared to the existing CLIR methods, the maximum AIs of R-prec and P@10 were 93.20% and 34.60%. [Limitations] The proposed model needs to be examined with more cross language search engines. [Conclusions] Our model improves the performance of CLIR.

Select

Automatic Triage of Online Doctor Services Based on Machine Learning

Ruojia Wang,Lu Zhang,Jimin Wang

Data Analysis and Knowledge Discovery. 2019, 3(9): 88-97. https://doi.org/10.11925/infotech.2096-3467.2019.0147

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper compares the performance of various machine learning algorithms for automatic triage, aiming to improve their effectiveness through analyzing mis-classification data. [Methods] First, we retrieved 33,073 real patients’ questions from a website named “chunyu doctor”. Then, we compared the accuracy of two text vectorization methods and six classification models. Finally, we analyzed the mis-classification data and extracted new features to improve the performance of models. [Results] The best automatic triage model used TF-IDF as text vectorization method and support vector machine as classification algorithm. After adding age and gender characteristics, the classification accuracy rate reached 76.3%. The classifier had the lowest accuracy rate for surgery department due to the setting of this platform’s categories. [Limitations] We assumed that the department selection of the patient was correct. [Conclusions] Machine learning techniques could improve the performance of automatic triage services of the online health consulting platforms.

Select

Impacts of Financial Media Information on Stock Market: An Empirical Study of Sentiment Analysis

Yonghua Cen,Zhihao Tan,Chengyao Wu

Data Analysis and Knowledge Discovery. 2019, 3(9): 98-114. https://doi.org/10.11925/infotech.2096-3467.2018.1223

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to study the impacts of media coverage on stock market. [Methods] We used the LSTM deep neural networks to evaluate the sentiments of the online news, forum posts and blogs from leading financial websites. Then, we established autoregressive distributed lag model and panel regression model to test the relationship between media information sentiments and stock market performance from the perspectives of macro market and individual stocks. [Results] (I) In the short term, the positive and negative sentiments significantly changed the stock prices and led to overreaction. In the longer term, the stock market reversed. (II) There were a negative relationship between sentiment volatility/discrepancy and stock prices, and a U-shaped nonlinear correlation between sentiment discrepancy and trading. (III) Investors reacted more immediately and strongly to positive sentiments, and the rational correction of this overreaction was slower than those of the negative information. (IV) High discrepancy of sentiments led to more over-trading than high consensus. [Limitations] The accuracy of sentiment analysis needs to be improved with more complex models. [Conclusions] Our research provides theoretical, methodological and practical implications for financial supervision and regulation.

Select

Extracting Emotion Tags from Comments of Microblog Commodities

Bocheng Li,Yunqiu Zhang,Kaixi Yang

Data Analysis and Knowledge Discovery. 2019, 3(9): 115-123. https://doi.org/10.11925/infotech.2096-3467.2018.1429

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to collect emotion tags from microblog comments, aiming to improve the performance of feature-level data extraction. [Methods] First, we divided the evaluation units and extracted the explicit tags based on the dependency parsing and the extraction rules. Then, we revealed the implicit expression relationship in comments with the NodeRank algorithm. Finally, we retrieved the implicit tags to improve the accuracy of emotion tag retrieval. [Results] We examined the proposed method with the real online comments. The overall precision of the method was 83.6%, the recall rate was 87.1%, and the F value was 85.3%, which were better than the traditional methods. [Limitations] We did not fully utilize users’ general emotional expressions. [Conclusions] The proposed method based on dependency parsing and NodeRank algorithm can extract emotion tags effectively.

Select

Analyzing Textual Features of Excess-funded Agricultural Products——Case Study of Crowdfunding Website

Manyu Huang,Qi Yun,Hufeng Peng,Xuemeng Dou

Data Analysis and Knowledge Discovery. 2019, 3(9): 124-134. https://doi.org/10.11925/infotech.2096-3467.2018.1332

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to identify the textual features of excess-funded agricultural products with the crowdfunding services and the time evolution rules of typical topics. [Methods] We used the TOT analysis model to retrieve the texts of 1137 excess-funded agricultural product between September 2013 and April 2018 from the crowdfunding website. Then, we obtained the probability distribution of the terms with each theme. Finally, we examined the time evolution trends of each topic. [Results] The excess-funded agricultural products were in the categories of tea, wine and honey. Text characteristics of these projects focused on the value of customers, quality of the agricultural products and social benefits. The distribution of topic intensity on value of customers and quality of agricultural products from 2014 to 2017 showed the U pattern. [Limitations] The high-quality crowdfunding data of agricultural products is relatively limited. [Conclusions] The projects seeking more crowdfunding support should emphasize the high quality of the products and the unique experience of participation .

Please choose a citation manager

Content to export

25 September 2019, Volume 3 Issue 9

模态框（Modal）标题

Please choose a citation manager

Content to export

25 September 2019, Volume 3 Issue 9