Data Analysis and Knowledge Discovery

Select

Review of Methods and Applications of Text Sentiment Analysis

Zhong Jiawa,Liu Wei,Wang Sili,Yang Heng

Data Analysis and Knowledge Discovery. 2021, 5(6): 1-13. https://doi.org/10.11925/infotech.2096-3467.2021.0040

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper reviews literature on text sentiment analysis, aiming to summarize its technical development trends and applications. [Coverage] We searched relevant literature from the Web of Science Core Collection and CNKI database on the concepts, methods and techniques of sentiment analysis. A total of 69 papers were retrieved from 2011 to 2020 and then analyzed. [Methods] We summarized the main models and applications of text sentiment analysis from the dimensions of time and theme. We also discussed the fields needs to be improved. [Results] There were mainly three methods for text sentiment analysis, which were based on sentiment lexicon and rules, machine learning, as well as deep learning. Each method has advantages and disadvantages. The methods based on multi-strategy hybrid became more popular in recent years. [Limitations] We reviewed previous literature on text sentiment analysis from the perspective of macro-technical methods. More research is needed to compare and elaborate the technical details of sentiment analysis algorithms. [Conclusions] The development of artificial intelligence technology (big data and deep learning) will further improve text sentiment analysis, and benefit business decision making applications.

Select

Review of Key Technologies of High Performance Blockchain

Dong Zhenheng,Lv Xueqiang,Ren Weiping,Jiang Yang,Li Guolin

Data Analysis and Knowledge Discovery. 2021, 5(6): 14-24. https://doi.org/10.11925/infotech.2096-3467.2020.1210

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the key technologies and major issues of high-performance blockchain, and then explores its research trend and future development. [Coverage] We searched “Consensus Algorithm”, “Smart Contract”, and “Blockchain” in Chinese and English with Web of Science, Google Scholar, CNKI and other Internet resources. A total of 39 documents were selected for this review. [Methods] We summarized the evolution of consensus algorithm, as well as the advantages and disadvantages of smart contract applications or platforms. [Results] This study discussed the key issues and methods of the consensus algorithm and smart contracts for high-performance blockchain. [Limitations] We only reviewed the representative consensus algorithms and implementation platforms. [Conclusions] This paper summarizes the technologies of high-performance blockchain and provides ideas for the future research.

Select

Clustering User Groups of Public Opinion Events from Multi-dimensional Social Network

Wang Xiwei,Jia Ruonan,Wei Yanan,Zhang Liu

Data Analysis and Knowledge Discovery. 2021, 5(6): 25-35. https://doi.org/10.11925/infotech.2096-3467.2020.0077

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] User groups are the main units to disseminate public opinion. This study identifies the characteristics of user groups through clustering techniques, which could help social network companies provide better services. [Methods] With the help of Group Theory, we clustered users based on their influence, sentiments, and behaviors. First, we collected user data from the Sina Weibo. Then, we utilized Canopy and K-Means algorithms to cluster users. Finally, we visualized our findings with Neo4j and Weka. [Results] User groups of the same public opinion event were different in emotion, influence, and behaviors, while user groups from different public opinion events shared common characteristics. [Limitations] Both public opinion events in this study happened at Chinese universities, and we only collected data from Sina Weibo. [Conclusions] Based on the clustering results, we could propose effective administration strategies for each user group in the same or different public opinion events.

Select

Interpretable Recommendation of Reinforcement Learning Based on Talent Knowledge Graph Reasoning

Ruan Xiaoyun,Liao Jianbin,Li Xiang,Yang Yang,Li Daifeng

Data Analysis and Knowledge Discovery. 2021, 5(6): 36-50. https://doi.org/10.11925/infotech.2096-3467.2020.1218

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an interpretable reinforcement learning method for job recommendation based on talent knowledge graph reasoning, which addresses the issues of difficulties in large-scale application, cold start, and lack of novelty. [Methods] First, we constructed a knowledge graph for the social experience of the job applicants based on their resume data. Then, we trained a strategic agent with the knowledge graph and the theory of reinforcement learning. This algorithm, which divided the reasoning process into choosing directions and nodes, could identify potential high-quality recommendation targets from the knowledge graph. [Results] The MRR@20 (81.7%), Hit@1 (74.8%), Hit@5 (92.2%) and Hit@10 (97.0%) of the proposed model were higher than those of the LR, BPR, JRL-int, JRL-rep and PGPR models. [Limitations] The size of the experimental datasets and the task-types needs to be further expanded. [Conclusions] Our model could effectively recommend jobs for applicants based on their previous experience or other successful recommendations. It also provides reasoning paths with the help of knowledge graph.

Select

Predicting Prices and Analyzing Features of Online Short-Term Rentals Based on XGBoost

Cao Rui,Liao Bin,Li Min,Sun Ruina

Data Analysis and Knowledge Discovery. 2021, 5(6): 51-65. https://doi.org/10.11925/infotech.2096-3467.2020.1186

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposed a model to predict prices and analyze properties of online short-term rentals based on XGBoost, aiming to address the issue of lacking reasonable pricing suggestion mechanism for housing with different characteristics. [Methods] We collected data from the Airbnb platform and used Lasso to extract features from these raw data as well as reduced their dimensions. Then, we input the extracted data to XGBoost and iteratively trained the prediction model. Finally, we used the SHAP value to interpret the model features. [Results] The RMSE, MAE and R-squared values of the proposed model were 0.091, 0.065 and 0.798 respectively after tuning the hyperparameters, which were better than those of the four existing models. [Limitations] Our new model could not merge the features of real-time online business data, which influenced the prediction accuracy. [Conclusions] The proposed model has good interpretability, and could identify the key factors affecting housing prices, which helps the landlords improve services.

Select

Patterns and Evolution of Public Opinion on Weibo During Natural Disasters： Case Study of Typhoons and Rainstorms

Ma Yingxue,Zhao Jichang

Data Analysis and Knowledge Discovery. 2021, 5(6): 66-79. https://doi.org/10.11925/infotech.2096-3467.2020.1258

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study reveals patterns and evolution of public opinion on Weibo during natural disasters from the perspectives of trending topics and information dissemination. [Methods] We proposed a machine learning approach to extract the valid data of natural disasters from Weibo. Then, we employed a deep learning model to cluster these textual posts. Finally, we investigated the information dissemination patterns with complex network analysis. [Results] The accuracy of our extractor for valid disaster information reached 0.82. The clusters of textual posts indicated the changes of trending topics. The structure of information dissemination during disasters was sparse. The sizes of online communities expanded constantly while their distribution unchanged. Users in different regions had different preferences for information sources. [Limitations] We did not conduct experiment to examine data from different social platforms. [Conclusions] The proposed method could effectively identify public opinion events during natural disasters.

Select

Comparing Technology Diffusion Structure of China and the U.S. to Countries Along the Belt and Road

Gao Yilin,Min Chao

Data Analysis and Knowledge Discovery. 2021, 5(6): 80-92. https://doi.org/10.11925/infotech.2096-3467.2020.1168

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study explores the characteristics and structure of international technology diffusion in different fields by China and the United States, from the perspectives of patents and the Belt and Road Initiatives. [Methods] First, we collected needed data from PCT international patent cooperation and transnational patent applications. Then, we used social network QAP analysis to measure the synergy of the two technology diffusion channels technically and regionally. [Results] China’s patent strategy in countries along the Belt and Road has achieved some positive results, while the technology diffusion formed a high degree of synergy. However, there are some gaps between China and the United States in technology diffusion. [Limitations] The study only compared the technological diffusion of China and the United States. The characteristics of other technology diffusion channels, such as intellectual property trade, were not analyzed. [Conclusions] This study could help China more effectively implement the Belt and Road Initiative.

Select

A Capsule Network Model for Text Classification with Multi-level Feature Extraction

Yu Bengong,Zhu Xiaojie,Zhang Ziwei

Data Analysis and Knowledge Discovery. 2021, 5(6): 93-102. https://doi.org/10.11925/infotech.2096-3467.2020.1273

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a structured method to extract text information hierarchically from bottom to top, aiming to improve the performance of existing shallow text classification models. [Methods] We built a MFE-CapsNet model for text classification based on the acquired global and high-level features. The model extracted context information with bidirectional gated recurrent unit (BiGRU). It also introduced the attention coding hidden layer vector to improve feature extraction of the sequence model. We used the capsule network and dynamic routing to obtain high-level aggregated local information and build the MFE-CapsNet model. We also conducted comparative experiment on the performance of our new model. [Results] The F1 values of the MFE-CapsNet model were 96.21%, 94.17%, and 94.19% on the Chinese datasets from three different fields. Our results were at least 1.28, 1.49, and 0.46 percentage points higher than those of the popular text classification methods. [Limitations] We only conducted experiment on three corpora. [Conclusions] The proposed MFE-CapsNet model could effectively extract semantic features and improve the performance of text classification.

Select

Sentiment Classification of Image-Text Information with Multi-Layer Semantic Fusion

Xie Hao,Mao Jin,Li Gang

Data Analysis and Knowledge Discovery. 2021, 5(6): 103-114. https://doi.org/10.11925/infotech.2096-3467.2020.1159

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper conducts sentiment analysis of images and text on social media data, aiming to better understand the public's emotions and opinion tendencies. [Methods] To fully explore the correlation and complementarity between images and text, this paper proposes an image-text sentiment classification model in social media based on multi-layer semantic fusion. There are three sub-models in our study: text-image semantic association model, image-text semantic association model, and multimodal semantic deep association fusion model. We used these sub-models to explore the bidirectional and multi-level semantic associations between images and text. Then, we obtained the final classification results using a weighting strategy on the sentiment classification scores generated by the three sub-models. [Results] We examined our model with real image-text data sets and found it achieved the best performance in all evaluation metrics. The accuracy and F1 values of our model were 1.0% and 1.2% better than those of the optimal baseline model. [Limitations] We only evaluated the model’s performance with one single dataset. More research is needed to examine the robustness and scalability of the model. [Conclusions] In the sentiment classification task, the proposed model could more effectively explore the correlation and complementarity between image and text information on social media.

Select

Expanding Queries Based on Word Embedding and Expansion Terms

Huang Mingxuan,Jiang Caoqing,Lu Shoudong

Data Analysis and Knowledge Discovery. 2021, 5(6): 115-125. https://doi.org/10.11925/infotech.2096-3467.2020.1312

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a query expansion model based on the intersection of word embedding and expansion terms, aiming to reduce the mismatched words in information retrieval. [Methods] First, we trained the word embedding learning with the retrieved documents to obtain the Word Embedding Candidate Expansion Term set. Then, we examined the association rules and generated the Mining Candidate Expansion Term set. Finally, we created the final expansion term set by merging the previous two sets and expanded the queries. [Results] The MAP and P@5 of the proposed model were higher than those of the benchmark ones. Compared with the similar query expansion methods developed in recent years, the average increase of the MAP and P@5 were 0.96%-31.24% and 1.07%-13.55%, respectively. [Limitations] The proposed model needs to be examined with real world information retrieval systems. [Conclusions] The proposed model can improve the quality of expansion terms and the performance of information retrieval systems, which also reduces query topic drifting and word mismatch issues.

Select

Identifying Clickbait with BERT-BiGA Model

Yin Pengbo,Pan Weimin,Zhang Haijun,Chen Degang

Data Analysis and Knowledge Discovery. 2021, 5(6): 126-134. https://doi.org/10.11925/infotech.2096-3467.2021.0098

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an algorithm with BiGRU and attention mechanism based on the Chinese BERT model,aiming to identify the clickbait from online news titles. [Methods] First, we pre-trained our model as a text encoder using the Chinese BERT. Then, we extracted text features through the fusion attention mechanism, and used BiGRU to model news titles and contents. Finally, we identified clickbait based on their semantic correlation. [Results] This method addressed the issues of complex feature engineering and secondary error amplification in the text similarity calculation. The recognition accuracy rate was 81%, and a browser plug-in was developed to detect clickbait. [Limitations] The proposed model only examined news titles and contents, and did not include pageviews, likes, and comments in the calculation. [Conclusions] Our new method, whose recall is 4% higher than those of the existing methods, could effectively identify the clickbait from online news.

Select

A Multiple Pattern Matching Algorithm for Specifications of Incremental Metadata for Sci-Tech Literature

Dong Mei,Chang Zhijun,Zhang Runjie

Data Analysis and Knowledge Discovery. 2021, 5(6): 135-144. https://doi.org/10.11925/infotech.2096-3467.2020.1006

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs a multiple pattern matching algorithm to standardize the institutional information of sci-tech literature metadata. [Methods] First, we used the Hash function to locate the pattern strings and reduced the system memory usage. Then, we extracted the first words of the pattern strings, which were combined with word skipping matching. The new algorithm reduced the number of matches and increased the jump range, which improved the efficiency of multiple pattern matching. [Results] We examined our model with the CSCD’s institutional library as the pattern string set. Compared with the Aho-Corasick (AC) algorithm, our method quickly constructed the dictionary corresponding to the pattern string sets. When the data volume reached about 10 000, our model spent less time on the same tasks. For the English corpus, there was a 9.39% improvement in time performance. Compared with the Wu-Manber (WM) algorithm, our method was not restricted by the shortest pattern strings. [Limitations] The algorithm or data needs to be adjusted for different pattern strings and text strings. This algorithm and the extended headless mode are not suitable for small pattern string sets with large string sets. [Conclusions] The algorithm can be applied to Chinese, English, and Chinese-English mixed texts. The time performance of our algorithm is superior to the AC and WM algorithms in processing large pattern string set (10⁶) and small string set (about 10,000).

Please choose a citation manager

Content to export

25 June 2021, Volume 5 Issue 6

模态框（Modal）标题

Please choose a citation manager

Content to export

25 June 2021, Volume 5 Issue 6