[Objective] This paper analyzes the popular text similarity measures and discusses their latest developments. [Coverage] We retrieved 69 key articles from CNKI and Web of Science databases by searching “TI: ‘text similarity’ or ‘semantic similarity’ or ‘lexical similarity’ ” in Chinese and English respectively. [Methods] We systematically reviewed the text similarity measures focusing on their basic concepts, characteristics and future directions. [Results] There were four types of text similarity measures: String-based, Corpus-based, Knowledge-based and others. Measures based on the neural network, Knowledge-based measures and inter-disciplinary measures could be the future research directions. [Limitations] We did not discuss the applications of those measures. [Conclusions] This paper is a comprehensive review of text similarity measure research.
[Objective] This paper identifies the common features of existing Data Science curriculums around the world. It also addresses the main challenges facing these courses as well as possible solutions. [Methods] We conducted an empirical study with the help of text analysis techniques to examine the data science curriculums from China and abroad. [Results] We found common features of the retrieved curriculums and the differences between them and other related courses. [Limitations] Our study focused on the curriculum issues, therefore, more research is needed to discuss data science as a discipline. [Conclusions] This paper addresses the top ten key challenges facing data science curriculum and then proposes some solutions.
[Objective] This paper summarizes the content characteristics and network evolution of social recommendation research based on the of bibliometrics and social network analysis. [Methods] First, we collected the data of social recommendation research from the Web of Science database. Then we analyzed the data with manual interpretation, keywords co-occurrence analysis, bibliometrics, social network analysis and data visualization. [Results] A total of 3701 articles on social recommendation were retrieved, which have been increasing recently. Based on the threshold of papers published each year, we divided the development of social recommendation research into three distinct stages. [Limitations] We only used keywords to explore the characteristics of the relevant document contents, which could be improved with in-depth text mining. There is lack of uniform criterion to classify the evolution stages of the related research. Our study only shows the changing of contents and development trends. [Conclusions] The international impacts of Chinese scholars have been rising in social recommendation studies, which highly focus on the topics of social media and collaborative filtering.
[Objective] This paper proposes a new model to recommend potential similar users with the help of social tags and relation network. [Methods] First, we explored characteristics of the users’ short or long-term interests based on the social tagging system. Then, we built a user-clustering model using multidimensional scaling method with the tags and relationship data. Finally, we recommended similar users based on the clustering results. The proposed model was examined with Weibo data. [Results] We found that the new model could effectively combine the characteristics of the user’s interests, and then identify the potential similar ones. [Limitations] The sample data does not include everything on user interests. Thus, we only examined the effectiveness of the proposed model with limited data. [Conclusions] The user recommendation model based on static tags and dynamic relational network could improve the personalized recommendation services.
[Objective] This study aims to identify phishing websites more effectively with the help of online evaluation data and URL abnormal features. [Methods] First, we used eight machine learning techniques to compare the performance of various online evaluation data and URL abnormal features in identifying phishing websites. Then, we proposed a new method to improve the accuracy of the identification procedures. [Results] We found that the evaluation data had better performance than abnormal features of URL. Combining the two data sets could improve the identification performance. [Limitations] We did not consider the difference between the numbers of phishing sites and the good ones. [Conclusions] Online evaluation data and URL abnormal features could help us identify phishing websites effectively, which indicates the direction of future studies.
[Objective] This paper aims to improve the performance of Cosine text similarity computing method with the help of text semantic chunk feature. [Methods] First, we retrieved the project data of carbon nanotubes studies, which were pre-processed with stemming and POS techniques. Then, we identified the semantic chunk of text contents with the conditional random field model. Third, we calculated the similarity of texts based on semantic chunk feature. Finally, we compared our results with those generated by the unlabeled data. [Results] The proposed method improved the performance of Cosine similarity calculation by up to 26%. [Limitations] Our study relies on semantic chunks to annotate the computing performance. [Conclusions] The proposed method could effectively identify similar texts, and reduce the dimensions of vector space model, which improves the computing efficiency. The new method is robust and could be transferred to other fields.
[Objective] This paper aims to increase the recommendation accuracy with the help of modified Slope One algorithm. [Methods] We proposed a Slope One Collaboration Filtering Algorithm based on multi-weights, which improved the items’ similarity measure, attributes similarity measure and users’ rating probability function. Then, we combined the items’ similarity measure with the number of users and Pearson correlation coefficient, the items’ attributes similarity measure with modified Laplacian smoothing and Jaccard coefficient. We also identified users’ ratings with a new probability function. [Results] The proposed method reduced the MAE by 5.4%, which increased the recommendation accuracy. [Limitations] The new method did not examine the users’ comments, which might pose some negative effects to the recommendation accuracy. [Conclusions] The proposed algorithm could effectively improve the service of recommendation systems.
[Objective] This paper explores the service optimization methods based on the concept of “shared ownership without possession” of the sharing economy. [Methods] First, we retrieved data from the website of “xiaozhu short-term rentals”. Then, we used the 2-mode network tool “Ucinet” to analyze the changing of users’ locations. Third, we studied the impacts of individual centrality on users’ behaviors through the fixed effect model and the relationship among the one-mode network users. [Results] We found that degree centrality positively influenced users’ behaviors. The betweenness centrality of the host agents was negatively correlated with the consumers’ behaviors, while the betweenness centrality of the key tenant agents positively affected the hosts’ offering behaviors. [Limitations] We focused on active users, and did not investigate the characteristics of the entire network. [Conclusions] Business social network systems like xiaozhu.com should encourage their users to become both consumers and service providers, which will promote the development of Sharing Economy.
[Objective] This paper builds a model to quantitatively measure the credibility of Web contents, aiming to improve the efficiency of removing dis-information. [Methods] We first constructed a credibility measurement model based on Bayesian inference theory, and then established a minimum error rate evaluation model for credibility measurement with Bayesian decision theory. [Results] With the increasing of social media users, the minimum error rate of credibility degree went down, and the proposed model had better performance than those based on traditional fuzzy theory. [Limitations] The influencing factors of the reliability measurement model only include the number of participants. More research is needed to examine other factors, such as the conditional attributes and the reference objects. [Conclusions] This paper reveals that the minimum error rate is decreased by increasing the number of participants.
[Objective] The paper aims to help the government administrate online public opinion and social media profiles more effectively. [Methods] First, we retrieved data on the topic of “Draw up the Lifeline” from Sina Weibo. Then, we used centrality, cluster and K-core indicators to analyze the network structure and dissemination patterns of public opinion with new media. [Results] We found that online public opinion is disseminated through a scale-free network, and all communities had similar structures. The core network was relatively close but widely distributed, and the mobile technology played some major roles. [Limitations] The collected data was not comprehensive and the inactive users were not removed, which might generate some biased results. [Conclusions] This paper provides some new perspectives to research on social welfare movements. It also lists some practical guides to regulate online public opinion.