[Objective] This paper analyzes the trending topics generated by users from disaster-affected areas and those by users from non-disaster areas at different stages of a disaster, aiming to discover the evolution of topics. [Methods] Firstly, we used geo-tags and users’ profiles to decide their locations. Then, we proposed a framework based on topic-word co-occurrence and community detection to identify trending topics, calculate topic strength and analyze topic evolution. Thirdly, we used alluvial diagram to visualize the evolution of these topics. Finally, based on situational awareness theory, we compared the macro and micro-evolutionary patterns of trending topics between the two user groups. [Results] During a disaster, the affected users mainly published tweets on physical environment, while the non-affected users tended to express their emotions on Twitter. After a disaster, the affected users mainly published emotional topics, while the non-affected users posted tweets on built environment and physical environment. [Limitations] Deciding a user’s geographic location based on his/her profile might not be reliable. More research is needed to optimize the measurement of topic strength. [Conclusions] The affected and non-affected users show different topic preferences at various stages of a disaster, which helps the related agencies identify peoples in need more effectively.
[Objective] The paper explores the influence of sample size, the N value of N-gram, stop words, and weighting methods of word frequency on the automatic recognition of rhetorical moves in scientific paper, aiming to improve the abstracting method based on support vector machine (SVM) model. [Methods] We retrieved a total of 1.1 million labeled moves from 720,000 structured abstracts of scientific papers as experimental data, and constructed SVM model for move recognition. Based on the principle of single variable, we used control variable method by changing the sample size, the N value, removal of stop words, and word frequency weighting methods to analyze their impacts on the model’s performance. [Results] We found that the model yielded the best result with a sample size of 600,000 abstracts, the N value [1,2], keeping stop words, and using TF-IDF to weight word frequency. [Limitations] We only examined the model with structured abstracts, which might not be comparable with other studies. [Conclusions] The sample size and some fine features have significant impacts on the performance of traditional machine learning models.
[Objective] This paper proposes a new supervised learning method to identify road intersections automatically based on GPS trajectory data generated by travelers in the mixed traffic mode. [Methods] Firstly, we encoded and partitioned the original trajectory data and their active regions with the GeoHash algorithm. Then, the coded trajectory and the coding matrix of active regions were mapped into a binary fusion matrix for characteristics of road intersections. Finally, we employed the K nearest neighbor classification algorithm with sliding window to identify the intersections. [Results] The proposed method was more efficient than the Latitude and Longitude Coordination based systems. Encoding with GeoHash algorithm helped us reduce the volume of datasets by 61%. It had better performance than the turning-angle based methods, and its F1-measure score was 0.82 with the distance measure of 50 meters. [Limitations] More real life GPS data is needed to better evaluate our method’s performance. [Conclusions] The proposed method is robust to the changing of sampling frequencies and could effectively identify the urban intersections based on GPS trajectory data.
[Objective] This study modifies the TextRank algorithm with a method of removing word nodes, aiming to improve the results of keyword extraction from Chinese documents. [Methods] We proposed an updated RemoveRank algorithm to collect Chinese keywords and alternately carried out the sorting and removing steps. Based on the complex network structure characteristics of word graph, we used the removal queue as the sorting results for word nodes to extract keywords. [Results] We examined the proposed method on dataset with marked keywords from Southern Weekend. The new algorithm had better performance than the traditional methods. When the number of extracted keywords were 3, 5, and 7, their F values were 4%, 6%, and 5% higher than those of the TextRank. [Limitations] Our word graph did not include the weight of edges. [Conclusions] The RemoveRank method could effectively extract keywords from Chinese documents with the appropriate sliding window values.
[Objective] This paper tries to construct a time series prediction model based on the fluctuation of users’ historical interests, aiming to improve the recommendation results. [Methods] We added time attenuation factor to the ratings by each type of users and linearly fit the data fluctuation with neural network. Then, we chose the optimal parameters to compare the effectiveness of the proposed method. [Results] We conducted five rounds of user simulation tests and found the MAE and RMSE errors of the proposed method were reduced by 47.63% and 44.61%. [Limitations] Analysis of time fluctuation relies on users’ historical data, thus, additional cold-start algorithm is needed to preprocess the data. [Conclusions] The proposed method could effectively analyze and predict the changing of interests in different commodities, and provide more accurate recommendation lists.
[Objective] This paper builds a Q-LDA model to identify topics of online health community, aiming to improve the quality of information generated by the LDA model, as well as its theme representation ability. [Methods] Firstly, we evaluated and weighted the online health information. Then, we constructed a Q-LDA topic mining model based on the LDA model. Finally, we examined the proposed model with real world data. [Results] The Q-LDA model yielded better results than the traditional LDA model. The efficiency of extracting topics was improved by 16%. [Limitations] We only examined the proposed model with textual data from online discussion boards on one disease. [Conclusions] Adding quality of health information to data mining could help us meet the needs of users.
[Objective] This paper analyzes the contributions of the crowdsourcing community members, aiming to encourage them to share more knowledge. [Methods] We adopted qualitative comparison method based on fuzzy sets to study these members from a competitive knowledge crowdsourcing community. Then, we chose condition variables from community environment, motivation theory, and sunk cost effect. Finally, we used the degree of knowledge contribution as the result variable. [Results] There are two types of high-level knowledge sharing configurations: (I) Communities with administrators could lead members to a high degree of knowledge sharing through currency rewards and sunk costs (time or money); (II) Community members invested money and time could also promote high-level knowledge sharing. There are two types of low-level knowledge sharing: (I) In communities without administrators, it is hard for members to share knowledge;(II)Community members without money or time investments had low level of knowledge sharing. [Limitations] We only studied one website and the cross-section of research data might influence our results. [Conclusions] The proposed method helps the knowledge sharing community improve member management, as well as the quality of crowdsourcing tasks.
[Objective] This study tries to improve the assessment of prison risks, such as violence, suicide, being abetting or abetted. [Methods] We proposed a risk assessment system for prisoners based on the interval-valued fuzzy VIKOR method. First, on the basis of 62-dimesion sample data of more than 1100 prisoner records, we established the optimized data set with interval-valued fuzzy VIKOR method. Then, we trained the new model with multiple machine learning algorithms. Finally, we compared the performance of our model with the existing ones. [Results] The precision, recall and F1 values were improved by 8.9%, 11.1% and 0.1 respectively. [Limitations] We could not propose a universal algorithm for all types of risks. [Conclusions] Our model provides some new directions for prison management and research.
[Objective] This study proposes a route recommendation method based on two-way link analysis of geographic name entities, aiming to improve the results with entity properties. [Methods] Firstly, we collected data from the directed weighted network of different place-name entities in specific scenarios. Then, we calculated the chain-in and chain-out values of different trajectory chains belonging to the ideal set of place-name entities. Finally, based on the Boolean logic and position-qualifying elements for user’s queries, we applied the fuzzy search algorithm to match user queries and track chains. [Results] The precision of proposed algorithm was 0.75, which is higher than traditional recommendation methods. However, the recall rate did not change significantly. As the increasing of the weighted network scale, the precision and recall rates showed a clear inverse relationship. [Limitations] We did not examine the impacts of the object attribute data on the recommendation results. [Conclusions] The proposed method combines the recommendation algorithms based on statistical and semantic analysis, which can quickly generate alternative routes and recommendation index.
[Objective] This paper evaluates the influence of scholars in a more scientific and standardized way, aiming to find domain experts effectively. [Methods] Firstly, we constructed a knowledge super-network model from four dimensions: author, literature, domain and subject. Secondly, we used the measurement methods for super-network and literature, the LDA model and the PageRank ranking algorithm, to present a domain expert identification method based on knowledge super-network. [Results] We used library and information science as the field to examine the proposed model and found it yielded better results than h-index, p-index and social network analysis. [Limitations] We only retrieved papers from some journals, which may affect the results with other data. The granularity of mining domain labels through the LDA topic model needs to be refined. [Conclusions] Based on the knowledge super-network of scientific and technological literature, the proposed method could assess the academic impacts effectively, and provides new ideas to identify domain experts.
[Objective] This paper proposes a new classification method based on grammar rules, aiming to improve the accuracy of sentiment analysis for Chinese texts. [Methods] Firstly, we combined the Chinese grammar rules with Bi-LSTM in the form of constraints and standardized the adjacent positions of sentences from the experimental corpus. Then, we generated the linguistic functions of non-emotional, emotional, negative, and degree words at sentence level. [Results] Compared with the RNN, LSTM and Bi-LSTM models, the accuracy of our model reached upto 91.2%. [Limitations] The experimental data was only collected from the hotel reviews. More research is needed to examine the performance of this model on other data sets. [Conclusions] The proposed method improves the accuracy of sentiment classification for Chinese.
[Objective] This paper explores the relationship between stock market fluctuations and social media users’ interactive behaviors, aiming to predict the stock prices with social data. [Methods] Firstly, we set snapshots and constructed several social networks by crawling the quotes of Sina Finance Blogs. Then, we extracted the topological features and conducted correlation analysis between the topological features and Shanghai Composite Index. Finally, we used the Granger causality test to further examine the relationship between Shanghai Composite Index and the correlated features. [Results] There was a relationship of quadratic term between graph Density and Shanghai Composite Index, and the extreme point was 3,400. There was a positive correlation between blog Nodes’ average number of likes and Shanghai Composite Index (correlation coefficient = 0.486). Taking the first order lag, the average number of likes can be the Granger cause of Shanghai Composite Index. [Limitations] We did not caculate the emotional scores of the blogs and only extracted the basic topological features. [Conclusions] Users’ social network behaviors could help us predict the changes of stock market.
[Objective] This study combines the external features and contents of the Weibo posts, aiming to identify online opinion leaders with the help of text sentiment analysis. [Methods] First, we identified the potential opinion leaders and introduced the Word2Vec algorithm to find new sentiment words. Then, we conducted sentiment analysis to categorize the texts as positive, negative or neutral ones. Finally, we detected and removed bloggers attracted too many negative comments. [Results] The proposed model optimized the ranking of opinion leaders, which was better than the improved PageRank algorithm, and more consistent with the Weibo data. [Limitations] We only examined our model with one piece of breaking news. [Conclusions] This paper identifies three types of online opinion leaders from the public reaction in emergency.