[Objective] This paper uses multi-dimensional information of social media users to automatically classify them. [Methods] First, we defined social media users as individual, media, government, and organization. Then, we extracted the following features from user profiles: demographic characteristics, namings, and self-descriptions. Third, we created a user classification models based on machine learning algorithms and evaluated its performance with real Twitter dataset. [Results] Both precision and recall of the proposed model were greater than 83%. The naming, demographic characteristics, and self-description features posed increasing contributions to the classification model. [Limitations] The sample size needs to be expanded, which helps us better analyzed the characteristics of different users. [Conclusions] The proposed method could accurately identify four types of users, which benefits social media user classification research in the future.
[Objective] The paper proposes a tripartite network sentiment analysis method, aiming to reflect the indirect connections between nodes. [Methods] We constructed a “user-product-sentiment tag” tripartite network, which were split into three bipartite networks for network structure analysis. Then, we used the proposed tripartite network projection method to obtain the “two-sentiment one-mode” network of users and products. [Results] We obtained the association of high-weighted related nodes from NetEase Cloud music dataset, and information such as genre classifications, hot-rated songs, and fan groups. [Limitations] The large number of user nodes need to be visualized in the future. [Conclusions] Based on the formation, splitting and projection of the sentiment tripartite network, we present the indirect connection between nodes, and provide new perspectives for network sentiment analysis.
[Objective] This paper measures the developments and the life cycles of the technology system with an improved technology entropy method, aiming to provide theoretical foundation for predicting technology development and decision-making of the governments. [Methods] We constructed a model measuring technological entropy based on information entropy and multiple indicators for the patented technology system. Then, we conducted an empirical analysis with the new model for carbon capture technology in China. [Results] We found that the target technology concluded the stages of sprouting, and slow growth. It is currently in the stage of rapid growth. [Limitations] The quality of the sample data needs to be improved. [Conclusions] The proposed method is an effective way to analyze the evolution trends of patent technology system, which provides a better solution for identifying the life cycle of technologies.
[Objective] This study tries to improve the POI recommendation based on user’s geographic information and social relationships. [Methods] First, we proposed a MFDR model (MF with Distance-entropy and Refined-social-regularization), which introduced the concept of distance-entropy to refine user’s preferences and the frequency-based user-interest-matrix. Then, we applied the user-relationship-interest-matrix to refine the preferences with their social-relationship. Finally, we used the regularization-based matrix factorization method to factorize the user-preference-matrix and user-relationship-interest-matrix to ensure their consistency. [Results] We examined the new model with Gowalla and Brightkite check-in datasets, and found it outperformed existing POI recommendation algorithms. When the number of latent factors was 10 and the number of recommended POI was 10, the precision and recall of MFDR on Gowalla reached 4.47% and 9.95%. These results were 30.71% and 28.93% higher than those of traditional POI recommendation models. [Limitations] The expeimental datasets need to be expanded. [Conclusions] The proposed MFDR model based on geographical preference refinement and social-relationship preference implicit analysis is an effective way to recommend POI.
[Objective] This paper explores the evaluation methods for information services of online health Q&A platform, aiming to promote its sustainable development. [Methods] We introduced the SERVQUAL framework and established assessment indicators and extension evaluation model. [Results] We examined the proposed model with Dingxiang Doctor, a health Q&A platform in China, to evaluate the quality of its information services. We found its quality grade was 3 and characteristic value of grade variable was 2.955. These results indicated the Dingxiang Doctor maintains good services. However, its reliability, assurance and empathy need to be improved. [Limitations] The sample of this research is small, and the expert scoring method might be subjective. [Conclusions] The matter-element model and extension evaluation method can help us evaluate and improve the service of online health Q&A platform.
[Objective] This paper uses active learning methods, structured abstracts and a few annotations to create a classification model for sentence functions, aiming to reduce the dependence on manually labeled corpus. [Methods] First, we trained the SVM, CNN and Bi-LSTM classifiers with structured function sentences from abstracts. With the help of active learning techniques, we predicted the function of a large number of unlabeled common abstract sentences. Third, we automatically identified uncertain samples for manual annotation, which were used to optimize the initial classifier. Finally, we used active learning to improve the performance of classifiers. [Results] We examined the new method with Library and Information Science literature. The precision, recall, and F1 values were 84.65%, 84.49%, and 84.57%, which were 3.25%, 3.24%, and 3.25% higher than those of the traditional methods. [Limitations] We only conducted five iterations to avoid massive work of manual corpus annotation. [Conclusions] Active learning method could effectively discover the difference between unlabeled corpus and existing training corpus, which also reduces the manual labeling costs. The proposed method might be used in citation and full text analysis.
[Objective] This paper proposes a modified collaborative filtering algorithm, aiming to improve the results of personalized recommendations. [Methods] First, we evaluated item quality and corrected user ratings based on their previous records. Then, we identified users with similar interests to generate better recommendations. [Results] We tested the new algorithm on MovieLens dataset and found the MAE was 4.7% higher than those of the traditional or other modified methods. [Limitations] The new algorithm does not address the interests drifting issues. [Conclusions] The proposed algorithm could recommend products to consumers more effectively.
[Objective] This paper integrates the topic information to the TextRank model, aiming to improve the precision and recall of automatic keyword extraction. [Methods] First, we used the LDA to create a model for document topics, and obtained the topic distribution of the candidate keywords. Then, we calculated the node weights with the topic-word probability distribution features. Third, we weighted the probability distributions of document-topic and topic-word characteristics as the node’s random jump probability. Finally, we constructed a new transition matrix for word graph iteration to improve the TextRank model. [Results] We examined the proposed model with 1559 news articles from the website of Southern Weekly. When the number of extracted keywords was three, the model’s keyword extraction precision values were 4.7% and 6.5% higher than those of the original TextRank and TF-IDF algorithms. [Limitations] The fusion algorithm increased computational complexity. [Conclusions] The proposed algorithm could extract keywords more effectively.
[Objective]This paper tries to improve the recommendation algorithm, aiming to reduce the dependence on the number of groups (k value) at the catorization stage.[Methods]Weused the ISA algorithm to modify the collaborative filtering algorithm and finish the clustering tasks from the perspectives of users and projects. Then, we created a virtual user representing the group interests based on user’s expertise. Finally, we predicted the target users’ ratings based on the new collaborative filtering algorithm.[Results]This algorithm can remove the empirical dependence of k, and improve the accuracy of collaborative filtering recommendation algorithm. The MAE was reduced to 0.697 with 200 groups and the MAE was reduced to 0.693 with 500 groups from the FilmTrust dataset. The RMSE was reduced to 1.022 with the MovieLens dataset. [Limitations]Several rounds of repeating experience are needed to improve the quality of this study.[Conclusions] This algorithm does not rely on the dependence of k, and effectively improves the performance of collaborative filtering recommendation algorithm.
[Objective] This paper proposes a model using machine learning techniques and various omics data, aiming to better predict the survival length of breast cancer patients. [Methods] The prediction model was established with random forest algorithm. It merged four types of omics data, including gene expression, copy number variation, DNA methylation and protein expression of breast cancer cases from TCGA database. [Results] On the test data set, the model’s prediction precision reached 97.22%, and the recall was 98.13%. Compared with the exisiting models, the AUC value of our new algorithm was the highest (0.8393). [Limitations] The sample size needs to be expanded. [Conclusions] The proposed method is an effective way to predict breast cancer patients’ survival length.
[Objective] This paper tries to predict the impacts of financial events/news on stock price with financial, non-financial and public opinion factors. [Methods] We designed a financial affairs ontology based on the Rule-Based Reasoning (RBR) and Case-Based Reasoning (CBR). Then, we created a SWRL rule-based reasoning model, which pursued the rule-based reasoning using the Dloors engine. Thirdly, we designed a topic case database to describe the structure of the financial cases. Finally, we used the model to describe, retrieve, reuse, correct and preserve the data. [Results] We conducted an empirical study to examine the reliability of rule-based reasoning and case-based reasoning with enterprise data. [Limitations] We did not compare our model with the existing methods. [Conclusions] The proposed method could predict the stock price in big data environment.
[Objective] This paper modified the method for new word extraction, which are used to improve the performance of medical text segmentation models. [Methods] With the help of traditional mutual information model, we obtained the statistics of words and strings. Then, we established a logical regression classification model with these data, and built an algorithm for new word identification. [Results] A series of experiments were carried out on the texts of electronic medical records from Dermatology Department of Xiangya Hospital. Compared with PMI, PMI 2 and PMI 3, our model with logistic regression achieved the highest accuracy of new words extraction (0.803). [Limitations] To establish the logistic regression model for classification, we have to manually judge whether or not the training strings are words. [Conclusions] The proposed model and algorithm could effectively identify new words from medical records.
[Objective] This paper analyzes the fine-grained characteristics of funding and paper data in English, aiming to identify the frontiers of scientific research. [Methods] We retrieved NSF funded projects and WOS papers in the field of carbon nanotubes, and identified their LDA themes. Then, we compared their topic novelty, intensity and similarity. [Results] We found two trending topics, five emerging topics, four dying topics and two topics with potentialities. [Limitations] We did not evaluate our method with data in Chinese. [Conclusions] Compared with methods relying on single data source or dimension, our method can identify the frontiers of scientific research more effectively.