Data Analysis and Knowledge Discovery

Select

Classifying Social Media Users with Machine Learning

Gang Li,Huayang Zhou,Jin Mao,Sijing Chen

Data Analysis and Knowledge Discovery. 2019, 3(8): 1-9. https://doi.org/10.11925/infotech.2096-3467.2018.1207

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses multi-dimensional information of social media users to automatically classify them. [Methods] First, we defined social media users as individual, media, government, and organization. Then, we extracted the following features from user profiles: demographic characteristics, namings, and self-descriptions. Third, we created a user classification models based on machine learning algorithms and evaluated its performance with real Twitter dataset. [Results] Both precision and recall of the proposed model were greater than 83%. The naming, demographic characteristics, and self-description features posed increasing contributions to the classification model. [Limitations] The sample size needs to be expanded, which helps us better analyzed the characteristics of different users. [Conclusions] The proposed method could accurately identify four types of users, which benefits social media user classification research in the future.

Select

Sentiment Analysis for Online User Reviews Based on Tripartite Network

Weicong Lu,Jian Xu

Data Analysis and Knowledge Discovery. 2019, 3(8): 10-20. https://doi.org/10.11925/infotech.2096-3467.2018.1030

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The paper proposes a tripartite network sentiment analysis method, aiming to reflect the indirect connections between nodes. [Methods] We constructed a “user-product-sentiment tag” tripartite network, which were split into three bipartite networks for network structure analysis. Then, we used the proposed tripartite network projection method to obtain the “two-sentiment one-mode” network of users and products. [Results] We obtained the association of high-weighted related nodes from NetEase Cloud music dataset, and information such as genre classifications, hot-rated songs, and fan groups. [Limitations] The large number of user nodes need to be visualized in the future. [Conclusions] Based on the formation, splitting and projection of the sentiment tripartite network, we present the indirect connection between nodes, and provide new perspectives for network sentiment analysis.

Select

Measuring Tech-Entropy of System Evolution: An Empirical Study of Patents

Jianhua Hou,Pan Liu

Data Analysis and Knowledge Discovery. 2019, 3(8): 21-29. https://doi.org/10.11925/infotech.2096-3467.2018.0904

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper measures the developments and the life cycles of the technology system with an improved technology entropy method, aiming to provide theoretical foundation for predicting technology development and decision-making of the governments. [Methods] We constructed a model measuring technological entropy based on information entropy and multiple indicators for the patented technology system. Then, we conducted an empirical analysis with the new model for carbon capture technology in China. [Results] We found that the target technology concluded the stages of sprouting, and slow growth. It is currently in the stage of rapid growth. [Limitations] The quality of the sample data needs to be improved. [Conclusions] The proposed method is an effective way to analyze the evolution trends of patent technology system, which provides a better solution for identifying the life cycle of technologies.

Select

POI Recommendation Based on Geographic and Social Relationship Preferences

Yan Wen,Lijian Ma,Qingtian Zeng,Wenyan Guo

Data Analysis and Knowledge Discovery. 2019, 3(8): 30-39. https://doi.org/10.11925/infotech.2096-3467.2018.0764

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to improve the POI recommendation based on user’s geographic information and social relationships. [Methods] First, we proposed a MFDR model (MF with Distance-entropy and Refined-social-regularization), which introduced the concept of distance-entropy to refine user’s preferences and the frequency-based user-interest-matrix. Then, we applied the user-relationship-interest-matrix to refine the preferences with their social-relationship. Finally, we used the regularization-based matrix factorization method to factorize the user-preference-matrix and user-relationship-interest-matrix to ensure their consistency. [Results] We examined the new model with Gowalla and Brightkite check-in datasets, and found it outperformed existing POI recommendation algorithms. When the number of latent factors was 10 and the number of recommended POI was 10, the precision and recall of MFDR on Gowalla reached 4.47% and 9.95%. These results were 30.71% and 28.93% higher than those of traditional POI recommendation models. [Limitations] The expeimental datasets need to be expanded. [Conclusions] The proposed MFDR model based on geographical preference refinement and social-relationship preference implicit analysis is an effective way to recommend POI.

Select

Evaluating Information Services of Online Health Q&A Platform

Chuang Hong,He Li,Lihui Peng,Yiming Xu

Data Analysis and Knowledge Discovery. 2019, 3(8): 41-52. https://doi.org/10.11925/infotech.2096-3467.2018.1482

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the evaluation methods for information services of online health Q&A platform, aiming to promote its sustainable development. [Methods] We introduced the SERVQUAL framework and established assessment indicators and extension evaluation model. [Results] We examined the proposed model with Dingxiang Doctor, a health Q&A platform in China, to evaluate the quality of its information services. We found its quality grade was 3 and characteristic value of grade variable was 2.955. These results indicated the Dingxiang Doctor maintains good services. However, its reliability, assurance and empathy need to be improved. [Limitations] The sample of this research is small, and the expert scoring method might be subjective. [Conclusions] The matter-element model and extension evaluation method can help us evaluate and improve the service of online health Q&A platform.

Select

Sentence Function Recognition Based on Active Learning

Guo Chen,Tianxiang Xu

Data Analysis and Knowledge Discovery. 2019, 3(8): 53-61. https://doi.org/10.11925/infotech.2096-3467.2018.1198

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses active learning methods, structured abstracts and a few annotations to create a classification model for sentence functions, aiming to reduce the dependence on manually labeled corpus. [Methods] First, we trained the SVM, CNN and Bi-LSTM classifiers with structured function sentences from abstracts. With the help of active learning techniques, we predicted the function of a large number of unlabeled common abstract sentences. Third, we automatically identified uncertain samples for manual annotation, which were used to optimize the initial classifier. Finally, we used active learning to improve the performance of classifiers. [Results] We examined the new method with Library and Information Science literature. The precision, recall, and F1 values were 84.65%, 84.49%, and 84.57%, which were 3.25%, 3.24%, and 3.25% higher than those of the traditional methods. [Limitations] We only conducted five iterations to avoid massive work of manual corpus annotation. [Conclusions] Active learning method could effectively discover the difference between unlabeled corpus and existing training corpus, which also reduces the manual labeling costs. The proposed method might be used in citation and full text analysis.

Select

Collaborative Filtering Recommendation Based on Item Quality and User Ratings

Fusen Jiao,Shuqing Li

Data Analysis and Knowledge Discovery. 2019, 3(8): 62-67. https://doi.org/10.11925/infotech.2096-3467.2018.1000

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a modified collaborative filtering algorithm, aiming to improve the results of personalized recommendations. [Methods] First, we evaluated item quality and corrected user ratings based on their previous records. Then, we identified users with similar interests to generate better recommendations. [Results] We tested the new algorithm on MovieLens dataset and found the MAE was 4.7% higher than those of the traditional or other modified methods. [Limitations] The new algorithm does not address the interests drifting issues. [Conclusions] The proposed algorithm could recommend products to consumers more effectively.

Select

Extracting Keywords Based on Topic Structure and Word Diagram Iteration

Mingzhu Sun,Jing Ma,Lingfei Qian

Data Analysis and Knowledge Discovery. 2019, 3(8): 68-76. https://doi.org/10.11925/infotech.2096-3467.2018.0765

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper integrates the topic information to the TextRank model, aiming to improve the precision and recall of automatic keyword extraction. [Methods] First, we used the LDA to create a model for document topics, and obtained the topic distribution of the candidate keywords. Then, we calculated the node weights with the topic-word probability distribution features. Third, we weighted the probability distributions of document-topic and topic-word characteristics as the node’s random jump probability. Finally, we constructed a new transition matrix for word graph iteration to improve the TextRank model. [Results] We examined the proposed model with 1559 news articles from the website of Southern Weekly. When the number of extracted keywords was three, the model’s keyword extraction precision values were 4.7% and 6.5% higher than those of the original TextRank and TF-IDF algorithms. [Limitations] The fusion algorithm increased computational complexity. [Conclusions] The proposed algorithm could extract keywords more effectively.

Select

ISA Biclustering Algorithm for Group Recommendation

Shan Li,Yehui Yao,Hao Li,Jie Liu,Karmapemo

Data Analysis and Knowledge Discovery. 2019, 3(8): 77-87. https://doi.org/10.11925/infotech.2096-3467.2018.1015

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper tries to improve the recommendation algorithm, aiming to reduce the dependence on the number of groups (k value) at the catorization stage.[Methods]Weused the ISA algorithm to modify the collaborative filtering algorithm and finish the clustering tasks from the perspectives of users and projects. Then, we created a virtual user representing the group interests based on user’s expertise. Finally, we predicted the target users’ ratings based on the new collaborative filtering algorithm.[Results]This algorithm can remove the empirical dependence of k, and improve the accuracy of collaborative filtering recommendation algorithm. The MAE was reduced to 0.697 with 200 groups and the MAE was reduced to 0.693 with 500 groups from the FilmTrust dataset. The RMSE was reduced to 1.022 with the MovieLens dataset. [Limitations]Several rounds of repeating experience are needed to improve the quality of this study.[Conclusions] This algorithm does not rely on the dependence of k, and effectively improves the performance of collaborative filtering recommendation algorithm.

Select

Predicting Breast Cancer Survival Length with Multi-Omics Data Fusion

Huiying Qi,Yuhe Jiang

Data Analysis and Knowledge Discovery. 2019, 3(8): 88-93. https://doi.org/10.11925/infotech.2096-3467.2019.0021

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a model using machine learning techniques and various omics data, aiming to better predict the survival length of breast cancer patients. [Methods] The prediction model was established with random forest algorithm. It merged four types of omics data, including gene expression, copy number variation, DNA methylation and protein expression of breast cancer cases from TCGA database. [Results] On the test data set, the model’s prediction precision reached 97.22%, and the recall was 98.13%. Compared with the exisiting models, the AUC value of our new algorithm was the highest (0.8393). [Limitations] The sample size needs to be expanded. [Conclusions] The proposed method is an effective way to predict breast cancer patients’ survival length.

Select

Ontology Reasoning for Financial Affairs with RBR and CBR

Shaohua Qiang,Yunlu Luo,Yupeng Li,Peng Wu

Data Analysis and Knowledge Discovery. 2019, 3(8): 94-104. https://doi.org/10.11925/infotech.2096-3467.2018.1137

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to predict the impacts of financial events/news on stock price with financial, non-financial and public opinion factors. [Methods] We designed a financial affairs ontology based on the Rule-Based Reasoning (RBR) and Case-Based Reasoning (CBR). Then, we created a SWRL rule-based reasoning model, which pursued the rule-based reasoning using the Dloors engine. Thirdly, we designed a topic case database to describe the structure of the financial cases. Finally, we used the model to describe, retrieve, reuse, correct and preserve the data. [Results] We conducted an empirical study to examine the reliability of rule-based reasoning and case-based reasoning with enterprise data. [Limitations] We did not compare our model with the existing methods. [Conclusions] The proposed method could predict the stock price in big data environment.

Select

Extracting New Words with Mutual Information and Logistic Regression

Xianlai Chen,Chaopeng Han,Ying An,Li Liu,Zhongmin Li,Rong Yang

Data Analysis and Knowledge Discovery. 2019, 3(8): 105-113. https://doi.org/10.11925/infotech.2096-3467.2018.1445

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper modified the method for new word extraction, which are used to improve the performance of medical text segmentation models. [Methods] With the help of traditional mutual information model, we obtained the statistics of words and strings. Then, we established a logical regression classification model with these data, and built an algorithm for new word identification. [Results] A series of experiments were carried out on the texts of electronic medical records from Dermatology Department of Xiangya Hospital. Compared with PMI, PMI ² and PMI ³, our model with logistic regression achieved the highest accuracy of new words extraction (0.803). [Limitations] To establish the logistic regression model for classification, we have to manually judge whether or not the training strings are words. [Conclusions] The proposed model and algorithm could effectively identify new words from medical records.

Select

Identifying Frontier Topics from Funding and Paper——Case Study of Carbon Nanotube

Bowen Liu,Rujiang Bai,Yanting Zhou,Xiaoyue Wang

Data Analysis and Knowledge Discovery. 2019, 3(8): 114-122. https://doi.org/10.11925/infotech.2096-3467.2018.1297

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the fine-grained characteristics of funding and paper data in English, aiming to identify the frontiers of scientific research. [Methods] We retrieved NSF funded projects and WOS papers in the field of carbon nanotubes, and identified their LDA themes. Then, we compared their topic novelty, intensity and similarity. [Results] We found two trending topics, five emerging topics, four dying topics and two topics with potentialities. [Limitations] We did not evaluate our method with data in Chinese. [Conclusions] Compared with methods relying on single data source or dimension, our method can identify the frontiers of scientific research more effectively.

Please choose a citation manager

Content to export

25 August 2019, Volume 3 Issue 8

模态框（Modal）标题

Please choose a citation manager

Content to export

25 August 2019, Volume 3 Issue 8