Data Analysis and Knowledge Discovery

Select

Visualizing Appropriation of Research Funding with t-SNE Algorithm

Chen Ting,Li Guopeng,Wang Xiaomei

Data Analysis and Knowledge Discovery. 2018, 2(8): 1-9. https://doi.org/10.11925/infotech.2096-3467.2018.0251

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs a visualization method for the appropriation of research funding, aiming to more effectively present the locations of funded projects. [Methods] First, we retrieved 4,669 funded projects from NSF’s Information and Intelligent System. Then, we added topic tags to these projects using clustering algorithm and human interpretation. Third, we extracted the high-dimensional text features for the application documents with TF-IDF model and LSA model. Fourth, we used the t-SNE algorithm to project high-dimensional features into two or three-dimensional spaces for visualization. Finally, we examined the visualization results with pre-classified topic labels. [Results] The proposed method created maps of funded projects, in both two-dimensional or three-dimensional spaces. [Limitations] The algorithm parameters need to be adjusted manually. More research is needed to evaluate the proposed method with documents of projects funded by other agencies. [Conclusions] The proposed method could generate maps for the funded projects, which is a helpful tool for scientific management.

Select

Building Childhood Asthma Prediction Model with Artificial Neural Network and BRFSS Database

Ma Xiaoyu,Zhang Han,Zhao Yuhong

Data Analysis and Knowledge Discovery. 2018, 2(8): 10-15. https://doi.org/10.11925/infotech.2096-3467.2018.0205

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to identify high-correlated variables with significant impacts on childhood asthma, aiming to establish predictive model without invasive clinical indicators. [Methods] First, we used statistical methods to identify the needed variables from the BRFSS database. Second, we employed the back propagation artificial neural network to build the prediction model. Finally, we compared the performance of the new model with three other methods: the traditional logistic regression, decision tree and support vector machine. [Results] The identified variables included history of asthma, correct use of inhaler, age of diagnosis, and family income. The proposed model has an accuracy of 0.723, a sensitivity of 0.697 and a specificity of 0.680. [Limitations] The BRFSS database has lots of missing data, which may influence the prediction accuracy. [Conclusions] The self-adaptable BP artificial neural network, could help us establish better prediction models for childhood asthma.

Select

Exploring the Influential Factors of Askers’ Intention to Pay in Knowledge Q&A Platforms

Zhao Yuxiang,Liu Zhouying,Song Shijie

Data Analysis and Knowledge Discovery. 2018, 2(8): 16-30. https://doi.org/10.11925/infotech.2096-3467.2018.0216

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the influencing factors of the askers’ intention to pay, aiming to promote the development of the knowledge Q&A platforms. [Methods] First, we compared the similarities and differences between the traditional and the new generation of knowledge Q&A platforms after conducting comprehensive literature review. Then, we built a model and conducted an empirical study on the askers’ intention to pay based on the theory of social exchange and social capital. [Results] The perceived value posed significant positive impact on the askes’ intention to pay. The financial benefit, social support, self-enhancement and entertainment had significant positive impacts on perceived value, while the financial cost had significant negative impacts on the askers’ intention to pay. The positive reciprocity belief posed significant positive effects on the financial costs, and the askers’ trust in the answerer also positively changed the relationship between perceived value and intention to pay. [Limitations] The study only employed cross-section data, and most of the data were self-reported. [Conclusions] The research contributes to the theoretical foundation of examining askers’ intention to use the payment-based knowledge Q&A platforms. It also offers some practical suggestion to the design and management of these new systems.

Select

The Study on the Temporal and Spatial Distribution of Event Tourism Based on Large-scale Tourism Early Warning Platform

Wang Ling,Dai Qianjin,Wu Xiaojun

Data Analysis and Knowledge Discovery. 2018, 2(8): 31-40. https://doi.org/10.11925/infotech.2096-3467.2017.1002

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper visualizes the big data of festival visitors, aiming to analyze their movement patterns and influencing factors. [Methods] We used the GIS tools to analyze the tourist flow data of 80 scenic spots during Shanghai Tourism Festival, and constructed metrological model to examine the influencing factors. [Results] We found that initiation tourism resources, which broke the obstacles facing event tourists, included the motivation of tourists gathering and rapid flows. The number of tourists declined from the multiple event centers to surrounding areas. The time distribution of tourist flow did not follow the classic “inverted U-shape”, and then led to more agglomeration effects. Tourism resource endowment, traffic conditions, competitiveness of tourism products, and tourism reception could all promote tourists gathering, while facilities (i.e. capacity) was no longer the key element in attracting visitors. [Limitations] More research is needed to discuss the dynamic path of tourist flow. [Conclusions] GIS and big data technology can be used to present the visitors’ flow.

Select

Comparing Text Vector Generators for Weibo Short Text Classification

Li Xinlei,Wang Hao,Liu Xiaomin,Deng Sanhong

Data Analysis and Knowledge Discovery. 2018, 2(8): 41-50. https://doi.org/10.11925/infotech.2096-3467.2018.0322

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses the Word2Vec and Sent2Vec algorithms to generate vectors for the text posts of Sina Weibo, aiming to achieve lower computational cost and higher efficiency in text classification. [Methods] First, we classified words from the posts with the 0-1 matrix and used results as the baseline. Then, we used the Word2Vec algorithm to generate the word vector and the vector representation of the sentences in different ways. Third, we classified the Weibo posts using sentence vectors generated by the Sent2Vec algorithm. Finally we comprehensively evaluated the advantages and disadvantages of the three methods. [Results] Both Word2Vec and Sent2Vec algorithms could reduce the text features significantly. We used 30,000 words as features and found Word2Vec and Sent2Vec algorithms could reduce feature numbers to less than 1000. The classification accuracy rate of the Word2Vec algorithm was 75.14%, which was 3% lower than the baseline. The accuracy rate of the Sent2Vec algorithm was far less than the other two methods, with the accuracy rate was only 63.08%. [Limitations] The corpus size of this paper needs to be expanded. We found that the Word2Vec algorithm did not have enough semantic information to calculate word vector. However, Sent2Vec has poor classification results for Chinese sentence vectors. [Conclusions] Word2Vec algorithm is suitable for large-scale corpus classification, and words should be used as classification features for lack of text.

Select

Sentiment Analysis for Micro-blogs with LDA and AdaBoost

Zeng Ziming,Yang Qianwen

Data Analysis and Knowledge Discovery. 2018, 2(8): 51-59. https://doi.org/10.11925/infotech.2096-3467.2018.0060

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The paper aims to improve the performance of sentiment analysis for micro-blog texts with the help of LDA model and AdaBoost algorithm. [Methods] First, we used the LDA topic model to extract topics of micro-blog posts. Then, we merged the emotional and sentence pattern features. Finally, we trained the proposed sentiment analysis model with the AdaBoost ensemble classification method. [Results] The topic feature posed significant positive impacts on emotion recognition therefore, model with topic and emotional features yielded the best results. The precision of the proposed model reached 84.512%, while the recall reached 83.160%. [Limitations] The sample size needs to be expanded, and the sentiment dictionary should be improved too. We did not study the emoticons from the micro-blog posts. [Conclusions] The proposed AdaBoost model with LDA could effectively identify emotional tendencies.

Select

Sentiment Mining of Online Product Reviews Based on Domain Ontology

He Youshi,He Shufang

Data Analysis and Knowledge Discovery. 2018, 2(8): 60-68. https://doi.org/10.11925/infotech.2096-3467.2017.1043

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper studies the relationship between the product attributes and the emotional attitudes of consumers, aiming to optimize the sentiment analysis on consumer reviews. [Methods] First, we constructed the product domain ontology to extract the needed attributes. Then, we built the product attribute hierarchy model, which combined the collocation weight of emotional words with attribute words to identify implicit attributes. Third, we created a dictionary to calculate the emotional orientation of product attributes at all levels for the sentiment analysis. [Results] We examined the proposed model with online reviews of smart phones and found it improved the accuracy of emotion classification. [Limitations] The construction of ontology needs to be further improved. [Conclusions] The proposed method could effectively identify the logical relationship among attributes, which improve the performance of sentiment analysis in real world cases.

Select

Semantic Changes of Queries from Cross-device Searching

Wu Dan,Lu Liuxing

Data Analysis and Knowledge Discovery. 2018, 2(8): 69-78. https://doi.org/10.11925/infotech.2096-3467.2018.0109

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper studies the changes of queries from cross-device searching, aiming to improve users’ experience. [Methods] With the help of user experiment, log analysis and cluster analysis, we examined the cross-device search queries for their length, diversity, and the number of keywords, as well as the changing of their semantic similarities. [Results] The length and the keyword numbers of queries from desktop devices were much higher than those from the mobile devices. However, the diversities of queries did not make significant changes. There were W, M, and V patterns for semantic similarities among cross-device search queries. [Limitations] The number of experiment participants needs to be increased, which could generate more queries for future studies. [Conclusions] The changing patterns of query semantic similarities reflects users’ searching strategies, which benefits cross-device searching services.

Select

Impacts of Waiting on Mobile Users —— Case Study of Digital Novels

Ma Yanyang,Liu Yulei,Xu Bochu,Zhi Jinyi

Data Analysis and Knowledge Discovery. 2018, 2(8): 79-87. https://doi.org/10.11925/infotech.2096-3467.2017.1249

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the waiting perception of users reading digital novels, aiming to explore the ways and factors affecting users’ satisfaction. [Methods] We used the novel bookshelf of QQ browser to create various experiments. Then, we obtained the data of user’s changing satisfaction levels facing different waiting perceptions with the video observation, task prompt, questionnaire and depth interview. Finally, we identified the relationship among the factors influencing satisfaction. [Results] We found that the waiting time, the filling, the function, the manipulation, the scene, and the context factors had different degrees of impacts on user’s satisfaction, which were all statistically significant. [Limitations] The sample size needs to be expanded to include the under-represented population. [Conclusions] This study could help service providers improve the users experience.

Select

Matching Strategies for Institution Names in Literature Database

Sun Haixia,Wang Lei,Wu Yingjie,Hua Weina,Li Junlian

Data Analysis and Knowledge Discovery. 2018, 2(8): 88-97. https://doi.org/10.11925/infotech.2096-3467.2018.0178

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs and implements matching strategies for institution names in literature database, aiming to regulate their storage and management. [Methods] We first established seven name matching rules based on their regions, types and naming characteristics. Then, we designed four hybrid matching strategies combining rules and Levenstein distance. Finally, we evaluated the four hybrid strategies with institution names from the papers indexed by Chinese Biomedical Literature (CBM) database during 2006-2011. [Results] More than six million affiliation strings from CBM were matched, which included higher education institutions, hospitals and research institutes. We found that the hybrid matching strategy based on region, naming characteristics and Levenstein distance obtained the highest precision (all above 80%), recall (64.82%), and F-value (71.66%). [Limitations] The rules and related dictionary were mainly constructed with human experience and their coverage is limited. There are some errors in the identifying institution names. The proposed strategy cannot address the issues caused by the transformative actions of institutions. [Conclusions] The proposed strategies could improve the performance of scientific research literature databases.

Select

Finding Association Between Diseases and Genes from Literature Abstracts

Mu Dongmei,Jin Shan,Ju Yuanhong

Data Analysis and Knowledge Discovery. 2018, 2(8): 98-106. https://doi.org/10.11925/infotech.2096-3467.2018.0142

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to find association between genes and diseases from literature abstracts, aiming to provide evidence for the prevention and treatment of diseases. [Methods] First, we established the entity extraction rules with the help of recognition techniques based on thesaurus. Then, we proposed a model to discover the association between disease and gene entities. Finally, we validated the new model with abstracts of diabete nephropathy studies. [Results] A total of 656 diabetic nephropathy associated genes were obtained, which included high frequency, mid frequency and low frequency genes. [Limitations] More research is needed to explore other diabete complications with the proposed model. [Conclusions] (I)The high frequency associated genes of disease are possibly the theoretical foundations of current research. (II)Intermediate frequency associated genes are the focus of current research. (III) Low frequency associated genes could become new fields for knowledge discovery.

Please choose a citation manager

Content to export

25 August 2018, Volume 2 Issue 8

模态框（Modal）标题

Please choose a citation manager

Content to export

25 August 2018, Volume 2 Issue 8