Data Analysis and Knowledge Discovery

Select

Generating News Clues with Biterm Topic Model

Zhao Tianzi, Duan Liang, Yue Kun, Qiao Shaojie, Ma Zijuan

Data Analysis and Knowledge Discovery. 2021, 5(2): 1-13. https://doi.org/10.11925/infotech.2096-3467.2020.1025

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper modifies the topic model to improve the quality of extracted news clues. [Methods] We constructed a News-IBTM model based on IBTM (Incremental Biterm Topic Model) with dynamic sliding window, which reduced the extraction scope of binary phrases. Then, we used this model to extract topics and topic-word distributions from news, and inferred the document-topic distributions. Finally, we used the JS (Jensen-Shannon) divergence to measure the difference between document-topic distributions and generate news clues. [Results] We examined our News-IBTM model with news from People’s Daily Online and Weibo. The proposed model outperformed existing ones in perplexity, accuracy and efficiency. [Limitations] The accuracy of News-IBTM algorithm needs to be further improved. [Conclusions] The proposed method could effectively extract quality news topics and clues.

Select

Identifying Relationship Between Pollution Sources and Cancer Cases with Spatial Ordered Pair Patterns

Xie Wang, Wang Lizhen, Chen Hongmei, Zeng Lanqing

Data Analysis and Knowledge Discovery. 2021, 5(2): 14-31. https://doi.org/10.11925/infotech.2096-3467.2020.1026

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to identify the relationship between pollution sources and cancer cases, aiming to address the issues of discovering too many non-pertnient patterns by method using spatial co-location patterns. [Methods] First, we combined the properties of Voronoi diagram and the star instance model. Then, we defined the proximity relationship between spatial instances and the concept of spatial ordered pair patterns. Third, we decided the prevalence and the influence of the spatial ordered pair patterns based on the distance attenuation and the influence superposition effects. Finally, we proposed a basic algorithm and an optimization algorithm to examine the spatial ordered pair patterns.[Results] The proposed algorithms revealed more pertinent relationship which cannot be identified by the traditional algorithms. And the total number of results was much less than those of the traditional algorithms. Compared with the basic algorithm, the pruning rate of the optimization algorithm surpassed 80%. The larger the data set, the better the results. [Limitations] The default data are all point-spatial objects, while the extended spatial objects merit more studies. [Conclusions] The spatial ordered pair patterns could effectively identify the relationship between pollution sources and cancer cases.

Select

Identifying Leaders and Dissemination Paths of Public Opinion

Xu Yabin, Sun Qiutian

Data Analysis and Knowledge Discovery. 2021, 5(2): 32-42. https://doi.org/10.11925/infotech.2096-3467.2020.1027

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes new method to monitor social media, aiming to limit or guide the spread of public opinion. [Methods] First, we constructed an OLMT model to identify opinion leaders based on the dissemination force and topological potential. Then, we modified the Transformer model to build a social media behavior prediction model (MF-Transformer) with high parallelism and attention mechanism. [Results] The proposed models identified opinion leaders and their retweeting behaviors, as well as the main dissemination paths of online public opinion. The recall and accuracy of the predicted results were 92.17% and 99.07%, respectively, which were higher than those of the existing methods. [Limitations] We only examined our new models with data from Sina Weibo. [Conclusions] The proposed models could effectively identify online opinion leaders, as well as predict the dissemination paths of their comments and retweets.

Select

Grouping Microblog Users of Trending Topics Based on Sentiment Analysis

Zhang Mengyao, Zhu Guangli, Zhang Shunxiang, Zhang Biao

Data Analysis and Knowledge Discovery. 2021, 5(2): 43-49. https://doi.org/10.11925/infotech.2096-3467.2020.1059

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The paper proposes a model to group users of Weibo trending topics. [Methods] First, we computed the sentiment of user’s texts with sentiment dictionary. Then, we combined sentiment and text vector expression to determine the characteristics of user opinion. Finally, we grouped similar users with the K-means method. [Results] The proposed model divided users into three categories, and the value of evaluation index (CA) reached 78.2%. [Limitations] Our model needs to define the number of categories before dividing user groups. [Conclusions] The proposed model could effectively group users with the same sentimental views.

Select

Topic Recognition and Key-Phrase Extraction with Phrase Representation Learning

Zhang Jinzhu, Yu Wenqian

Data Analysis and Knowledge Discovery. 2021, 5(2): 50-60. https://doi.org/10.11925/infotech.2096-3467.2020.0060

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs a topic recognition and key-phrase extraction method based on phrase representation learning,aiming to address this issue from more specific perspective. [Methods] First, we constructed sequence for extracted phrases with dependency syntax analysis. Then, we modified the word representation learning model to process the phrase semantic vectors. Third, we developed topic recognition method based on the vector clustering technique. Fourth, we constructed the sequence of phrase topics with the phrases and the corresponding topic category numbers. Finally, we proposed a Topic-Phrase to Vector (TP2Vec) model to extract topic related phrases. [Results] Compared with the LDA model, the average similarity among topics of the proposed model was reduced by up-to 0.27. The extracted representative words were semantically related to the topics, and the results were more readable and interpretable. [Limitations] More research is needed to examine the proposed method with data sets from other fields. [Conclusions] The proposed method could effectively identify research topics and related phrases, which might be applied to other fields.

Select

Analyzing Highly Cited Papers Sponsored by National Natural Science Foundation of China

He Xueyao, Ma Tingcan, Yue Mingliang, Ou Guiyan

Data Analysis and Knowledge Discovery. 2021, 5(2): 61-69. https://doi.org/10.11925/infotech.2096-3467.2020.0691

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study analyzes the Essential Science Indicators (ESI) of highly cited papers, aiming to assess scientific outcomes of research projects sponsored by National Natural Science Foundation of China (NSFC). [Methods] We compared the total number and citations of highly cited articles sponsored by NSFC, U.S. funding, other Chinese funding, or no funding. [Results] The number of highly cited papers sponsored by NSFC soared from 2009 to 2018, which was still less than those supported by the U.S. funding. More than 80% of Highly Cited Papers from China were funded by NSFC. [Limitations] We only studied highly cited papers in English. [Conclusions] NSFC plays significant role in promoting scientific publication and influence expansion.

Select

Health Information Readability Affects Users’ Cognitive Load and Information Processing: An Eye-Tracking Study

Ke Qing, Ding Songyun, Qin Qin

Data Analysis and Knowledge Discovery. 2021, 5(2): 70-82. https://doi.org/10.11925/infotech.2096-3467.2020.0666

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the impacts of health informtion readability on users’ cognitive load and information processing. [Methods] We created two sets of health education webpages with high and low readability as experimental materials for the eye tracking tests. Then, we explored the mediating effects of cognitive load and the moderating effects of gender and task complexity. [Results] We found that readability posed significant impacts on saccade distance, as well as the total fixation duration and counts. Readability also significantly influenced the total duration of completed tasks and the accuracy of search results. Task complexity moderated the influence of readability on the time of first fixation. [Limitations] We did not consider the subjective factors of readability and the participants were mainly college students. A self-report method should be included in future studies. [Conclusions] This study promotes user information behavior research to the level of information processing. Improving readability visually could reduce users’ cognitive load, promote utilization efficiency of information, and optimize user searching experience.

Select

Optimizing Quality Evaluation for Answers of Q&A Community

Shen Wang, Li Shiyu, Liu Jiayu, Li He

Data Analysis and Knowledge Discovery. 2021, 5(2): 83-93. https://doi.org/10.11925/infotech.2096-3467.2020.0626

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to construct a new quality evaluation system for answers from a Q&A community (Zhihu in China). [Methods] First, we established a quality criteria based on user evaluation and data characteristics. Then, we created vectors for the answers. Third, we used the SVM model to learn the label representation of texts as well as the accuracy of text classification. [Results] The proposed system yielded a classification accuracy of 85.32%, which is higher than the one only included user evaluation criteria (61.44%) and the other one only adopted data characteristics (79.10%). [Limitations] Our evaluation method might be biased due to the subjective annotations. [Conclusions] The proposed method is an effective way to evaluate answer quality of the Q&A community.

Select

Music Recommendation Method Based on Multi-Source Information Fusion

Li Danyang, Gan Mingxin

Data Analysis and Knowledge Discovery. 2021, 5(2): 94-105. https://doi.org/10.11925/infotech.2096-3467.2020.0521

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper creates a musical feature system based on multi-source information, aiming to address the cold start issue facing music recommendation and provide personalized services. [Methods] We proposed a two-stage model with multi-source information fused by neural network algorithm. Then, we built the musical feature system and predicted the potential factor vectors of music. Finally, we generated the TopN recommendation list for the users. [Results] We examined our model with the Million Song Dataset. Compared with other models such as CNN, the F₁ value was improved by 9.13%, and the RMSE, MAE values were reduced by 8.08% and 3.91%, respectively. [Limitations] Our new method encounters more limits than the end-to-end training ones. And training with the Mel-frequency spectrum demands much more memory. [Conclusions] The proposed model improves the performance of music recommendation services.

Select

Analyzing Knowledge Demand and Supply of Community Question Answering with TF-PIDF

Li Ming, Li Ying, Zhou Qing, Wang Jun

Data Analysis and Knowledge Discovery. 2021, 5(2): 106-115. https://doi.org/10.11925/infotech.2096-3467.2020.0395

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper propose a new method to study the knowledge demand and supply of community question answering, aiming to make effective targeted interventions. [Methods] First, we constructed novel word weight calculation models (TF-PIDF) for the questions and answers. Then, we obtained the main categories of demanded and supplied knowledge by clustering questions and answers, as well as the popularity of topics. Third, we paired the categories of knowledge demand and their supply counterparts. Fourth, we proposed an algorithm to calculate the popularity of knowledge demands. [Results] The proposed model was examined with topis on influenza from the community of ZHIHU. We found six categories of topics for knowledge demand and supply. The trending one was “epidemic”, which represented the most popular real time needs. [Limitations] The identified topics rely on the topic meaning from feature word clustering. [Conclusions] The proposed method could effectively manage the knowledge demand and supply of community question answering.

Select

Predicting Diabetic Complications with Unbalanced Data

Qiu Yunfei, Guo Lei

Data Analysis and Knowledge Discovery. 2021, 5(2): 116-128. https://doi.org/10.11925/infotech.2096-3467.2020.0353

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper addresses the classification issues facing unbalanced sample data, aiming to find a better solution and improve the prediction results of diabetic complications. [Methods] At the data level, we used the improved SMOTE oversampling algorithm (F_SMOTE) to change the class distribution of unbalanced data. At the algorithm level, we adopted the balance accuracy, ROC and AUC under PR curve as evaluation criteria. Finally, we compared the performance of four single classifier learning models and four ensemble learning models. [Results] Compared with the traditional over sampling algorithm, our F_SMOTE algorithm improved the prediction accuracy, ROC and PR by 1.49%, 3.43% and 8.05%, respectively. Compared with the single classifier learning model, our method improved the accuracy, ROC and PR by 9.73%, 14.07% and 46.79%, respectively. The combined F_SMOTE algorithm and Random Forest model reached 97.64% in accuracy, 98.91% in ROC and 96.64% in PR for unbalanced data. [Limitations] The coverage and efficiency of our model training needs to be further improved. [Conclusions] This method creates a predictive analysis framework for researchers, which could also help doctors in disease diagnosis and prevention.

Select

Framework for Computing Trust in Online Short-Rent Platform Using Feature Selection of Images and Texts

Liang Jiaming, Zhao Jie, Zheng Peng, Huang Liushen, Ye Minqi, Dong Zhenning

Data Analysis and Knowledge Discovery. 2021, 5(2): 129-140. https://doi.org/10.11925/infotech.2096-3467.2020.0690

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a novel framework to compute consumer trust of online short-rent platform. It provides multiple groups of low-dimension feature subsets for users to present their personal information, which addresses the issues of missing data. [Methods] We used rough-set feature selection based on evolutionary algorithm to extract information from images and texts. [Results] The proposed framework reduced dimension to 5% of the original feature set while classification accuracy remained unchanged. [Limitations] More research is needed to examine our model with data from overseas platforms. [Conclusions] The proposed framework could effectively compute users’ trust while protecting their privacy.

Please choose a citation manager

Content to export

25 February 2021, Volume 5 Issue 2

模态框（Modal）标题

Please choose a citation manager

Content to export

25 February 2021, Volume 5 Issue 2