Data Analysis and Knowledge Discovery

Current Issue

, Volume 6 Issue 10

Previous Issue Next Issue

For Selected:

View Abstracts

Download Citations
EndNote Reference Manager ProCite BibTeX RefWorks

Toggle Thumbnails

Select

Recommendation Method for Potential Factor Model Based on Time Series Drift

Ding Hao, Hu Guangwei, Wang Ting, Suo Wei

2022, 6 (10): 1-8. DOI: 10.11925/infotech.2096-3467.2021.1464

Abstract

HTML ( 54 )

PDF(1361KB) ( 245 )

[Objective] This paper proposes a decomposition model for potential factors based on time series drift, aiming to capture the characteristics of changing user interests and improve the recommendation accuracy. [Methods] First, we built a model combining the temporal dynamic evolution of user preferences and the impacts of their previous behaviors on current ones. Then, we constructed an auxiliary matrix to capture the evolution of users. Finally, we introduced a time impact factor to balance the influence of current and past behaviors. [Results] We examined our model with three experimental datasets. Compared with the baseline method, the accuracy was improved by 40.02%, 3.75% and 19.81% on average. [Limitations] The evolution analysis of interest drift relies on historical data. When the amount of historical data is too sparse, other user information needs to be used for a cold start. [Conclusions] The proposed model has stronger generalization ability to process the characteristics of interest fluctuation, which accurately analyzes user interest evolution, and effectively improves the recommendation performance of enterprises.

Figures and Tables | References | Related Articles | Metrics

Select

Topic Clustering for Social Media Texts with Heterogeneous Graph Neural Networks

Feng Xiaodong, Hui Kangxin

2022, 6 (10): 9-19. DOI: 10.11925/infotech.2096-3467.2022.0038

Abstract

HTML ( 40 )

PDF(2101KB) ( 524 )

[Objective] This paper develops an effective topic clustering method to address the issues of semantic sparsity and multiple interactions of social media texts. [Methods] We constructed a model for the multiple interaction relationship between social media users and online contents with the help of heterogeneous information network. First, we used word embedding method to obtain the representation of texts as the initial input features. Then, we propagated and aggregated representations of nodes with the heterogeneous graph neural network. Finally, we trained the model with representation of text nodes, and conducted an unsupervised clustering for the topics. [Results] We examined our model on the English benchmark data set, and found its NMI for original posts and comments reached 0.837 2 and 0.868 9 respectively, which were higher than those of the traditional LDA or directly clustering method with words or text embedding vectors by Word2Vec, Doc2Vec, or GolVe. [Limitations] Due to the limits of data, we did not examine the social relationship among users and multimedia contents online. [Conclusions] The proposed model can effectively improve the topic clustering for social media texts.

Figures and Tables | References | Related Articles | Metrics

Select

Recommending Research Collaborators Based on Scholar Profiling

Dong Wenhui, Xiong Huixiang, Du Jin, Wang Niuniu

2022, 6 (10): 20-34. DOI: 10.11925/infotech.2096-3467.2021.1457

Abstract

HTML ( 30 )

PDF(2802KB) ( 197 )

[Objective] This paper helps scholars quickly find suitable scientific research partners, and then promote research output and enhance academic exchanges. [Methods] First, we explored the four dimensional characteristics of scholars’ natural attributes, interest attributes, ability attributes and social attributes with the LDA topic model, PageRank algorithm and social network analysis. Then, we constructed scholars’ profiles, and recommended collaborators based on their preferences. [Results] We examined the proposed method with 14 007 articles, 13 292 citations and 11 869 authors in the field of Library and Information Science from the CNKI and CSSCI databases. A total of 20 potential collaborators with similar and complementary research interests were recommended to the target scholars. [Limitations] More research is needed to address the cold start issue, as well as the contribution of authors in different signing orders of the papers. The data of the empirical study also needs to be expanded. [Conclusions] The proposed model can effectively recommend potential research collaborators for the target scholars, which has good application value.

Figures and Tables | References | Related Articles | Metrics

Select

Predicting Popularity of Emerging Topics with Multivariable LSTM and Bibliometric Indicators

Chen Wen, Chen Wei

2022, 6 (10): 35-45. DOI: 10.11925/infotech.2096-3467.2022.0075

Abstract

HTML ( 24 )

PDF(1366KB) ( 455 )

[Objective] This paper identifies emerging topics from multi-source data, and constructs a multivariable LSTM with bibliometric indicators to predict their popularity. [Methods] Firstly, we explored the topics of funded projects, papers and patents. Secondly, we identified the emerging ones based on their novelty, growth and persistence. Finally, we predicted these topics’ popularity with the multivariable LSTM model and indicators of funding amounts, number of fundings, average citation counts for each article, and number of patent IPC subclasses. [Results] We examined our new model with studies on solid oxide fuel cell, which yielded better performance than BP, KNN, SVM and univariate LSTM. Our model had the lowest MAE (16.534) and RMSE (23.494), as well as the highest R² (0.642). [Limitations] We did not include each patent’s citation number because it was difficult to obtain specific data for each time window. [Conclusions] The modified LSMT could effectively predict the popularity of emerging topics.

Figures and Tables | References | Related Articles | Metrics

Select

Analyzing Structures of Medical Imaging Diagnosis Reports

Sheng Yu, Hu Huirong, Wang Congcong, Yang Shengyi

2022, 6 (10): 46-56. DOI: 10.11925/infotech.2096-3467.2022.0085

Abstract

HTML ( 11 )

PDF(1029KB) ( 432 )

[Objective] This paper tries to turn medical imaging diagnosis reports into structured data, aiming to effectively extract information from these free-text-reports. [Methods] First, we analyzed the text characteristics of medical imaging diagnosis reports, and proposed a structuring method based on entity recognition and rule extraction. Then, we annotated 800 reports to construct datasets for model evaluation. [Results] The proposed method had a precision rate of 0.87 for all entities from the medical imaging diagnostic reports, which was 4.03% higher than that of the BERT-BiLSTM-CRF. Its recall rate was also 2.81% higher than that of the BERT-BiLSTM-CRF. Compared with the method of dependency analysis, the proposed model improved the recognition precision of medical exam items and results by 5.62% and 2.31%. [Limitations] We only examined the proposed method with diagnostic PET-CT imaging reports from one hospital. [Conclusions] This study successfully converts the free texts of medical imaging diagnostic reports to structured data. It not only optimizes the classification, storage, and retrieval of medical reports, but also provides supports for future research on medical imaging.

Figures and Tables | References | Related Articles | Metrics

Select

Extracting Patent Keywords by Integrating Restriction Relationship

Yu Yan, Zhu Shengchen

2022, 6 (10): 57-67. DOI: 10.11925/infotech.2096-3467.2021.1458

Abstract

HTML ( 19 )

PDF(1127KB) ( 285 )

[Objective] This paper tries to improve the accuracy of patent keyword extraction with the characteristics of patent claims. [Methods] We examined the restriction relationship between technical features of patent claims. Then, we integrated these relationship into the patent keyword extraction method based on graph. [Results] We examined our model with the USPTO and Baiten data sets for patents. The MRR index of our method was 31.79% (USPTO) and 33.81% (Baiten) higher than the traditional TextRank method. [Limitations] The data of our experimental analysis need to be further expanded. [Conclusions] The proposed method could significantly improve the accuracy of patent keyword extraction.

Figures and Tables | References | Related Articles | Metrics

Select

Quantifying Logical Relations of Financial Risks with BERT and Mutual Information

Jia Minghua, Wang Xiuli

2022, 6 (10): 68-78. DOI: 10.11925/infotech.2096-3467.2022.0009

Abstract

HTML ( 23 )

PDF(1229KB) ( 312 )

[Objective] This paper tries to prevent and control financial risks by quantifying their logical relationship, which also improve the reliability of processing word frequency of financial events. [Methods] We proposed a quantitative analysis method for the logical relation of financial risks based on BERT and mutual information combined with domain knowledge. Then, we quantified the relations with COPA and financial data sets. [Results] The proposed model effectively addressed the issue of unreliable quantization of word frequency. Its accuracy reached 80.1%, which was 3.1%~37.4% higher than the benchmark models. [Limitations] More research is needed to examine our new model with non-financial and other corpora. [Conclusions] Our new method can reveal the evolutionary path of financial risk events and improve the effect quantitative presentation of their logical relationship.

Figures and Tables | References | Related Articles | Metrics

Select

Prediction and Early Warning Model for Environmental Data and Circulatory System Disease Death with Machine Learning

Wang Yan, Xu Meimei, Tong Yujia, Gou Huan, Cai Rong, Shan Zhiyi, An Xinying

2022, 6 (10): 79-92. DOI: 10.11925/infotech.2096-3467.2022.0012

Abstract

HTML ( 23 )

PDF(4902KB) ( 163 )

[Objective] This paper builds a prediction and early warning model for circulatory system disease death, aiming to improve disease prevention. [Methods] We retrieved the death data of circulatory system diseases in a Chinese region from 2014 to 2018, and constructed the prediction model with GAM, RF and XGBoost. Then, we used the distributed lag nonlinear model to calculate the accumulative lag effect results, and built the early warning model. [Results] The continuous low and high temperatures, strong sunshine hours and high concentration of environmental pollutants would increase the risk of death from circulatory system diseases. The accumulative weekly relative risks were 1.236, 1.130, 1.560, 1.062, 1.218, 1.153 and 1.796 respectively. The RMSE of the RF and XGBoost models were 4.979 and 5.341 with good performance. Age, sex, temperature, sunshine hours, SO₂, NO₂, CO, O₃, PM₁₀, PM_2.5 concentration are the characteristic variables, and the early warning value was determined from the data of accumulative lag effects. The early warning effect is good. The sensitivity, specificity and area under the curve of the XGBoost prediction results were 0.948, 0.939 and 0.941 respectively. [Limitations] We need to add data on concomitant diseases and their progress. [Conclusions] The regional number of deaths is related to the increase of age, men, temperature, sunshine hours and pollutant concentration. The new prediction and early warning model could benefit disease prevention and intervention.

Figures and Tables | References | Related Articles | Metrics

Select

Classification Model for Scholarly Articles Based on Improved Graph Neural Network

Huang Xuejian, Liu Yuyang, Ma Tinghuai

2022, 6 (10): 93-102. DOI: 10.11925/infotech.2096-3467.2022.0071

Abstract

HTML ( 30 )

PDF(1593KB) ( 512 )

[Objective] This paper tries to address the over-smoothing issues of the traditional graph neural network, and then realizes the weight adaptive allocation of different depths and neighbors, aiming to improve the performance of academic literature classification. [Methods] We proposed an improved graph neural network model for academic paper classification. First, with the help of multi-head attention mechanism, the new model learned a variety of related features among documents, and adaptively distributing the weights of different neighbor nodes. Then, based on the residual network structure, the model aggregated outputs of each layer node, and provided the learning of adaptive aggregation radius. Finally, with the help of improved graph neural network, the model learned feature representation of each node in the paper citation graph, which was input into the multi-layer fully connected network to obtain the final classification. [Results] We examined our model on large-scale real datasets. The accuracy of our model reached 0.61, which is 0.04 and 0.14 higher than those of the GCN and Transformer models. [Limitations] More research is needed to improve the classification accuracy of small categories and difficult to distinguish samples. [Conclusions] The improved graph neural network can effectively conduct classification for academic articles.

Figures and Tables | References | Related Articles | Metrics

Select

Detecting Topics of Online News with Shared Nearest Neighbours and Markov Clustering

Wu Zhenfeng, Lan Tian, Wang Mengmeng, Pu Mo, Zhang Yu, Liu Zhihui, He Yanqing

2022, 6 (10): 103-113. DOI: 10.11925/infotech.2096-3467.2021.1170

Abstract

HTML ( 18 )

PDF(2015KB) ( 245 )

[Objective] This paper proposes a topic detection method for online news, aiming to more effectively utilize the internal structure of data. [Methods] First, we examined the association strength among online news with the number and rank of their shared nearest neighbors. Then, we constructed a graph for the shared nearest neighbors, which improved the utilization of internal structure of the data. Finally, we detected the topics of online news with dimension reduction, the decision of the optimal number of topics, Markov clustering, and automatic topic description based on closeness centrality. [Results] We examined our new model with two data sets of online news and found the ARI values were up to 0.86 and 0.97, while the ARI values of the LDA, K-means, and GMM models were all less than 0.75 and 0.90. [Limitations] We need to evaluate the performance of the proposed method with data sets from other fields and the multilingual ones. [Conclusions] The proposed method could effectively detect the topics of online news and provide new direction for the future research.

Figures and Tables | References | Related Articles | Metrics

Select

Recommending Point-of-Interests with Real-Time Event Detection

Li Zhi, Sun Rui, Yao Yuxuan, Li Xiaohuan

2022, 6 (10): 114-127. DOI: 10.11925/infotech.2096-3467.2021.1461

Abstract

HTML ( 20 )

PDF(1129KB) ( 155 )

[Objective] This paper constructs a point-of-interest (POI) recommendation system based on real-time event detection, appropriate time and POI characteristics. [Methods] First, we retrieved the real-time events from a large number of tweets with geographical markers. Then, the system learned the embedded feature representation of real-time events and time perception information through tree convolution neural network. Third, we captured the perceptual features of POI’s graphic contents from comments and photos. Fourth, the system learned the graphic feature vector of POI with convolution neural network. Finally, we used the recall rate at the top K and the average of the reciprocal of the ranking to evaluate the effectiveness of different recommendation systems. [Results] The mean reciprocal rank (MRR) of the proposed model is 8.9% higher than that of the MP model and 57.9% higher than that of the non-negative matrix factorization (NMF) model. [Limitations] The characteristics of POI only include textual and image features, which need to be expanded. [Conclusions] The proposed model could effectively recommend point-of-interests, which benefits location-based services such as search, transportation and environmental monitoring.

Figures and Tables | References | Related Articles | Metrics

Select

Improvement of Data Augment Algorithm for Named Entity Recognition with Small Samples

Liu Xingli, Fan Junjie, Ma Haiqun

2022, 6 (10): 128-141. DOI: 10.11925/infotech.2096-3467.2022.0261

Abstract

HTML ( 20 )

PDF(1778KB) ( 421 )

[Objective] This paper proposes a strategy to improve data augment algorithm for named entities recognition with small samples. [Methods] Taking the task of domain named entity recognition as an example, a multi-dimensional improvement strategy based on easy data augment (EDA) algorithm is proposed: the entity replacement of mixed multiple domain dictionaries, the replacement of part of speech in domain semantic classification dictionaries, the random deletion based on semantic protection mechanism, the random insertion strategy of part of speech protection and the improved combination strategy of the four methods mentioned above, and the improved combination strategy of the four methods are respectively trained with named entity recognition(NER) model. [Results] The domain NER experimental results with small samples show that on the one hand, the efficiency was improved through a single strategy EDA: the F value is increased by 3.2, 4.6, 4.5 and 2.5 percentage points respectively. In contrast, the F value showed poor performance when applying two or more hybrid strategies. In the expansion experiment of the People’s Daily and Weibo datasets with small samples, the improvement effect was significant. The F value of the Entity Replacement Strategy Based on Multi-Domain Dictionary Mixing improvement strategy on the two datasets increased by 6.7 percentage points at the most. [Limitations] In the multiple strategy combination experiment, the regulation of the parameters α、N becomes more difficult, and the NER improvement of the combined strategy is affected. [Conclusions] The improvement strategy of EDA algorithm suggested in this paper effectively improves the results of named entity recognition model with small samples.

Method

Figures and Tables | References | Related Articles | Metrics

Select

Analyzing Public Opinion on Three-Child-Policy with Sentiment Classification and Keyword Extraction

Meng Fansi,Zhong Han,Shi Shuicai,Xie Zekun

2022, 6 (10): 142-150. DOI: 10.11925/infotech.2096-3467.2022.0067

Abstract

HTML ( 19 )

PDF(1364KB) ( 305 )

[Objective] This paper studies the public opinion on the three-child-policy in different Chinese provinces. [Context] Existing research on this issue addresses public opinion from the Web as a whole, and ignores the demands or concerns from individual province. These studies’ research methods are rather simple with single data source. [Methods] Firstly, we analyzed the public opinion on three-child-policy with time series method from the statistical perspective. Then, we examined their sentiments with the SVM model, and extracted keywords from the negative opinion with the CRF model. Third, we created word clouds for these keywords. Finally, we conducted research on these public opinion in different provinces and generated word clouds for them. We also examined the ties between political or economic statistics and the negative key words from different provinces. [Results] The three-child-policy was more popular than other policies during the same period. The public opinion was dominated by neutral sentiments (60.56%), followed by the positive (35.15%) and the negative ones (4.29%). Public concerns in different provinces were different and correlated to the political, economic and ecological factors. [Conclusions] Different provinces should adopt customized public opinion guidance to support the three-child-policy, which will address people’s concerns more effectively.

Figures and Tables | References | Related Articles | Metrics