Data Analysis and Knowledge Discovery

Select

Review of Knowledge Elements Extraction in Scientific Literature Based on Deep Learning

Li Guangjian, Yuan Yue

Data Analysis and Knowledge Discovery. 2023, 7(7): 1-17. https://doi.org/10.11925/infotech.2096-3467.2023.0498

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the research of extracting knowledge elements from scientific literature using deep learning techniques. [Coverage] We used keywords such as “knowledge elements” and “deep learning” in databases, including Web of Science, Google Scholar, and CNKI. A total of 71 representative articles were manually selected for the review. [Methods] First, we provided an overview of the relevant concepts and characteristics of knowledge units in scientific literature. Then, we summarized the deep learning techniques for knowledge elements extraction from existing studies. [Results] The existing extraction methods are based on word-level or sentence-level knowledge elements. The deep learning process in knowledge extraction involves learning and capturing the different characteristics of word-level or sentence-level knowledge elements, which is crucial to using deep learning methods for knowledge extraction. [Limitations] This paper is based on the selected sample literature, which might not fully reflect certain achievements in the field. [Conclusions] The application of deep learning techniques in knowledge element extraction has improved the extraction process's accuracy, coverage, and robustness. Future studies should not only include the structured information of the scientific literature but also focus on understanding its internal knowledge content and inherent logic.

Select

Review of Latent Knowledge Discovery Methods Based on Association Between Scientific Papers and Technology Patents

Wang Shiwei, Chen Chun

Data Analysis and Knowledge Discovery. 2023, 7(7): 18-31. https://doi.org/10.11925/infotech.2096-3467.2022.0981

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper reviews the latent knowledge discovery methods based on scientific papers and technology patents to identify deficiencies in current studies and future development directions.[Coverage] A total of 75 representative articles were retrieved using keywords such as “Patents and Papers”, “Science and Technology”, and “Knowledge Discovery” from the Web of Science, Springer Link, and CNKI. [Methods] Based on the scientific-technical association, we reviewed the literature from four aspects: data association, subject association, theme association, and multi-dimensional association. [Results] The existing research methods have limitations, such as the need for more data sources for identifying corpus and the non-standardization of heterogeneous data sources. The potential knowledge discovery of the recognition method needs more semantics and better granularity. The knowledge system and measurement index based on papers and patents still need to be completed. The recognition results need more comprehensiveness, dynamic and exploratory nature. [Limitations] Mainly select some representative literature to review, in-depth elaboration is not deep enough. At the level of content analysis, the multi-strategy comprehensive analysis method of science-technology correlation is a hot research at present, but the analysis of this method is not systematic enough in this paper. The selection of representative review literature obtained from the search has a certain degree of individual subjectivity. [Conclusions] In future research, we should integrate multi-source databases and standardize heterogeneous data, enhance the semantic analysis ability of recognition methods, and refine the recognition granularity. We also need to improve the knowledge organization system, enrich the measurement indicators, and strengthen the research on the dynamic evolution of latent knowledge discovery.

Select

Interdisciplinary Subject Recognition Based on Feature Measurement and PhraseLDA Model——Case Study of Nanotechnology in Agricultural Environment

Zhang Zhenqing, Sun Wei

Data Analysis and Knowledge Discovery. 2023, 7(7): 32-45. https://doi.org/10.11925/infotech.2096-3467.2022.0651

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to identify the interdisciplinary subjects based on the feature measure method and the PhraseLDA model. [Methods] First, we analyzed the subjects’ interdisciplinary characteristics and constructed their measurement index system. Then, we identified the interdisciplinary subjects with the help of the PhraseLDA model. Finally, we conducted an empirical study of nanotechnology applications in agricultural environments. [Results] A total of 24 cross-topic were objectively identified, including catalyst preparation, soil bioremediation, and many more. Compared with the traditional identification method, the cross-topic recognition rate of the proposed method increased by 71.40%, and the recognition rate of fine-grained topics increased by 42.86%. [Limitations] The number of topics and interdisciplinary topic identification indicators of the PhraseLDA topic model were decided after repeating calculation and debugging. Therefore, the proposed method depends on the rationality of the relevant thresholds. [Conclusions] The proposed method can effectively identify interdisciplinary topics and support scientific decision-making and technological innovation research in related fields.

Select

Identifying Abnormal Riding Behaviour in Urban Rail Transit with Multi-Source Data

Xue Gang, Liu Shifeng, Gong Daqing, Zhang Pei, Liu Zhongliang

Data Analysis and Knowledge Discovery. 2023, 7(7): 46-57. https://doi.org/10.11925/infotech.2096-3467.2022.0648

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs data sets and algorithms to identify abnormal riding behaviour in urban rail transit (theft, begging, performing arts, and unauthorized advertisement distribution). [Methods] By constructing a spatiotemporal matrix, the passengers’ spatiotemporal trajectories are refined into the spatiotemporal feature map. All travel records are retained in the map without increasing complexity. Then, we used the spatiotemporal feature map as input to create an algorithm framework based on the attention mechanism and graph convolution neural networks. This algorithm can extract passengers’ key trajectory pattern features and identify abnormal behaviour from the regular passenger flow. [Results] Experimental results demonstrate the effectiveness of the proposed method, achieving a precision of 93.10%, a recall of 95.30%, and an F1 of 94.19%. All evaluation metrics have improved by over 3% compared to the baseline model. [Limitations] More research is needed to expand the sample size of the dataset and address the false positive issues. Our model cannot identify abnormal passengers who frequently change their smart cards. [Conclusions] This study constructs a dataset for abnormal commuting behavior with a larger sample size and reduced workload. The model can serve as a tool for accurately identifying abnormal commuting behavior in rail transit systems.

Select

Mining Trajectory Hotspots Based on Co-location Patterns

Yan Ruibin, Yin Dechun, Gu Yijun

Data Analysis and Knowledge Discovery. 2023, 7(7): 58-73. https://doi.org/10.11925/infotech.2096-3467.2022.0704

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes multiple Trajectory Traversal Hotspots Mining algorithms based on different trajectory characteristics like N-Degree Trajectory Table Join, N-Degree Trajectory Table Traversal, and graph databases. These algorithms will help us reduce the time and space complexity of trajectory hotspots mining, [Methods] If the trajectory data does not form a complete graph structure, we will use the N-Degree Trajectory Table Join algorithm or N-Degree Trajectory Table Traversal algorithm to iterate the path table multiple times. Based on the distribution density of the trajectory data, the algorithms help us obtain the hotspots. If the trajectory data forms a graph structure, the Trajectory Traversal Hotspots Search algorithm will perform traversal search and pruning optimization to obtain the trajectory hotspots. [Results] We conducted experiments with the ChoroChronos open-source dataset. Regarding time complexity, the running time of the Trajectory Traversal Hotspots Search algorithm was reduced by 25% compared with the best comparison algorithm. Regarding space complexity, the N-Degree Trajectory Table Join algorithm and N-Degree Trajectory Table Traversal algorithm consume 67% less memory space than the best comparison algorithm. [Limitations] We still need to fully utilize the temporal features in the trajectory sequences and should conduct experiments on a more comprehensive dataset. [Conclusions] Compared with other trajectory hotspots mining algorithms, the proposed one effectively reduces the space and time complexity.

Select

Feature Selection and Efficient Disease Early Warning Based on Optimized Ensemble Learning Model：Case Study of Geriatric Depression and Anxiety

Yan Ying, Huang Qi, Li Na

Data Analysis and Knowledge Discovery. 2023, 7(7): 74-88. https://doi.org/10.11925/infotech.2096-3467.2022.0718

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper makes disease prediction models balance computational efficiency and prediction accuracy by selecting key disease risk variables, aiming to help public health departments achieve efficient disease early warning. [Methods] We used ensemble learning-based Random Forest and XGBoost models to learn high-dimensional disease risk variable data for disease prediction. The models autonomously select subsets of variables that contribute to their prediction. To ensure that the selected subset has high prediction accuracy, we analyze the ensemble strategy of Random Forest and XGBoost. By adjusting hyperparameters and cross-validating, we improved the out-of-bag error rate of the Random Forest model iteratively and converged the loss curve of the XGBoost model on different sub-training sets. Finally, we proposed unique optimization solutions for each model to enhance their disease prediction performance. [Results] We examined the optimized models with the dataset of geriatric depression and anxiety. They exhibited excellent and comparable disease prediction performance, achieving prediction accuracies of 88.6% and 89.7%, as well as AUC values of 0.936 and 0.940, respectively. However, the XGBoost model had a simpler and more efficient structure with the optimized feature selection. It selected only 17 key variables out of 54 geriatric depression and anxiety risk variables, achieving a prediction accuracy of 85.8% and an AUC of 0.917. [Limitations] We did not utilize the latest geriatric cohort data for experimentation. More research is needed to test the adaptability of models in complex and heterogenous data environments. [Conclusions] The feature selection effect of the optimized XGBoost model is superior in improving the efficiency of disease early warning and providing decision support for public health management.

Select

Analyzing Tourist Satisfaction of Rural Scenic Attractions Based on IPA Model

Wu Jiang, Li Qiubei, Hu Zhongyi, Liu Yang

Data Analysis and Knowledge Discovery. 2023, 7(7): 89-99. https://doi.org/10.11925/infotech.2096-3467.2022.0667

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Based on online reviews, this paper constructs a framework for analyzing tourist satisfaction and provides a new research perspective for the sustainable development of rural tourism. [Methods] With the visitor comments about scenic areas, we built a tourist satisfaction analysis framework using the IPA model. Then, we used the unsupervised method to extract the visitors’ fine-grained attribute opinions about the scenic attractions. Third, we evaluated the visitors’ perceived emotions and the importance of different attributes with SnowNLP and XGBoost. Finally, we analyzed the satisfaction of attraction attributes with the IPA model. [Results] Empirical analysis demonstrates that the constructed framework can identify user opinions and analyze satisfaction levels for different attributes. The advantages of Hongcun scenic area include natural scenery and entertainment, which can be emphasized in promoting the area. On the other hand, consumer perception, commercialization, and tourism services need improvement. Furthermore, visitor flow, dining options, infrastructure, and scenic area management are low-priority development options that can be sequentially improved when sufficient resources are available. [Limitations] The experimental dataset has data imbalance issues in the ratings. [Conclusions] According to the analysis results of tourist satisfaction in the case study, this paper explores management and marketing strategies to promote the sustainable development of scenic areas, providing new insights into related issues in tourism.

Select

Domain Ambiguous Collocation Dictionary for Real-Time Financial Sentimental Analysis

Zhao Youlin, Xu Jingnan, Lu Yingjun

Data Analysis and Knowledge Discovery. 2023, 7(7): 100-110. https://doi.org/10.11925/infotech.2096-3467.2022.0696

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to address the problem of inaccurate sentiment analysis due to ignoring the dynamic polarity in ambiguous words. It aims to effectively identify sentiment-ambiguous words with economic characteristics and their collocations. [Methods] The study takes dynamic financial news information as the research object. First, we calculated the positive and negative sentiment scores of words in phrases to extract ambiguous seed words. Then, we retrieved their strongly related collocations with algorithms such as association rules and PMI. Third, we labeled the sentiment polarity of collocation pairs to build an ambiguous collocation lexicon. Finally, we measured the performance of sentiment mining on real-time updated news texts from a dynamic perspective. [Results] The accuracy, recall, and F-value of the sentiment analysis of the financial information text were 89.62%, 87.52%, and 88.57%, respectively, which were 5.79%, 15.89%, and 10.84% higher than the traditional models. [Limitations] Some collocation words cannot be identified due to their significant distance from the seed words. [Conclusions] The ambiguous collocation dictionary constructed in this paper effectively expands the sentiment lexicon in economics. It optimizes the lexicon in granularity and depth, significantly improving sentiment analysis accuracy.

Select

Recognition of Emotions and Analysis of Emotional Changes in Chinese Folk Songs

Zhao Meng, Wang Hao, Li Xiaomin

Data Analysis and Knowledge Discovery. 2023, 7(7): 111-124. https://doi.org/10.11925/infotech.2096-3467.2022.0678

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to achieve the automatic recognition of rich emotions in Chinese folk songs and to explore their emotional context and fluctuation patterns digitally. [Methods] We adopted Hevner's emotion model in the field of music and introduced external Chinese knowledge for the semantic enhancement of emotion words. The automatic mapping of artificially labelled tags is then realized by semantic distance calculation. We constructed a multimodal multitag emotion recognition model (MMERM) that fuses features of lyrics and audios for automatic emotion recognition. The model is also transferred to recognize changes of emotions in songs, based on which statistical analysis and visualization of emotional context and fluctuation patterns can be conducted. [Results] The semantic enhancement and mapping effectively improve the concentration and differentiation of tags in emotion recognition. MMERM performs well on both complete songs and fragments, with a precision of 82.29%. Regularity analysis indicates a changing trend of lightness to sadness and sacredness from the beginning to the end of the songs. Furthermore, the fluctuation pattern of Chinese folk songs is found to differ remarkably from that of Western music. [Limitations] The information of folk songs is insufficient, and emotional characteristics under different temporal and spatial conditions are not analyzed. [Conclusions] This paper provides a new paradigm for the research of traditional music from the perspective of digital humanities.

Select

Chinese-Tibetan Bilingual Named Entity Recognition for Traditional Tibetan Festivals

Deng Yuyang, Wu Dan

Data Analysis and Knowledge Discovery. 2023, 7(7): 125-135. https://doi.org/10.11925/infotech.2096-3467.2022.0698

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the performance of pre-trained models in resource-scarce languages and assists in building Tibetan knowledge graphs and semantic retrieval. [Methods] We collected Chinese-Tibetan bilingual text data related to traditional Tibetan festivals from websites such as People's Daily and its Tibetan Edition. Then, we compared the performance of multiple pre-trained language models and word embeddings on named entity recognition tasks in a Chinese-Tibetan bilingual context. We also analyzed the impact of two feature processing layers (BiLSTM and CRF) in the named entity recognition model. [Results] Compared with word embeddings, the pre-trained language models of Chinese and Tibetan improved the F1 performance by 0.010 8 and 0.059 0, respectively. In the context of fewer entities, the pre-trained model can extract more textual information than word embeddings, reducing the training time by 40%. [Limitations] The Tibetan and Chinese language data are not parallel corpora, and the Tibetan language data has fewer entities than the Chinese data. [Conclusions] The pre-trained models demonstrate significant performance in the Chinese text domain but also perform well in Tibetan, a language with scarce resources.

Select

Hybrid Recommendation with Category Preferences and Item Timeliness Factor

Yang Huaizhen, Zhang Jing, Li Lei

Data Analysis and Knowledge Discovery. 2023, 7(7): 136-145. https://doi.org/10.11925/infotech.2096-3467.2022.0712

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper addresses the influence of historical data sparsity, category preference, and item timeliness on the performance of recommendation algorithms and improves their accuracy. [Methods] Firstly, we used Huffman Coding to encode the rating data with category preference and item popularity. Then, we computed the score similarity matrices of users and projects. We also extracted their latent feature vectors using the DeepWalk model. Finally, we fused the user and project feature vectors and predicted the project ratings with Extreme Learning Machines. [Results] We examined the new model on the MovieLens and Yahoo! R3 datasets. As the proportion of the training set increased, the highest prediction accuracies reached 95.52% and 98.01%, respectively, with a runtime of only 19.93s and 22.21s. The proposed algorithm outperformed the XGB-CF algorithm in terms of prediction by 0.84 and 2.10 percentage points, respectively, with a runtime reduction of 7.92s and 9.79s. [Limitations] The proposed algorithm did not consider the textual information from user comments and diversified project categories. [Conclusions] Our new algorithm demonstrates higher prediction accuracy than the reference algorithm and can be used for personalized recommendations.

Select

Literature Recommendation Algorithm Integrating High-Order Similarity of Motif Structure

Chen Liu, Guo Yuhong

Data Analysis and Knowledge Discovery. 2023, 7(7): 146-155. https://doi.org/10.11925/infotech.2096-3467.2022.0715

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper applies the collaborative filtering method to the field of literature recommendation. It incorporates high-order similarity features reflected by the Motif structure in the user cosine similarity network to improve the recommendation quality. [Methods] Firstly, we constructed the user preference data for literature using their behavior information of collecting literature and the citation relationship between literature. Secondly, in the user cosine similarity network based on user literature collection behavior information, we captured the high-order similarity with subgraph—Motif structure within the network. Finally, we integrated user cosine and high-order similarity based on Motif structure into the matrix factorization recommendation algorithm to predict user preferences for literature. [Results] Compared with the traditional matrix factorization recommendation algorithms, this algorithm's RMSE and MAE metrics were reduced by 0.0482 and 0.0379, respectively. [Limitations] The proposed algorithm does not consider the temporal decay of the literature. [Conclusions] The new algorithm reduces the prediction error of user preferences and improves the literature recommendation quality.

Select

Hierarchical Multi-label Classification of Children's Literature for Graded Reading

Cheng Quan, Dong Jia

Data Analysis and Knowledge Discovery. 2023, 7(7): 156-169. https://doi.org/10.11925/infotech.2096-3467.2022.0649

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs a hierarchical multi-label classification model for children's literature, aiming to realize the automatic classification of children's books, guiding young readers to select books suitable for their development needs. [Methods] We materialized the concept of graded reading into a hierarchical classification label system for children's literature. Then, we built ERNIE-HAM model using deep learning techniques and applied it to the hierarchical multi-label text classification system. [Results] Compared with the four pre-training models, the ERNIE-HAM model performed well in the second and third hierarchical classification levels for children's books. Compared to the single-level algorithm, the hierarchical algorithm improved the $A U (P R C ˉ)$ values for the second and third levels by about 11%. Compared to the two hierarchical multi-label classification models, HFT-CNN and HMCN, the ERNIE-HAM model improved the third level by 12.79% and 6.48% in the classification results, respectively. [Limitations] The overall classification performance of the proposed model can be further improved, and future work should focus on expanding the dataset and refining the algorithm design. [Conclusions] The ERNIE-HAM model is effective in the hierarchical multi-label classification for children's literature.

Please choose a citation manager

Content to export

25 July 2023, Volume 7 Issue 7

模态框（Modal）标题

Please choose a citation manager

Content to export

25 July 2023, Volume 7 Issue 7