Data Analysis and Knowledge Discovery

Current Issue

, Volume 8 Issue 2

Previous Issue Next Issue

For Selected:

View Abstracts

Download Citations
EndNote Reference Manager ProCite BibTeX RefWorks

Toggle Thumbnails

Select

A Survey on Session-Based Recommendation Methods with Graph Neural Network

Zhang Xiongtao, Zhu Na, Guo Yuhui

2024, 8 (2): 1-16. DOI: 10.11925/infotech.2096-3467.2022.1282

Abstract

HTML ( 13 )

PDF(1083KB) ( 299 )

[Objective] This paper focuses on graph neural network technology,reviewing session-based recommendation methods to provide a reference for future research. [Coverage] We took “session-based recommendation” and “graph neural network” as search terms, and 82 domestic and foreign literatures were screened from databases such as “Web of Science” and “China National Knowledge Infrastructure”. [Methods] From the perspective of framework, evaluation and trend, this paper generalises and compares session-based recommendation methods based on graph neural networks, summarises the existing evaluation resources and discusses the future research trend. [Results] Graph Neural Network is the mainstream technology for implementing session-based recommender systems. The studies on session-based recommendation methods with graph neural network mainly focus on three core problems, session graph construction, session graph learning and session interest representation. [Limitations] Session-based recommendation methods with graph neural networks are constantly emerging, and the research reviewed is only the typical research and not all studies are listed. Future research can be deepened in terms of interpretability, robustness, diversity and fairness. [Conclusions] Graph Neural Network is the mainstream technology for session-based recommender systems. Existing research has conducted preliminary exploration from various aspects and provided sufficient evaluation resources. Future research should combine the characteristics of session recommendation scenarios and develop graph neural network technology to further improve the existing research deficiencies.

Figures and Tables | References | Related Articles | Metrics

Select

An Overview of Research on Multi-Document Summarization

Bao Ritong, Sun Haichun

2024, 8 (2): 17-32. DOI: 10.11925/infotech.2096-3467.2022.1245

Abstract

HTML ( 9 )

PDF(1034KB) ( 147 )

[Objective] This paper reviews the literature on multi-document summarization, aiming to examine their research frameworks and mainstream models. [Coverage] We searched the AI Open Index, Paper with Code, and CNKI databases with queries “multi-document summarization” and “多文档摘要”. A total of 76 representative articles were retrieved. [Methods] We summarized the mainstream research frameworks, the latest models, and algorithms of multi-document summarization technology. We also present prospects for future studies. [Results] This paper compared the strengths and weaknesses of the latest models for multi-document summarization to the traditional methods. We also summarized high-quality multi-document summarization datasets and current evaluation metrics. [Limitations] We only discussed the evaluation results of some popular models on the Multi-News dataset, lacking a comparison of all models on the same dataset. [Conclusions] Many challenges remain in the task of multi-document summarization, including the generated summaries' low factual accuracy and the models' poor generality.

Figures and Tables | References | Related Articles | Metrics

Select

Computing Patent Similarity Based on Hierarchical Feature of Claims

Xiang Shuxuan, Cao Yujie, Mao Jin

2024, 8 (2): 33-43. DOI: 10.11925/infotech.2096-3467.2022.1340

Abstract

HTML ( 12 )

PDF(1058KB) ( 96 )

[Objective] This paper proposes a new model to compute patent similarity, which fully leverages the characteristics of patent texts and their structural and context features. [Methods] First, we used technical compound sentences, the weighting of information core degree, and information richness to represent patents. Then, we calculated patent-to-patent similarity with the representation. Finally, we conducted comparative experiments with correlation scores and patent classification. [Results] The proposed method outperformed benchmark methods in computing patent similarities. The technical compound sentences and weighting of information core degree and richness further improved the model's performance. [Limitations] We only examined the model with quantum computing. [Conclusions] Using a claim tree and technical compound sentences to organize patent information can improve the efficiency of patent text processing. The weighting of information core degree and richness based on hierarchical features of patents can improve their representation and patent similarity computing tasks.

Figures and Tables | References | Related Articles | Metrics

Select

Label Distribution Learning Based on Hierarchical Tag Structure

Liu Kan, You Meilin, Wei Lanxi

2024, 8 (2): 44-55. DOI: 10.11925/infotech.2096-3467.2022.1278

Abstract

HTML ( 6 )

PDF(1649KB) ( 81 )

[Objective] This paper focuses on the complex hierarchical relationship between tokens in label distribution learning. It enhance performance by adding the hierarchical tag structure to the label distribution learning model.[Methods] We proposed a hierarchy-based label distribution learning algorithm (H-LDL), which used conditional probability to describe the extensive and intensive tag structural relationship. We also adjusted the exact distribution of each level by the function of hierarchical weighted loss and its optimization strategy. [Results] We examined the new model on two public datasets. The Euclidean, Squared, and K-L scores decreased by 3.99%, 1.07%, and 3.10% on BU_3DFE dataset compared to the baseline model, while Intersec and Fidelity improved by 4.24% and 0.67%. On COMP dataset, the Euclidean decreased by 0.48%, but the Squared and K-L showed no significant decrease, while Intersect and Fidelity metrics increased by 0.45% and 0.02%. [Limitations] We only included two hierarchical relationships in the new model. Further research is needed for more complex hierarchical relationships. [Conclusions] A hierarchical label structure effectively improves the performance of label distribution learning.

Figures and Tables | References | Related Articles | Metrics

Select

Chinese Named Entity Disambiguation Based on Multivariate Similarity Fusion

Shi Shuiqian, Jin Jing, Shen Gengyu, Wang Baojia, Ren Ni

2024, 8 (2): 56-64. DOI: 10.11925/infotech.2096-3467.2022.1190

Abstract

HTML ( 7 )

PDF(997KB) ( 84 )

[Objective] This paper aims to solve the ambiguity problems arising from mapping multiple entities of the same name with different meanings to a knowledge base. It improves the accuracy of entity disambiguation. [Methods] We proposed a multi-dimensional similarity fusion method. It utilizes the semantic similarity of entity context, the entity attributes' background similarity, and the topic words' semantic similarity to characterize entities. [Results] We examined the new model on the agricultural dataset from Wikipedia. The proposed method achieved an accuracy of 89.7%, outperforming traditional methods. [Limitations] The proposed method is only applicable in specific fields. [Conclusions] The new method addresses the entity disambiguation issues in specific fields. It can be applied to a broader range of entity disambiguation scenarios.

Figures and Tables | References | Related Articles | Metrics

Select

Constructing Automatic Structured Synthesis Tool for Sci-Tech Literature Based on Move Recognition

Liu Yi, Zhang Zhixiong, Wang Yufei, Li Xuesi

2024, 8 (2): 65-73. DOI: 10.11925/infotech.2096-3467.2022.1330

Abstract

HTML ( 4 )

PDF(1769KB) ( 44 )

[Objective] This paper utilizes AI technology to construct an automatic structured synthesis tool, which organizes the sci-tech research frameworks structurally and reveals their main points. [Methods] The new tool was developed based on move recognition. First, we identified the research questions, methodology, and progress keywords to extract the most important knowledge points from each literature. Then, we employed hierarchical clustering and cluster label generation methods to synthesize the knowledge. Third, we designed a tree structure for the synthesis outputs. [Results] The proposed tool could automatically synthesize the literature contents and reveal their framework with a “research question, methodology, and progress” tree structure. [Limitations] Insufficient clustering accuracy and difficulty determining cluster numbers reduce our model's synthesis performance. [Conclusions] The synthesis tool based on move recognition could automatically retrieve structured literature contents.

Figures and Tables | References | Related Articles | Metrics

Select

Identifying Moves in Full-Text Chinese Academic Papers

Du Xinyu, Li Ning

2024, 8 (2): 74-83. DOI: 10.11925/infotech.2096-3467.2022.1284

Abstract

HTML ( 7 )

PDF(1728KB) ( 38 )

[Objective] This paper investigates the recognition of moves in full-text academic papers. It establishes a solid foundation for automatically understanding paper contents. Existing research on move recognition in academic papers only processes a small number of moves with coarse granularity. There are few open datasets for move classification. [Methods] Based on the BERT model, we constructed a move classification dataset of academic papers with multi-stage fine-tuning. Then, we proposed a move recognition model incorporating the section titles to recognize moves at a fine-grained level. [Results] For the 22-class classification, the overall accuracy of the RoBERTa-wwm-ext model increased by 0.031 to 0.909, and the Micro-F1 improved by 0.022 to 0.837. [Limitations] There is a small amount of unbalanced data in the constructed corpus, and the paper's quality will affect by the proposed model's performance. [Conclusions] The proposed model benefits the automatic understanding of academic papers, research quality evaluation, and semantic content retrieval, which play important roles in using scientific and technological literature.

Figures and Tables | References | Related Articles | Metrics

Select

Support for Cross-Domain Methods of Identifying Fake Comments of Chinese

Gu Yan, Zheng Kaihong, Hu Yongjun, Song Yishan, Liu Dongping

2024, 8 (2): 84-98. DOI: 10.11925/infotech.2096-3467.2022.1347

Abstract

HTML ( 8 )

PDF(1131KB) ( 76 )

[Objective] This paper constructs a cross-domain Chinese fake review identification model (CFEE) for multi-domain datasets. It extracts the semantic information of the comment texts and addresses the problems of traditional recognition models. [Methods] First, we established 11 rules for constructing fake review datasets and created a multi-domain dataset. Then, we designed the CFEE model to identify Chinese fake comments across domains. Third, it extracted the deep semantic information with the ERNIE pre-training model. The model identified the hidden comments based on the texts' emotional attributes. Finally, it projected the text information to the word relation dimension with the convolutional neural network and realized classification based on features of neural network fusion. [Results] The CFEE model's F₁ value reached 91.52% on the multi-domain Chinese fake comment datasets. The model's F₁ values were 85.71%, 79.59%, 85.71%, and 85.00% on single-domain datasets for mobile phones, food, clothing, and household appliances, respectively. It outperformed the existing models significantly. [Limitations] There is subjectivity in the manual annotation. [Conclusions] The proposed method can effectively identify Chinese fake reviews across domains.

Figures and Tables | References | Related Articles | Metrics

Select

Identifying Important Topics and Knowledge Flow Paths with Topic-Citation Fusion

Liang Shuang, Liu Xiaoping, Chai Wenyue

2024, 8 (2): 99-113. DOI: 10.11925/infotech.2096-3467.2022.1335

Abstract

HTML ( 6 )

PDF(1762KB) ( 79 )

[Objective] Understanding and exploring the internal mechanism and direction of knowledge flow, this paper provides references for science and technology innovation, scientific evaluation, and decision-making. [Methods] We established a topic-based knowledge network and constructed the topic importance indicators with their impact factors and node intersection degrees. We used the maximum path search algorithm based on these important topics to construct the knowledge inflow and outflow paths. [Results] The new method could effectively identify the important topics. We also identified the knowledge flow paths and the domains with the most significant knowledge dissemination. [Limitations] The measurement of knowledge flow intensity between nodes needs to consider citation motivations and types. [Conclusions] This paper identifies two-way knowledge flows between topics. Topic groups communicate closely with each other within each discipline. Knowledge flow paths provide valuable references for grasping the research topic developments as a whole.

Figures and Tables | References | Related Articles | Metrics

Select

Influence of Network Structure Changes on Co-word Network Link Prediction

Chen Zhuo, Jiang Xixi, Zhang Xiaojuan

2024, 8 (2): 114-130. DOI: 10.11925/infotech.2096-3467.2022.1311

Abstract

HTML ( 5 )

PDF(2103KB) ( 51 )

[Objective] This article studies the impacts of co-word network structure changes on link prediction using the similarity metric.[Methods] Firstly, we randomly retrieved the ISLS, LAW, BSS, COM, and Ocean literature from the core collection of Web of Science (2015 to 2020). Secondly, according to the diverse keyword frequencies, we constructed co-word networks with various topological features, such as the number of nodes and edges, the Average Clustering Coefficient, the Density, the Network Transitivity, and the Average Degree. Finally, we chose 15 traditional link prediction similarity metrics(e.g., AA, CN, RWR, and Katz) to conduct link prediction experiments on various co-word networks. [Results] We compared and analyzed the prediction effects of different similarity metrics with the network structure change. (1) In different disciplines, in most cases, the larger the overall frequency of keywords in the co-word network, the smaller the average clustering coefficient, the larger the density, network transitivity, average degree, average degree centrality, average betweenness centrality and average closeness, and the greater the possibility of poor link prediction effect. Conversely, the larger the average clustering coefficient, the smaller the other network topologies, and the better the link prediction effect. (2) Among the 15 selected similarity indicators, the RWR metric performed the best in co-word networks with different topological characteristics. The prediction performance of the Katz metrics is the most stable in different co-word networks. The prediction results of each index in the LAW discipline are most affected by the change in keyword frequency. [Limitations] Due to limited computing space, we only used one classification method and one evaluation index in this study. In addition, we did not explore some node similarity indicators (i.e., likelihood analysis-based metrics and probability model-based metrics). [Conclusions] This study provides a theoretical foundation for selecting similarity metrics of co-word networks of different disciplines.

Figures and Tables | References | Related Articles | Metrics

Select

Identifying Trending Events Based on Time Series Anomaly Detection

Yang Xinyi, Ma Haiyun, Zhu Hengmin

2024, 8 (2): 131-142. DOI: 10.11925/infotech.2096-3467.2022.1316

Abstract

HTML ( 8 )

PDF(2385KB) ( 45 )

[Objective] This study aims to discover information topics and identify real-world events that stimulate public discussions. It helps us establish timely responses and reduce risks. [Methods] We first constructed a co-word network to detect communities representing topics. Then, we calculated the document topic vectors based on the overlaps between the document words and topic community words. Third, we decided topic popularity time series according to the document time. Finally, we used the STL to decompose topic popularity time series and employed the 3σ rule to detect anomalies. We identified real-world events stimulating discussion by examining high-frequency words and highly correlated documents at anomalous time points. [Results] We examined the new model with posts from Sina Weibo about the heavy rainstorm in Henan. We discovered topics related to disaster situations, emergency management, and social response. Anomaly detection and analysis show that the topics about disaster situations received the highest public attention, with rainfall warnings and flood control actions being hot events. In emergency management, rescue and relief efforts and accident investigation can stimulate discussions. Regarding social response, stories of victims' mutual aid and public donations attract attention. [Limitations] The dataset of this study is relatively small, so we have to manually set the threshold of anomaly detection. An automatic method is needed for larger datasets. [Conclusions] Anomaly detection in topic time series can identify the trending events on social platforms. In crisis response, government agencies need to address rescue, prevention, and recovery aspects, issue timely warnings, provide information on disaster relief and accident investigations to address public concerns, and guide positive or healthy public opinion by promoting rescue, mutual aid, and donation activities.

Figures and Tables | References | Related Articles | Metrics

Select

Predicting User Pay Conversion Intention Based on Stacking Ensemble Learning: Case Study of Free Value-Added Games

Li Meiyu, Liu Yang, Wang Yixuan, Zhu Qinghua

2024, 8 (2): 143-154. DOI: 10.11925/infotech.2096-3467.2022.1261

Abstract

HTML ( 4 )

PDF(1422KB) ( 39 )

[Objective] This paper proposes a model based on the Stacking ensemble learning method to predict users' intention to convert to paid services, aiming to identify potential paying users accurately. [Methods] We constructed a model for predicting payment intention based on Stacking ensemble learning. First, we determined the base model combination by their prediction performance. Then, we examined the proposed model performance and portability with game players' behavior data set. [Results] The prediction accuracy of our model reached 90.88%, with a F1 value of 90.71% and an AUC value of 0.960 2. Compared to the Bayesian model with the worst performance, our model improved by 4.15%, 4.50%, and 0.106 2, respectively. [Limitations] Our model cannot predict whether players will engage in irrational spending. [Conclusions] This study verifies the applicability of the Stacking ensemble learning method in game payment scenarios. The fusion of multiple models can obtain stable and accurate prediction results of payment intention. The proposed model could predict users' payment intentions in different fields.

Figures and Tables | References | Related Articles | Metrics

Select

Predicting Drug-Target Relationship Based on Relation Fusion and Bidirectional Mass Diffusion Model

Zhang Yunqiu, Huang Qifei, Zhu Xiang

2024, 8 (2): 155-167. DOI: 10.11925/infotech.2096-3467.2022.1225

Abstract

HTML ( 8 )

PDF(1306KB) ( 30 )

[Objective] This study proposes a new method to predict the relationship between drugs and targets to improve the prediction performance. [Methods] Firstly, we used the SNF, AVG, and MAX methods to fuse multiple semantic relationships in drug and target similarity networks, which further enriched the semantic information of the networks. Then, we constructed a bidirectional diffusion model based on the fused similarity networks and the existing drug-target interaction network to predict the drug-target relationship. [Results] Compared with mainstream forecasting models, our method's AUC value index improved by 2.2% and 12.8%. With a retrospective study, the prediction scores ranked in the top 10, 20, and 30 drug-target relationship pairs, and clues and evidence related to 3, 8, and 11 drug-target pairs could be found in the literature. The SNF had the best fusion effect and maximized the prediction. [Limitations] We did not fuse similarities in objective attributes of drugs or targets, such as the chemical structure of drugs or sequence structure similarities of targets. The cold start problem in the relationship between new drugs and new targets still needs to be solved. [Conclusions] The prediction method proposed in this study could provide some references for the research on drug repositioning and relationship prediction of other biomedical entities.

Figures and Tables | References | Related Articles | Metrics