Data Analysis and Knowledge Discovery

Select

Constructing Semantic Association Model for Narrative-Oriented Archaeological Excavation Data

Han Muzhe, Gao Jinsong, Fang Xiaoyin, Li Shuaike, Sun Yanling, Li Yu

Data Analysis and Knowledge Discovery. 2024, 8(5): 1-17. https://doi.org/10.11925/infotech.2096-3467.2023.0409

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to ensure the shareability of knowledge in Archaeological Excavation Data (AED) and promote knowledge integration across humanity disciplines. It constructs an ontology model based on narrative logic analysis of multi-dimensional semantic decomposition to achieve a multi-dimensional associative combination and narrative representation of AED knowledge. [Methods] Firstly, we thoroughly analyzed the knowledge structure and narrative logic within AED to determine a plan for ontology construction. Secondly, we examined the widely-used CIDOC CRM ontology model and its expanded CRM ontology family in cultural heritage to assess the reusability of related ontology. Thirdly, we semantically aligned the knowledge from archaeological sites, remains, and relics to define entity classes. Finally, targeting the narrative logic in AED, we determined each entity class’s object and data properties to construct the ontology model. [Results] Using the AED data from the Yanbulake cemetery in Hami, Xinjiang, we identified the semantic association between the site and archaeological excavation activities. It also explored extensive semantic association methods for burial relics and unearthed artifacts with knowledge-mining value, resulting in a series of narrative displays. [Limitations] Although the data from the Yanbulake cemetery is representative, the site is relatively small, and the complexity of actual application scenarios may be higher. [Conclusions] The semantic association model constructed in this paper can achieve knowledge representation that aligns with the archaeological data’s knowledge structure and narrative logic at the knowledge unit level.

Select

Examining Dialogue Consistency Based on Chapter-Level Semantic Graph

Li Fei, Deng Kaifang, Fan Maohui, Teng Chong, Ji Donghong

Data Analysis and Knowledge Discovery. 2024, 8(5): 18-28. https://doi.org/10.11925/infotech.2096-3467.2023.0431

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper integrates chapter-level semantic graphs to improve the accuracy of dialogue consistency detection. [Methods] First, we used the pre-trained language model BERT to encode the dialogue context and knowledge base. Then, we constructed a dialogue chapter-level semantic graph containing coreference chains and abstract meaning representations. Third, we captured the semantic information of the constructed graph using a multi-relation graph convolutional network. Finally, we built multiple classifiers to predict dialogue inconsistency. [Results] We examined our new model on the CI-ToD benchmark dataset and compared its performance with the existing models. The proposed model’s F1 value improved by more than 1% over the optimal models. [Limitations] The proposed model cannot address the co-referential entity omission in dialogues. [Conclusions] Integrating various types of semantic information, such as coreference chains and abstract meaning representations, can effectively improve the performance of dialogue consistency detection.

Select

Fusion of Organization Authority Files from Multiple Sources

Fan Yunman, Chen Ying, Tang Xiaoli

Data Analysis and Knowledge Discovery. 2024, 8(5): 29-37. https://doi.org/10.11925/infotech.2096-3467.2023.0475

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to improve the selection and evaluation of the organization authority files (OAF) and address the mapping issues between OAF and redundant relationships. [Methods] First, we examined the existing OAF and related studies. Then, we constructed a fusion model with six steps: data collection and analysis, metadata framework fusion, organization relationship fusion, alias fusion, OAF data model construction, and verification of fusion results. Finally, we examined the new model using data from Dimensions, Scopus, and Web of Science. [Results] Our new model’s F1 value reached 0.97 or above in the first, second, and third-level organizations, and the Dimensions made the most significant contribution. We constructed an OAF containing 5,128 organizations. [Limitations] The organization relationship only included the parent-child relations. Cross-reference relations and the choice of standard organization names need to be studied. We also need to verify the proposed model with more data. [Conclusions] The new model could effectively integrate OAF from multiple sources.

Select

Learning with Dual-graph for Concept Prerequisite Discovering

Xu Guolan, Bai Rujiang

Data Analysis and Knowledge Discovery. 2024, 8(5): 38-45. https://doi.org/10.11925/infotech.2096-3467.2023.0099

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper fully utilizes fine-grained information, such as the mention of concepts in learning resources, to more effectively identify prerequisite relationships. [Methods] First, we explored prerequisite relationships using a dual-graph neural network. Then, we constructed a concept semantic graph and a concept prerequisite graph based on the connections between learning resources and concepts. Third, we obtained the representations of concepts with a graph neural network and predicted the unknown prerequisite relationships. [Results] We extensively examined our model on four classic prerequisite relationship mining datasets. Our method achieved promising results, surpassing existing methods. It outperformed the second-best method by 0.059, 0.037, 0.073, and 0.042 regarding the F1 score on each dataset. [Limitations] This method shows weak predictive ability for concepts not appearing in the learning resources. [Conclusions] The proposed dual-graph neural network method can effectively leverage semantic information in learning resources to enhance prerequisite relationship mining.

Select

Analyzing Compliance of Privacy Policy with Knowledge-Enhanced Deep Learning Model: From the Perspective of Integrity and Semantic Conflict

Zhu Hou, Luo Yingjia, Chen Menglei, Ouyang Jiaxiang, Xiao Ying, Cai Yinan

Data Analysis and Knowledge Discovery. 2024, 8(5): 46-58. https://doi.org/10.11925/infotech.2096-3467.2023.0446

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The paper aims to detect the compliance of privacy policies at the semantic level by integrating legal and regulatory knowledge. [Methods] We constructed a compliance evaluation index system from the integrity and semantic conflict perspective based on the Information Security Technology—Personal Information Security Specification (GB/T 35273-2020) and annotated the corpus. Then, we used the K-BERT model embedded with a knowledge graph to build an integrity evaluation model and a consistency evaluation model to detect semantic conflicts. Finally, we analyzed the compliance of app privacy policies in 15 fields with the integrity and consistency evaluation models. [Results] We constructed a Chinese privacy policy corpus that passed the Kendall's W test, and the F1 Score of the integrity and consistency evaluation models reached 0.92 and 0.87, respectively. We analyzed 1762 app privacy policies and found that policies in the fields of Audio-Video Entertainment, Purchase Comparison, Financial Planning, Sports and Health, and Automotive are better in integrity, while those in the fields of Social Communication and Purchase Comparison are more semantically compliant with legal and regulatory requirements. [Limitations] The content in hyperlinks that may appear in a few privacy policies is ignored, which may cause bias in the compliance testy of some privacy policies. [Conclusions] The proposed model achieves the goal of automated analysis of privacy policy compliance in various fields, which is significant for China in enhancing the regulatory capacity for mobile apps handling user privacy data.

Select

Intelligent Completion of Ancient Texts Based on Pre-trained Language Models

Li Jiajun, Ming Can, Guo Zhihao, Qian Tieyun, Peng Zhiyong, Wang Xiaoguang, Li Xuhui, Li Jing

Data Analysis and Knowledge Discovery. 2024, 8(5): 59-67. https://doi.org/10.11925/infotech.2096-3467.2023.0163

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method based on pre-trained language models for completing ancient texts, utilizing representations obtained from pre-training models at different semantic levels and for simplified and traditional Chinese characters. The method constructs a mixture-of-experts system and a simplified-traditional Chinese fusion model to complete ancient texts. [Methods] We designed the mixture-of-experts system-based model for transmitted texts and constructed the simplified-traditional Chinese character fusion model for excavated literature. We fully integrated and explored the model’s capabilities in different scenarios to improve its ability to complete ancient texts. [Results] We examined the new models with self-constructed datasets of transmitted and excavated texts. The models achieved accuracy of 70.14% and 57.13% for the completion task. [Limitations] We only utilized natural language processing approaches. Future improvements involve leveraging multimodal techniques, combining computer vision with natural language processing, and integrating image and semantic information to yield better results. [Conclusions] The proposed models achieve high accuracy on the constructed datasets of ancient literature, providing a competitive solution for completing ancient texts.

Select

Expert Recommendation in Q&A Community Based on Topic Interest and Domain Authority

Li Mingzhu, Mi Chuanmin, Gou Xiaoyi, Xiao Lin

Data Analysis and Knowledge Discovery. 2024, 8(5): 68-79. https://doi.org/10.11925/infotech.2096-3467.2023.0433

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to enhance the accuracy of expert recommendations in Q&A communities based on topics of users’ historical Q&A texts and contextual information. [Methods] First, we combined the BERT model with the Labeled-LDA model. Then, we utilized the label information to vectorize users’ historical Q&A texts. Third, we identified contextual topics with dimension reduction and topic clustering. We also obtained the probability distribution of the expert’s topic interests. Fourth, based on the results of topic interest mining, we constructed the Topic Sensitive PageRank Algorithm (TSPR). We used the users’ quality weight to calculate their domain authority iteratively. From this, we proposed the TIDARank algorithm for expert recommendation.[Results] Based on the Stack Exchange public dataset, the BERT-LLDA model outperformed TF-IDF, BERT, and BERT-LDA models on silhouette coefficient (0.5756) and topic coherence (0.4766). The ACC@20 and MRR@20 of TIDARank reached 0.5807 and 0.2430, respectively, improved by 0.145 and 0.081 compared with the best-performing Bi-LSTM+TSPR baseline algorithm. [Limitations] We did not consider user activity in link analysis. [Conclusions] The BERT-LLDA model could optimize topic clustering for question-answering texts and improve the performances of expert recommendations in Q&A communities.

Select

Identifying Critical Nodes of Collaboration Networks Based on Improved K-shell Decomposition

Zhang Dayong, Men Hao, Su Zhan

Data Analysis and Knowledge Discovery. 2024, 8(5): 80-90. https://doi.org/10.11925/infotech.2096-3467.2023.0485

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an improved K-shell decomposition algorithm based on semi-local centrality, aiming to address the degradation issue of critical nodes identification. [Methods] First, we constructed a semi-local centrality index based on the nodes’ first-order neighbor information. Then, we determined the final key node set by recursive removal, with the semi-local information of the remaining and removed nodes. [Results] We examined our algorithm with six groups of cooperative networks. It could effectively eliminate the degradation issue of the original algorithm with high computational accuracy and low computational complexity. [Limitations] Due to the influence of network structures, the calculation accuracy of some sample networks was lower than that of the betweenness centrality algorithm. [Conclusions] The new algorithm can improve the stability of the collaboration network and identify key node sets in large-scale practical networks.

Select

Multimodal Sentiment Analysis Model Integrating Multi-features and Attention Mechanism

Lyu Xueqiang, Tian Chi, Zhang Le, Du Yifan, Zhang Xu, Cai Zangtai

Data Analysis and Knowledge Discovery. 2024, 8(5): 91-101. https://doi.org/10.11925/infotech.2096-3467.2023.0026

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a multimodal sentiment analysis model integrating multiple features and attention mechanisms. It addresses the insufficient extraction of multimodal features and inadequate interaction of intra-modal and inter-modal information in existing models. [Methods] In multimodal feature extraction, we enhanced the features of body movements, gender, and age of individuals in the video modality. For the text modality, we integrated BERT-based character-level and word-level semantic vectors. Therefore, we enriched the low-level features of multimodal data. We also utilized self-attention and cross-modal attention mechanisms to integrate intra-modal and inter-modal information. We concatenated the modal features and employed a soft attention mechanism to allocate attention weight to each feature. Finally, we generated the sentiment classification results through fully connected layers. [Results] We examined the proposed model on the public dataset (CH-SIMS) and the Hot Public Opinion Comments Videos (HPOC) dataset constructed in this paper. Compared with the Self-MM model, our model improved the binary classification accuracy, tri-class classification accuracy, and F1 value by 1.83%, 1.74%, and 0.69% on the CH-SIMS dataset, and 1.03%, 0.94%, and 0.79% on the HPOC dataset. [Limitations] The person’s scene in the video may change constantly, and different scenes may contain different emotional information. Our model does not integrate the scene information of the person. [Conclusions] The proposed model enriches the low-level features of multimodal data and improves the effectiveness of sentimental analysis.

Select

Sentiment Analysis of User Reviews Integrating Margin Sampling and Tri-training

Jiang Yiping, Zhang Ting, Xia Zhengming, Li Yuhua, Zhang Zhaotong

Data Analysis and Knowledge Discovery. 2024, 8(5): 102-112. https://doi.org/10.11925/infotech.2096-3467.2023.0519

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a sentiment analysis method for user reviews integrating margin sampling and tri-training. It addresses the issues of the large volume of user reviews, ambiguous sentiment tendencies, and short content. [Methods] First, we constructed a multi-class support vector machine based on a one-vs-all decomposition strategy. Then, we integrated a margin sampling strategy considering cosine similarity to create an initial set. Finally, we proposed a Tri-training algorithm combining a soft voting mechanism. [Results] The proposed algorithm improved the voting mechanism in the Tri-training algorithm, which further reduced the probability of misjudgment in sample classification by multiple classifiers. All categories achieved precision rates above 79%. [Limitations] The proposed method does not consider extracting information from multimedia data. [Conclusions] Compared with traditional and recently improved semi-supervised learning algorithms, the proposed algorithm demonstrates classification accuracy and efficiency superiority.

Select

Analyzing the Evolution of Internet Public Opinion Based on Short-Video Network

Wei Hongcheng, Zhu Hengmin, Wei Jing, Ye Dongyu

Data Analysis and Knowledge Discovery. 2024, 8(5): 113-126. https://doi.org/10.11925/infotech.2096-3467.2023.0506

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Short videos have become a new medium for spreading Internet public opinion. In order to reveal the evolution characteristics of Internet public opinion spreading through short videos, a method based on short-video network is proposed. [Methods] First, the similarity of short videos’ titles, covers and video content is calculated to construct a short-video network. Then, we detect topics from the network based on hierarchical clustering, and measure the sentiment of videos’ audios and titles. We also classify video accounts into different categories of stakeholders. Finally, the evolution of public opinion spreading through short videos is analyzed from three dimensions including topics, sentiment and stakeholders. [Results] The results show that the multimodal features of videos and the relationship between videos can be used to effectively describe the evolution of public opinion. And the SSE value of short video topics under the combination of “titles+video covers + video content” is 6.708, which is better than the single modality or combinations of other modalities in the paper. [Limitations] The audios of short videos are crawled from the Douyin platform including background music, and there is a certain deviation for the analysis of audio modality. [Conclusions] The study is helpful to understand the evolution of topics and group emotions in public opinion spreading through short videos, discover the concerns and sentimental evolution of different video accounts, and promptly regulate and guide public opinion.

Select

Evaluating Innovation Quality of Academic Papers——Case Study of Pluripotent Stem Cells

Wang Xuefeng, Yu Huiyan, Zheng Sijia, Lei Ming

Data Analysis and Knowledge Discovery. 2024, 8(5): 127-138. https://doi.org/10.11925/infotech.2096-3467.2023.0242

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs an evaluation model for academic paper innovation quality. It explores a new method combining quantitative and qualitative approaches and promotes the progressive innovation of scientific research. [Methods] Balancing the innovative novelty and impact characteristics, we utilized the Doc2Vec algorithm to convert unstructured textual content into a vector space model. Then, we used cosine similarity to measure text content’s similarity. Simultaneously, we constructed a calculation method for the innovation impact index using the local citation network of the paper under evaluation. Third, we mapped the novelty and impact measurements onto a two-dimensional scatter plot. Finally, we constructed a model for evaluating the innovation quality of academic papers based on regional division. [Results] Empirical results on pluripotent stem cell technology showed that the proposed method is consistent with the F1000 recommendation results and can partly compensate for the deficiencies in the current evaluation of the innovation quality of academic papers. [Limitations] We only discussed the impacts of academic papers’ novelty and innovation. There are many other factors influencing the quality of academic paper innovation. [Conclusions] Our new model can provide quantitative data support for qualitative peer review and represents a beneficial exploration of quantitative evaluation of the innovation quality of academic papers.

Select

Conceptual Framework for Cultural Impacts of Scientific Research and REF2021 Cases

Zeng Yan, Zan Tingting, Yang Xiao, Qu Mingjian

Data Analysis and Knowledge Discovery. 2024, 8(5): 139-150. https://doi.org/10.11925/infotech.2096-3467.2023.0272

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study analyzes the cultural impact cases of foreign scientific research. It provides references for the cultural value assessment of scientific achievements in China. [Methods] We established a conceptual framework for 16 cultural impact categories and four types of research outcomes. Then, we analyze 29 cultural impact cases in medical science and technology from the UK REF2021 impact assessment. Finally, we conducted a structured analysis of case texts using the Notion AI tool. [Results] This study identified the rich diversity of cultural impact categories in medical science and technology. The most prominent category is “participation or application in various media or cultural carriers”. The cultural impact of different disciplines and types of research outcomes exhibits differences. [Limitations] The study has a limited number of cases. The conceptual framework for cultural impact needs further improvement. It does not include an analysis of the impact on cultural ideologies. [Conclusions] The proposed cultural impact framework helps to interpret case texts. The differences in cultural impact between different disciplines and types of research outcomes highlight the necessity and significance of categorical evaluation. The conceptual framework of cultural impacts needs to be expanded to better support decision-making for assessment.

Select

Constructing Smart Consulting Q&A System Based on Machine Reading Comprehension

Wang Yihu, Bai Haiyan

Data Analysis and Knowledge Discovery. 2024, 8(5): 151-162. https://doi.org/10.11925/infotech.2096-3467.2023.0324

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to improve the smart consulting systems to effectively answer academic questions. [Methods] We utilized deep learning, machine reading comprehension, data augmentation, information retrieval, and semantic similarity techniques to construct datasets and an academic knowledge question-answering system. Additionally, we designed a multi-paragraph recall metric to address the characteristics of academic literature and enhance retrieval accuracy with multidimensional features. [Results] Our new model’s ROUGE-L score reached 0.7338, with a question-answering accuracy of 88.65% and a multi-paragraph recall metric accuracy of 88.38%. [Limitations] We only examined the new model with single-domain content, which may limit the system’s performance in dealing with complex issues involving multiple domains. [Conclusions] The deep integration of machine reading comprehension technology with reference services can enhance the efficiency and sharing of academic resources and provide more comprehensive and accurate information support for researchers.

Please choose a citation manager

Content to export

25 May 2024, Volume 8 Issue 5

模态框（Modal）标题

Please choose a citation manager

Content to export

25 May 2024, Volume 8 Issue 5