Data Analysis and Knowledge Discovery

Select

Knowledge Diffusion Characteristics of Highly Disruptive Patents Based on Citation Network

Pan Yiru, Mao Jin, Li Gang

Data Analysis and Knowledge Discovery. 2023, 7(10): 1-14. https://doi.org/10.11925/infotech.2096-3467.2022.0927

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the knowledge diffusion of disruptive patents. [Methods] First, we used the disruption index to identify highly disruptive patents from the USPTO database. Then, we matched the patents of the control group according to the number of citations and co-citation couplings. Third, we analyzed the knowledge diffusion characteristics of highly disruptive patents from citation distribution and citation network characteristics. Finally, we built a regression model to reveal the core features. [Results] The citation take-off point of highly disruptive patents appeared 1 to 3 years after authorization. The increasing speed peaked in 3 to 5 years and decreased from the 6th year. Significant differences exist between highly disruptive and control group patents in knowledge diffusion intensity, efficiency, local and global knowledge diffusion capabilities, etc. First citation-year, first-peak interval-year, and first-peak-year citation metrics, average path length, average clustering coefficient, and connectivity metrics for low-citation generations help us identify highly disruptive patents. [Limitations] The disruption index fluctuates over time. This study selects highly disruptive patents according to the time interval, and its disruption index value is not yet stable. [Conclusions] The study reveals the knowledge diffusion characteristics of disruptive technologies from the perspective of patent citations and provides theoretical support for identifying disruptive technologies.

Select

Recognition and Visual Analysis of Interdisciplinary Semantic Drift

Li Nan, Wang Bo

Data Analysis and Knowledge Discovery. 2023, 7(10): 15-24. https://doi.org/10.11925/infotech.2096-3467.2022.0635

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the semantic drift of domain terms with machine learning techniques. It recognizes and visualizes interdisciplinary semantic drifts and explores their patterns and causes. [Methods] We designed a framework for identifying and visualizing the semantic drift of domain terms with deep learning algorithms. The framework combined algorithms of “SBERT model+word embedding optimization+hierarchical clustering” to identify interdisciplinary semantic drift. It also utilized Bokeh and principal component analysis to visualize the phenomenon of interdisciplinary semantic drift. [Results] The proposed framework can accurately identify interdisciplinary semantic drift, and the overall recognition accuracy (p) in the DT-Sentence dataset reached 86.15%. [Limitations] The framework needs to be verified with more disciplines’ datasets. [Conclusions] This study benefits data mining and visualization of semantic drifts. It also lays the technical foundation for semantic evolution, understanding, and modeling.

Select

A Decentralized Classification Algorithm for Online Consumers Based on Improved LPA

Liu Zhu, Qian Xiaodong

Data Analysis and Knowledge Discovery. 2023, 7(10): 25-36. https://doi.org/10.11925/infotech.2096-3467.2022.0795

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

Abstract: [Objective] This paper proposes a classification algorithm based on the improved LPA model,aiming to improve the label propagation distance and node similarity judgment in decentralized e-commerce network consumer classification. [Methods] Firstly, we introduced the cosine similarity formula to measure the similarity of nodes and constructed a similarity adjacency matrix. These steps improved the measurement of node distance according to their shared relationship in the LPA algorithm. We also introduced the principle of a back lookup table to conform to the characteristics of locality and reduce the time complexity. Secondly, we selected the initial center point with the degree centrality index and used the clustering coefficient index to update the label rules. We proposed the label propagation distance optimization formula to make the LPA algorithm meet the locality requirements. [Results] The category structure modularity Q of the improved LPA algorithm was 0.054 and 0.145 higher than the traditional LPA algorithm in the network with two neighbor similarity thresholds. The modular Q value increased up to 0.092 on data of different scales. [Limitations] The paper needs to set two parameters and use the principle of the back lookup table. The relationship between time complexity and network size is square. [Conclusions] The improved LPA can more effectively limit label propagation, which creates higher intra-categories node similarity and lower inter-categories node similarity. It is suitable for analyzing decentralized e-commerce consumer networks.

Select

Influencing Mechanisms of Triadic Closure in Pharmaceutical — Opportunity, Trust, and Motivation

Wu Shengnan, Sun Yidan, Pu Hongjun, Dong Jizong, Gao Jian, Tian Ruonan, Li Lin

Data Analysis and Knowledge Discovery. 2023, 7(10): 37-49. https://doi.org/10.11925/infotech.2096-3467.2022.0895

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Opportunity, trust, and motivation are the three mechanisms influencing the formation of triadic closure in directed social networks. This paper explores the mechanisms affecting triadic closure in the pharmaceutical domain, aiming to provide foundations for drug knowledge discovery. [Methods] First, we used social network indices to measure the three mechanisms of opportunity, trust, and motivation. Then, we examined the Pearson correlation coefficient between these mechanisms and the triadic closure clustering and their numbers. Third, we introduced additional node attributes, network characteristics, and econometric methods to examine the relationships between node attributes and network characteristics/the three mechanisms. [Results] The node pairs for the opportunity and the clustering coefficient of the edges between them showed a strong positive correlation (r₁>0.5). The node pairs for trust and motivation showed a strong positive correlation with the number of enclosed triads they belong to (r₃, r₅>0.5). The closeness centrality of node pairs negatively impacted the opportunity and trust,but a positive impact on motivation. The betweenness centrality and eigenvector centrality of node pairs positively impacted opportunity, trust, and motivation. The average path length of the network negatively affected the opportunity of node pairs but positively impacted their trust and motivation. [Limitations] More literature needs to be included for empirical analysis in the future. [Conclusions] The proposed method illustrates the circumstances of triadic closure formation for node pairs. Node attributes and network characteristics influence the three mechanisms, which provides a new direction for drug knowledge discovery.

Select

Identifying Styles of Cross-Language Classics with Pre-Trained Models

Zhang Yiqin, Deng Sanhong, Hu Haotian, Wang Dongbo

Data Analysis and Knowledge Discovery. 2023, 7(10): 50-62. https://doi.org/10.11925/infotech.2096-3467.2022.0926

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses pre-trained language models to explore and study the linguistic style of canonical texts, aiming to improve their connotation quality. [Methods] We compared the performance of five pre-trained language models with the deep learning model Bi-LSTM-CRF on the cross-lingual canonical ancient Chinese-English corpus. The selected works include The Analects of Confucius, The Tao Te Ching, The Book of Rites, The Shangshu, and The Warring States Curse. We also examined the lexicon-based canonical language style. [Results] The SikuBERT pre-trained language model achieved 91.29% precision, 91.76% recall, and 91.52% in concordance mean F1 for recognizing canonical words. The modern Chinese translation yielded deeper semantic meaning, clearer ideographic referents, and more vivid and flexible word combinations than the original canonical words. [Limitations] This study only chose specific pre-Qin classical texts and their translations. More research is needed to examine the models’ performance in other domains. [Conclusions] The pre-trained language model SikuBERT could effectively analyze language style differences of cross-lingual canonical texts, which promotes the dissemination of classic Chinese works.

Select

Extracting Product Features and Analyzing Customer Needs from Chinese Online Reviews with Hybrid Neural Network

Shi Lili, Lin Jun, Zhu Guiyang

Data Analysis and Knowledge Discovery. 2023, 7(10): 63-73. https://doi.org/10.11925/infotech.2096-3467.2022.0872

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to extract product features and analyze customer needs based on the content of Chinese online reviews. [Methods] First, we proposed a hybrid neural network (HNN) to extract product features. Then, we applied critical incident technique (CIT) and analysis of complaints and compliments (ACC) to the Kano model to classify and prioritize product features. [Results] The F1 value of the HNN model reached 94.85%, which was 10.52 percentage points higher than the variant benchmark models and 9.47 percentage points over other leading models on average. [Limitations] The proposed model is supervised learning, and the need for labeling information restricts its application. [Conclusions] The proposed method improves the accuracy of product feature extraction, as well as classifies and prioritizes product features based on customer needs. It lays a foundation for managers to develop product improvement strategies.

Select

Multi-task & Multi-modal Sentiment Analysis Model Based on Aware Fusion

Wu Sisi, Ma Jing

Data Analysis and Knowledge Discovery. 2023, 7(10): 74-84. https://doi.org/10.11925/infotech.2096-3467.2022.1019

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develops a multi-task and multi-modal sentiment analysis model based on aware fusion, aiming to sufficiently use context information, as well as address the modality-invariant and modality-specific issues. [Methods] We established multi-modal, text, acoustic, and image sentiment analysis tasks. We extracted their features using BERT, wav2vec2.0, and openface2.0 models, which were processed by the self-attention layer and sent to the aware fusion layer for multi-modal feature fusion. Finally, we categorized the single-modal and multi-modal information using Softmax. We also introduced the loss function of the homoscedastic uncertainty to assign weights to different tasks automatically. [Results] Compared with the baseline method, the proposed model improved the accuracy and F1 value by 1.59% and 1.67% on CH-SIMS, and 0.55% and 0.67% on CMU-MOSI. The ablation experiment showed that the accuracy and F1 value of multi-task learning were 4.08% and 4.18% higher than those of single-task learning. [Limitations] We need to examine the new model’s performance on large-scale data sets. [Conclusions] The model can effectively reduce noise and improve multi-modal fusion. The multi-task learning framework could also achieve better performance.

Select

Detecting Multimodal Sarcasm Based on ADGCN-MFM

Yu Bengong, Ji Xiaohan

Data Analysis and Knowledge Discovery. 2023, 7(10): 85-94. https://doi.org/10.11925/infotech.2096-3467.2022.0987

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a sarcasm detection model based on affective dependency graph convolutional neural network-modality fusion. It tries to comprehensively improve multimodal sarcasm detection studies with sentiment information and syntactic dependencies of texts. [Methods] The new model enhances text modalities’ sentiment and syntactic information by utilizing sentiment graphs and syntactic dependency graphs. It uses graph convolutional neural networks to obtain text information with rich sentiment semantics and then fuses multimodal features by modal fusion. Finally, the model uses a self-attention mechanism to filter redundant information and perform sarcasm detection based on the fused information. [Results] The new model’s accuracy reached 85.85%, which is 3.46%, 2.25%, 1.83%, and 0.95% higher than the baseline models HFM, Res-BERT, D&R Net, and IIMI-MMSD, respectively. The F1 value reached 84.80%, 1.44% higher than the baseline models. [Limitations] More research is needed to validate the generalization and robustness of the model on more datasets. [Conclusions] The proposed model can thoroughly examine the sentiment and syntactic dependencies of the text and effectively detect multimodal sarcasm.

Select

Examining Topics and Sentiments of Chronic Disease Patients’ Online Reviews — Case Study of “Sweet Homeland”

Yu Jiaqi, Zhao Doudou, Liu Rui

Data Analysis and Knowledge Discovery. 2023, 7(10): 95-108. https://doi.org/10.11925/infotech.2096-3467.2022.0891

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a model for topic-sentiment collaborative mining, aiming to understand chronic disease patients at different stages better. [Methods] First, we added sentiment and time features to the LDA model to create the new dUTSU (dynamic unsupervised topic and sentiment unification) model. Then, we retrieved posts by diabetes patients from an online health community. Finally, we assessed the dUTSU model’s performance with the topic-sentiment analysis and the topic-sentiment evolution analysis. [Results] The dUTSU model had better perplexity, average topic similarity, and sentiment classification accuracy than the JST, ASUM, and UTSU models. The model identified 15 topics and captured trending topics, sentiment, and intensity across seven distinct periods, including the disease diagnosis stage and the complication stage. The model also revealed the topic-sentiment evolution over time. [Limitations] The experiment only used the diabetics reviews. We did not consider patients’ geographical locations, personal attributes, and social relationships. [Conclusions] The dUTSU model could effectively extract topic-sentiment data collaboratively reviews from patients with chronic diseases. The findings can serve as valuable references for online health communities, medical institutions, and patients to carry out health services.

Select

Extracting Value Elements and Constructing Index System for Calligraphy Works Based on Hyperplane-BERT-Louvain Optimized LDA Model

Pan Xiaoyu, Ni Yuan, Jin Chunhua, Zhang Jian

Data Analysis and Knowledge Discovery. 2023, 7(10): 109-118. https://doi.org/10.11925/infotech.2096-3467.2022.0915

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper uses big data and artificial intelligence to identify the value elements of calligraphy works and provides technical support for their trading activities. It addresses the issue of lacking standards in the assessment of calligraphy works. [Methods] First, we combined the hyperplane algorithm and BERT model to preprocess calligraphy documents by eliminating stop words and expanding semantics to create an optimized corpus with high recognition. Secondly, we constructed a complex semantic network for calligraphy literature and introduced the Louvain algorithm to determine the optimal number of topics by maximizing the modularity of the community network. Finally, we developed a new method based on “Hyperplane-Bert-Louvain-LDA” (HBL-LDA) to construct an assessment index system of calligraphy value. [Results] Compared with LDA, the precision and F value of the topic recognition of the HBL-LDA were increased by 45.00% and 29.46%, respectively. The average topic quality rate was reduced by 0.96, with more high-quality topics identified. We also used regression models to verify the evaluation index system with representative calligraphy works, with the highest accuracy rate of 84.00%. [Limitations] This paper only constructed an evaluation system for calligraphy works, which cannot be applied to other artworks. The BERT model lacks the topic semantic information, which makes it challenging to expand similar feature words. [Conclusions] The new model for calligraphy value evaluation proposed in this paper provides new directions for constructing index systems in other fields.

Select

Analyzing Social Effects of Olympic Games on Winter Sports with Baidu Index

Zhang Yan, Wang Ziwei, Ye Pengqian

Data Analysis and Knowledge Discovery. 2023, 7(10): 119-130. https://doi.org/10.11925/infotech.2096-3467.2022.1240

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the impacts of the Olympic Games on the public attention of ice and snow sports, aiming to support the promotion of winter sports. [Methods] We collected data from the Baidu index to conduct the Wilxcon rank sum test and correlation analysis. Then, we built a two-way fixed effect model with the LSDV method to test related hypotheses. [Results] The Olympic Winter Games Beijing significantly improved the Baidu Index of ice and snow sports and related equipment or venues. Winter Olympic medals, especially gold medals, significantly increased the Baidu Index of associated sports. Public preferences and the star athletes were positively related to the attention of winter sports, which also increased the social impact of the Winter Olympics and medals. [Limitations] We only examined the Baidu Index of the Olympic Winter Games Beijing, which needs to be expanded. [Conclusions] Major winter games and results, public preferences, and star athletes could promote winter sports in China.

Select

Constructing and Evaluating Chinese Reading Comprehension Corpus for Anti-Terrorism Field

Gao Feng, Yang Zihang, Hou Jin, Gu Jinguang, Cheng Junjun

Data Analysis and Knowledge Discovery. 2023, 7(10): 131-143. https://doi.org/10.11925/infotech.2096-3467.2022.0334

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper develops a corpus for Chinese machine reading comprehension in the security field (SecMRC), which adds professional data support for related studies. [Methods] First, we constructed a keyword search engine to retrieve the domain news. Then, we automatically generated the questions for pre-annotation with the ERNIE-GEN model. Third, we created the domain vocabulary using temporal feature words and domain keyword-matching algorithms to support accurate word separation. Finally, we formed the final dataset with manually annotated question-answer pairs and proposed a new baseline model (SecMT5). [Results] The dataset contains 2 100 Anti-terrorism and security-related news, 7 300 extracted question-answer pairs, 2 100 generative Q&A pairs, and 4,796,264 characters. We conducted tests using advanced reading comprehension models on the SecMRC. The F1 of the extraction task reached 72.05% (6.13% higher than the baseline model), and the average ROUGE-L of the generative task was 37.62%. Both are significantly weaker than the human intelligence. [Limitations] The number of Q&A pairs in the dataset needs to be expanded, and the difficulty and diversity of these pairs need to be improved. [Conclusions] The SecMRC dataset highlights domain knowledge and is challenging. It can effectively support the research of machine reading comprehension. The dataset construction method can be utilized in other fields.

Select

Constructing Multimodal Corpus of Chinese Vocabulary for Sign Language Linguistics

Zhang Yanqiong, Zhu Zhaosong, Zhao Xiaochi

Data Analysis and Knowledge Discovery. 2023, 7(10): 144-155. https://doi.org/10.11925/infotech.2096-3467.2022.1262

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper extracts and organizes knowledge from multimodal sign language resources and constructs a corpus for related research. It meets the public’s urgent demands to obtain sign language knowledge. [Context] The new multimodal corpus is suitable for mining sign language knowledge, which addresses low information levels, disordered resource organization, and difficult utilization of sign language knowledge. [Methods] Firstly, we constructed the multi-modal feature annotation system for sign language vocabulary. Secondly, we formulated the feature coding scheme of the vocabulary and implemented multi-level annotation. Finally, we established the graph model for sign language vocabulary and the Neo4j database to store and visualize. [Results] The vocabulary data are from the national sign language vocabulary corpus. Over 10 000 sign language vocabulary multimodal annotation has been completed, and we realized the whole process of constructing a multimodal corpus. [Conclusions] The new corpus increases knowledge retrieval of hand shape, movement, expression, and posture, which greatly improves the usability of the sign language corpus.

Please choose a citation manager

Content to export

25 October 2023, Volume 7 Issue 10

模态框（Modal）标题

Please choose a citation manager

Content to export

25 October 2023, Volume 7 Issue 10