Data Analysis and Knowledge Discovery

Select

Measuring Diffusion of Technology Topics with Patent Data

Wang Li, Liu Xiwen

Data Analysis and Knowledge Discovery. 2022, 6(6): 1-10. https://doi.org/10.11925/infotech.2096-3467.2021.0915

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to quantify the diffusion of technology topics based on patent data, aiming to predict their dissemination in advance. [Methods] First, we constructed the technology diffusion relationship with the patent citation data. Then,we constructed a comprehensive measuring index for technology diffusion from their strength, speed and breadth. Finally, we built the model measuring technology topic diffusion. [Results] We examined our model with 100 topics in the graphene field, which quickly identified topics with high comprehensive diffusion scores. We also found the diffusion directions of graphene patents. [Limitations] We only normalized three measuring indice for technology diffusion with Min-Max Scaling, while their weights not beening optimized for the applications. [Conclusions] The proposed model could help us find intelligence effectively with the help of multiple measurements.

Select

Impacts of Message Framing on Changing Intention of Health Behaviors —An Eye-Tracking Experiment

Han Wenting, Han Xi, Zhu Qinghua

Data Analysis and Knowledge Discovery. 2022, 6(6): 11-21. https://doi.org/10.11925/infotech.2096-3467.2021.1128

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the impacts of message framing (gain-loss framing and time framing) on the changing intention of individual’s health behaviors in a pre-contemplative stage, aiming to identify the information guiding people’s health behavioral intentions. [Methods] First, we designed four types of health information based on message framing theory. Then, we used an eye-tracking experiment to record participants’eye movements while reading health information and analyzed these data with ANOVA. Third, we used the semi-structured interview to explore the mechanism between message framing and health behavioral intentions. Finally, we identified participants’ health information needs. [Results] Loss framing information attracted more attention than the gain framing information (Total Fixation Duration: 49.456>32.633, P=0.045; Average Fixation Duration: 0.314>0.223, P=0.003). There was no significant difference between short-term framing and long-term framing information (Total Fixation Duration: P=0.524; Fixation Count: P=0.291; Average Fixation Duration: P=0.240). Health information influenced behavioral intention by increasing perceived risks, benefits and self-efficacy. Six types of health information acquired by individuals in the pre-contemplative stage were identified through interview data. [Limitations] The diversity and number of subjects, as well as types of message framing need to be expanded. [Conclusions] This study provides some reference for personalized health information intervention studies in the future.

Select

Creating Consumer Psychology Portrait with Aspect Words

Xiao Hanqiong, Zhang Xinyu, Xiao Yuhan, Lin Huiping

Data Analysis and Knowledge Discovery. 2022, 6(6): 22-31. https://doi.org/10.11925/infotech.2096-3467.2021.1261

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to explore consumer psychology and their preferences based on online comments, aiming to address the difficulties of drawing personality-based consumer portraits. [Methods] Firstly, we mapped relationship among the experience levels, product features and aspect words. Then, we extracted aspect words from user comments to examine their attentions at different experience levels. Third, we categorized users with their instinctual, behavioral, and reflective preferences. Finally, we utilized deep learning-based aspect sentiment analysis technology to examine user’s preferences for products. [Results] We evaluated our new model with more than 900 000 reviews on mobile phones from JD.com. Among them, users with instinctual preferences accounted for 41.60%, which was higher than behavioral preferences (33.01%) and reflective preferences (25.39%). We also analyzed their purchasing behaviors from the perspectives of brands and prices. [Limitations] We only collected review data on mobile phones sold by JD.com. More products and platforms need to be examined with our new model in the future. [Conclusions] The new model for creating user portraits can identify the preferences of different groups of consumers.

Select

Sentiment Curve Clustering and Communication Effects of Barrage Videos

Zhang Teng, Ni Yuan, Mo Tong, Lv Xueqiang

Data Analysis and Knowledge Discovery. 2022, 6(6): 32-45. https://doi.org/10.11925/infotech.2096-3467.2021.0793

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a clustering model for sentimental time series of bullet screen texts, aiming to predict video communication effects. [Methods] First, we used the Word2Vec to expand the sentiment dictionary and optimize the performance of sentiment classifiers. Then we added comprehensive weights to make the sentiment sequence smooth and stable. Finally, we constructed the SBD measurement and K-shape clustering model to analyze sentiment sequence patterns, characteristics, and communication effects. [Results] The optimized model had F1 values of 0.89 and 0.79 with multi-classification indicators (subjective or objective, and polar classification). The performance of the subjective and objective classifier was improved by 123%. Compared with the existing multiple time series measurement clustering algorithms, the proposed new model generated better Davies-Bouldin Index and Silhouette Index. [Limitations] The new algorithm did not fully utilize the Internet buzzwords or sentence situations without central adjectives. The description and interpretation of sentimental time series clustering results need to be further explored. [Conclusions] The proposed model could reduce the irregular noise and the timing phase shift of the bullet screen texts, while the clustering results are the basis for identifying the different effects.

Select

Session Sequence Recommendation with GNN, Bi-GRU and Attention Mechanism

Zhang Ruoqi, Shen Jianfang, Chen Pinghua

Data Analysis and Knowledge Discovery. 2022, 6(6): 46-54. https://doi.org/10.11925/infotech.2096-3467.2021.1105

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to improve the traditional session sequence recommendation algorithm, whose one-time modeling technique cannot represent the product’s comprehensive information or capture the user’s global/short-term interests. [Methods] First, we constructed a directed session graph based on the historical sessions, and used the GNN to learn their node information representation to enrich node embeddings. Then, we captured the user’s global and short-term interests in session sequences with Bi-GRU and attention mechanism to generate recommendation lists. [Results] We examined our new algorithm with the Yoochoose and the Diginetica datasets. Compared with the suboptimal model, the Mean Reciprocal Rank of our algorithm improved by 1.02%, and the precision improved by 2.11%, respectively. [Limitations] The proposed model did not work well with the long sequences. [Conclusions] Our new algorithm can more effectively model the user behavior sequence, predict the user’s possible actions, and improve the recommendation lists.

Select

Identifying Fake Accounts with User-Review-Shop Relationship and User Deviation Analysis

Meng Yuan, Wang Yue

Data Analysis and Knowledge Discovery. 2022, 6(6): 55-70. https://doi.org/10.11925/infotech.2096-3467.2021.1259

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Based on the user-review-shop (URS) and the fake degree relationship, this paper proposes a model based on user deviation, aiming to effectively identify fake accounts. [Methods] First, we measured the user’s deviations of contents and behaviors with the means method, JS divergence and KL divergence respectively. Then, we constructed the URS-FDIRM model to identify fake users with experimental data from mafengwo.com. [Results] The proposed models effectively measured the user’s deviations of contents and behaviors. The F1 value of URS-FDIRM model reached 92.57%. [Limitations] This method mainly uses the conventional measurements to extract the deviation index and did not include more deviation measurements with user behaviors. [Conclusions] The proposed method could help us reveal the false relationship among users, reviews and shops, and monitor abnormal user behaviors.

Select

SSVAE: A Deep Variational Text Clustering Model with Semantic Supplementation

Xue Jingjing, Qin Yongbin, Huang Ruizhang, Ren Lina, Chen Yanping

Data Analysis and Knowledge Discovery. 2022, 6(6): 71-83. https://doi.org/10.11925/infotech.2096-3467.2021.1212

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to address the missing semantics of the existing algorithms for documents clustering. [Methods] Based on the traditional deep variational inference algorithm, we proposed a Semantic Supplemented Variational Text Clustering Model (SSVAE), which could add text semantic information to the clustering process. [Results] The SSVAE effectively addressed the missing semantics issue. Compared with the best existing models, SSVAE’s NMI with the BBC, Reuters-1500, Abstract, Reuters-10k, and 20news-l datasets were improved by 8.92%, 7.43%, 8.73%, 4.80% and 6.14% respectively. [Limitations] During the process of semantic supplementation, the SSVAE inevitable brought in some noises, which posed some impacts on the clustering performance. [Conclusions] The new SSVAE model effectively improves the accuracy of text clustering.

Select

Classification Model for Long Texts with Attention Mechanism and Sentence Vector Compression

Ye Han,Sun Haichun,Li Xin,Jiao Kainan

Data Analysis and Knowledge Discovery. 2022, 6(6): 84-94. https://doi.org/10.11925/infotech.2096-3467.2021.1216

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to address the input length issue of the pretraining language model, aiming to improve the accuracy of long text classification. [Methods] We designed an algorithm using punctuation in natural texts to segment sentences and feed them into the pre-trained language model in order. Then, we compressed and encoded the classification feature vectors with the average pooling method and the weighted attention mechanism. Finally, we examined the new algorithm with multiple pre-trained language models. [Results] Compared to methods directly truncating the text contents, the classification accuracy of the proposed method improved by up to 3.74%. After applying the attention mechanism, the classification F1-score on two datasets increasd by 1.61% and 0.83% respectively. [Limitations] The improvements are not significant on some pre-trained language models. [Conclusions] The proposed model can effectively classify long texts without changing the pre-training language model’s architecture.

Select

Abstracting Interactive Contents from New Media for Government Affairs Based on Topic Clustering

Hu Jiming, Zheng Xiang

Data Analysis and Knowledge Discovery. 2022, 6(6): 95-104. https://doi.org/10.11925/infotech.2096-3467.2021.0916

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to summarize interactive contents from new media for government affairs based on topic clustering, aiming to help the government effectively control public opinion events. [Methods] First, we analyzed the textual features of the interactive contents. Then, we generated abstracts for the contents with the Top2Vec, TextRank and Transformer-Copy algorithms. [Results] The proposed model’s ROUGE-1, ROUGE-2 and ROUGE-L values reached 22.05%, 6.93% and 20.96%, respectively, which were better than those of the Seq2Seq and Seq2Seq-Attention models. [Limitations] We only examined the new model with interactive contents on 10 draft laws and regulations from Sina Microblog. [Conclusions] The proposed method can summarize the topics and public opinion on specific events.

Select

Extracting Medical Entity Relationships with Domain-Specific Knowledge and Distant Supervision

Jing Shenqi, Zhao Youlin

Data Analysis and Knowledge Discovery. 2022, 6(6): 105-114. https://doi.org/10.11925/infotech.2096-3467.2021.1238

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a distant supervised model to extract medical entity relationships based on Medical Domain-Specific Knowledge, aiming to reduce the cost of data labeling and potential errors of the existing models. [Methods] First, we used a multi-instance strategy to reduce the noise of distant supervised labeled data. Then, we utilized a pre-trained language model (MedicalBERT) to encode the labeled texts. Third, with the description of the entities in the medical knowledge base, we provided supervision signals for medical relationship extraction, and improved the accuracy of the semantic encoding. [Results] Compared with the existing models, performance of our new algorithm was up to 5.4% higher for Precision, 2.5% higher for Recall, and 4.1% higher for F1. In addition, F1-score for the complicated extraction tasks reached 93.8%. [Limitations] More research is needed to examine the proposed method with more sentences. [Conclusions] Our new model could effectively extract medical entity relationships and benefit related research.

Select

Comparing Official Projected and Public Perceived Images of Festival Events with Textual Compositional Distance

Geng Shuang, He Yuqin, Xu Xin, Niu Ben

Data Analysis and Knowledge Discovery. 2022, 6(6): 115-127. https://doi.org/10.11925/infotech.2096-3467.2021.1194

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study investigates the similarities and differences between the projected image by official marketing activities and the perceived image by user generated contents. [Methods] First, we retrieved marketing data of festival events and related user generated contents with a web crawler. Then, we used the grounded theory to construct the model for festival event images. Third, we utilized compositional distance analysis to examine the distance between the projected and perceived images. Finally, we collected quantitative data to evaluate the proposed model and the compositional distance analysis results. [Results] We found that festival images had three dimensions of event, social, and location. Compositional distance of the location dimension is the largest, while the social dimension is the smallest. [Limitations] We only collected data from the Strawberry Music Festival. More research is needed to examine the proposed model with other festival events. [Conclusions] This research provides an effective data-driven method for tracking and analyzing official marketing strategies, i.e. the difference between the projected and the perceived images.

Select

Identifying Tax Audit Cases with Multi-task Learning

Li Guofeng, Li Zuojuan, Wang Zheji, Wu Meng

Data Analysis and Knowledge Discovery. 2022, 6(6): 128-140. https://doi.org/10.11925/infotech.2096-3467.2021.1116

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper integrates tax-related data from multiple sources, and uses machine learning methods to identify the illegal corporate tax evasions. [Methods] First, we use web-scraping, text mining, and other methods to collect business financial data, executive information, and media coverage of the corporations. Then, we used the random forest method for feature selection and established indictors for the candidate companies. Then, we built a discriminatory model with the multi-task sparse structure learning based on the improved focal loss function. Finally, we trained the model with different types of tax audits to identify the needed candidates. [Results] We examined our model with real world datasets and found it had good performance for various applications. Its mean recall rate reached 0.830 9, which was 0.135 1 and 0.103 3 higher than the logistic method and the traditional multi-task sparse structure learning. [Limitations] The model needs to be examined with datasets not from the listed companies. [Conclusions] The new model could identify the target enterprises with various dishonest tax evasions. This study provides new directions for smart tax audit by the government.

Please choose a citation manager

Content to export

25 June 2022, Volume 6 Issue 6

模态框（Modal）标题

Please choose a citation manager

Content to export

25 June 2022, Volume 6 Issue 6