[Objective] This paper tries to quantify the diffusion of technology topics based on patent data, aiming to predict their dissemination in advance. [Methods] First, we constructed the technology diffusion relationship with the patent citation data. Then,we constructed a comprehensive measuring index for technology diffusion from their strength, speed and breadth. Finally, we built the model measuring technology topic diffusion. [Results] We examined our model with 100 topics in the graphene field, which quickly identified topics with high comprehensive diffusion scores. We also found the diffusion directions of graphene patents. [Limitations] We only normalized three measuring indice for technology diffusion with Min-Max Scaling, while their weights not beening optimized for the applications. [Conclusions] The proposed model could help us find intelligence effectively with the help of multiple measurements.
[Objective] This paper explores the impacts of message framing (gain-loss framing and time framing) on the changing intention of individual’s health behaviors in a pre-contemplative stage, aiming to identify the information guiding people’s health behavioral intentions. [Methods] First, we designed four types of health information based on message framing theory. Then, we used an eye-tracking experiment to record participants’eye movements while reading health information and analyzed these data with ANOVA. Third, we used the semi-structured interview to explore the mechanism between message framing and health behavioral intentions. Finally, we identified participants’ health information needs. [Results] Loss framing information attracted more attention than the gain framing information (Total Fixation Duration: 49.456>32.633, P=0.045; Average Fixation Duration: 0.314>0.223, P=0.003). There was no significant difference between short-term framing and long-term framing information (Total Fixation Duration: P=0.524; Fixation Count: P=0.291; Average Fixation Duration: P=0.240). Health information influenced behavioral intention by increasing perceived risks, benefits and self-efficacy. Six types of health information acquired by individuals in the pre-contemplative stage were identified through interview data. [Limitations] The diversity and number of subjects, as well as types of message framing need to be expanded. [Conclusions] This study provides some reference for personalized health information intervention studies in the future.
[Objective] This paper proposes a new method to explore consumer psychology and their preferences based on online comments, aiming to address the difficulties of drawing personality-based consumer portraits. [Methods] Firstly, we mapped relationship among the experience levels, product features and aspect words. Then, we extracted aspect words from user comments to examine their attentions at different experience levels. Third, we categorized users with their instinctual, behavioral, and reflective preferences. Finally, we utilized deep learning-based aspect sentiment analysis technology to examine user’s preferences for products. [Results] We evaluated our new model with more than 900 000 reviews on mobile phones from JD.com. Among them, users with instinctual preferences accounted for 41.60%, which was higher than behavioral preferences (33.01%) and reflective preferences (25.39%). We also analyzed their purchasing behaviors from the perspectives of brands and prices. [Limitations] We only collected review data on mobile phones sold by JD.com. More products and platforms need to be examined with our new model in the future. [Conclusions] The new model for creating user portraits can identify the preferences of different groups of consumers.
[Objective] This paper constructs a clustering model for sentimental time series of bullet screen texts, aiming to predict video communication effects. [Methods] First, we used the Word2Vec to expand the sentiment dictionary and optimize the performance of sentiment classifiers. Then we added comprehensive weights to make the sentiment sequence smooth and stable. Finally, we constructed the SBD measurement and K-shape clustering model to analyze sentiment sequence patterns, characteristics, and communication effects. [Results] The optimized model had F1 values of 0.89 and 0.79 with multi-classification indicators (subjective or objective, and polar classification). The performance of the subjective and objective classifier was improved by 123%. Compared with the existing multiple time series measurement clustering algorithms, the proposed new model generated better Davies-Bouldin Index and Silhouette Index. [Limitations] The new algorithm did not fully utilize the Internet buzzwords or sentence situations without central adjectives. The description and interpretation of sentimental time series clustering results need to be further explored. [Conclusions] The proposed model could reduce the irregular noise and the timing phase shift of the bullet screen texts, while the clustering results are the basis for identifying the different effects.
[Objective] This study aims to improve the traditional session sequence recommendation algorithm, whose one-time modeling technique cannot represent the product’s comprehensive information or capture the user’s global/short-term interests. [Methods] First, we constructed a directed session graph based on the historical sessions, and used the GNN to learn their node information representation to enrich node embeddings. Then, we captured the user’s global and short-term interests in session sequences with Bi-GRU and attention mechanism to generate recommendation lists. [Results] We examined our new algorithm with the Yoochoose and the Diginetica datasets. Compared with the suboptimal model, the Mean Reciprocal Rank of our algorithm improved by 1.02%, and the precision improved by 2.11%, respectively. [Limitations] The proposed model did not work well with the long sequences. [Conclusions] Our new algorithm can more effectively model the user behavior sequence, predict the user’s possible actions, and improve the recommendation lists.
[Objective] Based on the user-review-shop (URS) and the fake degree relationship, this paper proposes a model based on user deviation, aiming to effectively identify fake accounts. [Methods] First, we measured the user’s deviations of contents and behaviors with the means method, JS divergence and KL divergence respectively. Then, we constructed the URS-FDIRM model to identify fake users with experimental data from mafengwo.com. [Results] The proposed models effectively measured the user’s deviations of contents and behaviors. The F1 value of URS-FDIRM model reached 92.57%. [Limitations] This method mainly uses the conventional measurements to extract the deviation index and did not include more deviation measurements with user behaviors. [Conclusions] The proposed method could help us reveal the false relationship among users, reviews and shops, and monitor abnormal user behaviors.
[Objective] This paper tries to address the missing semantics of the existing algorithms for documents clustering. [Methods] Based on the traditional deep variational inference algorithm, we proposed a Semantic Supplemented Variational Text Clustering Model (SSVAE), which could add text semantic information to the clustering process. [Results] The SSVAE effectively addressed the missing semantics issue. Compared with the best existing models, SSVAE’s NMI with the BBC, Reuters-1500, Abstract, Reuters-10k, and 20news-l datasets were improved by 8.92%, 7.43%, 8.73%, 4.80% and 6.14% respectively. [Limitations] During the process of semantic supplementation, the SSVAE inevitable brought in some noises, which posed some impacts on the clustering performance. [Conclusions] The new SSVAE model effectively improves the accuracy of text clustering.
[Objective] This paper tries to address the input length issue of the pretraining language model, aiming to improve the accuracy of long text classification. [Methods] We designed an algorithm using punctuation in natural texts to segment sentences and feed them into the pre-trained language model in order. Then, we compressed and encoded the classification feature vectors with the average pooling method and the weighted attention mechanism. Finally, we examined the new algorithm with multiple pre-trained language models. [Results] Compared to methods directly truncating the text contents, the classification accuracy of the proposed method improved by up to 3.74%. After applying the attention mechanism, the classification F1-score on two datasets increasd by 1.61% and 0.83% respectively. [Limitations] The improvements are not significant on some pre-trained language models. [Conclusions] The proposed model can effectively classify long texts without changing the pre-training language model’s architecture.
[Objective] This paper tries to summarize interactive contents from new media for government affairs based on topic clustering, aiming to help the government effectively control public opinion events. [Methods] First, we analyzed the textual features of the interactive contents. Then, we generated abstracts for the contents with the Top2Vec, TextRank and Transformer-Copy algorithms. [Results] The proposed model’s ROUGE-1, ROUGE-2 and ROUGE-L values reached 22.05%, 6.93% and 20.96%, respectively, which were better than those of the Seq2Seq and Seq2Seq-Attention models. [Limitations] We only examined the new model with interactive contents on 10 draft laws and regulations from Sina Microblog. [Conclusions] The proposed method can summarize the topics and public opinion on specific events.
[Objective] This paper proposes a distant supervised model to extract medical entity relationships based on Medical Domain-Specific Knowledge, aiming to reduce the cost of data labeling and potential errors of the existing models. [Methods] First, we used a multi-instance strategy to reduce the noise of distant supervised labeled data. Then, we utilized a pre-trained language model (MedicalBERT) to encode the labeled texts. Third, with the description of the entities in the medical knowledge base, we provided supervision signals for medical relationship extraction, and improved the accuracy of the semantic encoding. [Results] Compared with the existing models, performance of our new algorithm was up to 5.4% higher for Precision, 2.5% higher for Recall, and 4.1% higher for F1. In addition, F1-score for the complicated extraction tasks reached 93.8%. [Limitations] More research is needed to examine the proposed method with more sentences. [Conclusions] Our new model could effectively extract medical entity relationships and benefit related research.
[Objective] This study investigates the similarities and differences between the projected image by official marketing activities and the perceived image by user generated contents. [Methods] First, we retrieved marketing data of festival events and related user generated contents with a web crawler. Then, we used the grounded theory to construct the model for festival event images. Third, we utilized compositional distance analysis to examine the distance between the projected and perceived images. Finally, we collected quantitative data to evaluate the proposed model and the compositional distance analysis results. [Results] We found that festival images had three dimensions of event, social, and location. Compositional distance of the location dimension is the largest, while the social dimension is the smallest. [Limitations] We only collected data from the Strawberry Music Festival. More research is needed to examine the proposed model with other festival events. [Conclusions] This research provides an effective data-driven method for tracking and analyzing official marketing strategies, i.e. the difference between the projected and the perceived images.
[Objective] This paper integrates tax-related data from multiple sources, and uses machine learning methods to identify the illegal corporate tax evasions. [Methods] First, we use web-scraping, text mining, and other methods to collect business financial data, executive information, and media coverage of the corporations. Then, we used the random forest method for feature selection and established indictors for the candidate companies. Then, we built a discriminatory model with the multi-task sparse structure learning based on the improved focal loss function. Finally, we trained the model with different types of tax audits to identify the needed candidates. [Results] We examined our model with real world datasets and found it had good performance for various applications. Its mean recall rate reached 0.830 9, which was 0.135 1 and 0.103 3 higher than the logistic method and the traditional multi-task sparse structure learning. [Limitations] The model needs to be examined with datasets not from the listed companies. [Conclusions] The new model could identify the target enterprises with various dishonest tax evasions. This study provides new directions for smart tax audit by the government.