Data Analysis and Knowledge Discovery

Select

Review of Studies Analyzing Interdisciplinary Dynamics

Chen Shiji, Cui Tengteng, Qiu Junping

Data Analysis and Knowledge Discovery. 2022, 6(5): 1-9. https://doi.org/10.11925/infotech.2096-3467.2021.0976

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper summarizes studies analyzing interdisciplinary dynamics, aiming to construct their research framework, contents, and the latest developments. [Coverage] A total of 46 representative papers were retrieved from the Web of Science core collection and CNKI. The interdisciplinary dynamics and the related research topics such as interdisciplinary knowledge transfer, diffusion, integration, and topic evolution were used to conduct searches. We also expanded our search to include more related literature. [Methods] From the perspectives of interdisciplinary dynamics definition and theoretical research, we summarized the analytical framework of interdisciplinary dynamics. Then, we described the methods and technologies based on this framework. Finally, we summarized the developing trends of interdisciplinary dynamics from their formation mechanism and process. [Results] Interdisciplinary dynamics includes three areas of research: interdisciplinary development dynamics, interdisciplinary formation mechanism, and interdisciplinary formation process. The development of bibliometrics and scientometrics provides methods and techniques for quantitative analysis of interdisciplinary dynamics. [Limitations] There are many research on transplantation and topic evolution in interdisciplinary dynamics analysis, however, only some typical documents were reviewed by this paper. [Conclusions] At present, the studies on interdisciplinary dynamics mainly focuses on theory and mechanism, and relatively few studies on the formation mechanism and process of the interdisciplinary from a quantitative point of view. With the development of data science and bibliometrics, interdisciplinary dynamics will tend to reveal the development and evolution process of related fields from a quantitative perspective.

Select

Review of Studies on Incremental Name Disambiguation

Cao Simeng, Li Chunwang

Data Analysis and Knowledge Discovery. 2022, 6(5): 10-19. https://doi.org/10.11925/infotech.2096-3467.2021.0189

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the research on name incremental disambiguation for authors, aiming to provide reference for future studies. [Coverage] We used “author” and “name disambiguation” as keywords to search Google Scholar, ACM, IEEE, Elsevier, Springer, CNKI and VIP databases. After manually screening and extending citation search based on seed documents, a total of 58 articles were retrieved, which included 30 papers directly discussing incremental disambiguation, and 28 other related research. [Methods] We discussed the developments, technical frameworks, and basic principles of incremental disambiguation. We also analyzed the development of incremental disambiguation on similarity comparison strategies, author assignment methods, and other issues.[Results] Popular areas include feature selection and representation, similarity calculation and author assignment methods. However, fragment merging, multi-topic recognition of the same author, and error-correction needs to be strengthened.[Limitations] There were limited studies on direct incremental disambiguation of author names, which could not fully support our results. [Conclusions] The research on incremental disambiguation should be strengthened. Combining traditional feature engineering methods with deep learning and a.pngicial intelligence technology could address more practical issues.

Select

Mining Policy Text Relevance with Syntactic Structure and Semantic Information

Wu Kaibiao, Lang Yuxiang, Dong Yu

Data Analysis and Knowledge Discovery. 2022, 6(5): 20-33. https://doi.org/10.11925/infotech.2096-3467.2021.0606

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to analyze policy text relevance, aiming to retrieve more in-depth semantic information. [Methods] First, we built a new algorithm combining the dependency parsing analysis and word embedding model. Then, we analyzed the semantic relevance of policy texts from the perspective of sentence and word meaning information. Our method fully utilized the language characteristics of the policy texts to establish the extraction rules for dependency syntax. [Results] For test dataset with a relatively low degree of policy text association, our new algorithm’s F1 value reached 0.857, which was 22.78% higher than the algorithm fusing TF-IDF and cosine similarity. We also described policy text relevance with the subtle word differences. [Limitations] For semantic inforamiton mining, more research is needed to train word vector models for specific policy domains to further improve their accuracy. In sentence information mining, the accuracy of existing dependency syntactic analysis tools could be improved. [Conclusions] The proposed algorithm could effectively reveal the policy text association, as well as bring new research perspectives and tools for quantitative research on policy texts.

Select

Item Categorization Algorithm Based on Improved Text Representation

Tu Zhenchao, Ma Jing

Data Analysis and Knowledge Discovery. 2022, 6(5): 34-43. https://doi.org/10.11925/infotech.2096-3467.2021.0958

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new model to improve the traditional text classifiers which tend to misclassify commodity titles with different labels and similar modifiers. [Methods] First, we designed the text discriminator as an auxiliary task, which took the normalized Euclidean distance of different label text vectors as the loss function. Then, we utilized the cross-entropy loss function of the traditional text classification to the new text encoder. Finally, we generated text representation with sufficient discrimination for different categories of commodity texts, and constructed the ITR-BiLSTM-Attention model. [Results] Compared with the BiLSTM-Attention model without text discriminator, the proposed model’s accuracy, precision, recall and F1 values improved by 1.84%, 2.31%, 2.88% and 2.82%, respectively. Compared with the Cos-BiLSTM-Attention model, our new model improved accuracy, precision, recall and F1 values by 0.53%, 0.54%, 1.21% and 1.01%, respectively. [Limitations] The impacts of different sampling methods on the model were not tested. We did not conduct experiment on a larger data set. [Conclusions] The text discriminator auxiliary task designed in this paper can improve the text representation generated by the text encoder. The item categorization model based on improved text representation was more effective than the traditional ones.

Select

Question Generation Based on Sememe Knowledge and Bidirectional Attention Flow

Duan Jianyong, Xu Lishan, Liu Jie, Li Xin, Zhang Jiaming, Wang Hao

Data Analysis and Knowledge Discovery. 2022, 6(5): 44-53. https://doi.org/10.11925/infotech.2096-3467.2021.0857

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a question generation model based on sememe knowledge and bidirectional attention flow, aiming to improve the semantics of the questions. [Methods] We developed two strategies to enhance semantics: (I) By integrating the external knowledge of sememe in the embedding layer, we captured the semantic knowledge with a smaller granularity than word vectors, and then enhanced the semantic features of the text itself. In addition, we obtained an expanded sememe knowledge base that is more in line with the semantics of the contextual text through the cosine similarity algorithm. It helped us filter out the sememes creating semantic noise in the original knowledge base, and recommended semantically compliant sememe sets for words labeled with non-semantic origins. (II) We enhanced the semantic representation between texts and answers by incorporating a bidirectional attention flow after the encoding layer. [Results] We evaluated our model with the SQuAD1.1 dataset, and the Bleu_1, Bleu_2, Bleu_3, and Blue_4 reached 46.70%, 31.07%, 22.90%, and 17.48%, respectively. The proposed model outperformed the baseline models. [Limitations] With the bidirectional attention flow, the model needs to extract features of paragraph texts and questions, which demands double memory and time to train the model. [Conclusions] Sememe knowledge and bidirectional attention flow could help the proposed model generate higher-quality questions more in line with human language habits.

Select

Extracting Keywords from Government Work Reports with Multi-feature Fusion

Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang

Data Analysis and Knowledge Discovery. 2022, 6(5): 54-63. https://doi.org/10.11925/infotech.2096-3467.2021.0700

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.

Select

Mining Uninteresting Items with Visibility of User Time Points and Collaborative Filtering Recommendation Method

Shi Lei, Li Shuqing

Data Analysis and Knowledge Discovery. 2022, 6(5): 64-76. https://doi.org/10.11925/infotech.2096-3467.2021.0842

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to improve the collaborative filtering algorithm based on explicit feedbacks, aiming to address data sparsity and user selection bias issues. [Methods] First, we retrieved the negative preferences of users who have seen the items but did not interact with them. Then, we measured the visibility of items along with user activity, item popularity and time factors. Third, we introduced the concept of pre-use preferences to construct a weighted matrix factorization model based on user time point visibility. Finally, we ide.pngied items that users were not interested in, and marked them with low values. [Results] We examined our model with the MovieLens datasets, and found the recommendation accuracy of ItemCF and BiasSVD increased by an average of 2 to 2.5 times. [Limitations] There may be empirical bias in modeling pre-use preferences based on the users’ negative preferences from the “seen-but-not-interacted items”. [Conclusions] The proposed model could effectively reduce the impacts of data sparsity and user selection bias, and make accurate recommendation results.

Select

Point-of-Interest Recommendation with Spectral Clustering and Multi-Factors

Guo Lei, Liu Wenju, Wang Ze, Ren Yueqiang

Data Analysis and Knowledge Discovery. 2022, 6(5): 77-88. https://doi.org/10.11925/infotech.2096-3467.2021.1047

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to improve the recommendation algorithm for Location-Based Social Networks (LBSN) and reduce the impacts of sparse data on recommendation precision. [Methods] First, we used the adaptive spectral clustering technique to group the users. Then, we created the recommending candidates for the point of interests (POIs) visited by the users. Finally, we calculated the attracting scores of the candidate sets and generated the recommended POIs with higher scores. [Results] We examined the new model with two real LBSN data sets: Gowalla and Foursquare, and set the recommended number of POIs as 2. Our model’s precision reached 11.4% and 7.4%, which were 3.2% and 1.1% higher than the Lore model. The new model’s running time reduced to 50 644.53 s and 406 224.7 s (16 961.49 s and 227 248.6 s shorter than the benchmark model). [Limitations] The clustering algorithm could influence the screening of POIs. [Conclusions] The proposed model could effectively improve the recommendation precision of heterogeneous networks (i.e.,LBSN).

Select

User Community Partition Based on Multi-layer Information Fusion in E-commerce Heterogeneous Network

Feng Yong, Xu Wentao, Wang Rongbing, Xu Hongyan, Zhang Yonggang

Data Analysis and Knowledge Discovery. 2022, 6(5): 89-98. https://doi.org/10.11925/infotech.2096-3467.2021.1068

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new algorithm based on multi-layer information fusion in an e-commerce heterogeneous network, aiming to improve the accuracy of user community division. [Methods] First, we conducted hierarchical processing of the e-commerce heterogeneous networks and constructed user node embeddings based on different relationship types. Then, we merged users of different layers and obtained their embedding characterization in e-commerce heterogeneous networks. Third, we used the objective function to optimize the relevant parameters of the user nodes. Finally, we clustered these users with an improved K-means algorithm, and created the reasonable community division. [Results] The NMI and Sim@5 indicators of the proposed algorithm were 6.4% and 1.7% higher than the existing algorithms based on DeepWalk, Node2Vec, and GCN. The model effectively characterized user nodes and accurately divided their communities. [Limitations] We did not examine the time information and noise points from the heterogeneous network. [Conclusions] The proposed algorithm could improve the performance of friend prediction, group recommendation and other applications.

Select

Identifying R&D Teams and Innovations with Patent Collaboration Networks

Guan Peng,Wang Yuefen,Fu Zhu,Jin Jialin

Data Analysis and Knowledge Discovery. 2022, 6(5): 99-111. https://doi.org/10.11925/infotech.2096-3467.2021.0772

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to ide.pngy technology R&D teams based on the patent holders’ collaboration networks, aiming to analyze factors influencing these teams’ innovations. [Methods] First, we ide.pngied the core R&D personnel and their team members. Then, we used the number of patents as the quantity index of innovation outputs, and the number of patent citations and claims as the quality index of innovation outputs. Finally, we used the negative binomial regression model to analyze the impacts of team characteristics on their innovations. [Results] We conducted an empirical study in the field of speech recognition technology and the proposed algorithm effectively ide.pngied 566 evolutionary sequences of R&D teams, including 1 827 R&D teams in each snapshot, with an average size of 16.670. These teams form a small world sub-network with an average clustering coefficient of 0.856 and an average shortest path length of 1.646. [Limitations] The proposed algorithm could not effectively find technology R&D teams from the fields with few well-known experts. The sample size also needs to be expanded. [Conclusions] The team size and average shortest path length of team network have significant positive impacts on the quantity and quality of innovations. The persistence, stability and network density of these teams have significant negative effects on the quantity and quality of innovations. The team clustering coefficient has significant negative effects on the quantity of innovations, but no significant impacts on the quality of innovations.

Select

Evaluating Privacy Policy for Mobile Health APPs with Machine Learning

Zhao Yang, Yan Zhouzhou, Shen Qiqi, Li Zhonghang

Data Analysis and Knowledge Discovery. 2022, 6(5): 112-126. https://doi.org/10.11925/infotech.2096-3467.2021.0897

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes privacy policies for mobile health APPs in China with machine learning, aiming to improve the efficiency and accuracy of compliance evaluation. [Methods] First, we constructed the evaluation system for the privacy policy compliance of mobile health APPs according to relevant policies and regulations. Then, based on the hard voting classifier, we established the compliance evaluation model integrating three machine learning algorithms: CNN, RNN and LSTM. Finally, we examined our model using 1210 mobile health APPs from the Android APP market, and evaluated the compliance of their privacy policies. [Results] The overall compliance of the privacy policies for mobile health APPs was poor. There are many violations in the six evaluation criteria. The compliance scores of online medical APPs, medical service APPs, health management APPs, and medical information APPs were 0.63, 0.59, 0.61and 0.66. [Limitations] Due to the limited amount of annotated privacy policy data, the proposed model may not be able to fully learn the features of evaluation indicators. [Conclusions] This proposed model could conduct large-scale, fine-grained automatic evaluation of the compliance of APPs privacy policies. It also provides new ideas and methods for the government agencies and APP operators to improve decision making.

Select

Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering

Zhou Qian, Yao Zhen, Sun Bo

Data Analysis and Knowledge Discovery. 2022, 6(5): 127-136. https://doi.org/10.11925/infotech.2096-3467.2021.0847

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy. [Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters. Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results] The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could ide.pngy the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.

Please choose a citation manager

Content to export

25 May 2022, Volume 6 Issue 5

模态框（Modal）标题

Please choose a citation manager

Content to export

25 May 2022, Volume 6 Issue 5