Current Issue
    , Volume 6 Issue 5 Previous Issue    Next Issue
    For Selected: View Abstracts Toggle Thumbnails
    Review of Studies Analyzing Interdisciplinary Dynamics
    Chen Shiji, Cui Tengteng, Qiu Junping
    2022, 6 (5): 1-9.  DOI: 10.11925/infotech.2096-3467.2021.0976
    Abstract   HTML ( 25 PDF(723KB) ( 97 )  

    [Objective] This paper summarizes studies analyzing interdisciplinary dynamics, aiming to construct their research framework, contents, and the latest developments. [Coverage] A total of 46 representative papers were retrieved from the Web of Science core collection and CNKI. The interdisciplinary dynamics and the related research topics such as interdisciplinary knowledge transfer, diffusion, integration, and topic evolution were used to conduct searches. We also expanded our search to include more related literature. [Methods] From the perspectives of interdisciplinary dynamics definition and theoretical research, we summarized the analytical framework of interdisciplinary dynamics. Then, we described the methods and technologies based on this framework. Finally, we summarized the developing trends of interdisciplinary dynamics from their formation mechanism and process. [Results] Interdisciplinary dynamics includes three areas of research: interdisciplinary development dynamics, interdisciplinary formation mechanism, and interdisciplinary formation process. The development of bibliometrics and scientometrics provides methods and techniques for quantitative analysis of interdisciplinary dynamics. [Limitations] There are many research on transplantation and topic evolution in interdisciplinary dynamics analysis, however, only some typical documents were reviewed by this paper. [Conclusions] At present, the studies on interdisciplinary dynamics mainly focuses on theory and mechanism, and relatively few studies on the formation mechanism and process of the interdisciplinary from a quantitative point of view. With the development of data science and bibliometrics, interdisciplinary dynamics will tend to reveal the development and evolution process of related fields from a quantitative perspective.

    Figures and Tables | References | Related Articles | Metrics
    Review of Studies on Incremental Name Disambiguation
    Cao Simeng, Li Chunwang
    2022, 6 (5): 10-19.  DOI: 10.11925/infotech.2096-3467.2021.0189
    Abstract   HTML ( 23 PDF(642KB) ( 85 )  

    [Objective] This paper analyzes the research on name incremental disambiguation for authors, aiming to provide reference for future studies. [Coverage] We used “author” and “name disambiguation” as keywords to search Google Scholar, ACM, IEEE, Elsevier, Springer, CNKI and VIP databases. After manually screening and extending citation search based on seed documents, a total of 58 articles were retrieved, which included 30 papers directly discussing incremental disambiguation, and 28 other related research. [Methods] We discussed the developments, technical frameworks, and basic principles of incremental disambiguation. We also analyzed the development of incremental disambiguation on similarity comparison strategies, author assignment methods, and other issues.[Results] Popular areas include feature selection and representation, similarity calculation and author assignment methods. However, fragment merging, multi-topic recognition of the same author, and error-correction needs to be strengthened.[Limitations] There were limited studies on direct incremental disambiguation of author names, which could not fully support our results. [Conclusions] The research on incremental disambiguation should be strengthened. Combining traditional feature engineering methods with deep learning and a.pngicial intelligence technology could address more practical issues.

    Figures and Tables | References | Related Articles | Metrics
    Mining Policy Text Relevance with Syntactic Structure and Semantic Information
    Wu Kaibiao, Lang Yuxiang, Dong Yu
    2022, 6 (5): 20-33.  DOI: 10.11925/infotech.2096-3467.2021.0606
    Abstract   HTML ( 26 PDF(3556KB) ( 109 )  

    [Objective] This paper proposes a new method to analyze policy text relevance, aiming to retrieve more in-depth semantic information. [Methods] First, we built a new algorithm combining the dependency parsing analysis and word embedding model. Then, we analyzed the semantic relevance of policy texts from the perspective of sentence and word meaning information. Our method fully utilized the language characteristics of the policy texts to establish the extraction rules for dependency syntax. [Results] For test dataset with a relatively low degree of policy text association, our new algorithm’s F1 value reached 0.857, which was 22.78% higher than the algorithm fusing TF-IDF and cosine similarity. We also described policy text relevance with the subtle word differences. [Limitations] For semantic inforamiton mining, more research is needed to train word vector models for specific policy domains to further improve their accuracy. In sentence information mining, the accuracy of existing dependency syntactic analysis tools could be improved. [Conclusions] The proposed algorithm could effectively reveal the policy text association, as well as bring new research perspectives and tools for quantitative research on policy texts.

    Figures and Tables | References | Related Articles | Metrics
    Item Categorization Algorithm Based on Improved Text Representation
    Tu Zhenchao, Ma Jing
    2022, 6 (5): 34-43.  DOI: 10.11925/infotech.2096-3467.2021.0958
    Abstract   HTML ( 16 PDF(821KB) ( 83 )  

    [Objective] This paper proposes a new model to improve the traditional text classifiers which tend to misclassify commodity titles with different labels and similar modifiers. [Methods] First, we designed the text discriminator as an auxiliary task, which took the normalized Euclidean distance of different label text vectors as the loss function. Then, we utilized the cross-entropy loss function of the traditional text classification to the new text encoder. Finally, we generated text representation with sufficient discrimination for different categories of commodity texts, and constructed the ITR-BiLSTM-Attention model. [Results] Compared with the BiLSTM-Attention model without text discriminator, the proposed model’s accuracy, precision, recall and F1 values improved by 1.84%, 2.31%, 2.88% and 2.82%, respectively. Compared with the Cos-BiLSTM-Attention model, our new model improved accuracy, precision, recall and F1 values by 0.53%, 0.54%, 1.21% and 1.01%, respectively. [Limitations] The impacts of different sampling methods on the model were not tested. We did not conduct experiment on a larger data set. [Conclusions] The text discriminator auxiliary task designed in this paper can improve the text representation generated by the text encoder. The item categorization model based on improved text representation was more effective than the traditional ones.

    Figures and Tables | References | Related Articles | Metrics
    Question Generation Based on Sememe Knowledge and Bidirectional Attention Flow
    Duan Jianyong, Xu Lishan, Liu Jie, Li Xin, Zhang Jiaming, Wang Hao
    2022, 6 (5): 44-53.  DOI: 10.11925/infotech.2096-3467.2021.0857
    Abstract   HTML ( 13 PDF(1029KB) ( 41 )  

    [Objective] This paper proposes a question generation model based on sememe knowledge and bidirectional attention flow, aiming to improve the semantics of the questions. [Methods] We developed two strategies to enhance semantics: (I) By integrating the external knowledge of sememe in the embedding layer, we captured the semantic knowledge with a smaller granularity than word vectors, and then enhanced the semantic features of the text itself. In addition, we obtained an expanded sememe knowledge base that is more in line with the semantics of the contextual text through the cosine similarity algorithm. It helped us filter out the sememes creating semantic noise in the original knowledge base, and recommended semantically compliant sememe sets for words labeled with non-semantic origins. (II) We enhanced the semantic representation between texts and answers by incorporating a bidirectional attention flow after the encoding layer. [Results] We evaluated our model with the SQuAD1.1 dataset, and the Bleu_1, Bleu_2, Bleu_3, and Blue_4 reached 46.70%, 31.07%, 22.90%, and 17.48%, respectively. The proposed model outperformed the baseline models. [Limitations] With the bidirectional attention flow, the model needs to extract features of paragraph texts and questions, which demands double memory and time to train the model. [Conclusions] Sememe knowledge and bidirectional attention flow could help the proposed model generate higher-quality questions more in line with human language habits.

    Figures and Tables | References | Related Articles | Metrics
    Extracting Keywords from Government Work Reports with Multi-feature Fusion
    Pan Huiping, Li Baoan, Zhang Le, Lv Xueqiang
    2022, 6 (5): 54-63.  DOI: 10.11925/infotech.2096-3467.2021.0700
    Abstract   HTML ( 11 PDF(859KB) ( 68 )  

    [Objective] This paper proposes a modified BiLSTM-CRF model to automatically extract keywords from the government work reports with the help of BERT word vector, Wubi features, domain synonyms, and word frequencies. [Methods] First, we used the BERT and Wubi vectors to capture the semantic and font features of the input sequence. Then, we captured the category features of the input sequence with the domain synonym table for the government work reports. Third, we assigned the word frequency features as weight to the word vector to capture context features of input sequence. Finally, we used the BiLSTM-CRF model to retrieve more semantic information and automatically extract keywords from government work reports. [Results] We examined the proposed model on the self-built corpus of government work reports. The precision, recall and F1 values reached 86.14%, 91.56%, and 88.42%. We also evaluated the validity of each feature in the model with the ablation experiment. [Limitations] More research is needed to utilize the model to other texts. [Conclusions] The proposed method could effectively extract keywords from Chinese texts.

    Figures and Tables | References | Related Articles | Metrics
    Mining Uninteresting Items with Visibility of User Time Points and Collaborative Filtering Recommendation Method
    Shi Lei, Li Shuqing
    2022, 6 (5): 64-76.  DOI: 10.11925/infotech.2096-3467.2021.0842
    Abstract   HTML ( 9 PDF(998KB) ( 28 )  

    [Objective] This paper proposes a new method to improve the collaborative filtering algorithm based on explicit feedbacks, aiming to address data sparsity and user selection bias issues. [Methods] First, we retrieved the negative preferences of users who have seen the items but did not interact with them. Then, we measured the visibility of items along with user activity, item popularity and time factors. Third, we introduced the concept of pre-use preferences to construct a weighted matrix factorization model based on user time point visibility. Finally, we ide.pngied items that users were not interested in, and marked them with low values. [Results] We examined our model with the MovieLens datasets, and found the recommendation accuracy of ItemCF and BiasSVD increased by an average of 2 to 2.5 times. [Limitations] There may be empirical bias in modeling pre-use preferences based on the users’ negative preferences from the “seen-but-not-interacted items”. [Conclusions] The proposed model could effectively reduce the impacts of data sparsity and user selection bias, and make accurate recommendation results.

    Figures and Tables | References | Related Articles | Metrics
    Point-of-Interest Recommendation with Spectral Clustering and Multi-Factors
    Guo Lei, Liu Wenju, Wang Ze, Ren Yueqiang
    2022, 6 (5): 77-88.  DOI: 10.11925/infotech.2096-3467.2021.1047
    Abstract   HTML ( 10 PDF(1184KB) ( 43 )  

    [Objective] This paper tries to improve the recommendation algorithm for Location-Based Social Networks (LBSN) and reduce the impacts of sparse data on recommendation precision. [Methods] First, we used the adaptive spectral clustering technique to group the users. Then, we created the recommending candidates for the point of interests (POIs) visited by the users. Finally, we calculated the attracting scores of the candidate sets and generated the recommended POIs with higher scores. [Results] We examined the new model with two real LBSN data sets: Gowalla and Foursquare, and set the recommended number of POIs as 2. Our model’s precision reached 11.4% and 7.4%, which were 3.2% and 1.1% higher than the Lore model. The new model’s running time reduced to 50 644.53 s and 406 224.7 s (16 961.49 s and 227 248.6 s shorter than the benchmark model). [Limitations] The clustering algorithm could influence the screening of POIs. [Conclusions] The proposed model could effectively improve the recommendation precision of heterogeneous networks (i.e.,LBSN).

    Figures and Tables | References | Related Articles | Metrics
    User Community Partition Based on Multi-layer Information Fusion in E-commerce Heterogeneous Network
    Feng Yong, Xu Wentao, Wang Rongbing, Xu Hongyan, Zhang Yonggang
    2022, 6 (5): 89-98.  DOI: 10.11925/infotech.2096-3467.2021.1068
    Abstract   HTML ( 21 PDF(929KB) ( 58 )  

    [Objective] This paper proposes a new algorithm based on multi-layer information fusion in an e-commerce heterogeneous network, aiming to improve the accuracy of user community division. [Methods] First, we conducted hierarchical processing of the e-commerce heterogeneous networks and constructed user node embeddings based on different relationship types. Then, we merged users of different layers and obtained their embedding characterization in e-commerce heterogeneous networks. Third, we used the objective function to optimize the relevant parameters of the user nodes. Finally, we clustered these users with an improved K-means algorithm, and created the reasonable community division. [Results] The NMI and Sim@5 indicators of the proposed algorithm were 6.4% and 1.7% higher than the existing algorithms based on DeepWalk, Node2Vec, and GCN. The model effectively characterized user nodes and accurately divided their communities. [Limitations] We did not examine the time information and noise points from the heterogeneous network. [Conclusions] The proposed algorithm could improve the performance of friend prediction, group recommendation and other applications.

    Figures and Tables | References | Related Articles | Metrics
    Ide.pngying R&D Teams and Innovations with Patent Collaboration Networks
    Guan Peng, Wang Yuefen, Fu Zhu, Jin Jialin
    2022, 6 (5): 99-111.  DOI: 10.11925/infotech.2096-3467.2021.0772
    Abstract   HTML ( 15 PDF(1151KB) ( 58 )  

    [Objective] This paper tries to ide.pngy technology R&D teams based on the patent holders’ collaboration networks, aiming to analyze factors influencing these teams’ innovations. [Methods] First, we ide.pngied the core R&D personnel and their team members. Then, we used the number of patents as the quantity index of innovation outputs, and the number of patent citations and claims as the quality index of innovation outputs. Finally, we used the negative binomial regression model to analyze the impacts of team characteristics on their innovations. [Results] We conducted an empirical study in the field of speech recognition technology and the proposed algorithm effectively ide.pngied 566 evolutionary sequences of R&D teams, including 1 827 R&D teams in each snapshot, with an average size of 16.670. These teams form a small world sub-network with an average clustering coefficient of 0.856 and an average shortest path length of 1.646. [Limitations] The proposed algorithm could not effectively find technology R&D teams from the fields with few well-known experts. The sample size also needs to be expanded. [Conclusions] The team size and average shortest path length of team network have significant positive impacts on the quantity and quality of innovations. The persistence, stability and network density of these teams have significant negative effects on the quantity and quality of innovations. The team clustering coefficient has significant negative effects on the quantity of innovations, but no significant impacts on the quality of innovations.

    Figures and Tables | References | Related Articles | Metrics
    Evaluating Privacy Policy for Mobile Health APPs with Machine Learning
    Zhao Yang, Yan Zhouzhou, Shen Qiqi, Li Zhonghang
    2022, 6 (5): 112-126.  DOI: 10.11925/infotech.2096-3467.2021.0897
    Abstract   HTML ( 11 PDF(1486KB) ( 107 )  

    [Objective] This paper analyzes privacy policies for mobile health APPs in China with machine learning, aiming to improve the efficiency and accuracy of compliance evaluation. [Methods] First, we constructed the evaluation system for the privacy policy compliance of mobile health APPs according to relevant policies and regulations. Then, based on the hard voting classifier, we established the compliance evaluation model integrating three machine learning algorithms: CNN, RNN and LSTM. Finally, we examined our model using 1210 mobile health APPs from the Android APP market, and evaluated the compliance of their privacy policies. [Results] The overall compliance of the privacy policies for mobile health APPs was poor. There are many violations in the six evaluation criteria. The compliance scores of online medical APPs, medical service APPs, health management APPs, and medical information APPs were 0.63, 0.59, 0.61and 0.66. [Limitations] Due to the limited amount of annotated privacy policy data, the proposed model may not be able to fully learn the features of evaluation indicators. [Conclusions] This proposed model could conduct large-scale, fine-grained automatic evaluation of the compliance of APPs privacy policies. It also provides new ideas and methods for the government agencies and APP operators to improve decision making.

    Figures and Tables | References | Related Articles | Metrics
    Under-sampling Algorithm with Weighted Distance Based on Adaptive K-Means Clustering
    Zhou Qian, Yao Zhen, Sun Bo
    2022, 6 (5): 127-136.  DOI: 10.11925/infotech.2096-3467.2021.0847
    Abstract   HTML ( 13 PDF(800KB) ( 59 )  

    [Objective] This study tries to reduce the impacts of imbalanced data on classification accuracy. [Methods] First, we used the adaptive k-means clustering algorithm to process the majority class and remove the outliers. Then, we calculated the weighted distance between data and the centers of the clusters to sort the weighted distances. We also sequentially sampled the majority class according to the density of the clusters. Finally, we trained the classification algorithm combining of the sampled data and the minority class. [Results] The average max AUC values reached 0.912 with 25 imbalanced datasets, which was at least 0.014 higher than other methods. Our new algorithm’s average running time was 1.377s, and worked well with imbalanced big data sets. [Limitations] The proposed model could not address the multi-classification issues. [Conclusions] This new algorithm could ide.pngy the optimal k-value, detect and remove the outliers, solve class imbalance problem, and improve classification accuracy. It is capable of processing imbalanced large data sets faster and cost-effectively.

    Figures and Tables | References | Related Articles | Metrics
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn