[Objective] This paper reviews the research progress of citation content analysis in recent years and clarifies the research direction and technology development trend. [Coverage] HowNet, Scopus, Semantic Scholar, and other search platforms are used to search papers with keywords such as “citation full text”, “citation context”, “citation content” and so on, and manual screening is conducted. [Methods] Research on citation analysis is summarized and compared from four aspects: discrimination of relevant concepts, main research directions, key technologies, analysis tools and platforms, and existing problems and future research directions are raised. [Results] New ideas and methods are emerging in citation content analysis research directions such as citation motivation, citation evaluation, knowledge flow, and paper recommendation. Key common technologies for citation content analysis have achieved much progress in citation extraction, citation location identification, citation sentiment analysis, and knowledge point identification. [Limitations] It mainly summarizes and analyzes the relevant research from the macro level and does not elaborate on the content in all aspects in-depth. [Conclusions] Citation content analysis has unique advantages over citation analysis. With the rapid iteration of natural language processing technology, it will have a broad development prospect.
[Objective] This paper examines the significance, perspective, and related technical methods identifying scholars’ research interests, aiming to provide references for future studies. [Coverage] We used “scholar profile” and “research interest” as keywords to search CNKI, Web of Science and DBLP, which retrieved 62 representative articles. [Methods] We reviewed these studies from the perspectives of words, topics and networks. We also analyzed their developments and future trends. [Results] The research at word and topic levels were well-developed, which can effectively identify scholars’ research interests and their evolutionary characteristics. However, the research at the network level merit more attention. [Limitations] This paper did not thoroughly discussed technical details of the relevant algorithms. [Conclusions] More studies need to be conducted on scholars’ research interest association and semantic recognition, as well as their semantic descriptions.
[Objective] This paper proposes a news classification scheme combining semi-supervised learning and active learning, aiming to improve intelligence monitoring based on news mining. [Methods] First, we carried out K-means clustering based on the learning of news text representations, and selected a small number of representative samples from various clusters for manual judgment. These categories were merged and adjusted as sub-field categories. Then, we used the representative samples as the training set for a variety of integrated classification algorithms and train the initial classifier. Finally, we utilized active learning to optimize the initial classifier. [Results] We tested our new model with news on tanks and armored vehicles. After active learning, we received better text classification results. The precision, recall and F1 value reached 83.68%, 83.35% and 83.17%, which were increased by 2.71%, 2.52% and 2.81% respectively. [Limitations] To reduce manually labeling work, we only conducted 2 iterations. [Conclusions] The proposed method can effectively classify news with little corpus annotation and no pre-trained classifier. It could also be used in other fields.
[Objective] This paper designs a new deep learning algorithm to improve the recommendation results. [Methods] Our model evaluated user and item quality features from user ratings and item quality consistency, numerical distribution of ratings and time-period-based numerical distribution of ratings. [Results] We examined our model with the MovieLens dataset, and found the MAE and MSE were improved by up to 3.71% and 4.24%, respectively. [Limitations] More research is needed to explore a quality index evaluation method including attribute features of user and items. [Conclusions] The proposed model generates more accurate scoring prediction, and effectively improves the quality of recommendation.
[Objective] This paper proposes a feature fusion method for patent text classification, aiming to address the low recall issues of the existing methods, which do not utilize the unregistered words. [Methods] First, we fused the sentence vector pre-trained by BERT and the proper noun vector. Then, we used the TF-IDF value of the proper nouns as the weight assigned to the vector. [Results] We examined our model with the self-built patent text corpus. Its accuracy, recall and F1 values were 84.43%, 82.01% and 81.23% respectively. The F1 value was about 5.7% higher than other methods. [Limitations] The experimental data were mainly collected from the field of new energy vehicles, which need to be expanded. [Conclusions] The proposed method could effectively process the unbalanced data and unregistered words in patent texts.
[Objective] The paper tries to improve author name disambiguation with entity relationship data from academic literature. [Methods] First, we extracted multi-type nodes and their relationships from literature to construct a heterogeneous information network (HIN). Then, we applied representation learning to obtain the latent vectors of authors, and used clutering analysis to get a preliminary division. Finally, we merged several clusters based on strong rule matching to obtain the disambiguation. [Results] We examined the new model with dataset from the Web of Science. The K-Metric mean value was 0.842, a 63.18% increase over the baseline model. Without strong rule matching, the improvement also reached 34.69%. [Limitations] The proposed model requires citation information, which limited its application scenarios. [Conclusions] Our new method could effectively improve the performance of author name disambiguation.
[Objective] This study develops a method to identify disease subtypes based on BERT-TextCNN, which could facilitate cohort selection for clinical trials. [Methods] We transformed the disease subtype identification into a single-label classification task based on BERT-TextCNN. Then, we examined our new model with clinical trials data for strokes from ClinicalTrials.gov. [Results] The BERT-TextCNN based on the LP method yielded the best weighted macro-average F1 value of 0.905 3. It identified stroke subtypes for participants of a clinical trial. [Limitations] More research is needed to evaluate our model with other diseases and data sets. [Conclusions] The proposed method could be an effective approach to identify complex disease subtypes.
[Objective] This paper builds knowledge graph for business environment to improve the utilization of resources, aiming to discover the internal entity relationship of development factors, and analyze government decision-making. [Methods] We constructed the knowledge graph based on business environment policy of Beijing, and proposed a knowledge extraction method integrating dependency syntax analysis and semantic role annotation. Then, we constructed a combined classifier to identify entity relationship triples, calculate semantic similarity, as well as perform relationship name fusion and alignment. We also designed an experiment to explore the performance of trans R model in different link prediction tasks. Finally, we identified the main influencing factors and used adjustment strategies to complete knowledge reasoning. [Results] The newly constructed knowledge graph contains 31,955 entities, 1,847 relationships and 45,682 triples. The data was stored and visualized with Neo4j and Gephi, which also supported knowledge query using cypher statement. [Limitations] Due to the complex context information, more research is needed to build a model for unclear entities to improve the performance of knowledge extraction and the quality of knowledge graph triples. [Conclusions] Our new knowledge graph could help to build an effective Q&A system, and improve the government decision-making to optimize business environment.
[Objective] This paper comprehensively reviews the current literature, aiming to address the inconsistence issues facing empirical studies on influencing factors of social media users’ intention to disclose privacy. [Methods] First, we retrieved 55 relevant empirical studies published in China and abroad. Then, we used the CMA3.0 software to conduct heterogeneity test, publication bias test and effect value analysis, which helped us explore the needed influencing factors. [Results] Among the 8 influencing factors included in the meta-analysis, habit (r=0.520) was strongly correlated with privacy disclosure intention, while perceived benefit (r=0.426) and trust (r=0.309) were moderately correlated with privacy disclosure intention. Perceived control (r=0.221), anonymity (r=0.175), privacy concern (r=-0.166), and perceived risk (r=-0.135) were weakly correlated with privacy disclosure intention, while subjective norms were not related to privacy disclosure intention. [Limitations] This paper only studied the simple path from the influencing factors to disclosure intention, which might leave some mediating or moderating effects to be identified. [Conclusions] The meta-analysis based model could more effectively reveal the factors affecting the privacy disclosure of social media users, which provides theoretical guidance for improving services, and future studies.
[Objective] This research designed a new model of profiling big data users’ portraits, aiming to address the fusion issue facing qualitative and quantitative methods. [Methods] We combined qualitative and quantitative methods to design the new model, which has a user value map based on sociological and psychological theories. Then, we used the Look-alike algorithm to build a map data label system, and used the K-Means clustering algorithm to processs the data. Finally, we interpret the clustered data. [Results] We examined our model with 200 million data points, and successfully divided young users into 20 groups. The total amount of data reached 17 million with 606 labels, which are better than the survey data. [Limitations] More research is needed to extract more original data, improve the subjective control of the user value map, as well as conduct heterogeneous data profiling. [Conclusions] The proposed model is of significance for related studies.
[Objective] This paper proposes a metaphor identification model based on graph convolutional neural network and Transformer, aiming to effectively find metaphor expressions with multiple words. [Methods] We used the graph convolutional neural network to extract the structure information from the syntactic dependency tree. Then, we combined the structure with deep semantic representation by the Transformer. Finally, we calculated the probability of metaphorical expression for the target words through SoftMax. [Results] Compared with the existing algorithms, the F1 values of our model increased by 1.9% and 1.7% on UVA VERB and UVA ALL POS datasets. The F1 values were also improved by 1.1% and 1.9% on TOEFL VERB and TOEFL ALL POS. The F1 value increased by 1.2% on the Chinese data CCL. [Limitations] If there is ambiguity or ambiguous referential information in the sentence, our model will not effectively identify the metaphor expressions. [Conclusions] Graph convolutional network and syntactic dependency tree can enrich the semantics of target words, which improves the recognition of single and multi-word metaphors.
[Objective] This paper reviews the developments of metaphor research in China in the past 40 years, aiming to provide references for linguists and then narrow the gaps between Chinese and foreign researchers. [Methods] First, we used keywords extraction algorithms to map metaphor documents into keyword sets. Then, we chose effective features as parameters for the regression models, which helped us predict the frequency of trending words in the next year. Finally, we analyzed the developments of metaphor research diachronically and synchronously. [Results] Our study compared the results of five regression models. Among them, the GBR model with the best fitting degree had the highest prediction accuracy for next year’s trending words. The feature ablation experiment also confirmed that our selected features were effective. [Limitations] The accuracy of keyword extraction algorithm could be optimized. [Conclusions] Metaphor research is developing towards the direction of cross-domains and inter-disciplines. The method of feature selection provides more references for research in prediction models.