[Objective] This paper explores the relationship of information mapping among different languages, aiming to effectively monitor public opinion around the world and guide domestic audience effectively. [Methods] We proposed CLOpin, a cross-linguistic knowledge-mapping framework in the field of public opinion analysis and early warning. The platform developed several toolsets for different scenarios to process cross-linguistic data sets. CLOpin could integrate data from various sources efficiently and construct a knowledge graph to implement cross-linguistic public opinion analysis and early warning. [Results] Within the first hour following breaking news, the knowledge integrity of our model was 13.9% higher than that of the single language knowledge graph models. Our model’s knowledge integrity was 5.2% lower than that of the latter in 24 hours. [Limitations] The construction of our model was constrained by the scarcity of domain experts, which is the bottleneck for the knowledge graph of non-common language. [Conclusions] The CLOpin framework help us accurately grasp public opinion and early warning accordingly.
[Objective] This paper tries to identify information needed by the users, and then makes timely and accurate recommendations. [Methods] First, we generated the candidate set through content-based recommendation algorithm and item-based collaborative filtering algorithm. Then, we used parallel MapReduce technique to improve the parallel data mining performance of the proposed method. Finally, we adopted machine learning algorithms to increase the accuracy of recommended candidates and referred, personalized documents to the users. [Results] We created the recommendation list based on articles checked by the individual user. The model’s evaluation accuracy was 78.5%, and its mean squared error was 0.22. [Limitations] The user and text features need to be further investigated. The accuracy of word segmentation and model training algorithm needs to be optimized. [Conclusions] The proposed model generates personalized recommendation lists for users, and provide good support for related services.
[Objective] The paper constructs mathematical and content prediction models based on the external and internal characteristics academic articles, aiming to analyze the evolution of trending research topics. [Methods] With the help of LDA model, we identified the needed topics and constructed their time series. Then, we determined the popular topics by mean values and linear regression fitting. Finally, we predicted the trending topics with ARIMA and Word2Vec models based on the topic intensity and content. [Results] We conducted an empirical study to evaluate our models with stem cell research in the United States. We identified popular topics and predicted their development trends. [Limitations] There might be ambiguity in interpreting the documents, because the Word2Vec model analyzes trends of theme contents based on single words. [Conclusions] The proposed method can provide better prediction results than methods based on manual interpretation.
[Objective] This study proposes a size-adaptive template matching algorithm to quickly construct large-scale data set for academic literature figure and table positions. [Methods] First, we used the PubMed Open Access database to retrieve documents with figure/table images, and parsed their contents. Then, we matched document pages and pictures to extract their features. Finally, we identified the figure/table positions based on matched feature points. [Results] The proposed method’s precision and F1 value reached 98.87% and 97.44%, respectively. [Limitations] We only used simple keywords to match literature pages and figure/table pictures. [Conclusions] ;The proposed algorithm could quickly construct data set for chart positions in academic literature.
[Objective] This paper tries to generate contrastive sentences from two related paragraphs, aiming to establish a new model for creating contrastive paragraphs. [Methods] We generated contrastive sentences automatically from contrastive text sequences. We designed a deep learning model based on Seq2seq, which incorporated contrast features with character vectors to represent texts. Both the Encoder and Decoder layers of our model used BiLSTM structure, which also included attention mechanism. [Results] We examined the proposed model with manually annotated search lists and scientific papers. Then, we adopted BLEU as evaluation index for the results. The final evaluation score was 12.1, which was 6.5 higher than those of the benchmark model using BiLSTM + Attention. [Limitations] Due to the complexity of manually labeling, the data size in our experiments was small. [Conclusions] The proposed model could be used to build new model for generating contrastive paragraphs.
[Objective] This research proposes a method to automatically transferring e-mails received by government websites, aiming to reduce labor costs of managing public email boxes. [Methods] First, we chose four representative classification algorithms, including Naïve Bayes, Decision Tree, Random Forest and Multi-Layer Perception, and compared their classification resutls of e-mails received by the websites of Mayor’s Offices in Beijing, Hefei and Shenzhen. Then, we designed a method of automatically transferring these emails. Finally, we gave suggestions on the application of our method in the real world settings. [Results] Multi-Layer Perception yielded the best performance in our study, with the macro average precision and recall reaching more than 0.85, and all micro average indicators reaching more than 0.93. Naïve Bayes took the second place. Random Forest had a high macro average precision, but poor recall score. Decision Tree had an average precision and recall results. [Limitations] We did not examine the impacts of skewed distribution of received emails and eliminated the departments receiving few emails. [Conclusions] The proposed method optimizes the operation of public e-mails, which improves the efficiency of online government and reduces administrative costs.
[Objective] This study aims at developing a new argumentative zoning method based on deep learning language representation model to achieve better performance. [Methods] We adopted a pre-trained deep learning language representation model BERT, and improved model input with sentence position feature to conduct transfer learning on training data from biochemistry journals. The learned sentence representations were then fed into neural network classifier to achieve argumentative zoning classification. [Results] The experiment indicated that for the eleven-class task, the method achieved significant improvement for most classes. The accuracy reached 81.3%, improved by 29.7% compared to the best performance from previous studies. For the seven core classes, the model achieved an accuracy of 85.5%. [Limitations] Due to limitation on experiment environment, our refined model was trained based on pre-trained parameters, which could limit the potential for classification performance. [Conclusions] The proposed method showed significant improvement compared to shallow machine learning schema or original BERT model, and was able to avoid tedious work of feature engineering. The method is independent of language, hence also suitable for research articles in Chinese language.
[Objective] This study automatically analyzes resources of a virtual learning community, aiming to address the issue of information overload. [Methods] We proposed a hyper-network LDA model based on the user-document-word cube. Then, we modified this LDA model with the help of word and user analysis. Finally, we improved the cohesiveness of topics in the hyper-network LDA model, through increasing the distribution probability of closely connected words or users for the same topics. [Results] Compared to the traditional social network analysis methods, the proposed LDA model can identify important users, key topics and the relationship among them, as well as user preferences with frequency matrix of user-vocabulary and distribution probability of user-topic. [Limitations] Hyper-network analysis theory is still developing and we only studied the weighted un-directed network, which does not include the relationship of posting and replying. [Conclusions] The hyper-network LDA model effectively analyzes topics of short texts and online interactions, which are of significance to users and online learning community managers.
[Objective] This paper analyzes customer loan information, and extracts their characteristics, aiming to more effectively predict customer defaults of online loans. [Methods] First, we collected customer credit data from Lending Club. Then, we integrated the characteristic variables from four aspects of customer information and created a grayscale map. Finally, we established a customer credit evaluation model based on convolutional neural networks. [Results] The proposed model had specificity of 99.4%, sensitivity of 68.7%, G-mean value of 82.7%, F1 value of 81.4% and AUC value of 99.5%. The performance of our new model was much better than those credit models based on feature processing. [Limitations] We only investigated the performance of a few models. More research is needed to study the impacts of unbalanced data. [Conclusions] The proposed model effectively predicts probability of customer defaults.
[Objective] This paper measures the quality of index terms from research topics in academic databases and explores their distribution characteristics. [Methods] We collected the index terms of research topics in humanities, society and natural sciences from Web of Science and CNKI. Then, we constructed terminology spaces based on research topics, domains and databases. Third, we used term discriminative capacity (TDC) to evaluate their quality. Finally, we conducted ANOVA testing to explore the distribution characteristics of index terms quality from different databases/domains. [Results] The index term quality of research topics followed the rules of “Abstract”> average level >“Keyword”. The “Title” of CNKI (“Keyword Plus” in Web of Science) were lower than “Abstract”, while the “Title” in WoS were lower than average. [Limitations] The amount of research topics in this study needs to be expanded. [Conclusions] The TDC measure method is stable and reliable, which helps us improve the information retrieval services and terms quality.
[Objective] This paper applies word embedding and word semantic knowledge to improve the sense prediction for Chinese Out Of Vocabulary (OOV). [Methods] First, we crawled webpages with OOV words. Then, we trained the Word2Vec and other embedding methods with the retrieved corpus. Finally, we improved the precision of OOV sense prediction with semantic knowledge of word formation, such as centro and pos filterings. [Results] We examined our method with datasets from the People’s Daily and found it achieved 87.5% precision on OOV sense prediction. Our result was much better than those of the models only adopting word embedding or based on semantic knowledge. [Limitations] The proposed model could not effectively predict semantically opaque OOV words. [Conclusions] Combining the external and internal information (i.e., word embedding and semantic knowledge) could remarkably improve the prediction of OOV words.
[Objective] This paper tries to find similar doctors and improve the descriptions of their characteristics. [Methods] We generated vector representation for each doctor’s consulting texts, article titles and service scopes with the Word2Vec model, which helped us identify similar doctors. Then, we analyzed their common characteristics and collaboratively tag these doctors. [Results] The accuracy of tagging results based on doctor’s consulting texts, article titles and services were 0.667, 0.252 and 0.708, respectively. The accuracy of tagging results based on mixed texts was 1.000. [Limitations] The performance of single-text based tagging needs to be improved. [Conclusions] Tags based on consultation texts are closely related to the immediate needs of patients, while tags based on article titles are strongly related to doctor’s interests. Tags obtained from their services and mixed texts are more accurate.
[Objective] This paper uses sentiment analysis technology to deeply excavate and quantify the cited sentiment contained in the cited content, to provide a more scientific theoretical basis and data support for the discovery of the intrinsic value of academic literature. [Methods] Taking the journal papers retrieved in CNKI as an example, through the fine-grained sentiment analysis and sentiment quantification of the citation content in the citing literature, the intrinsic academic value of the cited literature was deeply explored and a new academic evaluation method was proposed. [Results] Experiments showed that the dispersion coefficient based on citation sentiment method was 0.12 higher than the traditional method based on cited frequency, and the Spearman correlation coefficient reached 0.981. [Limitations] Because there is no full text citation database in China, it is difficult to obtain experimental data. The sample size in the experiment is small. [Conclusions] The academic evaluation method based on fine-grained citation sentiment quantification has a higher degree of discrimination and can more effectively measure the intrinsic academic value of the literature.