[Objective] This paper constructs a deep learning model for automatic word segmentation and part-of-speech (POS) tagging of ancient literature, aiming to build an automatic annotation solution for Chinese books from multiple fields. [Methods] We used 25 Pre-Qin literature as training corpus, which covers the Confucian classics, history, philosophy and miscellaneous works. Then, we constructed a unified model with BERT for word segmentation and POS tagging without adding new features. Third, we examined our model with The Records of the Grand Historian, which was not included in the training corpus. Finally, we analyzed the four basic parts constituting historical events (names, locations, time, actions) with statistics and case studies. [Results] The proposed model’s F-score for word segmentation and the POS-tagging reached 95.98% and 88.97%. [Limitations] After analyzing the confusion heat map of POS tagging, it is found that the mislabeling, which is caused by the imbalanced part-of-speech distribution, the similar syntactic features of some parts of speech instances, and the multi-category words, needs further research and resolutions. [Conclusions] Our deep learning model is stable and applicable for word segmentation and POS tagging with Pre-Qin literature.
[Objective] This paper tries to promote the national administration of food safety, and strengthen the prediction, warning and response of related emergencies. It not only facilitates research, but also informs the public on food safety issues concisely and intuitively. [Methods] We collected news reports on food safety incidents from leading websites and constructed a corpus for the food safety incident entities through data cleansing, annotation, and organization. Then, we compared performance of Bi-LSTM, Bi-LSTM-CRF, IDCNN, IDCNN-CRF and BERT models on entity recognition. [Results] In the 10-fold cross validation, the highest F-score of the BERT model reached 81.39%, while its average F-score was 5.50% and 2.58% higher than those of IDCNN-CRF and Bi-LSTM-CRF models respectively. We built the integrated presentation platform for food safety incident entities based on the Bi-LSTM-CRF model. [Limitations] More research is needed to identify location entities from complex administrative regions. [Conclusions] The constructed platform supports policy formulation and food industry administration.
[Objective] This study establishes an annotation system with cascaded deep learning model, aiming to automatically conduct sentence segmentation and punctuation for ancient Chinese literature. [Methods] First, we created a massive corpus of Chinese books from “Siku Quanshu”. Then, we studied the automatic sentence segmentation and punctuation as sequence labeling issues, and determined the cascaded ideas. Third, we obtained the results of automatic sentence segmentation for the uninterrupted sentences based on the BERT-LSTM-CRF model. Fourth, we processed these results with the multi-feature LSTM-CRF model and received the final punctuation marks after iterative learning. [Results] We built an application platform with the trained model and the Django framework. The average F values of the proposed method for automatic sentence segmentation and punctuation were 86.41% and 90.84%, respectively. [Limitations] The punctuation system needs to be refined. [Conclusions] The proposed model and platform significantly improve the sentence segmentation and punctuation of ancient Chinese literature, which benefits digital humanity and social science projects in China.
[Objective] The paper uses word embedding representation technology to better discover the implicit associations among topics of the medical science and technology reports, aiming to improve the analysis methods for medical topic evolution. [Methods] We adopted the TWE (Topical Word Embeddings) model to analyze the potential semantic association among topics of oncology studies, as well as their evolution. [Results] We found the splitting correlation of topics in 2006 and 2007, as well as the merging correlation of topics in 2011 and 2012. However, these TWE correlation results were not fully reflected in the topic evolution of generated by traditional LDA method. In 2009 and 2010, the results yielded by traditional LDA and word embedding were completely different. [Limitations] Our sample size is limited because we only collected Chinese reports. More research is needed to examine the proposed method with other medical research topics. [Conclusions] The topic mining and evolution analysis based on the word embeddings representation model could highlight the impacts of deep learning on topic association. It provides better results for topic evolution analysis of medical Sci-Tech reports.
[Objective] This paper reviews the methods, features and evaluation procedures of keyword extraction research, aiming to provide reference for future studies. [Coverage] We searched the Web of Science, DBLP, Engineering Index, Google Scholar, CNKI and Wanfang Data with “Keyword Extraction”, “Keyword Generation”,“Keyphrase Extraction”, and “Keyphrase Generation”, etc. A total of 89 representative literature were retrieved. [Methods] First, we analyzed the development of keyword extraction techniques. Then, we summarized related studies from the perspectives of research methods, characteristics and evaluation process. [Results] The keyword extraction methods, which gradually shifted from feature-driven models to data-driven models due to the development of machine learning, also faced problems like data labeling and evaluation criteria. [Limitations] We examined more mainstream methods for keyword extraction. [Conclusions] This paper summarizes the developing trends of keyword extraction methods, as well as the dis-advantages of existing evaluation mechanism.
[Objective] This paper reviews research on detecting patent infringements, aiming to provide theoretical frameworks and development trends for future studies. [Coverage] We retrieved 53 representative literatures from CNKI and Bing Scholar using the keywords of “Patent Infringement” or “Patent Similarity”. [Methods] First, we summarized the methods for detecting patent infringement based on clustering, vector space model, SAO (Subject-Action-Object) structure, deep learning and patent structure. Then, we compared the advantages and disadvantages of popular methods for detecting patent infringements. Finally, we explored some possible optimization solutions for the existing methods. [Results] Patent infringement detection aims to retrieve small number of patents with higher risks of infringement from a large number of patent documents. It reduces the number of patents requiring manual judgments. Our method decides the risk of patent infringement by calculating their similarities based on statistical information of different granularities. [Limitations] Due to the lack of standard data sets, we could not quantitatively compare the methods for detecting patent infringements. [Conclusions] We could optimize patent infringement detection with pre-training models, calculating similarity of different patent components, and constructing high-quality data sets.
[Objective] This research addresses the issues facing the storage and online access of massive text-level documents, the governance of large-scale data, and the low service performance, aiming to build a big data platform for sci-tech literature. [Methods] First, we analyzed the characteristics of distributed big data services for science and technology. Then, we adopted a co-tenant deployment strategy based on the servers and networks. Finally, we designed a big data platform for sci-tech literature with a “5+2” overall architecture. [Results] We established a PB-level big data platform for sci-tech literature. It has data storage capacity of 200TB and collected 320 million document entities as well as 6 billion entity relationship. The metadata processing performance based on MapReduce was increased by 3 times, and then formed the knowledge service architecture based on new technology. [Limitations] We did not adequately process streaming data, thus the system cannot offer prompt response for new data. [Conclusions] The new platform supports the knowledge discovery services of National Science Library, Chinese Academy of Sciences, as well as the intelligent scientific research system. It has good online services and improves the processing and service capabilities of sci-tech literature.
[Objective] This paper improves the matrix factorization algorithm with neighboring user’s comments, aiming to address the sparse comments issue and improve recommendation accuracy. [Methods] First, we used the Multi-layer Perceptron to improve the matrix decomposition algorithm and obtain the deep nonlinear features of users and commodities. Then, we processed the reviews and integrated the characteristics of users and their neighbors. Third, we identified users’ features in line with their preferences. Finally, we made recommendations based on the obtained prediction scores of the features. [Results] We compared the performance of our new algorithm with other models on the Amazon dataset. The accuracy, recall, and normalized cumulative loss gain of the proposed model increased by up to 8.3%, 22.8%, and 14.9%, respectively. [Limitations] We neither included the time factor of the user’s comments, nor excluded the fake comments. [Conclusions] Our new algorithm could effectively improve the recommendation results.
[Objective] This paper proposes a method to predict scientific collaboration based on the network representation learning and author topic model. [Methods] First, we established the embedding vector representation of authors with the help of network representation learning method. Then, we calculated the structural similarity of authors with cosine similarity. Third, we obtained the topic representation of authors with the author-topic model, and computed the authors’ topic similarity with Hellinger distance. Finally, we linearly merged the two similarity measures, and used the Bayesian optimization method for the hyperparameter selection. [Results] We examined the proposed method with the NIPS datasets and found the best node2vec+ATM model after Bayesian parameter selection. It had an AUC value of 0.9271, which was 0.1856 higher than that of the benchmark model. [Limitations] We did not include the author’s institution and geographic location to the model. [Conclusions] The proposed model utilizes structure and content features to improve the prediction results of network representation learning.
[Objective] Utilizing the advantages of the CRF model to solve the problem of sequence labeling, by incorporating part-of-speech information and the CRF model into the BiLSTM network, automatic extraction of journal keywords is realized. [Methods] The keyword extraction problem is considered as a sequence labeling problem. Pre-processing word segmentation and part-of-speech tagging of journal text; vectorizing the pre-processed text using the Word2Vec model for Word Embedding to obtain vector expressions of words; using BiLSTM-CRF model for automatic keyword extraction. [Results] Using the part-of-speech and BiLSTM-CRF network to perform experiments on the collected China National Knowledge Infrastructure text, the accuracy on Simple Word is improved by 3% compared to the original BiLSTM model. On Complex Word, the accuracy is improved by 12%. [Limitations] The journal keyword extraction model cannot accurately extract complex keywords. In future work, it is necessary to further remind the model of the performance of complex keywords. [Conclusions] Compared with the traditional method, the BiLSTM-CRF model with part-of-speech integration has higher recognition accuracy and is an effective keyword extraction method.
[Objective] This paper proposes a topic recognition method for news dataset with imbalanced number of reports on different topics, aiming to address the issue of inaccurate topic recognition by traditional LDA model. [Methods] First, we modified the LDA model with three feature detection methods: independence detection, variance detection and information entropy detection. Then, we identified news topics with the proposed model. [Results] We examined our model with the dataset of 10,000 news reports. Compared with the traditional LDA topic recognition method, the recall, precision and F1 values of the proposed method were improved by 0.2121, 0.0407 and 0.1520. [Limitations] Due to the large number of new words, the word segmentation accuracy was not very satisfactory, which affected the performance of news topic recognition. [Conclusions] The proposed method could effectively identify news topics from reports with imbalanced contents.
[Objective] This paper studies the polarity of dynamic political sentiments from U.S. politicians’ tweets, aiming to analyze the future directions of U.S. politics and the China-US relations. [Methods] First, we proposed a framework combining multiple deep learning models. Then, we constructed tweet dataset from politicians and obtained a multi-classifier for sentiment polarity. Third, we added the tweets’ time characteristics to find the dynamic political sentiment polarity. [Results] We examined our framework with tweets from 20 U.S. governors and senators. Its accuracy reached 80.66%, which was 8.07% higher than that of the traditional artificial neural network method. The success rate of sentiment polarity analysis was 75%. [Limitations] The analysis of dynamic political sentiment polarity depends on the regular update and iteration of the data set, otherwise the accuracy and effectiveness of the model will decrease with the change of time; political sentiment polarity is affected by many factors, and the emotional content of politicians’ tweets may be different from the real political tendency they represent, which will lead to a certain degree of misjudgment of the model. [Conclusions] The proposed method helps intelligence analysts effectively obtain polarity of dynamic political sentiments from massive Twitter text data.