[Objective] This paper summarizes key issues, algorithms, and models from the field of Chinese word segmentation, aiming to provide theoretical basis and practical guidance for future research.[Coverage] We reviewed a total of 109 papers from CNKI, Wanfang Data Knowledge Service Platform, and DBLP Computer Science Bibliography.[Methods] First, we discussed the developments and critical issues facing Chinese word segmentation. Then, we explored algorithms and models for Chinese word segmentation. Finally, we identified popular research topics and trends.[Results] The main challenge facing researchers is creating a Multi-Criteria Learning Model for Chinese Word Segmentation with multiple annotation datasets. The most popular research topic is building Multi-task joint model to finish both Chinese word segmentation and other natural language processing tasks.[Limitations] More research is needed to review studies on unsupervised learning approaches for Chinese word segmentation.[Conclusions] The existing methods of Chinese word segmentation still face challenges in building joint models with multi-perspective, multi-task, and multi-criterion features.
[Objective] This paper tries to predict the number of retweets of government microblogs, aiming to evaluate the important features affecting retweets and public opinions.[Methods] First, we used the Convolutional Neural Network (CNN) and Gradient Boosting Decision Tree (GBDT) to combine user, time and content features. Then, we predicted the retweet numbers of government microblogs. Finally, we ranked the importance of every feature to find the most important one for retweets.[Results] The proposed model improved the accuracy of retweet prediction to 0.933. The semantic feature of microblog texts is the most important one.[Limitations] We did not study the impacts of indirect retweeting behaviors.[Conclusions] The CNN-GBDT model for deep-combined features could effectively predict retweets of government microblogs.
[Objective] This paper tries to find potential trending topics from the online data, aiming to help government or enterprises monitor and guide public opinion.[Methods] First, we collected topics of public opinion with microblog’s real-time data stream. Then, we identified features of trending topics. Finally, we compared the performance of the Logistic Regression and SVM models for predicting potential trending topics.[Results] The Logistic Regression model is more capable of finding potential trending topics (recall=0.89) than SVM.[Limitations] More research is needed to examine our model with other social media platforms.[Conclusions] The proposed model could effectively identify potential trending topics of online public opinion.
[Objective] This study proposes a new convolutional neural network model, aiming to process the imbalanced data of online patient reviews.[Methods] First, we established the new model with mixed sampling and transfer learning techniques. Then we used end-to-end deep learning architecture based on Word2Vector and convolutional neural network for the distributed representation, feature extraction and topic classification of online patient reviews.[Results] Compared with traditional machine learning algorithm represented by SVM and single convolutional neural network, the proposed model significantly improved the accuracy, recall and F1 values.[Limitations] The imbalanced data of this study was only from online patient reviews.[Conclusions] The proposed model could effectively improve the recognition results of imbalanced data.
[Objective] The paper tries to eliminate the ambiguity of author names in the document system, aiming to solve the problem of incorrect document aggregation.[Methods] First, we constructed three types of networks for authors, documents and author-documents, with structured document data. Then we combined different network embedding methods to obtain the representation of document nodes. Finally, we employed the unsupervised learning model and the hierarchical agglomerative clustering to process the documents.[Results] We conducted empirical studies on datasets from ArnetMiner, CiteSeerX and DBLP. Our method performed well on sparse networks and the macro-F1 value increased by 6%.[Limitations] We only explored author name disambiguation in English.[Conclusions] The proposed method could effectively reduce the ambiguity of author names. It is of great significance for scientific collaboration and citation recommendation, as well as knowledge network related research.
[Objective] This study tries to detect funding topics and their evolution based on data from NASA’s Small Business Innovation Research Program.[Methods] First, we created funding maps with two-time windows for topics of funding applications. Then, we identified areas with higher number of topics in the map. Finally, we determined the trends by comparing the changes of hotspots from the two maps.[Results] The proposed method identified the disappeared, continuous and emerging funding topics from the maps.[Limitations] The algorithm parameters and results need to be adjusted and evaluated manually.[Conclusions] The proposed method could effectively detect funding toipics and their evolution, which helps scientific management and policy decision making.
[Objective] This paper aims to predict online customers’ future purchases based on their previous shopping behaviors.[Methods] We proposed a new product recommendation approach based on multi-head self-attention neural networks. Our method captured the relationship and attributes of items checked out by specific customers.Finally, we generated the recommended lists using recurrent neural networks with attentions.[Results] We examined the proposed approach on three real-world data sets and yielded better F1 values than existing methods (2% higher).[Limitations] The diversity of the recommended lists needs more analysis.[Conclusions] The multi-head self-attention mechanism is an effective way to model shopping behaviors and create better recommendations for the consumers.
[Objective] This study tries to extract named entities from the text, such as fragile ecological governance technology, implementation site, and implementation time, etc.[Methods] We combined the Bi-LSTM+CRF and feature-based named entity knowledge base to automatically extract needed data from CNKI documents.[Results] For the extraction of entities on ecological governance technology, the P, R and F1 values were 74.34%, 64.04% and 68.81%, respectively. Compared to the classic CRF method, our new model improves the P and F1 values by 9.41% and 4.26%, while the R value was basically the same.[Limitations] The accuracy of Chinese word segmentation tools may affect the performance of our model. More research is needed to study the relationship among entities.[Conclusions] The proposed model could be used for resource and environment information analysis based on fine-grained contents.
[Objective] This study tries to reduce the dimension of custom declaration texts, aiming to improve the efficiency of custom platforms.[Methods] We collected the declaration texts from a Chinese custom in four months as the corpus. Then, we evaluated the quality of the word vectors from the microscopic perspectives of word similarity and relevance. We also combined the traditional 0-1 matrix, frequency reduction and information gain with the SVM algorithm. Finally, we compared the results of these methods with the performance of Word2Vec word vector.[Results] Word2Vec word vector is an ideal dimension reduction method for customs declaration texts, and the classification efficiency was the highest when the word vector dimension reached 500, and the accuracy rate was 93.01%.[Limitations] We only studied the five categories with larger data volume.[Conclusions] The proposed method ensures data accuracy and integrity, which significantly reduces feature dimensions.
[Objective] This paper reveals the evolution of patents from knife-scissor industry in Guangdong Province, China.[Methods] Firstly, we proposed a new classification scheme. Secondly, we created a topic model with TRIZ feature based on LDA. Thirdly, we calculated the first n words with high probability in different years and fields. Finally, we predicted the patent evolution path in the next three years.[Results] The new classification method reduced the noise of manual annotation to less than 10%. We also found that patents from knife-scissors enterprises in Guangdong mainly focused on the TRIZ rules, such as shapes, structures, movement modes, and materials.[Limitations] We only studied the knife-scissors industries.[Conclusions] The proposed method identifies key technical developing trends of knife-scissors industries in Guangdong and gives suggestions on their upgrading in the future.
[Objective] This paper explores the dissemination laws of online public opinion during emergencies (OPOE), aiming to help governments guide and regulate such information.[Methods] First, we used “Xiangshui Explosion Accident in Jiangsu Province” as an example and introduced unique variables for this type of events. Then, we constructed a system dynamics model for OPOE. Third, we simulated and analyzed the proposed model with Vensim software. Finally, we adopted the government-related variables as control variables to discuss the impact of government behavior on online public opinion.[Results] For the simulation experiment, the MAPE values of the online posts and news were 18% and 27%. Thus, the simulation model is feasible and could effectively describe the developing trends of online public opinions. More importantly, the government reactions also posed significant effects to the dissemination of public opinions.[Limitations] Some of our data were from questionnaires and expert scoring, which might be biased.[Conclusions] The OPOE generally rises rapidly to the peak and then slowly declines. The government response time, level of reactions and transparency of official news posed positive, negative and negative effects to evolving of public opinions.
[Objective] This study aims to explore the resonance patterns of micro-blog users on different topics from similar public health emergencies.[Methods] We constrcted a random resonance model for sub-topics of public health emergencies based on the Langevin equation and collected more than 170,000 microblog entries on the Shandong vaccine incident and the Changchun Changsheng vaccine incident from the Sina Weibo platform. We analyzed the resonance pattern of micro-blog topics by calculating topic factors, geography factors, attitude values and topic salience.[Results] The topics about the progress of events, the public opinion, and the government response generated obvious resonance. However, the topics on the background knowledge and post-measures failed to cause resonance from similar public health emergencies.[Limitations] We only analyzed the resonance patterns with micro-blogging topics on two similar events. More research is needed to examine our findings with other cases.[Conclusions] Resonances exist between the topics of similar public health emergencies, which are related to the number of relevant micro-blog entries, topic contents and other factors.
[Objective] The paper tries to predict the remaining execution time of ongoing business process, aiming to provide better decision making support for process optimization.[Methods] We proposed a transfer learning framework for remaining time prediction, which constructed the prediction model with multi-layers recurrent neural networks. Then, we used representation learning method for events to pre-train the prediction model.[Results] We examined our model with five publicly available datasets and found the proposed approach outperforms the existing ones by 11% on average.[Limitations] The proposed model is of low interpretability, which limits its applications for real business management cases.[Conclusions] The proposed approach could help us predict remaining task processing time.
[Objective] This paper explores the granularity of Chinese terms from different fields, and then measures the Term Discriminative Capacity (TDC).[Methods] First, we used TDC to evaluate the quality of terms from four indexes. Then, we detected the differences in TDC among disciplines, fields and term granularity.[Results] In control group, the order of mean TDC was Title > Abstract > Keywords Plus > Keywords. In experimental group, the performance of Keywords Plus was improved, thus Title > Keywords Plus > Abstract > Keywords.[Limitations] We only collected data from five disciplines in Humanities and Social sciences.[Conclusions] Both Chinese term granularity and source fields influence the Term Discriminative Capacity. We should standarize term granularity to reduce the impact of fields.
[Objective] This paper investigates the decision-making mechanism of patients choosing doctors, aiming to build a better physician recommendation system.[Methods] First, we used Word2Vec to train the word vector model, and calculated the similarity between patients and doctors. Then, we analyzed the decision-making behaviors of patients choosing doctors. Finally, we combined the scores of doctors based on their similarity with patient needs and the latter’s decision mechanism to generate a recommended list.[Results] We conducted an empirical study with data from “Hao Daifu (Great Doctors)”. The proposed algorithm could help patients find doctors meeting their needs.[Limitations] The patient’s decision-making history needs to be analyzed. Our recommendation algorithm is for a single patient, which is costly.[Conclusions] The proposed method could recommend appropriate doctors meeting patient’s needs.
[Objective] This study tries to reconstruct tourists’ itineraries based on their travel notes and scenic information.[Methods] Firstly, we combined the TF-IDF and Word2Vec models. Then, we built a recognition method for named entities based on text similarity, which helped us identify scenic spots from travel notes. Finally, we proposed a model based on Markov property, prior knowledge and spatial characteristics to reconstruct tour itineraries.[Results] The recall, precision and F1 index values of the proposed method were 90.72%, 89.65%, and 0.9018, which were all better than those of the methods based on Conditional Random Field. The degree of similarity between the reconstructed routes and the actual ones was 83.27%.[Limitations] The completeness of scenic information might impact the performance of our model.[Conclusions] The proposed method can automatically identify scenic spots, and reconstruct travel itinerary effectively.
[Objective] This paper analyzes the geographic distributions of popular online topics, aiming to provide decision-making support for public opinion management and social governance.[Methods] First, we introduced location parameters of comments into the LDA model, and proposed a region-oriented topic recognition model (RO-LDA). Then, we used this model to label texts, topics, locations and vocabularies with location tags. Third, we created text-topics, topic-words and topic-locations matrices. Finally, we identified trending topics and their geographic distributions with the help of topic-words and topic-locations distributions.[Results] We examined the proposed model with real data set. The F value reached 80.05%, which is higher than the existing models.[Limitations] The location tags were set manually, which impacted the accuracy of region recognition.[Conclusions] The proposed method could identify geographic features of trending topics effectively.
[Objective] This paper proposes a method automatically annotating the knowledge points of test questions from online education resources.[Methods] First, we introduced the concept of text semantics to establish new association rules. Then, considering the semantic matching degrees between the target questions and the rules, we proposed an automatic method for knowledge point annotation. Finally, we presented a personalized question recommendation mechanism.[Results] We examined the proposed method with test questions from middle school mathematics and high school history courses. We also compared our model’s labeling accuracy with naive Bayes, K nearest neighbor, random forest and support vector machine, and yielded better results.[Limitations] The understanding of the semantics of test questions and the labeling accuracy could be further improved.[Conclusions] The knowledge point annotation and the personalized question recommendation methods could improve smart teaching and online learning.
[Objective] This paper predicts airfare on routes with fewer daily average flights and incomplete or even no historical data, aiming to help passengers choose better ticketing time.[Methods] We used historical data of multiple routes to predict airfares of the targets. Based on previous research and data, we extracted characteristic variables related to airfare fluctuations. We also classified these variables to establish the airfare forecasting model.[Results] When the model contains variables like the distance and the socio-economic characteristics of the route, the prediction error was significantly reduced.[Limitations] We did not include transit flights and local residents’ income data in our study. More research is needed to evaluate the performance of predicting algorithms.[Conclusions] The characteristics related to the year, the distance between the two places and the socio-economic factors of the routes are the main reasons for airfare fluctuations.
[Objective] The paper analyzed the feasibility of using Bayesian network for topic tracking, and proposed a new method to improve its performance.[Methods] We constructed two topic tracking models, one with Bayesian Network, and the other with Extended Bayesian Network. The nodes in the models represent terms, events and topics, while the arcs represent relationships among nodes. Finally, we calculated the similarity among topics, events and reports with the Propagation and Evaluation method.[Results] We examined our models on TDT4 data set and found the DET curve of the Bayesian Network model was below the curve of vector space topic model, the former had better performance. The result of extended Bayesian network topic tracking model was 1.7% higher than the first one.[Limitations] Extended Bayesian network topic tracking model was a static topic model while events were generated by the evolution of topics, so the model had limited performance improvement.[Conclusions] The new models can describe the structural relationships among topics, events and stories, and conduct probability inference, which improve the performance of topic tracking effectively.
[Objective] This paper tries to extract product attributes, aiming to cluster these words and analyze user’s sentiments.[Methods] Firstly, we identified the attributes of products with CRF technique. Then, we analyzed the sentiment of extracted terms with attention-based LSTM. Finally, we clustered these terms into appropriate categories with the help of Word2Vec and conducted fine-grained sentiment analysis of the products.[Results] The F1 values of term extraction and sentiment analysis were 0.76 and 0.78.[Limitations] We only retrieved explicit terms for this study and the sample size needs to be expanded.[Conclusions] The proposed method could effectively explore user’s preference in products.
[Objective] This paper conducts group recommendation using the relationship among users, tags and books.[Methods] First, we used the K-means algorithm to cluster users and books. Then, we calculated cosine similarity of the two groups. Third, we compared various books based on their reviews. Finally, we sorted and clustered books to personalize the recommendation results.[Results] We examined the proposed model with data from “Douban Net” and our model recommended better resources for user groups.[Limitations] The sample data size needs to be expanded.[Conclusions] The proposed model improves the personalized recommendation of books.
[Objective] This paper aims to reduce the amount of textual training data and shorten the training time of our models.[Methods] We proposed a new filtering algorithm for sample selection based on covariance estimator and then applied the data forgettable property to the embedded algorithm.[Results] In the training of model for Chinese reading comprehension, the two proposed algorithms reduced more than 50% training time. Compared with the Term Frequency-Inverse Document Frequency algorithm, our new algorithms increased the recall rate and F-score evaluation index by 0.018 and 0.012, 0.017 and 0.029, respectively.[Limitations] More training data is needed to improve the accuracy evaluation index of the model.[Conclusions] Our algorithms reduce model’s training time and improve the evaluation index. They are also suitable for large-scale data set paralleloperations.
[Objective] This paper designs and develops a modularity scientometrics system, aiming to meet the needs and real time processing tasks facing researchers. [Context] The relational database system cannot manage the vast amount of literature resources, while the distributed technology provides highly efficient computating ability for the scientometrics data.[Methods] We designed a genenal indicator model and a standard task workflow. Then,we built the proposed system based on ES, Redis and modularity indicator designs.[Results] Our platform provides standard workflow for users to conduct scientometrics tasks and receive resluts in almost real time.[Conclusions] The distributed technology and modularity design could help us build a highly efficient and universal scientometrics as well as decision making systems.
[Objective] This paper studies the domain discrimination for public opinions of online communities, aiming to improve knowledge base, as well as the effectiveness of the machine learning models.[Methods] We retrieved 478,303 pieces of textual data from multiple online communities for college students. Then, we created a semantic relationship graph with a total of 5,248 nodes and 16,488 edges, which could also be extended automatically. Finally, we proposed a short text analysis model to conduct domain analysis for the texts.[Results] The F value of the proposed model reached 83.94%, which was 8.56%, 5.97% and 4.27% higher than those of the SVM, NB and CNN methods.[Limitations] The sample size needs to be expanded and the parameter feedback mechanism needs to be modified.[Conclusions] Compared with methods based on machine learning, the proposed model’s accuracy is improved. It could also conduct real-time analysis.