[Objective] This paper tries to detect topics of continuous group chats with variou types of message, aiming to address the topic entanglement issue of group chats, and reduce the influence of sparse text features on clustering. [Methods] We proposed a detection model for group chat topics based on multi-strategies. This model solves topic crossover issue with topic sequences, and improves clustering results with data on users, time, and types of messages. [Results] We examined our model with plain texts of three group chats. The new method’s F value was 2.9%, 6.1% and 3.0% higher than those of the existing algorithms. The speed of our model is about 27.6%, 32.1% and 47.1% faster. This method also processed mixed types of data that cannot be handled by traditional algorithms, and the speed was improved by about 29.4%, 27.1%, and 22.5% respectively. [Limitations] We do not fully utilize the text features of group chat message and set too many thresholds for the algorithm. [Conclusions] The proposed method could identify group chat topics, and improve the efficiency of public opinion analysis.
[Objective] This paper analyzes the sentences on future work from scientific papers, aiming to automatically generate academic innovation ideas. [Methods] First, we combined rule matching with BERT to extract sentences on future work from papers. Then, we conducted the expansion calculation on papers in related fields, and identified keywords and papers on future directions. Finally, these innovative raw materials were fed to the UniLM-based model to create topics of innovation concepts. [Results] The average innovation score of the generated results is 6.04 points, and the average interest level score is 6.01 points. [Limitations] The topic generation model neither includes prior semantic knowledge nor uses large-scale data for experiment, and the quality of generated topics needs to be improved. [Conclusions] The proposed method provides a new idea to expand technological innovation.
[Objective] This study aims to detect fake news on social media earlier and curb the dissemination of mis/dis-information. [Methods] Based on the features of news images and texts, we mapped the images to semantic tags and calculated the semantic consistency between images and texts. Then, we constructed a model to detect fake news. Finally, we examined our new model with the FakeNewsNet dataset. [Results] The F1 value of our model was up to 0.775 on PolitiFact data and 0.879 on GossipCop data. [Limitations] Due to the limits of existing annotation methods for image semantics, we could not accurately describe image contents, and calculate semantic consistency. [Conclusions] The constructed model could effectively detect fake news from social media.
[Objective] This paper uses text mining techniques to extract China’s economic image from news published by western media. [Methods] First, we analyzed the representation of image by textual message based on the cognitive schema of human. Then, we extracted the image from topics, viewpoints and sentiment. Finally, we developed text mining process and methods to retrieve China’s image from Western reports. [Results] China’s economic image from news published by Western media covering Davos Forum was summarized as a developing country full of vitality, with great achievements, bringing opportunities and challenges to the world, and possibly affecting the world order. [Limitations] The human interpretation of LDA models inevitably leads to individual difference. [Conclusions] The proposed method could benefit research and practice on extracting image of a country, a region, or a city from news reports.
[Objective] This paper proposed a fuzzy community partition algorithm based on node vector representation,aiming to solve the problems of poor efficiency and accuracy of existing fuzzy overlapping community partition algorithms. [Methods] Firstly, the random walk strategy guided by node importance is used to generate the walk sequence, and then the skip-gram model is used to train the node vector. Then, the Gaussian mixture model is introduced into the community partition to realize the multi peak node data fitting. Finally, the optimal number of communities is obtained by maximizing the modularity. [Results] Compared with the classical community detection method, the EQ values of the algorithm on the real network jazz and artificial network N1 (mu = 0.5) are increased by 7.0% and 9.7% respectively, which can more accurately detect the community structure in the network. [Limitations] In the vector representation learning, only the topological structure information of complex network is considered, while the node attribute information and edge label information are ignored. [Conclusions] The fuzzy overlapping community detection algorithm based on node vector representation can effectively complete the community division task of complex network.
[Objective] This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation, aiming to generate high-quality labeled data for natural language processing in Chinese. [Methods] First, we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques. Then, we extracted the data signals of unlabeled samples by calculating their consistency norms. Third, we calculated the pseudo-label of the weakly enhanced samples, and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label. Finally, we set confidence threshold for the model to generate prediction results. [Results] We examined the proposed model with three publicly available datasets for sentiment analysis. With only 1 000 labeled documents from the Waimai and Weibo datasets, the performance of our model was 2.311% and 6.726% better than those of the BERT. [Limitations] We did not evaluate the model’s performance with vertical domain datasets. [Conclusions] The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data, and shows strong predicting stability.
[Objective] This paper creates a new model combining the statistical characteristics of audio and image properties, aiming to address the classification issues facing music retrieval. [Methods] First, we extracted the statistical characteristics of audios and the Mel spectrogram characteristics of images with the help of machine learning methods. Then, we transformed the audio classification tasks to image categorization. Finally, we constructed a deep learning method combining audio statistics and Mel spectrogram image features. [Results] In vocal music classification, the F1 value of the new method based on image features was about 6 percentage points higher than that of the classic machine learning methods. The F1 value of the deep learning model based on feature fusion was more than 69%, which is 3.4 percentage points higher than that of the model with image features. [Limitations] The size of experimental data is small, and the advantages of deep learning methods were not fully utilized. [Conclusions] The setting of the sampling parameters of the Mel spectrogram influences the experimental results. The new feature fusion method can effectively improve the performance of vocal music classification.
[Objective] This paper proposes a matrix factorization method (TCMF) integrating tags and contents, aiming to address the issue of heterogeneous information fusion in recommendation system. It tries to reduce prediction errors, overcome the problem of data sparsity, and improve the robustness of matrix factorization algorithm. [Methods] We transformed textual message to structured data with the help of embedding. Then, we extracted hidden features with CNN. Third, we merged the features of movie contents and tags with DNN to obtain comprehensive features. Finally, we proposed the TCMF based on matrix factorization algorithm and evaluated its performance with movie rating dataset (MovieLens-20m). [Results] The TCMF reduced the error of movie rating predictions (with the lowest RMSE of 0.829 5 and the lowest MAE of 0.618 9). Compared with the exisiting methods, the maxium reduction of RMSE and MAE were 9.62% and 14.17%. [Limitations] Due to the lack of information, the TCMF cannot characterize users’ personalized features. [Conclusions] The proposed model not only reduces the error of rating prediction, but also improves robustness of algorithm.
[Objective] This paper proposes a normalization model for Chinese disease names based on multi-feature fusion, aiming to address the issue of multiple alternative disease names for online health communities. [Methods] First, we constructed a normalized dataset for Chinese disease names used by online health communities. Second, we conducted experiments in Chinese and English with the LSTM, GRU and CNN models. Third, we generated external semantic feature vectors with Word2vec and GloVe. Finally, we developed the normalization model MFCF-CNN for Chinese disease names based on the multi-feature fusion and self-attention mechanism. [Results] We examined the proposed model with
[Objective] The paper represents category unit of the categorical document as an AND-OR logical expression with semantic features, which provides data for category semantic matching and retrieval. [Methods] We constructed the seq2seq generation model using UniLM based on the AND-OR logical semantic annotation of category unit descriptions. This model learns the speech features and explicit AND-OR logical text features, to improve the sorting strategy of Beam Search. The proposed method could generate AND-OR logical expression of semantic features within category unit. By integrating context-level semantics, we extended the external semantics of category unit. [Results] We examined our method with the manually annotated International Patent Classification data. The evaluation score of the experimental result was 87.2 points, which was 11.5 points higher than the benchmark model (BiLSTM-Attention). [Limitations] More research is needed to examine the model’s performance with other datasets. [Conclusions] The proposed semantic representation method could effectively generate AND-OR logical expressions for patent data, which integrates the internal semantic features of category unit and the semantic features at the contextual level.
[Objective] This paper tries to automatically summarize the contents of civil judgment documents in the first-instance, aiming to provide concise, readable, coherent, accurate and efficient knowledge services. [Methods] We proposed an automatic abstracting method for judgment documents, which includes extractive summary stage and abstract summary stage. We first added the expanded residual gate convolution to the pre-training model to extract key sentences from the judgment documents. Then, we input the extractive summary to the sequence to sequence model and generated the final judgment document abstracts. [Results] The ROUGE indicators of the proposed model were 50.31, 36.60, and 48.86 with the experimental data sets of judgment documents, which were 25.00, 23.25, 24.66 higher than the results of the benchmark model (LEAD-3). [Limitations] The extractive summary obtained in the first stage is used as the input of the second stage abstract model, which creates cumulative error issue. The overall performance of the proposed model is decided by the extractive model of the first stage. [Conclusions] The proposed model could summarize judgment texts automatically, which solve the information overload issue and help users quickly read judgment documents.
[Objective] This paper optimizes one existing question answering system, aiming to provide a more accurate disease knowledge query tool for the public. [Methods] Based on the disease knowledge graph, we obtained the disease symptom entities with the help of AC algorithm and semantic similarity calculation. Then, we categorized users’ questions with manual annotation and AC. Finally, we encapsulated the matched words into a dictionary, which was converted to database query language to retrieve relevant answers to the questions. [Results] We examined our new system with the Chinese medical question and answering data set. It had an average accuracy of 86.0% by answering five types of questions on COVID-19, which is higher than the existing Q&A system. [Limitations] There are many missing values of data on “checkup” and “infection”, which affects the performance of our new system. [Conclusions] The optimized automatic question answering system is an effective knowledge retrieval tool for epidemic related diseases.
[Objective] This study optimizes the Lingo3G algorithm with the help Solr scoring rules, aiming to realize the cross-database knowledge integration and knowledge fingerprint services of the institutional repository. [Methods] First, we analyzed user needs, and constructed a functional framework for knowledge integration analysis and visualization. Then, we selected key technologies and methods to build a platform, and explored the feasibility of knowledge integration. [Results] The proposed method calculated the characteristics of knowledge fingerprints in the institutional knowledge base. It organized and visualized knowledge fingerprints, as well as integrated cross-database knowledge through clustering. [Limitations] Due to the differences of database structure and cross-database retrieval methods ( i.e., no public resource API), we did not address all limits of cross-database retrieval. [Conclusions] The proposed method could help institutional knowledge repositories effectively integrate their knowledge resources and improve service capabilities.