Current Issue
    , Volume 5 Issue 5 Previous Issue    Next Issue
    For Selected: View Abstracts Toggle Thumbnails
    Detecting Topics of Group Chats with Multiple Strategies
    Wu Xu,Chen Chunxu
    2021, 5 (5): 1-9.  DOI: 10.11925/infotech.2096-3467.2020.0718
    Abstract   HTML ( 47 PDF(829KB) ( 404 )  

    [Objective] This paper tries to detect topics of continuous group chats with variou types of message, aiming to address the topic entanglement issue of group chats, and reduce the influence of sparse text features on clustering. [Methods] We proposed a detection model for group chat topics based on multi-strategies. This model solves topic crossover issue with topic sequences, and improves clustering results with data on users, time, and types of messages. [Results] We examined our model with plain texts of three group chats. The new method’s F value was 2.9%, 6.1% and 3.0% higher than those of the existing algorithms. The speed of our model is about 27.6%, 32.1% and 47.1% faster. This method also processed mixed types of data that cannot be handled by traditional algorithms, and the speed was improved by about 29.4%, 27.1%, and 22.5% respectively. [Limitations] We do not fully utilize the text features of group chat message and set too many thresholds for the algorithm. [Conclusions] The proposed method could identify group chat topics, and improve the efficiency of public opinion analysis.

    Figures and Tables | References | Related Articles | Metrics
    Identifying Academic Creative Concept Topics Based on Future Work of Scientific Papers
    Song Ruoxuan,Qian Li,Du Yu
    2021, 5 (5): 10-20.  DOI: 10.11925/infotech.2096-3467.2020.1275
    Abstract   HTML ( 38 PDF(1018KB) ( 439 )  

    [Objective] This paper analyzes the sentences on future work from scientific papers, aiming to automatically generate academic innovation ideas. [Methods] First, we combined rule matching with BERT to extract sentences on future work from papers. Then, we conducted the expansion calculation on papers in related fields, and identified keywords and papers on future directions. Finally, these innovative raw materials were fed to the UniLM-based model to create topics of innovation concepts. [Results] The average innovation score of the generated results is 6.04 points, and the average interest level score is 6.01 points. [Limitations] The topic generation model neither includes prior semantic knowledge nor uses large-scale data for experiment, and the quality of generated topics needs to be improved. [Conclusions] The proposed method provides a new idea to expand technological innovation.

    Figures and Tables | References | Related Articles | Metrics
    Detecting Social Media Fake News with Semantic Consistency Between Multi-model Contents
    Zhang Guobiao,Li Jie
    2021, 5 (5): 21-29.  DOI: 10.11925/infotech.2096-3467.2020.0884
    Abstract   HTML ( 49 PDF(2867KB) ( 829 )  

    [Objective] This study aims to detect fake news on social media earlier and curb the dissemination of mis/dis-information. [Methods] Based on the features of news images and texts, we mapped the images to semantic tags and calculated the semantic consistency between images and texts. Then, we constructed a model to detect fake news. Finally, we examined our new model with the FakeNewsNet dataset. [Results] The F1 value of our model was up to 0.775 on PolitiFact data and 0.879 on GossipCop data. [Limitations] Due to the limits of existing annotation methods for image semantics, we could not accurately describe image contents, and calculate semantic consistency. [Conclusions] The constructed model could effectively detect fake news from social media.

    Figures and Tables | References | Related Articles | Metrics
    Extracting China’s Economic Image from Western News
    Xu Guang,Ren Ming,Song Chengyu
    2021, 5 (5): 30-40.  DOI: 10.11925/infotech.2096-3467.2020.1190
    Abstract   HTML ( 25 PDF(2068KB) ( 454 )  

    [Objective] This paper uses text mining techniques to extract China’s economic image from news published by western media. [Methods] First, we analyzed the representation of image by textual message based on the cognitive schema of human. Then, we extracted the image from topics, viewpoints and sentiment. Finally, we developed text mining process and methods to retrieve China’s image from Western reports. [Results] China’s economic image from news published by Western media covering Davos Forum was summarized as a developing country full of vitality, with great achievements, bringing opportunities and challenges to the world, and possibly affecting the world order. [Limitations] The human interpretation of LDA models inevitably leads to individual difference. [Conclusions] The proposed method could benefit research and practice on extracting image of a country, a region, or a city from news reports.

    Figures and Tables | References | Related Articles | Metrics
    Fuzzy Overlapping Community Detection Algorithm Based on Node Vector Representation
    Chen Wenjie,Wen Yi,Yang Ning
    2021, 5 (5): 41-50.  DOI: 10.11925/infotech.2096-3467.2020.1208
    Abstract   HTML ( 17 PDF(1290KB) ( 409 )  

    [Objective] This paper proposed a fuzzy community partition algorithm based on node vector representation,aiming to solve the problems of poor efficiency and accuracy of existing fuzzy overlapping community partition algorithms. [Methods] Firstly, the random walk strategy guided by node importance is used to generate the walk sequence, and then the skip-gram model is used to train the node vector. Then, the Gaussian mixture model is introduced into the community partition to realize the multi peak node data fitting. Finally, the optimal number of communities is obtained by maximizing the modularity. [Results] Compared with the classical community detection method, the EQ values of the algorithm on the real network jazz and artificial network N1 (mu = 0.5) are increased by 7.0% and 9.7% respectively, which can more accurately detect the community structure in the network. [Limitations] In the vector representation learning, only the topological structure information of complex network is considered, while the node attribute information and edge label information are ignored. [Conclusions] The fuzzy overlapping community detection algorithm based on node vector representation can effectively complete the community division task of complex network.

    Figures and Tables | References | Related Articles | Metrics
    A Semi-Supervised Sentiment Analysis Method for Chinese Based on Multi-Level Data Augmentation
    Liu Tong,Liu Chen,Ni Weijian
    2021, 5 (5): 51-58.  DOI: 10.11925/infotech.2096-3467.2020.1170
    Abstract   HTML ( 18 PDF(915KB) ( 740 )  

    [Objective] This paper designs a semi-supervised model for sentiment analysis based on multi-level data augmentation, aiming to generate high-quality labeled data for natural language processing in Chinese. [Methods] First, we collected large amount of unlabeled data with the help of simple data enhancement and reverse translation of text enhancement techniques. Then, we extracted the data signals of unlabeled samples by calculating their consistency norms. Third, we calculated the pseudo-label of the weakly enhanced samples, and constructed the supervised training signal from the strongly enhanced sample together with the pseudo-label. Finally, we set confidence threshold for the model to generate prediction results. [Results] We examined the proposed model with three publicly available datasets for sentiment analysis. With only 1 000 labeled documents from the Waimai and Weibo datasets, the performance of our model was 2.311% and 6.726% better than those of the BERT. [Limitations] We did not evaluate the model’s performance with vertical domain datasets. [Conclusions] The proposed method fully utilizes the information of unlabeled samples to address the issue of insufficient labeled data, and shows strong predicting stability.

    Figures and Tables | References | Related Articles | Metrics
    Vocal Music Classification Based on Multi-category Feature Fusion
    Meng Zhen,Wang Hao,Yu Wei,Deng Sanhong,Zhang Baolong
    2021, 5 (5): 59-70.  DOI: 10.11925/infotech.2096-3467.2020.0902
    Abstract   HTML ( 13 PDF(1841KB) ( 481 )  

    [Objective] This paper creates a new model combining the statistical characteristics of audio and image properties, aiming to address the classification issues facing music retrieval. [Methods] First, we extracted the statistical characteristics of audios and the Mel spectrogram characteristics of images with the help of machine learning methods. Then, we transformed the audio classification tasks to image categorization. Finally, we constructed a deep learning method combining audio statistics and Mel spectrogram image features. [Results] In vocal music classification, the F1 value of the new method based on image features was about 6 percentage points higher than that of the classic machine learning methods. The F1 value of the deep learning model based on feature fusion was more than 69%, which is 3.4 percentage points higher than that of the model with image features. [Limitations] The size of experimental data is small, and the advantages of deep learning methods were not fully utilized. [Conclusions] The setting of the sampling parameters of the Mel spectrogram influences the experimental results. The new feature fusion method can effectively improve the performance of vocal music classification.

    Figures and Tables | References | Related Articles | Metrics
    A Matrix Factorization Recommendation Method with Tags and Contents
    Ma Yingxue,Gan Mingxin,Xiao Kejun
    2021, 5 (5): 71-82.  DOI: 10.11925/infotech.2096-3467.2020.1050
    Abstract   HTML ( 29 PDF(1646KB) ( 456 )  

    [Objective] This paper proposes a matrix factorization method (TCMF) integrating tags and contents, aiming to address the issue of heterogeneous information fusion in recommendation system. It tries to reduce prediction errors, overcome the problem of data sparsity, and improve the robustness of matrix factorization algorithm. [Methods] We transformed textual message to structured data with the help of embedding. Then, we extracted hidden features with CNN. Third, we merged the features of movie contents and tags with DNN to obtain comprehensive features. Finally, we proposed the TCMF based on matrix factorization algorithm and evaluated its performance with movie rating dataset (MovieLens-20m). [Results] The TCMF reduced the error of movie rating predictions (with the lowest RMSE of 0.829 5 and the lowest MAE of 0.618 9). Compared with the exisiting methods, the maxium reduction of RMSE and MAE were 9.62% and 14.17%. [Limitations] Due to the lack of information, the TCMF cannot characterize users’ personalized features. [Conclusions] The proposed model not only reduces the error of rating prediction, but also improves robustness of algorithm.

    Figures and Tables | References | Related Articles | Metrics
    Normalizing Chinese Disease Names with Multi-feature Fusion
    Han Pu,Zhang Zhanpeng,Zhang Mingtao,Gu Liang
    2021, 5 (5): 83-94.  DOI: 10.11925/infotech.2096-3467.2020.1211
    Abstract   HTML ( 19 PDF(1242KB) ( 473 )  

    [Objective] This paper proposes a normalization model for Chinese disease names based on multi-feature fusion, aiming to address the issue of multiple alternative disease names for online health communities. [Methods] First, we constructed a normalized dataset for Chinese disease names used by online health communities. Second, we conducted experiments in Chinese and English with the LSTM, GRU and CNN models. Third, we generated external semantic feature vectors with Word2vec and GloVe. Finally, we developed the normalization model MFCF-CNN for Chinese disease names based on the multi-feature fusion and self-attention mechanism. [Results] We examined the proposed model with Accuracy @ 10 dataset. The accuracy of our MFCF-CNN model reached 85.48%, which is 8.84% higher than the basic CNN model. Our model made better use of global and local semantic features. [Limitations] The amount of the experiment data needs to be expanded. [Conclusions] The proposed model promotes the normalization of Chinese disease names, which benefits the medical knowledge graph construction and natural language understanding in Chinese.

    Figures and Tables | References | Related Articles | Metrics
    Generating AND-OR Logical Expressions for Semantic Features of Categorical Documents
    Xu Zheng,Le Xiaoqiu
    2021, 5 (5): 95-103.  DOI: 10.11925/infotech.2096-3467.2021.0023
    Abstract   HTML ( 8 PDF(1047KB) ( 141 )  

    [Objective] The paper represents category unit of the categorical document as an AND-OR logical expression with semantic features, which provides data for category semantic matching and retrieval. [Methods] We constructed the seq2seq generation model using UniLM based on the AND-OR logical semantic annotation of category unit descriptions. This model learns the speech features and explicit AND-OR logical text features, to improve the sorting strategy of Beam Search. The proposed method could generate AND-OR logical expression of semantic features within category unit. By integrating context-level semantics, we extended the external semantics of category unit. [Results] We examined our method with the manually annotated International Patent Classification data. The evaluation score of the experimental result was 87.2 points, which was 11.5 points higher than the benchmark model (BiLSTM-Attention). [Limitations] More research is needed to examine the model’s performance with other datasets. [Conclusions] The proposed semantic representation method could effectively generate AND-OR logical expressions for patent data, which integrates the internal semantic features of category unit and the semantic features at the contextual level.

    Figures and Tables | References | Related Articles | Metrics
    Automatic Abstracting Civil Judgment Documents with Two-Stage Procedure
    Wang Yizhen,Ou Shiyan,Chen Jinju
    2021, 5 (5): 104-114.  DOI: 10.11925/infotech.2096-3467.2020.1109
    Abstract   HTML ( 13 PDF(939KB) ( 408 )  

    [Objective] This paper tries to automatically summarize the contents of civil judgment documents in the first-instance, aiming to provide concise, readable, coherent, accurate and efficient knowledge services. [Methods] We proposed an automatic abstracting method for judgment documents, which includes extractive summary stage and abstract summary stage. We first added the expanded residual gate convolution to the pre-training model to extract key sentences from the judgment documents. Then, we input the extractive summary to the sequence to sequence model and generated the final judgment document abstracts. [Results] The ROUGE indicators of the proposed model were 50.31, 36.60, and 48.86 with the experimental data sets of judgment documents, which were 25.00, 23.25, 24.66 higher than the results of the benchmark model (LEAD-3). [Limitations] The extractive summary obtained in the first stage is used as the input of the second stage abstract model, which creates cumulative error issue. The overall performance of the proposed model is decided by the extractive model of the first stage. [Conclusions] The proposed model could summarize judgment texts automatically, which solve the information overload issue and help users quickly read judgment documents.

    Figures and Tables | References | Related Articles | Metrics
    Optimizing Automatic Question Answering System Based on Disease Knowledge Graph
    Li He,Liu Jiayu,Li Shiyu,Wu Di,Jin Shuaiqi
    2021, 5 (5): 115-126.  DOI: 10.11925/infotech.2096-3467.2020.1263
    Abstract   HTML ( 32 PDF(2587KB) ( 1001 )  

    [Objective] This paper optimizes one existing question answering system, aiming to provide a more accurate disease knowledge query tool for the public. [Methods] Based on the disease knowledge graph, we obtained the disease symptom entities with the help of AC algorithm and semantic similarity calculation. Then, we categorized users’ questions with manual annotation and AC. Finally, we encapsulated the matched words into a dictionary, which was converted to database query language to retrieve relevant answers to the questions. [Results] We examined our new system with the Chinese medical question and answering data set. It had an average accuracy of 86.0% by answering five types of questions on COVID-19, which is higher than the existing Q&A system. [Limitations] There are many missing values of data on “checkup” and “infection”, which affects the performance of our new system. [Conclusions] The optimized automatic question answering system is an effective knowledge retrieval tool for epidemic related diseases.

    Figures and Tables | References | Related Articles | Metrics
    Cross-database Knowledge Integration and Fingerprint of Institutional Repositories with Lingo3G Clustering Algorithm
    Lu Linong,Zhu Zhongming,Zhang Wangqiang,Wang Xiaochun
    2021, 5 (5): 127-132.  DOI: 10.11925/infotech.2096-3467.2020.0882
    Abstract   HTML ( 12 PDF(1037KB) ( 369 )  

    [Objective] This study optimizes the Lingo3G algorithm with the help Solr scoring rules, aiming to realize the cross-database knowledge integration and knowledge fingerprint services of the institutional repository. [Methods] First, we analyzed user needs, and constructed a functional framework for knowledge integration analysis and visualization. Then, we selected key technologies and methods to build a platform, and explored the feasibility of knowledge integration. [Results] The proposed method calculated the characteristics of knowledge fingerprints in the institutional knowledge base. It organized and visualized knowledge fingerprints, as well as integrated cross-database knowledge through clustering. [Limitations] Due to the differences of database structure and cross-database retrieval methods ( i.e., no public resource API), we did not address all limits of cross-database retrieval. [Conclusions] The proposed method could help institutional knowledge repositories effectively integrate their knowledge resources and improve service capabilities.

    Figures and Tables | References | Related Articles | Metrics
  Copyright © 2016 Data Analysis and Knowledge Discovery   Tel/Fax:(010)82626611-6626,82624938   E-mail:jishu@mail.las.ac.cn