Data Analysis and Knowledge Discovery

Select

Review of Technology Term Recognition Studies Based on Machine Learning

Hu Yamin, Wu Xiaoyan, Chen Fang

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 7-17. https://doi.org/10.11925/infotech.2096-3467.2021.1066

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper reviews the status quo and future directions of technology term recognition studies based on machine learning. [Coverage] We searched “technology term* recognition” in Chinese and English with the Web of Science and CNKI. Then, we expanded our search to include the relevant algorithms literature. A total of 62 representative papers were chosen for this review. [Methods] We summarized the application and differences of machine learning in technology term recognition, and then examined it from four prospects: the classification of algorithms, general procedures, the existing problems, and downstream applications. Finally, we discussed the development trends and future studies. [Results] The algorithms can be divided into single statistical machine learning, single deep learning and hybrid algorithms. The most widely used algorithm is the hybrid method, i.e., the BiLSTM-CRF model. Transfer learning is an important research direction in the future. [Limitations] With the rapid progress of deep learning, hybrid models are constantly emerging, this paper only summarized the popular ones. [Conclusions] There are many issues needs to be addressed. In the future, research on fine-grained entity recognition, feature representation, evaluation and open source toolkits should be strengthened.

Select

Technology Evolution Analysis Framework Based on Two-Layer Topic Model and Application

Lv Lucheng, Zhou Jian, Wang Xuezhao, Liu Xiwen

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 18-32. https://doi.org/10.11925/infotech.2096-3467.2021.0908

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a new analysis framework for technology evolution, aiming to address the problems of the topic similarity calculation and manually setting the threshold to judge the correlation between window technology topics. [Methods] We established the new framework based on two layer topic model, which identified the dynamic topics using the LDA and NMF. Then, we evaluated the technical topic identification effects with the indicators of inner consistency and outer difference of the topics. Finally, we analyzed the evolution of technical topics from the perspectives of topic growth and importance. [Results] We examined our new method with data from the field of resources and environment. The two layer topic model based on NMF is more effective in dynamic topic recognition, and the analysis results of technology evolution can be verified from the list of breakthrough technologies released by MIT Technology Review. [Limitations] This paper only studies the development of technology from emergence to extinction, and does not examine the division, derivation and integration of technology. [Conclusions] The proposed method can automatically identify dynamic topics and analyze their evolution tracks using the literature. It has application value in scientific and technological information analysis.

Select

Predicting Values of Technology Convergence with Multi-Feature Fusion

Zhang Jinzhu, Han Yongliang

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 33-44. https://doi.org/10.11925/infotech.2096-3467.2021.0962

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to predict technology convergence relationship and their values based on the patent classification network and text semantic features. [Methods] First, we calculated the correlation between patents and their classification to construct the co-occurrence network and obtain their structure similarity features. Then, we connected patent texts and their classification schema with the correlation strength. We also obtained the text semantic similarity features using text representation learning. Second, we constructed similarity indicators with the network structure and text semantic features, which were fused to create a feature vector. Third, we used the random forest model to learn the weights and contributions of different indicators and calculated the technology fusion probability. We also generated the candidate technology fusion relationship set. Fourth, based on the network characteristics and bibliometric characteristics of patent classification and citation, as well as their influence and potential growth, we created the evaluation indices for their technical, commercial and strategic values. Finally, we used the proposed method to evaluate the technology integration relationship. [Results] The accuracy of the proposed method is at least 20% higher than that of single feature prediction. In addition, the top 10 pairs of high-value technology convergence relations that identified by the proposed method have little difference with the real ranking result, in which the MAE is only 3.2. [Limitations] Some data sets are in-consistent, while more machine learning methods need to be utilized. [Conclusions] The feature convergence method has higher prediction accuracy than traditional methods. The proposed method can also effectively evaluate technology convergence relationship value.

Select

Clustering Technology Topics Based on Patent Multi-Attribute Fusion

Liu Xiaoling, Tan Zongying

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 45-54. https://doi.org/10.11925/infotech.2096-3467.2021.1086

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] Reasonable, effective and accurate classification of technology topics is of great significance. This article integrates multiple attributes of patents to improve the division of technology topics. [Methods] First, we constructed the patent text vector, the patent citation vector and the patent classification vector based on text contents, citation relationship and classification information of the patents. Then, we obtained a new patent vector based on multi-attribute fusion of the three vectors. Finally, we identified technology topics through patent clustering analysis. [Results] Compared with the patent vector representation method based on single or two attributes, our method had higher patent classification precision, recall rate and F1 value on different IPC classification levels and sample sizes. Our measurement of patent similarity was also more accurate. [Limitations] We used automatic classification for patents rather than direct methods to evaluate the effect of technology topic division. [Conclusions] The proposed method improves the accuracy of patent similarity measurement and technology topic division.

Select

Identifying Emerging Technology with LDA Model and Shared Semantic Space——Case Study of Autonomous Vehicles

Zhou Yunze, Min Chao

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 55-66. https://doi.org/10.11925/infotech.2096-3467.2021.0926

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The paper proposed a new method to identify emerging technologies using shared semantic model and multi-source data. [Methods] We used the LDA model to detect the topics of multi-source data. Then, we utilized the Word2Vec model to create vectors for these topics based on the representative words and their weights. Third, we merged the topics, and used topic strength and novelty to identify emerging technologies. [Results] We found seven emerging technoligies from the field of Autonomous Vechicles, including Driver Switching, Selection and Control of Travel Path, Lane Change Safety, Motion Estimation and Risk Aversion, Structure Design, Perception of the Environment, as well as Communication Technology and Communication Security. [Limitations] More research is needed to explore better ways to determine the threshold and find fine-grained topics. [Conclusions] The proposed method is able to detect emerging topics using data from multiple sources, which optimizes the exisiting methods.

Select

Social Media Image Classification for Emergency Portrait

Li Gang, Zhang Ji, Mao Jin

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 67-79. https://doi.org/10.11925/infotech.2096-3467.2021.0952

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes a new classification method for social media images by fusing the related texts, aiming to efficiently construct emergency portrait. [Methods] First, we analyzed the general process of creating emergency portrait based on social media. Then, we designed a two-layer image classification system for the dimensions of emergency portrait. Third, we proposed a deep learning model (Unimodal and Crossmodal Transformer Model, UCTM) for image classification, which integrated image and text multimodal semantics. We constructed emergency portrait with our model on the dataset of Super Typhoon Mangkhut, and compared its performance with the existing ones. [Results] The MAP score of our UCTM was 0.021 higher than those of single-modal classification methods and bilinear fusion methods. For the preparation and rescue information, the F1 scores of our algorithm were 0.017 and 0.018 better than the direct classification ones. [Limitations] Our model does not investigate the inconsistency between textual and graphic semantics, and the types of emergencies need to be expanded. [Conclusions] This proposed method enriches the dimensions and contents of emergency portrait, which improves the preparation and response for crisis.

Select

Predicting Churners of Online Health Communities Based on the User Persona

Wang Ruojia, Yan Chengxi, Guo Fengying, Wang Jimin

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 80-92. https://doi.org/10.11925/infotech.2096-3467.2021.1062

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to predict user behaviors in online health community based on user persona technology, aiming to identify and keep the potential churners. [Methods] We constructed a multi-dimensional label system for user persona with the help of statistical analysis, social network analysis, natural language processing and LDA topic clustering. Then, we used the decision tree and ensemble learning models to predict the potential churners. [Results] We examined our new model with the Huaxia Traditional Chinese Medicine Forum and its F1 value reached 88.77%. [Limitations] More research is needed to examine our algorithm with other online health communities. [Conclusions] User persona technology could help us predict potential user churns.

Select

Research on User Roles Based on OHCs-UP in Public Health Emergencies

Qian Danmin, Zeng Tingting, Chang Shiyi

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 93-104. https://doi.org/10.11925/infotech.2096-3467.2021.0946

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] To explore the development trend of online health communities under public health emergencies, the paper constructs a post popularity evaluation model based on Topsis approach, and uses user portraits to define user roles. [Methods] Crawling the posts related to the epidemic situation in Dingxiangyuan, obtaining 4,972 pieces of valid data, using the Topsis entropy method to rank the popularity of the posts, then using factor analysis to reduce the dimensionality, and finally constructing user portraits based on K-means clustering. [Results] During the epidemic, Dingxiangyuan users posted posts in four major sections: postgraduate entrance examination, news hotspot, mood station, and preventive medicine. We used user portraits to divide users into 7 categories, such as high-influence users, professional users, long-term users, high-volume users, high-potential users, institutional users, and strong interactive users. [Limitations] Because the selected website only allows crawling of the first 14 pages of data, the data set constructed is small, and the horizontal comparison of different OHCs has not been performed. [Conclusions] The research shows that accurate user positioning helps to understand the differences between user groups and accurately grasp user needs during public health emergencies, so as to provide more evidence and suggestions for the community to carry out work under similar incidents.

Select

A Multi-Task Text Classification Model Based on Label Embedding of Attention Mechanism

Xu Yuemei, Fan Zuwei, Cao Han

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 105-116. https://doi.org/10.11925/infotech.2096-3467.2021.0912

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to adjust text classification algorithm according to task-specific features, aiming to improve the accuracy of text classification for different tasks. [Methods] We proposed a text classification algorithm based on label attention mechanism. Through label embedding learning of both word vector and the TF-IDF classification matrix, we extracted the task-specific features by assigning different weights to the words, which improves the effectiveness of the attention mechanism. [Results] The accuracy of the proposed method increased by 3.78%, 5.43%, and 11.78% in prediction compared with the existing LSTMAtt, LEAM and SelfAtt methods. [Limitations] We did not study the impacts of different vector models on the performance of text classification.[Conclusions] This paper presents an effective method to improve and optimize the multi-task text classification algorithm.

Select

Joint Extraction Model for Entities and Events with Multi-task Deep Learning

Yu Chuanming, Lin Hongjun, Zhang Zhengang

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 117-128. https://doi.org/10.11925/infotech.2096-3467.2021.0965

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The study tries to improve the performance of entity and event extraction with the help of their correlation. [Methods] Based on the multi-task deep learning, we proposed a joint entity and event extraction model (MDL-J3E), which had the shared layer, the private layer, and the decoding layer. The shared layer generated common features. The private layer had the named entity recognition and event detection modules, which extracted features of the two subtasks based on their general features. The decoding layer analyzed features of each task and generated tag sequence following the constraint rules. [Results] We examined our model with the ACE2005 dataset. The F1 values were 84.15% in the named entity recognition task and 70.96% in the event detection task. [Limitations] We did not evaluate the proposed model with other information extraction scenarios. [Conclusions] Compared with the single task model, our multi-task model has better performance in both named entity recognition and event detection tasks.

Select

Multi-label Patent Classification with Pre-training Model

Tong Xinyu, Zhao Ruijie, Lu Yonghe

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 129-137. https://doi.org/10.11925/infotech.2096-3467.2021.0930

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to improve the automatic patent classification method and accurately match patent applications with one or more suitable IPC classification numbers. [Methods] We constructed a large-scale Chinese patent dataset (CNPatents), and used the first four digits of IPC classification numbers as labels. Then, we utilized BERT, RoBERTa, and RBT3 models for training and testing. [Results] For our classification task with more than 600 labels, the best model reached an accuracy of 75.6% and a Micro-F1 value of 59.7%. After high-frequency label screening, the accuracy and the Micro-F1 value increased to 91.2% and 71.7%. [Limitations] The patent documents as the training set have extreme data imbalance issue, which needs more research to improve the high-frequency tag screening for the training. [Conclusions] This paper realizes the automatic classification of multi-label patents and further improves the performance of classification model with high-frequency label screening.

Select

Simulating Dynamics Prediction with Collaborative Allocation System for Blockchain Resources: Case Study of Guangdong-HongKong-Macao Greater Bay Area

Wang Xiaoqing, Chen Dong

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 138-150. https://doi.org/10.11925/infotech.2096-3467.2021.0967

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the interaction mechanism among key resource allocations in each blockchain, aiming to construct better regional economic blockchains and promote the coordinated economic development. [Methods] Based on the analysis of resource allocation elements in the blockchain and system dynamics theories and methods, we used the Vensim system to simulate and analyze the related blockchain industries. [Results] (Ⅰ) Sensitivity analysis showed: industry chain = capital chain> talent chain> innovation chain; (Ⅱ) In terms of industry chain, the year of 2030 is a key node; (Ⅲ) For capital chain, the years from 2021 to 2025 is the key time period; (Ⅳ) In the talent chain, the years from 2025 to 2035 is the key time period; (Ⅴ) For innovation chain, the whole time is the key node. [Limitations] More research is needed to improve the selection of influencing factors for the “five chains” and examine their internal mechanism thoroughly. [Conclusions] The proposed method provides some guidance for predicting results of resource collaborative allocation.

Select

An Analysis Framework for Job Demands from Job Postings

Yue Tieqi, Fu Youfei, Xu Jian

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 151-166. https://doi.org/10.11925/infotech.2096-3467.2021.0947

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a complete and systematic framework to analyze qualifications from online job postings. It then examines the requirements of Internet-related jobs with the framework. [Methods] First, we retrieved recruitment advertisements for the Internet industry. Then, we constructed an LDA model for topic mining and classification of job descriptions. Finally, we used the Word2Vec model and dependency syntax analysis to obtain the topic-word and degree-word lists to construct the topic ontology. [Results] The empirical analysis revealed the status quo of the Internet industry positions, such as the regional and category distributions, as well as the required qualification for different types of positions. [Limitations] There were few data samples for campus recruitment, which led to deviations between the analysis results and the actual situation. The word-segmentation is not perfect for the LDA model, and some topics were not representative. [Conclusions] The proposed framework could effectively analyze job postings.

Select

Identifying Metaphors and Association of Chinese Idioms with Transfer Learning and Text Augmentation

Zhang Wei, Wang Hao, Chen Yuetong, Fan Tao, Deng Sanhong

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 167-183. https://doi.org/10.11925/infotech.2096-3467.2021.1020

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to identify sentiment metaphors from Chinese idioms and build an idiom knowledge graph integrating external things (source) and users’ internal attitudes or sentiments (target). [Methods] We proposed a recognition scheme for metaphors of Chinese idioms based on transfer learning and text augmentation. First, we retrieved the idioms and their external categories to obtain the external knowledge and the learning corpus with the help of sentiment dictionary. Then, we matched idioms with the dictionary, which were used for the first round of transfer learning. All other sentiment words in the sentiment dictionary were the training set for the second round of transfer. Third, we introduced Chinese language knowledge to augment the texts with the weak sentiment semantics due to the metaphorical characteristics. Fourth, we compared the CLS of the BERT text embedding with the average pooling schemes using mainstream deep learning models. Finally, we hierarchically classified the un-matched idioms with the optimal model and merged them with the matched idioms to obtain internal knowledge. [Results] The average pooling accuracy was 4.69% higher than the [CLS], which was further improved by 13% by adding idiom interpretation. The sentiment accuracy at all levels of the second transfer reached 80%, and the highest improvement was up to 6.25% for small corpus. [Limitations] The classification accuracy of sentiment categories could be improved with larger corpus. [Conclusions] Our scheme can effectively identify the sentiment metaphor knowledge of Chinese idioms, and the association of internal and external knowledge lays the foundation for better knowledge services.

Select

Constructing Knowledge Graph for Financial Securities and Discovering Related Stocks with Knowledge Association

Liu Zhenghao, Qian Yuxing, Yi Tianlong, Lv Huakui

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 184-201. https://doi.org/10.11925/infotech.2096-3467.2021.0609

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs domain knowledge graph based on knowledge association and discovers industry characteristics and related stocks, aiming to improve investors’ decision making. [Methods] Firstly, we constructed the “seed” knowledge graph with stock data. Then, we conducted entity extraction and relationship classification with unstructured text data based on FinBERT pre-training model to generate the triples. Third, we merged the seed graph and the triples to create the knowledge graph for financial securities. Fourth, based on the graph, link prediction, similarity calculation and other data mining algorithms, we discovered the related stocks and their hidden characteristics. Our findings were preliminarily verified by statistical methods. [Results] Our new knowledge graph was constructed with 111,845 entities and 163,370 relationships. We analyzed 10 cross-industry stocks having the highest similarity with “Northeast Securities”. We also examined the potential nonlinear correlation between stocks using “Sihuan Biology”. [Limitations] The constructed knowledge graph only included the impacts of static information (e.g., industry and shareholder ownership) on stock correlation. [Conclusions] Our new knowledge graph provides strong data analytics support for investors to make effective portfolio strategies and predict stock trends.

Select

Question Comprehension and Answer Organization for Scientific Education of Epidemics

Cheng Zijia, Chen Chong

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 202-211. https://doi.org/10.11925/infotech.2096-3467.2021.1057

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs a KGQA system based on the knowledge graph of epidemics, which improves the comprehension of user questions and organization of answers, aiming to effectively disseminate professional knowledge to the public. [Methods] First, we summarized users’ information needs based on multiple health information systems. Then, we combined the AC algorithm with BERT model to understand user queries and map the elements of questions to structured query statements. Third, we retrieved answers from the pre-constructed epidemic knowledge graph. Finally, we organized the answers with Flask framework and a variety of JS packages, which improved the front end interaction and presentation. [Results] The average accuracy of our new Q&A system was more than 90% and the proposed method is practical for specific domains. [Limitations] The knowledge of epidemic diseases was retrieved from the public dataset of AMiner platform and the Q&A coverage as well as the question types should to be expanded. [Conclusions] The proposed model optimizes the semantics of the question comprehension, as well as the organization of answers, which helps the public understand the professional knowledge effectively.

Select

Mining Enterprise Associations with Knowledge Graph

Hou Dang, Fu Xiangling, Gao Songfeng, Peng Lei, Wang Youjun, Song Meiqi

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 212-221. https://doi.org/10.11925/infotech.2096-3467.2021.0948

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to explore relationships among enterprises in production and operation with the help of knowledge graph, aiming to provide new directions for risk management and valuation. [Context] In production and operation, there are enormous complex relationships containing valuable information. [Methods] We used the structured enterprise data tables to construct the enterprise knowledge graph, which helped us search the association between enterprises, and find the actual controller of enterprises and the affiliated groups. [Results] The constructed knowledge graph included more than 1.4 million entities, such as companies and individuals, and more than 3 million relationships on equity, guarantee, senior management, investment and so on. Based on the path and search algorithm of the graph, we found the association, actual controller and the affiliations. [Conclusions] The proposed algorithm could effectivley identify the hidden enterprise association relationship.

Select

Clustering and Characterizing Depression Patients Based on Online Medical Records

Nie Hui, Wu Xiaoyan, Lin Yun

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 222-232. https://doi.org/10.11925/infotech.2096-3467.2021.0883

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study examines the online consultation records of depression patients, aiming to thoroughly understand their situation. [Methods] First, we retrieved the depression consultation records from haodf.com, an online medical platform. Then, we modeled the patients with word vectors, and identified patient groups with the K-means clustering algorithm. Third, we used visualization techniques, such as t-SNE, heat map, and word-cloud, to analyze the group structure and relationship among them. Finally,we identified the emotional-psychological, social, and behavioral differences of different groups and decided their treatment needs with the LDA topic model. [Results] We found six depression groups with different emotional-psychology, social relationship, and behavioral performance. The depression patients’ needs include: seeking suggestion on offline medical treatments, multi-faceted consultation, and inquiry about medication. [Limitations] We analyzed the differences in group characteristics by selecting keywords in each dimension based on part-of-speech and manual analysis. [Conclusions] The proposed method could help us understand patients and their needs, and then construct better online medical platforms.

Select

Automatic Classification with Unbalanced Data for Electronic Medical Records

Zhang Yunqiu, Li Bocheng, Chen Yan

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 233-241. https://doi.org/10.11925/infotech.2096-3467.2021.0954

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an automatic classification method for electronic medical records with unbalanced data, aiming to further improve the classification performance of clinical electronic medical records. [Methods] First, we used the MC-BERT to enhance the semantic representation of electronic medical records. Then, we designed a deep neural network framework to improve the model’s semantic extraction capabilities. Finally, we designed a new loss function from the perspectives of the unbalanced sample categories and difficulty of classification. The proportion of categories, gradient coordination mechanism, and categories similarity were added to the model. [Results] We examined the new model with real electronic medical records. Its accuracy reached 81.37%, while the macro-average F1 value was 65.89%, and the micro-average F1 value was 81.47%. These results are better than the existing methods. [Limitations] We only retrieved medical records from one department. [Conclusions] The proposed method can effectively improve the classification results of unbalanced data.

Select

Identifying Named Entities of Chinese Electronic Medical Records Based on RoBERTa-wwm Dynamic Fusion Model

Zhang Yunqiu, Wang Yang, Li Bocheng

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 242-250. https://doi.org/10.11925/infotech.2096-3467.2021.0951

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an entity recognition model based on RoBERTa-wwm dynamic fusion, aiming to improve the entity identification of Chinese electronic medical records. [Methods] First, we merged the semantic representations generated by each Transformer layer of the pre-trained language model RoBERTa-wwm. Then, we input the bi-directional long short-term memory network and the conditional random field module to recognize the entities of the electronic medical records. [Results] We examined our new model with the dataset of “2017 National Knowledge Graph and Semantic Computing Conference (CCKS 2017)” and self-annotated electronic medical records. Their F1 values reached 94.08% and 90.08%, which were 0.23% and 0.39% higher than the RoBERTa-wwm-BiLSTM-CRF model. [Limitations] The RoBERTa-wwm used in this paper completed the pre-training process with non-medical corpus. [Conclusions] The proposed method could improve the results of entity recognition tasks.

Select

Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF

Zhang Fangcong, Qin Qiuli, Jiang Yong, Zhuang Runtao

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 251-262. https://doi.org/10.11925/infotech.2096-3467.2021.0910

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to address the issues of polysemy and incomplete words facing entity recognition for Chinese Electronic Medical Records (EMR). [Methods] We constructed a deep learning model RoBERTa-WWM-BiLSTM-CRF to improve the named entity recognition of Chinese EMR. We conducted four rounds of experiments to compare their impacts on entity recognition. [Results] The highest F1 value of the new model reached 0.8908. [Limitations] The experiment data set is small, and the entity recognition results of some departments was not very impressive. For example, the F1 value of respiratory department was only 0.8111. [Conclusions] The RoBERTa-WWM-BiLSTM-CRF model could effectively conduct named entity recognition for Chinese electronic medical records.

Select

Knowledge Description Framework for Foreign Patent Documents Based on Knowledge Meta

Fu Zhu, Ding Weike, Guan Peng, Ding Xuhui

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 263-273. https://doi.org/10.11925/infotech.2096-3467.2021.0921

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new knowledge description framework (KDF) for foreign patent documents based on knowledge meta, aiming to generate better full-text features of these documents from the fine-grained perspective. [Methods] First, we analyzed the U.S. and European patents to compare their differences to Chinese documents. Then, we used knowledge meta to describe the full-text features of foreign patents with external and content features to construct the KDF. Finally, we analyzed the semantic relationships of the contents from this new framework. [Results] The KDF generated eight core knowledge elements and their description rules, which had four types of semantic relationships between patent documents and knowledge elements, as well as five types of relationships between different knowledge elements. [Limitations] The adaptability of the KDF needs to be strengthened. [Conclusions] The proposed KDF could describe the full-text knowledge features of foreign patent documents effectively and reveal the semantic relationship between the knowledge features, which provides new directions for knowledge organization, mining and services of patent documents.

Select

Cross-domain Transfer Learning for Recognizing Professional Skills from Chinese Job Postings

Yi Xinhe, Yang Peng, Wen Yimin

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 274-288. https://doi.org/10.11925/infotech.2096-3467.2021.0963

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the online job postings and identifies the demands of employers accurately, aiming to address the skill gaps between supply and demand in the labor market.[Methods] We proposed a model with cross-domain transfer learning to recognize professional skill words (CDTL-PSE). This task was treated as sequence tagging like named entity recognition or term extraction in CDTL-PSE. It also decomposed the SIGHAN corpus into three source domains. A domain adaptation layer was inserted between the Bi-LSTM and the CRF layers, which helped us transfer learning from each source domain to the target domain. Then, we used parameter transfer approach to train each sub-model. Finally, we obtained the prediction of label sequence by majority vote. [Results] On the self-built online recruitment data set, compared with the baseline method, the proposed model improved the F1 value by 0.91%, and reduced the labeled samples by about 50%. [Limitations] The interpretability of CDTL-PSE needs to be further improved. [Conclusions] CDTL-PSE can automatically extract words on professional skills, and effectively increase the labeled samples in the target domain.

Select

Extracting Weapon Attributes Based on Word Completion

Ding Shengchun, You Weijing, Wang Xiaoying

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 289-297. https://doi.org/10.11925/infotech.2096-3467.2021.0969

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to address the issue facing dependency syntactic relation, which could only extract single noun attributes for military equipment. [Methods] First, we analyzed features of the text describing the technology and performance of weapons and equipment. Then, we wrote regular expressions to obtain the attribute values. Third, we extracted the attribute words based on dependency parsing. Finally, we completed the attribute word lists with the part of speech. [Results] We examined our new model with military news data sets and found the accuracy and recall rates for extracting open-source attribute words reached 91.53% and 72.78%. The accuracy of attribute word completion was up to 96.95%, and the accuracy for each category of attribute words was higher than 90%. [Limitations] This paper did not try to extract weapon attributes like the belonging country and the state of service. [Conclusions] The proposed method could effectively extract explicit attribute words.

Select

Identifying Moves from Scientific Abstracts Based on Paragraph-BERT-CRF

Guo Hangcheng, He Yanqing, Lan Tian, Wu Zhenfeng, Dong Cheng

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 298-307. https://doi.org/10.11925/infotech.2096-3467.2021.0973

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to automatically identify the moves from scientific paper abstracts, aiming to find the purpose, method, results, and conclusion of the paper. It also helps readers quickly receive main contents of the literature and conduct semantic retrieval. [Methods] We proposed a neural network model for abstract move recognition based on the Paragraph-BERT-CRF framework, which fully uses the context information. We also added the attention mechanism and the transfer relationship between sequence move labels. [Results] We examined our model with 94,456 abstracts of scientific papers. The weighted average precision, recall and F1 values were 97.45%, 97.44% and 97.44%, respectively. Compared with the ablation experimental results of CRF, BiLSTM, BiLSTM-CRF, BERT, BERT-CRF and Paragraph-BERT, our new model is effective. [Limitations] We only used the basic BERT-based pre-trained language model. More research is needed to optimize the model parameters and add more pre-trained language model in the recognition of move information. [Conclusions] Attention mechanism and paragraph level contextual information can effectively improve the proposed model’s inference scores and identify move information from abstracts.

Select

Extracting Chinese Patent Keywords with LSTM and Logistic Regression

Wei Tingting, Jiang Tao, Zheng Shuling, Zhang Jiantao

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 308-317. https://doi.org/10.11925/infotech.2096-3467.2021.0972

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper constructs a new method to extract keywords from Chinese patents based on the LSTM and logistic regression, aiming to identify low-frequency and long-tail keywords effectively. [Methods] First, we combined the LSTM neural network and logistic regression model to extract the candidate keywords. Then, we reconstructed the filtering rules to retrieve the target keywords. [Results] The extraction accuracy of all keywords, low-frequency keywords, long-tail keywords, and low-frequency long-tail keywords were 5%, 24%, 11% and 26% higher than those of existing methods. [Limitations] The proposed model classifies keywords by setting thresholds, which are not precise to process words near the thresholds. [Conclusions] Our new model could effectively discover key terms with low frequency and long characters from texts, which benefits patent analysis and other services.

Select

Extracting Relationship Among Characters from Local Chronicles with Text Structures and Contents

Wang Yongsheng, Wang Hao, Yu Wei, Zhou Zeyu

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 318-328. https://doi.org/10.11925/infotech.2096-3467.2021.0922

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study proposes a new method to extract relationship among characters from local chronicles, aiming to explore the culture and history information embedded in Yiwu Local Chronicles—Chapter of Persons. [Methods] We constructed the relationship extraction model based on text structures and contents. For text structures, we used the rule templates and word features to extract relationship from the original texts, which was also categorized with different granularity. For the text contents, we introduced a remotely supervised approach to extract relationship. Then, we combined the BERT+Bi-GRU+ATT and BERT+FC deep learning models to transform the relationship extraction to a multi-label classification task. Finally, we reduced the impacts of the noise from remote supervision on the model’s accuracy by correcting relationship labels. [Results] The proposed method realized high automation and yielded better extracted information. The BERT+FC models improved the F1 values by up-to 27%, while different relationship categories showed some affinity. The F1 value of the “strong co-occurrence relationship” was increased by 3% after label correction. [Limitations] We only investigated the relationships among characters in local chronicles. [Conclusions] The new method could effectively extract relationship among the same type of entities in historical Chinese documents.

Select

Classifying Images of Intangible Cultural Heritages with Multimodal Fusion

Fan Tao, Wang Hao, Li Yueyan, Deng Sanhong

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 329-337. https://doi.org/10.11925/infotech.2096-3467.2021.0911

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method combining images and texual descriptions, aiming to improve the classification of Intangible Cultural Heritage (ICH) images. [Methods] We built the new model with multimodal fusion, which includes a fine-tuned deep pre-trained model for extracting visual semantic features, a BERT model for extracting textual features, a fusion layer for concatenating visual and textual features, and an output layer for predicting labels. [Results] We examined the proposed model with the national ICH project-New Year Prints to classify the Mianzu Prints, Taohuawu Prints, Yangjiabu Prints, and Yangliuqing Prints. We found that fine-tuning the convolutional layer strengthened the visual semantics features of the ICH images, and the F1 value for classification reached 72.028%. Compared with the baseline models, our method yielded the best results, with a F1 value of 77.574%. [Limitations] The proposed model was only tested on New Year Prints, which needs to be expanded to more ICH projects in the future. [Conclusions] Adding textual description features can improve the performance of ICH image classification. Fine-tuning convolutional layers in image deep pre-trained model can improve extraction of visual semantics features.

Select

Classification Model for Chinese Traditional Embroidery Based on Xception-TD

Zhou Zeyu, Wang Hao, Zhang Xiaoqin, Tao Fao, Ren Qiutong

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 338-347. https://doi.org/10.11925/infotech.2096-3467.2021.0909

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper introduces artificial intelligence methods to the field of digital humanities, aiming to address the issues of small data sets, insufficient image feature representation, and low recognition accuracy facing traditional Chinese embroidery image classification. It also tries to prvovide methodology support to the digitalization of intangible cultural heritage protection. [Methods] We utilized deep learning techniques to analyze the embroidery images, and extracted their features. Then, we fine-tuned the Xception model with the migration learning approach, and constructed a Xception-TD method to classify traditional Chinese embroidery. Finally, we explored the impacts of the number and dimensions of fully connected layers, as well as the value of dropouts on the model’s performance. [Results] We found that increasing the number and dimensions of fully connected layers improved the embroidery image feature representation. The accuracy rate of our new model reached 0.96863, which was better than the benchmark model. In multi-classification tasks, the model’s accuracy was also better than that of the benchmark ones. [Limitations] The experimental data set was only constructed with Baidu images, which had small amount of manual taggings. [Conclusions] The proposed model based on transfer learning could improve the accuracy of embroidery classification.

Select

Trust Information Fusion and Expert Opinion for Large Group Emergency Decision-Making Based on Complex Network

Xu Xuanhua, Huang Li

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 348-363. https://doi.org/10.11925/infotech.2096-3467.2021.0941

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an information fusion approach to describe the complex relationship among decision-making experts and improve large group emergency response. [Methods] First, we identified and constructed a network for the relationship of the expert groups with information fusion, complex network analysis, experts’ opinion and trust information. Then, we clustered the group members, calculated expert weights, and reached personalized consensus. [Results] The proposed model visualized relationship among experts, which could be used in large group emergency decision-making. Compared to the traditional approaches, the proposed method reduced the cost of consensus adjustment by about 47% and improved consensus reaching efficiency by 40% while considering experts’ willingness. [Limitations] Experts’ complex relationships can be obtained from other dimensions. Trust needs to be additionally provided by experts. [Conclusions] This study enriches the group relationship analysis and provides innovative ideas for using complex relationships to support large group decision-making in the social network environment.

Select

End-to-End Aspect-Level Sentiment Analysis for E-Government Applications Based on BRNN

Shang Rongxuan, Zhang Bin, Mi Jianing

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 364-375. https://doi.org/10.11925/infotech.2096-3467.2021.0945

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an end-to-end aspect-level sentiment analysis method based on BRNN, aiming to conduct fine-grained sentiment analysis for reviews of government APPs. [Methods] First, we built a neural network containing a two-layer BRNN structure and three functional modules. Then, we recognized the boundary and sentiment tendency of the government APP reviews, as well as extracted aspect entities. [Results] The proposed E2E-ALSA model had excellent classification and generalization ability. Its precision, recall and F1-score all exceeded 0.93. [Limitations] The model can only jointly extract explicit aspect entities, while the implicit aspect extraction needs to be performed independently. The sample size needs to be expanded. [Conclusions] The proposed method could identify the users’ emotional needs and reactions to the e-government systems.

Select

Identifying “Fake News” with Multi-Perspective Evidence Fusion

Li Baozhen, Chen Ke

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 376-384. https://doi.org/10.11925/infotech.2096-3467.2021.0964

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a multi-perspective evidence fusion model for identifying fake news, aiming to address the issues of lacking evidence and inaccurate classification in traditional model. [Methods] With the help of subjective logical model and uncertainty measurements for the classification from different perspectives, we modified the Dempster-Shafer evidence theory. Then, we used different weights to combine the evidence from multiple perspectives, and obtained the uncertainty measurements of the overall evidence and classification. [Results] We examined our model with two public data sets, and found its accuracy and F1 values were significantly higher than the traditional models. [Limitations] Evidence fusion from multiple perspectives generated some noise, which might reduce the accuracy of the results. [Conclusions] Multi-perspective evidence fusion could effectively identify fake news.

Select

Automatic Identifying Abnormal Behaviors of International Journals

Wu Jinhong, Mu Keliang

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 385-395. https://doi.org/10.11925/infotech.2096-3467.2021.0949

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper creates an early warning mechanism for international journals, aiming to predict their quality changes and help researchers choose better publishing platforms. [Methods] We constructed an early-warning index system for scholarly journals with their impact strength, influencing timeline, characteristics, and author demographics. Then, we combined Pearson correlation coefficient and the important values of XGBoost to select features. Third, we analyzed the features with XGBoost, SVM, logistic regression, and Stacking fusion to identify the abnormal behaviors. Finally, we ranked these features with XGBoost information gain. [Results] We examined our method with three sample datasets from medical and scientific journals. The generalization of the model could be improved with feature screening, which could also slightly reduce the early warning performance. Feature screening and expansion could improve the accuracy of the early warning model. The self-citation and submission acceptance rates play significant roles for the model. [Limitations] Due to the actual acquisition of data, the range of disciplines involved is small and the training data is small, and journal features related to article processing charge are not included. [Conclusions] The proposed model could help institutions and researchers improve decision making on the quality of international journals.

Select

Predicting Public Opinion Reversal Based on Evolution Analysis of Events and Improved KE-SMOTE Algorithm

Wang Nan, Li Hairong, Tan Shuru

Data Analysis and Knowledge Discovery. 2022, 6(2-3): 396-408. https://doi.org/10.11925/infotech.2096-3467.2021.0800

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to accurately predict online public opinion reversal. [Methods] First, we retrieved the features of public opinion events based on their evolution characteristics and development process before the reversal points. Then, we used the improved KE-SMOTE algorithm to create an automatic optimization process, which balanced the event set with very skewed positive and negative samples. We also constructed a neural network ensemble learning model using the balanced event set. Finally, we examined our model with 30 trending public opinion events from 2021, and discussed the causes of errors for the inconsistent prediction results. We also provided corresponding countermeasures and suggestions on avoiding the reversal of public opinion. [Results] We found that the prediction accuracy of the proposed model on the test sets reached 99.7%, and all reversal events were predicted. [Limitations] While the time interval becoming much shorter between the occurrence and reversal of public opinion events, more research is needed to examine the proposed model with smaller data sets. [Conclusions] Our new model can accurately identify the public opinion reversal events in advance.

Please choose a citation manager

Content to export

25 March 2022, Volume 6 Issue 2-3

模态框（Modal）标题

Please choose a citation manager

Content to export

25 March 2022, Volume 6 Issue 2-3