Data Analysis and Knowledge Discovery

Select

Constructing a Common Data Circulation Infrastructure Platform for the National Unified Data Factor Market——Technical Path and Policy Thinking of Constructing the National “Data Networking” Root Service System

Dou Yue, Yi Chengqi, Huang Qianqian, Mo Xinyao, Wang Jiandong, Yu Shiyang

Data Analysis and Knowledge Discovery. 2022, 6(1): 2-12. https://doi.org/10.11925/infotech.2096-3467.2021.1411

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper is to construct a common data circulation infrastructure for the national unified data factor market. Establish a unified data identification coding and analysis system of basis data objects such as people, enterprises, cars, things and places. Guide all parties to strengthen the management of data classification, grading and identification calibration during the whole process of data factor circulation. Give support to provide common public services, such as data registration and filing, supply and demand matching, credit evaluation, compliance notarization, asset evaluation, etc. Promote the interconnection and integration development of nationwide cross-regional and cross-industry data factor circulation trading platforms. Provide safe and credible circulation environment and common public services for various market entities involved in data transactions. [Methods] We conduct a literature review on the recent development of related technologies to promote data circulation and transaction at home and abroad. Given the common difficulties the domestic data factor market construction comes across, we propose the national “data networking” root service system. [Results] This paper clarifies the construction ideas the national “data networking” root service system, with data identification fusion, blockchain interoperability and privacy preserving computing platform interoperability served as the basic support and the public service system of data circulation and transaction served as the carrier. [Limitations] Further research is needed to demonstrate the completeness, scalability and robustness of this proposed technical path. [Conclusions] The proposed national “data networking” root service system plays an important role in building the common data circulation infrastructure platform, providing safe and credible common public services and cultivating the data factor market and industrial ecology.

Select

TID-MOP:The Comprehensive Framework of Security Management and Control in the Scenario of Data Exchange

Du Ziran, Dou Yue, Yi Chengqi, Hong Boran, Gu Mingze, Li Lin

Data Analysis and Knowledge Discovery. 2022, 6(1): 13-21. https://doi.org/10.11925/infotech.2096-3467.2021.1412

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] In order to promote the safety and compliance development of data exchange, this paper proposes a comprehensive framework of data transaction security management and control which takes both technical path and mechanism guarantee into account for data transaction risk, data security risk and infrastructure security risk in data exchange scenarios. [Methods] Using literature research method, this paper reviews the current literature of technology and management in the field of data transaction security at home and abroad. Combining the practice of data exchange, this paper puts forward a “TID-MOP” data transaction security management and control framework which contains both technology and mechanism. [Results] The “TID-MOP” comprehensive framework designs a core technology architecture to realize transaction security, that is, “separation of business flow, computing flow and capital flow, and convergence of circulation environment through blockchain”; “Separate the experimental environment from the production computing environment, and link the computing environment through model management and data management”; “Data computing is separated from safety supervision, and the supervision environment is uniformly managed through the control and management center”. The technology architecture improves the safety of data circulation and transaction, and realizes the safety control and unified supervision of the whole process. [Limitations] Further research is needed to verify the actual operation efficiency of the comprehensive framework. [Conclusions] The “TID-MOP” framework takes the data transaction process as the core, and provides an effective reference for the development and innovation of data transaction.

Select

Comprehensive Management System and Technical Framework of Data Quality in the Data Circulation Transaction Scenario

Huang Qianqian,Zhao Zheng,Liu Zhaoyin

Data Analysis and Knowledge Discovery. 2022, 6(1): 22-34. https://doi.org/10.11925/infotech.2096-3467.2021.1422

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] In the context of data transaction, in order to strengthen data circulation management and improve data circulation transaction rules, a set of comprehensive data quality management system and technical framework under the scenario of data circulation transaction are constructed according to the focus of data product quality evaluation and management. [Methods] Using literature research method, we reviewed the current literature of data quality assessment and commonly used methods of data quality inspection at home and abroad. Combining industry experience and specific scenarios of data transactions, we proposed a quality evaluation model containing raw data sets, desensitized data sets, modeled data, and AI-based data, along with a management system to improve the data quality before, during, and after data transactions. [Results] This paper raises a data quality evaluation model in transaction context that based on the “6543” structure, namely six types of main indicators, five types of subjects, four types of products, and three types of evaluation methods. Provide testing and optimization solutions to data normativeness and completeness in the pre-transaction phase, data accuracy and consistency during the transaction phase, as well as data timeliness and accessibility in post-transaction phase. [Limitations] The data quality model and management system have not been systematically used in real transaction scenarios, and there is a lack of actual testing. [Conclusions] The proposed quality evaluation model and quality management system play an important role in realizing the quality evaluation and improvement of data products in the whole process of data transaction.

Select

Unified Privacy Computing Framework in Data Circulation Scenario Based on the Practice of Shenzhen Data Exchange

Zeng Jianpeng, Zhao Zheng, Du Ziran, Hong Boran

Data Analysis and Knowledge Discovery. 2022, 6(1): 35-42. https://doi.org/10.11925/infotech.2096-3467.2021.1420

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] In order to ensure the safe circulation of data and promote the development of the data circulation trading market, a standardized unified privacy computing framework is constructed for the interconnection of privacy computing platforms in the data circulation scenario. [Methods] This paper summarizes the development status of privacy computing technologies and platform in recent years, and proposes a unified privacy computing framework based on data circulation scenario with reference to current data circulation problems and data exchange practices. [Results] The unified privacy computing framework proposes a “three-layer architecture, two types of interoperability and one ecology” to achieve business linkage with data exchange platform, unified supervision in the circulation process, and interconnected standard management respectively. Two types of interoperability realize the interconnection between data exchange platform and privacy computing platforms, as well as the interconnection between different privacy computing platforms. One ecology realizes the circulation and transaction ecology of data elements. [Limitations] Private computing technologies are untested for large-scale commercial use; privacy computing technology has yet to strike a balance between computing security and computing efficiency. [Conclusions] The unified privacy computing framework proposed in this paper based on the data circulation transaction scenario is conducive to the close combination of privacy computing technology and data circulation, to maximize the value of data, and to provide a reference for the realization of privacy computing interconnection.

Select

Research Progress of Data Traceability from the Perspective of Data Element Circulation

Wang Xiaoqing, Sun Zhanwei, Wu Junhong, Du Ziran, Qian Chengjiang

Data Analysis and Knowledge Discovery. 2022, 6(1): 43-54. https://doi.org/10.11925/infotech.2096-3467.2022.0017

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The research progress and application scenarios of data traceability are analyzed through literature review, in order to provide reference for the construction of data trading platform, the construction of industrial data governance and the construction of digital government governance. [Methods] The data traceability model, data traceability method and data traceability application are summarized and analyzed, and on this basis, the research status and shortcomings are discussed. [Results] Whether in content description, model construction or scene application, data traceability research has achieved rich results, such as improving the quality of data traceability, ensuring the safety of data traceability and improving the efficiency of data traceability. [Limitations] The research on data traceability from the perspective of factor circulation started relatively late, the research results were not rich enough, the research system had not been formed, and the research focus was biased towards empirical research. [Conclusions] We can actively promote the normalization of data delivery and use by combining with data factor market; speed up the work of data traceability standards, and actively promote the institutionalization of data use; continuously improve the quality of data traceability information, and actively promote the quality of data services; attach great importance to data traceability information security, and actively promote the standardization of data information use; to build a high standard data traceability platform, and actively promote the healthy development of data factor market.

Select

Measuring Online Trust in Government Microblogs in Public Health Emergencies

An Lu, Xu Manting

Data Analysis and Knowledge Discovery. 2022, 6(1): 55-68. https://doi.org/10.11925/infotech.2096-3467.2021.0631

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to measure the netizens' trust in government microblogs during public health emergencies, and then explores reasons for the changes. [Methods] First, we calculated the trust from the comments on government microblogs with the comment objects, the topic similarity between comments and microblogs, as well as their sentiments. Then, we added the numbers of likes and forwards/retweets to decide the comprehensive trust of the netizens toward the government microblogs. [Results] We examined out model with microblog data on COVID-19 and found that topics related to industrial and government efforts fighting the pandemic enhanced the trust in government microblogs. There were great differences in the development trends and reasons of the trust in government microblogs from different fields. [Limitations] We only used the events and the microbloggers as the objects of comments. [Conclusions] The proposed model could help government agencies improve decision making, public trust, and lead online opinion during public health emergencies.

Select

Studying Opinion Leaders with Network Analysis and Text Mining

Sun Yu, Qiu Jiangnan

Data Analysis and Knowledge Discovery. 2022, 6(1): 69-79. https://doi.org/10.11925/infotech.2096-3467.2021.0407

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to study the classification scheme for opinion leaders and evaluate their characteristics from multiple perspectives. [Methods] We proposed a method to classify opinion leaders by community division. Then, we comprehensively analyzed their influences from the dimensions of network diffusion ability and emotional dominance. We conducted an empirical analysis with Twitter data, and compared the influence of different types of opinion leaders through network analysis and text mining. [Results] Opinion leaders are identified as three communities, which rank differently in network diffusion ability and emotional dominance. The two dimensions show no correlation with an absolute value of correlation coefficient less than 0.3. Compared with the traditional weighted summing method, the two-dimensional matrix analysis can reflect the influence characteristics more comprehensively. [Limitations] In the evaluation of emotional influence, we only analyzed the original texts, and future studies will include the comments. [Conclusions] The proposed methods could analyze the degrees and characteristics of the opinion leaders' influence. It helps us understand all kinds of opinion leaders and guide the public opinion directions more effectively in risk management.

Select

Modelling and Simulating Medical Crowdfunding with SEIR

Cao Guang, Shen Lining

Data Analysis and Knowledge Discovery. 2022, 6(1): 80-90. https://doi.org/10.11925/infotech.2096-3467.2021.0812

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to explore the dissemination mechanism and the equity issues of online medical crowdfunding. [Methods] First, we analyzed the diffusion process of medical crowdfunding. Then, we used the SEIR model to study the participants' characteristics and their decision-making behaviors. Finally, we developed the diffusion model of medical crowdfunding based on NetworkX and examined it with simulation experiments. [Results] We found that the node degree of the initiator, the appeal of the project, and the network structure affected the propagation speed and scope of the project. In the equity of financing, the node degree of the initiator, the appeal of the project, and individual wealth affected the amount of fundraising with correlation coefficients of 0.49, 0.47 and 0.63. We also found the “rich got richer” effects and the individual contribution did not reflect the “fixed ratio” way. [Limitations] The model considers few demographic characteristics, such as gender, age, occupational background, etc. [Conclusions] The proposed model could effectively simulate the diffusion process of medical crowdfunding projects on social media, which also explores their fundraising ability and the financing equity issues.

Select

Visualization Method for Technology Theme Map with Clustering

Wang Xuefeng, Ren Huichao, Liu Yuqin

Data Analysis and Knowledge Discovery. 2022, 6(1): 91-100. https://doi.org/10.11925/infotech.2096-3467.2021.0858

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to improve the mono-color technology topic maps generated with the clustering technique, aiming to enrich the visualization tools. [Methods] We proposed a new model to create technology topic maps with clustering. It used the layout algorithm to collect the topic words, and established the functions for pixel density, class density, as well as color intensity. We also conducted the color rendering based on the class density and color intensity, and obtained the technology topic maps. [Results] We embedded the new algorithm with ItgInsight,a text mining and visualization tool, and examined it with quantum cryptography communication patent data. The proposed method is simple and effective. [Limitations] The generated subject map is not a vector one, and the algorithm's efficiency can be further optimized. [Conclusions] The proposed method integrates clustering information and enhances topic discrimination, which help us create better technology topic maps.

Select

Automatic Classification of E-commerce Comments with Multi-Feature Fusion Model

Xie Xingyu, Yu Bengong

Data Analysis and Knowledge Discovery. 2022, 6(1): 101-112. https://doi.org/10.11925/infotech.2096-3467.2021.0503

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper designs a text classification method based on the BERT model and multi-channel feature extraction, aiming to accurately conduct automatic classification for e-commence comments. The new model will also address the issues of polysemy and sparse information of comments from public online forums and enterprise data warehouses. [Methods] First, we used BERT's TextCNN to reduce the polysemy of Chinese words. Then, our model utilized the BERT linkage Bi-LSTM channel to capture the long-distance context semantics. Third, we used BERT's fine-tuning mechanism to adjust the word vector coding with the extracted features. Finally, the model fused the feature vectors and finished the text classification. [Results] The accuracy of the MFFMB (Multi-Features Fusion Model BERT-based) reached 90.07% on the public data sets of e-commerce comments. Compared with the popular baseline models, the accuracy of the proposed one was improved by 2.36, 8.55, 4.61 and 5.11 percentage points. Meanwhile, combining the BERT and attention mechanism improved our models' accuracy by 1.48 and 4.81 percentage points than their best baseline counterparts. [Limitations] The attention mechanism was only used with the BiLSTM channel. Future research is needed to examine our model with more data sets. [Conclusions] The proposed model could effectively improve the accuracy of text classification.

Select

Discovering Chinese New Words Based on Multi-sense Word Embedding

Zhang Le, Leng Jidong, Lv Xueqiang, Yuan Menglong, You Xindong

Data Analysis and Knowledge Discovery. 2022, 6(1): 113-121. https://doi.org/10.11925/infotech.2096-3467.2021.0684

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a method to discover Chinese new words based on multi-sense word embedding, aiming to improve the word segmentation of social media texts. [Methods] Firstly, we trained the MWEC with social media texts, as well as data from Chinese HowNet and Chinese character stroke database to reduce the semantic confusion. Then, we used the n-gram frequent string mining method to identify the highly relevant sub-word set, and created the new candidate set. Finally, we used the semantic similarity of multi-sense word embedding to evaluate candidates and identified the new words. [Results] We examined the model with datasets of finance, sports, tourism and music. The MWEC improved the F1 value by 2.0, 3.0, 2.6 and 11.3 percentage points respectively compared with the existing methods. [Limitations] We generated candidate words based on the popularity of sub-words, which was difficult to identify the low-frequency words. [Conclusions] The multi-sense word embedding algorithm could effectively discover new words from Chinese social media texts.

Select

Assisted TCM Diagnosis and Treatment for Diabetes with Multi NLP Tasks

Zhang Yujie, Bai Rujiang, Xu Haiyun, Han Jing, Zhao Mengmeng

Data Analysis and Knowledge Discovery. 2022, 6(1): 122-133. https://doi.org/10.11925/infotech.2096-3467.2021.0409

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to provide more accurate and intelligent auxiliary reference for the diagnosis and treatment of Traditional Chinese Medicine (TCM), aiming to reduce their uncertainty and difficult to quantify issues. [Methods] First, we collected medical records of TCM for diabetes. Then, we created an auxiliary diagnosis and treatment scheme integrating multiple NLP tasks, i.e., emotion recognition and text matching. Finally, we examined our new model with the quantitative assessment of diabetes, symptom information matching, automatic symptom summarization, disease type discrimination and TCM recommendation. [Results] We conducted ten rounds of tests with the fuzzy comprehensive evaluation method. The average membership degrees of the four evaluation indices were 0.194 9, 0.314 0, 0.217 3 and 0.273 8 respectively. The maximum membership degree indicated the effectiveness of the proposed method. [Limitations] Due to the scarcity of clinical medical records, it is difficult to improve the performance of each subtask significantly. More research is needed to examine the model with data from other fields. [Conclusions] This method can effectively help doctors reduce the uncertainty as well as evaluate diagnosis and treatment.

Select

Disease Knowledge Discovery Based on SPO Predications

Cai Miaozhi, Li Xiaoying, Zhao Jiawei, Feng Fengxiang, Ren Huiling

Data Analysis and Knowledge Discovery. 2022, 6(1): 134-144. https://doi.org/10.11925/infotech.2096-3467.2021.0612

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to discover knowledge from the high-level evidence-based literature on diseases indexed by PubMed, aiming to provide reference for clinical diagnosis, treatment, as well as routine prevention and control of diseases. [Methods] We proposed a diseases knowledge discovery model based on SPO predications with the semantic extraction tool SemRep. Then we selected the diabetes-related literature to evaluate this model, and discovered knowledge based on SPO visualization and clinical knowledge. [Results] We obtained 1 258 SPO predications and 16 semantic relationships, which identified diabetes-related genes, common complications, as well as detection and treatment methods. [Limitations] We only examined our model with publicly accessible literature. More research is needed to include knowledge bases and electronic medical records. [Conclusions] The disease knowledge discovery model based on SPO predication could identify the biomedical knowledge from literature, which provides potential research hypotheses and ideas for biomedical researchers.

Please choose a citation manager

Content to export

25 January 2022, Volume 6 Issue 1

模态框（Modal）标题

Please choose a citation manager

Content to export

25 January 2022, Volume 6 Issue 1