Data Analysis and Knowledge Discovery

Select

A Review on Main Optimization Methods of BERT

Liu Huan,Zhang Zhixiong,Wang Yufei

Data Analysis and Knowledge Discovery. 2021, 5(1): 3-15. https://doi.org/10.11925/infotech.2096-3467.2020.0965

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes and summarizes the main optimization methods of the BERT language representation model released by Google to provide reference for future studies based on BERT. [Coverage] A total of 41 main literatureor models related to optimization of BERT have been reviewed and analyzed. [Methods] The optimization routes were explained from four aspects: pre-training targets optimization, external knowledge base fusion, Transformer structure evolution and pre-training model compression. [Results] The optimization of pre-training targets and the improvement of Transformer structure caught the earliest attention by researchers, and became the main routes to optimize BERT. After that, the pre-training model compression and the integration of external knowledge bases have also become new directions of research. [Limitations] Research on BERT has developed extremely rapidly, and some of the related research work may not yet be covered. [Conclusions] Researchers can focus on pre-training targets optimization and Transformer structure improvement, and consider choosing the optimization routes according to different application scenarios.

Select

Understanding Serendipity in Science: A Survey

Yu Shuo,Hayat Dino Bedru,Chu Xinbei,Yuan Yuyuan,Wan Liangtian,Xia Feng

Data Analysis and Knowledge Discovery. 2021, 5(1): 16-35. https://doi.org/10.11925/infotech.2096-3467.2020.1088

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper summarizes the components and definitions of serendipity, reviews representative supporting technologies and applications of serendipity in science, and discusses challenges and future directions in this field. [Coverage] We searched relevant keywords such as “serendipity”, “novelty” and “diversity” in research repositories such as Microsoft Academic and Google Scholar. A total of 102 well-selected references are finally cited. [Methods] We reviewed serendipitous discoveries in various scenarios, and discussed the concept of serendipity in the context of science. Relevant tools and applications are categorized. [Results] The tools that support serendipity are conducive to scientific research. However, there is no uniform definition of serendipity, thus making it difficult to measure serendipity in science. [Limitations] The factors affecting serendipity in science are complex, and yet to be explored. [Conclusions] Serendipity is one of the indispensable factors for scientific advances. However, many challenges are facing the exploration of serendipity in science, such as lack of metrics and difficulty to control.

Select

Review of Cultural Heritage Crowdsourcing in the Domain of Digital Humanities

Zhao Yuxiang,Lian Jingwen

Data Analysis and Knowledge Discovery. 2021, 5(1): 36-55. https://doi.org/10.11925/infotech.2096-3467.2020.0906

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper systematically reviews the development of research and practice of the cultural heritage crowdsourcing in the domain of digital humanities.[Coverage] We used various sources such as SSCI, SCIE, EI, A&HCI, CPCI-S, CPCI-SSH, Google Scholar, CNKI, Wanfang Data and VIP to search literatures with the keywords “cultural heritage crowdsourcing”, “crowdsourcing AND digital humanities”, “cultural heritage AND collaboration”, “cultural heritage AND user generated content”, “GLAM AND crowdsourcing” etc. We then collected 110 representative literatures in conjunction with topic screening and backward and forward approach.[Methods] First, we classified the connotation and extension of cultural heritage crowdsourcing and made a loosely defined concept. Then, we investigated the current status of research and practice of cultural heritage crowdsourcing from three key elements of data resources, digital technology and platform system.[Results] This paper explores the concept of cultural heritage crowdsourcing, proposes the classification of cultural heritage crowdsourcing projects, explores the data life cycle and digital technology classification system of cultural heritage crowdsourcing, and sorts out the relevant research results and experience in the construction and operation management of cultural heritage crowdsourcing platform.[Limitations] Future research will further refine the integrated framework of the cultural heritage crowdsourcing model for the research and practice of digital humanities.[Conclusions] Cultural heritage crowdsourcing is a new model in the field of public cultural services in recent years in terms of data collection and analysis, construction of information resources and innovation of knowledge services. It is a new campaign in response to the deep integration of technology and culture in the digital era, and also a new direction for digital humanities exploration in the discipline of library, information and archives management.

Select

Consensus Mechanisms of Consortium Blockchain: A Survey

Leng Jidong,Lv Xueqiang,Jiang Yang,Li Guolin

Data Analysis and Knowledge Discovery. 2021, 5(1): 56-65. https://doi.org/10.11925/infotech.2096-3467.2020.0981

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper analyzes the application of Byzantine problem and reviews related research on the consensus mechanisms of consortium blockchain.[Coverage] We searched “consensus mechanisms” as keywords from the titles or topics of WoS, ResearchGate, arXiv and CNKI databases. A total of 74 papers were retrieved.[Methods] We reviewed the consensus mechanism and classification method of the blockchain. Then, we explored the applications of Byzantine problem, and discussed the strong and permissioned consensus mechanisms.[Results] We summarized the developments and ties of Byzantine problem, Brewer’s theorem, Byzantine system and Byzantine fault-tolerant mechanism. We also proposed the basic procedures and evaluation criteria for consensus mechanisms of consortium blockchain. Finally, we divided the consensus mechanism into four categories based on security and time delay.[Limitations] This paper did not cover all consensus mechanisms for consortium blockchain.[Conclusions] The research on consensus mechanism promotes the implementation of blockchain, which could be improved from fault tolerance, communication delay and conversion efficiency in the future studies.

Select

Identifying Citation Texts with Unsupervised Method

Hyonil Kim,Ou Shiyan

Data Analysis and Knowledge Discovery. 2021, 5(1): 66-77. https://doi.org/10.11925/infotech.2096-3467.2020.0548

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a method to automatically identify citation texts and compare the contents of citation sentences. [Methods] We developed an unsupervised method to find the implicit citation sentences and then compared the similarity of these sentences and the citing/cited papers. We combined the vector space and the word embedding models to calcuate the similarity precisely. [Results] We identified the implicit citation sentences of two higly-cited papers from 200 citing articles and found the proposed method’s F-value was above 92%. By comparing the contents of the explicit and implicit citaiton senstences, we noticed their significant difference in citation functions and sentiments. There were more implicit citation sentences for research background and technical basis than the explicit ones. There were also fewer implicit citation sentences for research basis and comparison than the explicit ones. 45.3% of the explicit citation sentences were positive references while 78.8% of implicit citation sentences were neutral. [Limitations] We only investigated citation texts at sentence level. More research is needed to discuss the clause and phrase-level identifications.[Conclusions] The proposed method could effectively identify implicit citation sentences.

Select

Detecting Rumor Dissemination and Sources with SIDR Model

Chen Yixin,Chen Xinyue,Liu Yi,Wang Hanzhen,Lai Yongqing,Xu Yang

Data Analysis and Knowledge Discovery. 2021, 5(1): 78-89. https://doi.org/10.11925/infotech.2096-3467.2020.0715

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the characteristics of rumor sources and dissemination patterns, aiming to reduce their negative effects. [Methods] First, we added “fact checkers” to the traditional infectious disease model, and set changing rules for node status based on the characteristics of rumor dissemination. Then, we constructed a SIDR model with the node interaction in social networks. Third, we proposed an algorithm based on SIDR model to detect rumor sources. Finally, we optimized the proposed model with the Beam search algorithm. [Results] We examined the new model with real-world cases and found it accurately simulated the propagation of rumors. Identifying rumor sources could constrain their spread. The accuracy of our algorithm was up to 83% at the early stage.[Limitations] This paper does not consider the dynamic changes of social networks, and more representative cases should be included. [Conclusions] The proposed model could help us identify rumor sources and predict their development.

Select

Public Health Risk Forecasting with Multiple Machine Learning Methods Combined:Case Study of Influenza Forecasting in Lanzhou, China

Chai Guorong,Wang Bin,Sha Yongzhong

Data Analysis and Knowledge Discovery. 2021, 5(1): 90-98. https://doi.org/10.11925/infotech.2096-3467.2020.0754

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to explore the practicability and effectiveness of forecasting public health risks with machine learning, taken influenza as an example. [Methods] First, we collected the data on influenza and meteorological factors during 2009 to 2016 in Lanzhou, China. Data from the year 2009 to 2015 were used as the training data and 2016 as the testing data. Then, based on SARIMA, Kalman Filter, and VAR, three machine learning methods for influenza prediction were put forward, respectively. Moreover, we designed two multi-method combined forecasting strategies. Finally, the forecasting performance of the above methods (strategies) was carefully evaluated and compared. [Results] The SARIMA, VAR, and Kalman Filter achieved best predict performance in the whole period (WP), outbreak period (OP), and stabilization period (SP), with RMSE at 11.68, 19.23, 1.60, and R ² at 0.932, 0.923, 0.956, respectively. The forecasting performance among all three scenarios was improved by our multi-method combined strategies, in which Comb_2 has better performance, with RMSE at 10.82, 14.68, 1.38, and R ² at 0.942, 0.934, 0.963, respectively. [Limitations] Limited by the data, this study just considered meteorology factors as external factors. [Conclusions] Predicting public health risks (such as influenza) with machine learning is practicable, effective and has great potential. But a lack of multi-source data is the major dilemma. Therefore, to promote the open exchange and sharing of data, barriers should be broken at the technical, organizational, and institutional levels.

Select

Relationship Between Financial News and Stock Market Fluctuations

Lv Huakui,Liu Zhenghao,Qian Yuxing,Hong Xudong

Data Analysis and Knowledge Discovery. 2021, 5(1): 99-111. https://doi.org/10.11925/infotech.2096-3467.2020.0063

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores the impacts of financial news on the stock market fluctuations.[Methods] We used the method of “Word2Vec+k-means” to cluster news texts, and utilized VAR model to analyze the relationship between different types of news and the stock market performance.[Results] The sentiments and information of news significantly affect the trading volumes, amplitudes and returns of the stock market. Meanwhile, the fluctuations of stock market also influenced the emotion and length of the news reports.[Limitations] We did not analyze the relationship between the individual stock and news reports.[Conclusions] There are interactions and time-lag effects between news and stock market, while the news category is a key player.

Select

Developments of Tech-Innovation Network for Patent Cooperation: Case Study of Speech Recognition in China

Guan Peng,Wang Yuefen,Jin Jialin,Fu Zhu

Data Analysis and Knowledge Discovery. 2021, 5(1): 112-127. https://doi.org/10.11925/infotech.2096-3467.2020.0337

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the evolution of tech-innovation network from the perspective of patent cooperation. [Methods] First, we proposed a dynamic framework to analyze the patent cooperation network. Then, we conducted evolutionary analysis of the network’s topological structure based on network size, clustering, component analysis, degree distribution, and small-world theory. Third, we analyzed the centrality and structure holes of core network members. [Results] We evaluated the performance of our framework with an empirical study on speech recognition, which also investigated the impacts of network evolution on innovations. [Limitations] More research is needed to investigate the proposed model’s performance in other fields, as well as the influence of network structure on individual company’s innovations. [Conclusions] The domestic cooperative network for speech recognition patents, which has evolved from a fragmented network to the multi-center small-world one, plays an important role in innovation. The three core members of this network are companies, universities and research institutes, following the rule of “rich get richer”. This paper also discusses technology innovation management issues from the perspectives of cooperation and regional development.

Select

Forecasting Car Sales Based on Consumer Attention

Jiang Cuiqing,Wang Xiangxiang,Wang Zhao

Data Analysis and Knowledge Discovery. 2021, 5(1): 128-139. https://doi.org/10.11925/infotech.2096-3467.2020.0418

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study constructs a forecasting model for car sales based on consumer attention. [Methods] First, we defined consumer attention with consumer opinion and search data. Then, we used the Word2Vec algorithm to extract the initial keyword lists, while using time difference correlation analysis to identify the core keywords. Finally, we generated the user attention data with PCA and built Attention_LSTM model to predict car sales. [Results] The RMSE and MAPE indices of our model were reduced by 2.02 and 0.96%. The average percentage error of the new model was 6.52%, 3.42%, 2.56%, and 0.81% less than those of the ARIMA, SVR, BP neural network, and LSTM models. [Limitations] We did not include other social media data to analyze consumers’ online behaviors. [Conclusions] The Attention_LSTM model based on consumer attention could effectively forecast auto sales.

Select

Locating Academic Literature Figures and Tables with Geometric Object Clustering

Yu Fengchang,Cheng Qikai,Lu Wei

Data Analysis and Knowledge Discovery. 2021, 5(1): 140-149. https://doi.org/10.11925/infotech.2096-3467.2020.0630

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to improve the recall of figures/tables from academic literature. [Methods] First, we extracted geometric objects from the PDF files of literature. Then, we obtained priori information on scopes of figures/tables from the perspectives of underlying coding analysis and image comprehension. Third, we merged the geometric objects using K-means. Finally, we reconstructed the text contents using heuristic algorithm to determine the locations of figures/tables. [Results] On the experimental dataset, the precision of the proposed algorithm reached 0.915 and the recall was 0.918. The precision level is close to the state-of-the-art algorithms and the recall value was improved by 0.193 (26.6% better than the existing ones). [Limitations] Documents with complex layouts and irregular use of symbols will generate errors. The determination of the clustering k value and the algorithm for text filtering could be improved. [Conclusions] The proposed algorithm effectively increases the recall of figures/tables from academic literature.

Please choose a citation manager

Content to export

25 January 2021, Volume 5 Issue 1

模态框（Modal）标题

Please choose a citation manager

Content to export

25 January 2021, Volume 5 Issue 1