Data Analysis and Knowledge Discovery

Select

Principles on Constructing National Economic Brain

Wang Jiandong,Yu Shiyang

Data Analysis and Knowledge Discovery. 2020, 4(7): 2-17. https://doi.org/10.11925/infotech.2096-3467.2020.0325

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper identifies principles to build National Economic Brain, aiming to monitor and forecast macro-economy developments with big data. [Context] The National Development and Reform Commission’s Big Data Center is trying to construct ontology construction rules based on strategies, policies, projects, enterprises, natural persons.[Methods] We integrated algorithm of complex network, natural language processing and spatio-temporal analysis to create a macro-meso-micro analysis system.[Results] At the micro level, we integrated government and social data to build a dynamic ontology library. We established a unified association based on corporate social credit codes, which includes 30 million enterprises and 50 million individual business across the country, as well as 78 categories and 1 828 indicators. At the meso level, we built a simulation analysis platform based on the three dependencies of complex systems. At the macro level, we monitored economic power (investment, consumption, and trade), along with industrial operation and regional developments. We also put forward 15 big data monitoring indicators, and then combined traditional prediction, complexity prediction, behavior prediction and space-time prediction to strengthen risk identification.[Conclusions] We constructed a framework of microscopic dynamic ontology, mesoscopic simulation analysis, and macroscopic monitoring and forecasting system. It effectively addresses the theoretical dilemma of macro/micro-economics disconnection, and promotes decision-makings for macro-economy.

Select

Forecasting Poultry Turnovers with Machine Learning and Multiple Factors

Chen Dong,Wang Jiandong,Li Huiying,Cai Sihang,Huang Qianqian,Yi Chengqi,Cao Pan

Data Analysis and Knowledge Discovery. 2020, 4(7): 18-27. https://doi.org/10.11925/infotech.2096-3467.2020.0323

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to forecast the trends of poultry market influenced by multiple factors, aiming to strengthen the decision makings and policies for livestock and poultry production.[Methods] We chose 50 variables to construct machine learning models for predicting daily turnovers of dressed chicken. Our models were created based on popular machine learning algorithms.[Results] We found that GBRT, Random Forest and Elastic Net yielded stable prediction results and their MAEs were 25.30, 26.67, and 28.21 respectively. The prediction was improved with more large training sets and longer training time. We could forecast the turnovers of three periods in advance.[Limitations] The training sets needs to include more features and historical data.[Conclusions] The proposed models could quantatively assess and forecast the impacts of emergencies on industrial output, which imrpoves governmental policy making.

Select

Research on Public Policy Support Based on Character-level CNN Technology

Qiu Erli,He Hongwei,Yi Chengqi,Li Huiying

Data Analysis and Knowledge Discovery. 2020, 4(7): 28-37. https://doi.org/10.11925/infotech.2096-3467.2020.0324

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposed an index of Internet users’ sentiment classification which is more suitable for public policy evaluation, and explored the automatic method for Internet users’ stance detection based on the deep learning technology.[Methods] Three important public policies of different types and in different fields were selected as research objects. After collecting, cleaning and labeling the related data of Sina Weibo, this paper analyzed the three policies’ support on Internet, and constructed a text classification model based on the character-level convolutional neural network (CNN) technology. Meanwhile this paper compared and interpretd the effectiveness and efficiency of the experimental results.[Results] The results showed that our model can achieve good performance on the indicators of the accuracy and recall rate of the three datasets.There were two datasets with F1 value above 0.8 and one dataset with F1 value above 0.6. Meanwhile the model took less time than the recurrent neural network (RNN) model, and the training time gap is dozens of times.[Limitations] The data sample size and policy coverage are limited, and the calculation method for Internet users’ support needs to be further studied.[Conclusions] The stance classification method and the character-level CNN technology perform well in the effectiveness and efficiency of public policy evaluation, and may play a significant role especially in the evaluation of emergency policies.

Select

Measuring Enterprise’s Offline Resumption with Mobile Device Positioning Data

Nie Lei,Fu Juan,Yi Chengqi,Yang Daoling

Data Analysis and Knowledge Discovery. 2020, 4(7): 38-49. https://doi.org/10.11925/infotech.2096-3467.2020.0322

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper quantifies the offline resumption level after public emergencies, aiming to provide data support for making and implementing policies.[Methods] First, we used manual and automated POI fence delineation strategies to obtain the number of mobile devices in 931 areas. Then, we measured the offline resumption levels based on the number of mobile devices within each company’s physical settings. Finally, we evaluated the measurements with facts and related data.[Results] We found that for days immediately following the Spring Festival 2020, the average level of offline resumption in the sampled companies was about 30% of that of the same period in 2019. At the end of February 2020, about half of the employees from the sampled companies returned to work offline.[Limitations] The sample size needs to be expanded.[Conclusions] The proposed method could dynamically monitoring offline work resumption after public emergencies.

Select

Big Data Technology Stack Shifting: From SQL Centric to Graph Centric

Shen Zhihong,Zhao Zihao,Wang Haibo

Data Analysis and Knowledge Discovery. 2020, 4(7): 50-65. https://doi.org/10.11925/infotech.2096-3467.2020.0452

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] The traditional SQL centric technology stack cannot handle multivariant and heterogeneous data management, large-scale network management, as well as complex network analysis. Therefore, we proposed a new graphic centric technology stack for big data.[Methods] First, we analyzed the advantages of graph-based data model and established a new graph centric technology stack. Then, we developed PandaDB, an intelligent fusion data management system.[Results] The new technology stack performed well in the applications of biological data network and scholar knowledge graph. PandaDB could manage structured and unstructured data fusion.[Limitations] It is difficult to further promote this technology stack due to the lack of supporting tools and complete application ecology.[Conclusions] Our new technology stack will play a greater role in big data applications.

Select

Classification of Health Questions Based on Vector Extension of Keywords

Tang Xiaobo,Gao Hexuan

Data Analysis and Knowledge Discovery. 2020, 4(7): 66-75. https://doi.org/10.11925/infotech.2096-3467.2019.1299

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a classification model for health questions based on keywords vector expansion, aiming to improve the user experience of medical question-answering community.[Methods] First, we extracted keywords from the questions using TF-IDF and LDA models.Then, we extended the word vector features with Word2Vec and applied them to the classification of health questions.[Results] The proposed method yielded better classification results with the TF-IDF as keyword extraction method and the complete questions/answers as training corpus. The number of words in the reserved dictionary was 600, and the language model was CBOW. The values of our optimal model’s P, R, F were 0.987 2, 0.972 5 and 0.979 8 respectively.[Limitations] We did not extracted keywords of short medical texts with semantic depth.[Conclusions] Our new classification model has better performance than the existing ones.

Select

Extracting Key-phrases from Chinese Scholarly Papers

Xia Tian

Data Analysis and Knowledge Discovery. 2020, 4(7): 76-86. https://doi.org/10.11925/infotech.2096-3467.2020.0071

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper propose a new method to extract key-phrases from Chinese scholarly articles, aiming to provide concept representation at phrase level for academic text mining.[Methods] First, we introduced the cohesion and freedom concepts to measure the internal tightness of phrases and free collocation ability of boundary words. It helped us compute the authority of bi-word phrases. Then, we merged our list with phrases extracted by position-weighted method. Finally, the TopN elements were retrieved as the final key phrases.[Results] We examined the proposed PhraseRank method with Chinese academic papers, and found its precision, recall and R-MAP values were significantly higher than those of the traditional WordRank algorithm. Among them, the R-MAP value increased by more than 128%.[Limitations] Our method could not identify key phrases with three or more words.[Conclusions] The keyphrases extracted by PhraseRank, which are more consistent with manually labeled results than keywords, effectively describe characteristics of Chinese scholarly papers.

Select

Classification and Indexing Method with CNN for Imbalanced Datasets

Weng Mengjuan,Yao Changqing,Han Hongqi,Wang Lijun,Ran Yaxin

Data Analysis and Knowledge Discovery. 2020, 4(7): 87-95. https://doi.org/10.11925/infotech.2096-3467.2020.0137

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new classficiation method based on Convolutional Neural Network(CNN), aiming to improve the indexing accuracy of the skewed datasets.[Methods] Compared with stacking fusion methods, we stacked each base model’s distribution information of the classification label probabilities as CNN inputs. Our method does not need to manually set the weight for each base model. We examined the proposed model with the third-level categories of the Chinese Library Classification (CLC).[Results] The accuracy of our method was upto 60%, which was 19% higher than the performance of baselinemodels.[Limitations] Our method needs to design convolution kernels, which can only be determined with experiments. Meanwhile, the complexity of classifier training at the fusion stage depends on the number of categories and base models.[Conclusions] The porposed method can effectively improve the indexing accuracy of imbalanced datasets. With the help of hierarchical classification strategy, it can automatically finish classification and indexing tasks of CLC.

Select

Classification of Academic Papers for Periodical Selection

Wang Xinyun,Wang Hao,Deng Sanhong,Zhang Baolong

Data Analysis and Knowledge Discovery. 2020, 4(7): 96-109. https://doi.org/10.11925/infotech.2096-3467.2020.0232

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] We constructed a hierarchical system for papers published by academic journals and proposed submission guidance based on the similarity between articles and journals.[Methods] We studied journals in the field of Library and Information Science and used hierarchical clustering to construct two-layer architecture. Then, we employed SVM, CNN, and RNN to classify these papers. Third, we compared the results of different characteristic combinations, and selected the most suitable algorithm. To optimize the classification results, we combined the journals with similar coverage.[Results] Once the characteristic combinations were more reflective to the article contents, we got the highest accuracy of 81.84%.[Limitations] The data size needs to be expanded.[Conclusions] The deep learning algorithm does a better job in classification than the machine learning algorithm. Combining journals with similar contents improves the classification results.

Select

Studying Content Interaction Data with Topic Model and Sentiment Analysis

Xu Hongxia,Yu Qianqian,Qian Li

Data Analysis and Knowledge Discovery. 2020, 4(7): 110-117. https://doi.org/10.11925/infotech.2096-3467.2018.1362

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper explores data mining techniques for confrontational opinions from interaction data of online community.[Methods] First, we constructed a new algorithm to analyze emotional confrontations based on sentiment analysis and topic model. Then, we included the characteristics of knowledge, topic, and interaction data to the new model. Finally, we conducted an empirical study on the topic of AlphaGo.[Results] There was significant “Pro-AlphaGo” and “Anti-AlphaGo” confrontations online. The “Pro-AlphaGo” topics included human intelligence, competition and ability. The “Anti-AlphaGo” opinions covered AI companies, products and comprehension abilities.[Limitations] We only examined the proposed model with the topic of AlphaGo.[Conclusions] The proposed method benefits intelligence analysis.

Select

Retrieving Mathematical Expressions Based on Hesitant Fuzzy Weight

Xu Yicong,Tian Xuedong,Li Xinfu,Yang Fang,Shi Qingxuan

Data Analysis and Knowledge Discovery. 2020, 4(7): 118-126. https://doi.org/10.11925/infotech.2096-3467.2019.1294

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a retrieval method for mathematical expressions, aiming to find items matching the queries from a large collection of math expressions.[Methods] Firstly, we extracted characteristic subformulas of each single mathematical expression and introduced the theory of hesitant fuzzy sets(HFSs) to compute their weights. Secondly, we added the weight values of all subformulas belonging to the same expression as the similarity scores between the index and query. Finally, we ranked retrieved results with the similarity scores.[Results] The proposed method had higher retrieval efficiency and better results than traditional methods, with the highest NDCG value reached 0.88.[Limitations] Our method did not fully address the semantics of mathematical expressions.[Conclusions] The proposed method could retrieve the needed mathematical expressions more accurately.

Please choose a citation manager

Content to export

25 July 2020, Volume 4 Issue 7

模态框（Modal）标题

Please choose a citation manager

Content to export

25 July 2020, Volume 4 Issue 7