Data Analysis and Knowledge Discovery

Select

Designing Smart Knowledge Services with Sci-Tech Big Data

Li Qian,Jing Xie,Zhijun Chang,Zhenxin Wu,Dongrong Zhang

Data Analysis and Knowledge Discovery. 2019, 3(1): 4-14. https://doi.org/10.11925/infotech.2096-3467.2018.1364

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper investigates the issues facing scientific and technology knowledge services. It tries to design smart knowledge service based on big data, which provides semantic retrieval, precision information push, collective intelligence and intelligent analysis services. [Methods] The proposed system was driven by “data and scene”. It used the technology of natural language processing and artificial intelligence to build Knowledge Graph, Precision Service and Intelligent Informatics. It also supported the development of new generation smart knowledge service platforms. [Results] We successfully built a Science and Technology Big Data Center, which helped us develop a knowledge discovery platform. We also created an intelligent research assistant, launched an academic evaluation system for scientific and technological institutions, and constructed a panoramic observation platform for scientific and technological big data visualization. [Limitations] The knowledge graph and the precision service needs to be further improved. [Conclusions] The Smart Knowledge Service platforms provide analysis tools for scientific and technological intelligence.

Select

Building Knowledge Graph with Sci-Tech Big Data

Ying Wang,Li Qian,Jing Xie,Zhijun Chang,Beibei Kong

Data Analysis and Knowledge Discovery. 2019, 3(1): 15-26. https://doi.org/10.11925/infotech.2096-3467.2018.1354

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to extract information from Sci-Tech big data and build an academic knowledge network, aiming to develop smart knowledge services. [Methods] We proposed an Ontology schema and a framework to contruct knowledge graph based on the distributed storage and high-performance computing of big data platform. The proposed model helped us extract and align research entities for relationship discovery. We also adopted the knowledge merging and enrichment, semantic storage and quality management techniques. [Results] We created a huge knowledge graph including more than 300 million entities and 1.1 billion relations. It also supported knowledge discovery platform and smart personal research assistant apps for scientific big data. [Limitations] More research is needed to improve the quality management of knowledge graph, as well as the precision of entity alignment. [Conclusions] The proposed method improve the knowledge management of scientific and technology big data.

Select

Constructing Name Authority for Research Entities

Jianyong Zhang,Li Qian,Qianqian Yu,Zhipeng Dong,Yongwen Huang,Jianhua Liu,Shu Guo,Feng Wang

Data Analysis and Knowledge Discovery. 2019, 3(1): 27-37. https://doi.org/10.11925/infotech.2096-3467.2018.1363

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to construct name authority for authors, institutions, journals, and funding, etc. [Methods] First, we loaded, cleansed, transformed, integrated and merged names from multiple sources to create uniform structured data with unique identifiers. Then, we used the metadata model for name authority to extract research entities and relationships among them. Finally, we proposed disambiguation algorithms, such as Levenshtein Distance, Jaccard, word2vec and CNN, for different research entities. [Results] Our study created name authority databases for authors (23 million records), institutions (2.6 million records), journals (30,000 records), and funding (2 million records). We chose six institutions’ names from NSTL and compared them with those from Incites. We found the average precision reached 86.8%. [Limitations] The proposed disambiguation strategies and algorithms need to be further refined and improved in dealing with the diverse expressions of selected disambiguation feature. The analysis of data from different data sources are needed, in order to apply appropriate algorithms. [Conclusions] The proposed method and disambiguation strategies could improve the performance and comprehensiveness of databases for name authority.

Select

Extracting Fine-grained Knowledge Units from Texts with Deep Learning

Li Yu,Li Qian,Changlei Fu,Huaming Zhao

Data Analysis and Knowledge Discovery. 2019, 3(1): 38-45. https://doi.org/10.11925/infotech.2096-3467.2018.1352

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to extract fine-grained knowledge units from texts with a deep learning model based on the modified bootstrapping method. [Methods] First, we built the lexicon for each type of knowledge unit with the help of search engine and keywords from Elsevier. Second, we created a large annotated corpus based on the bootstrapping method. Third, we controlled the quality of annotation with the estimation models of patterns and knowledge units. Finally, we trained the proposed LSTM-CRF model with the annotated corpus, and extracted new knowledge units from texts. [Results] We retrieved four types of knowledge units (study scope, research method, experimental data, as well as evaluation criteria and their values) from 17,756 ACL papers. The average precision was 91%, which was calculated manually. [Limitations] The parameters of models were pre-defined and modified by human. More research is needed to evaluate the performance of this method with texts from other domains. [Conclusions] The proposed model effectively addresses the issue of semantic drifting. It could extract knowledge units precisely, which is an effective solution for the big data acquisition process of intelligence analysis.

Select

Mining Innovative Topics Based on Deep Learning

Changlei Fu,Li Qian,Huaping Zhang,Huaming Zhao,Jing Xie

Data Analysis and Knowledge Discovery. 2019, 3(1): 46-54. https://doi.org/10.11925/infotech.2096-3467.2018.1365

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to identify innovative topics from massive volumes of texts. [Methods] First, we extracted knowledge points with heavier weights from the data of scholarly knowledge graph. Then, these knowledge points were labeled as innovative seeds from the perspectives of “popularity”, “novelty” and “authority”. Third, we computed the knowledge correlation of the innovative seeds. Finally, the results were input to a deep learning model trained by large amounts of sci-tech papers to generate innovative topics. Note: the model is sequence to sequence with Bi-LSTM. [Results] We used Chinese research papers on artificial intelligence as the experimental data and found the average innovation score of the retrieved topics was 6.52, which were evaluated by experts manually. [Limitations] At present, contents of the knowledge graph and the training datasets need to be improved. [Conclusions] The proposed model, which identifies innovative topics from scholarly papers, could be optimized in the future.

Select

Constructing Big Data Platform for Sci-Tech Knowledge Discovery with Knowledge Graph

Jiying Hu,Jing Xie,Li Qian,Changlei Fu

Data Analysis and Knowledge Discovery. 2019, 3(1): 55-62. https://doi.org/10.11925/infotech.2096-3467.2018.1357

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to create a big data platform for sci-tech knowledge discovery, aiming to transform the keyword-based literature retrieval to knowledge retrieval. [Methods] First, we extracted and annotated scientific research entities and calculated their relationship with data mining techniques. Then, we created distributed indexes based on entity knowledge graph, which achieved multi-dimensional knowledge retrieval and correlated navigation. [Results] This study generated knowledge graphs for 10 research entities, such as papers, projects, scholars and institutions, etc. The proposed platform could conduct intelligent semantic search and multi-dimensional knowledge discovery with these knowledge graphs. [Limitations] Our study is at the entity level, and more research is needed for the semantic retrieval. [Conclusions] The proposed platform organizes data at the knowledge level, which meets user’s precise knowledge retrieval demands and improves user experience.

Select

Designing Framework for Precise Service of Scholarly Big Data

Jing Xie,Li Qian,Hongbo Shi,Beibei Kong,Jiying Hu

Data Analysis and Knowledge Discovery. 2019, 3(1): 63-71. https://doi.org/10.11925/infotech.2096-3467.2018.1366

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a framework for precise service of scholarly big data, aiming to improve knowledge acquisition of researchers. [Methods] First, we analyzed the status quo of online precision services. Then we summarized and compared the methods of precision services from the perspectives of data organization, technical methods and application scenarios. Finally, we designed the framework for academic eco-chain of scientific research. [Results] The framework connected data production, technology research and application development, which supported the precise search and recommendation of sci-tech data. [Limitations] More research is needed to evaluate the framework with real-world cases. [Conclusions] This proposed framework could help us build better academic precision search systems.

Select

Identifying Risks of HS Codes by China Customs

Zixuan Zhang,Hao Wang,Liping Zhu,Sanhong eng

Data Analysis and Knowledge Discovery. 2019, 3(1): 72-84. https://doi.org/10.11925/infotech.2096-3467.2018.0506

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to utilize patterns from the HS codes to provide effective knowledge service for the China customs taxation. [Methods] We proposed two machine learning-based automatic classification schemes. The first one directly used original HS codes as risk identifiers while the other one relied on the correctness of the HS codes. We also built a SVM prediction model and examined the two schemes from the perspectives of target structures and features, as well as the text length. [Results] We found that the second model required less training efforts and processing time and then reached better accuracy. [Limitations] Only used four-month-data to train the new models. [Conclusions] This study finds an effective way to forecast customs risks, and indicate directions of applicable products.

Select

Financial Decision Knowledge Acquisition Based on Neighborhood Rough Set and Ensemble Classifiers with Grid Search

Jing Li,Xiao Liu,Xiaoli Wang

Data Analysis and Knowledge Discovery. 2019, 3(1): 85-94. https://doi.org/10.11925/infotech.2096-3467.2018.0323

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper tries to improve the effectiveness and efficiency of acquiring decision-making knowledge from Financial Institutions. [Methods]First, we built a framework of an acquisition system for financial decision-making knowledge, which used neighborhood rough set to remove redundant attributes. Then, we adopted the SMOTE method to balance the data. We also applied grid search method to optimize parameters of the ensemble classifiers. Third, we trained and used the new model to identify the optimal reduction group. Finally, we acquired the needed knowledge through the optimal reduction, and stored them in the database.[Results]We examined the proposed method with 4,521 pieces of financial record, which yielded sensitivity of 83.55%, specificity of 80.74% and AUC of 0.8214. [Limitations]We did not run the proposed model with data of insurance or consumer loans. [Conclusions] The proposed method could improve the classification performance of financial decision-making system, which could identify and acquire knowledge of key customers effectively.

Select

Fine-Grained Sentiment Analysis Based on Convolutional Neural Network

Hui Li,Yaqing Chai

Data Analysis and Knowledge Discovery. 2019, 3(1): 95-103. https://doi.org/10.11925/infotech.2096-3467.2018.0158

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a fine-grained sentiment analysis method based on Convolutional Neural Network(CNN). [Methods] First, we incorporated attribute features into the word vector model. Then, we extracted the keyword sets of the comments statistically based on the fine-grained attributes of products or services. Third, we constructed the eigenvectors of the comments with attributes of the target objects. Finally, we trained the modified CNN model to add the affective clustering layer of the input text vector. [Results] Compared with the traditional emotion classification model, the training results of the new CNN model were significantly improved in terms of precision, recall and F-score. [Limitations] Only examined the new model with comments from one field. [Conclusions] The fine-grained sentiment analysis method based on convolutional neural network can dramatically improve the precision of sentiment classification.

Select

Finding Collaboration Opportunities from Emerging Issues with LDA Topic Model and Link Prediction

Junwan Liu,Zhixin Long,Feifei Wang

Data Analysis and Knowledge Discovery. 2019, 3(1): 104-117. https://doi.org/10.11925/infotech.2096-3467.2018.0394

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to discover collaboration opportunities from emerging issues. [Methods] We used literature corpus of deep learning as the research object. Firstly, we explored the intrinsic characteristics of these literature with the LDA topic model. Then, we calculated their weights, and used topics as nodes to build topic co-occurrence network. Finally, we applied link prediction to find the potential opportunities. [Results] The optimal index of topic co-occurrence network in deep learning was AA. The big data analysis research in deep learning were more likely associated with the biomedical studies and the improvement of related algorithms. [Limitations] Link prediction generated poor results for badly connected networks. [Conclusions] The LDA topic model and link prediction method could help us find new collaboration opportunities from emerging issues.

Select

Predicting User Ratings with XGBoost Algorithm

Guijun Yang,Xue Xu,Fuqiang Zhao

Data Analysis and Knowledge Discovery. 2019, 3(1): 118-126. https://doi.org/10.11925/infotech.2096-3467.2018.0414

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study aims to build a model for effectively predicting ratings of user reviews and analysing consumer behaviours. [Methods] First, we applied the Latent Dirichlet Allocation model to set the topic features from user reviews as independent variable and user ratings as dependent variable. Then, we built a user rating prediction model based on the eXtreme Gradient Boosting algorithm. Finally, we added the disturbances of samples and attributes to the proposed model for rating prediction. [Results] We used the new model to predict user’s comments on a domestic automobile online portal, and identified their preferences of automobile. Compared with the Logical Regression and Random Forest algorithms, the proposed model has better precision and efficiency. [Limitations] We need to include data from other fields to more comprehensively describe user’s behaviours. [Conclusions] The proposed model could quantify user’s reviews and then predict their ratings effectively.

Please choose a citation manager

Content to export

25 January 2019, Volume 3 Issue 1

模态框（Modal）标题

Please choose a citation manager

Content to export

25 January 2019, Volume 3 Issue 1