Data Analysis and Knowledge Discovery

Current Issue

, Volume 5 Issue 10

Previous Issue Next Issue

For Selected:

View Abstracts

Download Citations
EndNote Reference Manager ProCite BibTeX RefWorks

Toggle Thumbnails

Select

Chinese Text Classification with Feature Fusion

Wang Yan, Wang Huyan, Yu Bengong

2021, 5 (10): 1-14. DOI: 10.11925/infotech.2096-3467.2021.0228

Abstract

HTML ( 44 )

PDF(1099KB) ( 968 )

[Objective] This paper proposes a new classification model for Chinese texts, aiming to address the issues of weak structure, spelling errors or homonyms in the texts. [Methods] We constructed a multi-feature fusion method based on the traditional fusion features model for text classification. Then, we combined word level features, part of speech feature extension, the Chinese character features and the Pinyin letters to create multi-feature semantic representation. Third, we introduced the new multi-semantic characteristics into the BiGRU to obtain the context semantics, which were processed with the multi-channel CNN to generate the main features. Finally, we merged these features for the softmax layer to finish the classification tasks, and predicted the required category labels. [Results] The accuracy of our multi-feature fusion model reached 83.3% and 91.1% with two datasets, which was 7% higher than the existing model. [Limitations] More research is needed to examine the model with larger datasets. [Conclusions] The proposed model could effectively finish the Chinese text classification tasks.

Figures and Tables | References | Related Articles | Metrics

Select

Mining Topics of Social Appeals and Interprovincial Differences in Government-People Interaction——Case Study of E-mail Corpus of Provincial Leaders

Hu Guangwei, Teng Jie, Liu Lu

2021, 5 (10): 15-27. DOI: 10.11925/infotech.2096-3467.2021.0142

Abstract

HTML ( 20 )

PDF(2379KB) ( 274 )

[Objective] This research proposes a method to explore the topics of provincial leaders’ public emails, aiming to effectively respond to residents’ demands and provide support for social governance and services. [Methods] First, we retrieved the text message from public emails in 27 provinces and 4 municipalities directly under the Central Government. A total of 106,810 valid items were collected. Then, we applied the LDA modeling method to extract the topics of these emails and learned the public appeals. Finally, we conducted a comparative analysis of these appeals in different provinces and cities to understand the differences in social governance. [Results] The public paid more attention to the livelihood services, social development, education services, health issues, legal services and resource ecology. The public appeals showed significant inter-provincial difference. For example, Shanxi people paid more attention to employment; Jiangxi people attached great importance to travel; Henan people focused on education; Shanghai people cared housing issues; and Guangxi people paid more attention to enterprises. We also built a panoramic view of social concerns and support local governments’ decision making. [Limitations] This study did not comprehensively examined public appeal from multi-channels. More research is needed to optimize the algorithms and analysis methods. [Conclusions] This study could help local government better understand people’s demands or concerns, and improve decision making.

Figures and Tables | References | Related Articles | Metrics

Select

Multi-layer Cascade Classifier for Credit Scoring with Multiple-Support Vector Machines

Feng Hao, Li Shuqing

2021, 5 (10): 28-36. DOI: 10.11925/infotech.2096-3467.2021.0096

Abstract

HTML ( 13 )

PDF(1182KB) ( 241 )

[Objective] This paper proposes a new multi-layer cascade classifier based on multiple-support vector machines, aiming to address the credit scoring issues of financial institutions. [Methods] The proposed hybrid model combines the ideas of genetic algorithm, machine learning and ensemble learning. The framework includes support vector machine classifier, normalization method, feature extraction, parameter optimization, 10-fold cross evaluation and other technologies. We tested the layer deepening strategy, attribute reuse method, and fitness function diversification by experiment. [Results] We examined the support vector machine optimized by genetic algorithm with Australian Credit Approval dataset. The prediction accuracy was improved as the increase of layers, and the overall frame prediction accuracy reached 93.33%. [Limitations] The proposed method only uses SVM, which needs to be expanded. There are many classifiers in the framework, which took long time to train and optimize. [Conclusions] The proposed classifier could effectively improve credit scoring services, and finish similar binary classification tasks.

Figures and Tables | References | Related Articles | Metrics

Select

Identifying Cross-Region Patent Collaboration Opportunities Using LDA and Decision Trees——Case Study of Universities from Guangdong and Wuhan

Chen Hao, Zhang Mengyi, Cheng Xiufeng

2021, 5 (10): 37-50. DOI: 10.11925/infotech.2096-3467.2021.0194

Abstract

HTML ( 18 )

PDF(1569KB) ( 320 )

[Objective] This paper proposes an algorithm to identify potential collaboration opportunities for patents with the LDA and decision tree models, aiming to enhance the cross-region innovation. [Methods] First, we retrieved 22 855 patents from the incoPat database, which were developed by higher education institutions from Guangdong Province and Wuhan City. Then, we used the LDA to extract and cluster patent topics. Third, we constructed decision tree to identify the best potential cooperative relations by adjusting the decision boundaries. Finally, we chose the optimal data mining strategy based on the effective size of the inventors’ network, which helps us identify and recommend cooperative relationships. [Results] We found 18 pairs of potential cross-regional partners from the top four patent categories in the data set, which was much better than the link prediction method. [Limitations] The coverage of patent data needs to be expanded. More research is also needed to study the impacts of the university and industry on the innovation ecology. [Conclusions] The proposed method could identify the potential cross region partners for patents and innovation.

Figures and Tables | References | Related Articles | Metrics

Select

Topic Analysis of LIS Big Data Research with Overlay Mapping

Chen Shiji, Qiu Junping, Yu Bo

2021, 5 (10): 51-59. DOI: 10.11925/infotech.2096-3467.2021.0113

Abstract

HTML ( 21 )

PDF(2815KB) ( 464 )

[Objective] This paper explores the topics of big data research in Library and Information Science (LIS), aiming to reveal their developing trends. [Methods] We used “big data” as keyword to search the Web of Science and then constructed a test collection with the retrieved documents. Based on the citation analysis, we removed those irrelevant documents. Then, we used the Leiden algorithm and the VOSviewer to construct the science mapping on LIS big data research. Finally, we created the overlay mapping of research topics. [Results] According to the citation analysis, LIS big data research focuses on big data and social media analysis, followed by cloud computing, machine learning, big data technologies (such as Hadoop and MapReduce), health information, precision medicine, industry 4.0 and Internet of Things. [Limitations] We only analyzed the themes and development trends of LIS big data research from the macro-perspective. [Conclusions] Big data is an important LIS research topic. Popular studies focuses on big data and social media analysis. Machine learning, health information, precision medicine, Industry 4.0 and the Internet of Things are the important directions for Library and Information Science.

Figures and Tables | References | Related Articles | Metrics

Select

Extracting Hypernym-Hyponym Relationship for Financial Market Applications

Dai Zhihong, Hao Xiaoling

2021, 5 (10): 60-70. DOI: 10.11925/infotech.2096-3467.2020.1261

Abstract

HTML ( 14 )

PDF(908KB) ( 439 )

[Objective] This paper proposes a new method to extract superior-inferior relationship from knowledge graph, and then explores its effectiveness with practical application. [Methods] First, we constructed the mapping matrix for hypernym-hyponym words and their context semantics. Then, we combined word vector similarity with the matrix to extract the relation. [Results] We examined our method with datasets of listed companies and found its F1 value was more than 3% higher than those of the existing methods. The new model could help us study the association between company similarity and stock performance. [Limitations] More research is needed to improve relationship extraction with the help of clustering technique and pattern matching method. [Conclusions] The proposed method can effectively identify the relationship between entities, and study the related listed companies and stocks. It also helps us construct better knowledge graph in the financial field.

Figures and Tables | References | Related Articles | Metrics

Select

Position-Aware Stepwise Tagging Method for Triples Extraction of Entity-Relationship

Wang Yuan, Shi Kaize, Niu Zhendong

2021, 5 (10): 71-80. DOI: 10.11925/infotech.2096-3467.2021.0302

Abstract

HTML ( 18 )

PDF(1485KB) ( 201 )

[Objective] This paper designs a joint model for overlapping scenes, aiming to effectively extract triples from unstructured texts. [Methods] We designed a tagging method with position-aware stepwise technique. First, the main entities were determined by tagging their start and end positions. Then, we tagged the corresponding objects under each predefined relations. We also added multiple position-aware information to the tagging procedures. Finally, we shared the encoded sequences with the pre-order results and the attention mechanism. [Results] We examined our new model with DuIE, a Chinese public dataset. The performance of our method is better than those of the baseline models, with an F1 value of 0.886. We also verified the effectiveness of the model’s components through ablation studies. [Limitations] More research is needed to investigate the occasionally nested entities. [Conclusions] The proposed method could effectively address the issues facing triple extraction for overlapping scenes, and provide reference for future studies.

Figures and Tables | References | Related Articles | Metrics

Select

Constructing Degree Lexicon for STI Policy Texts

Zheng Xinman, Dong Yu

2021, 5 (10): 81-93. DOI: 10.11925/infotech.2096-3467.2021.0148

Abstract

HTML ( 14 )

PDF(1457KB) ( 355 )

[Objective] This paper constructs a sentiment lexicon for STI policy texts, aiming to identify and quantify the embedded attitudes of policy makers. It tries to address the issues of existing studies, which ignore the semantic intensity of words. [Methods] First, we summarized the characteristics of policy texts and proposed a method to construct degree lexicon. This lexicon chose seed words from expert knowledge, expanded domain degree words with the PMI algorithm, and screened these words with Tongyi Cilin. Finally, we combined the TextRank algorithm with the new lexicon and conducted an experimental validation. [Results] The constructed degree lexicon yielded better results in policy text analysis than the traditional single text mining algorithm. [Limitations] The weights of our lexicon needs to be refined. [Conclusions] The degree words in STI policy texts are abundant, standardized and stable. The new lexicon can effectively utilize degree words, and learn more semantic features of policy texts.

Figures and Tables | References | Related Articles | Metrics

Select

Recommendation Strategy Based on Users’ Preferences for Fine-Grained Attributes

Yang Chen, Chen Xiaohong, Wang Chuhan, Liu Tingting

2021, 5 (10): 94-102. DOI: 10.11925/infotech.2096-3467.2021.0291

Abstract

HTML ( 15 )

PDF(1006KB) ( 384 )

[Objective] This study proposes an improved recommendation model based on the users’ preferences for fine-grained attributes, aiming to address the data sparsity issues of the exisiting algorithms. [Methods] First, we constructed models for the project-attribute relationship and user-attribute preference. Then, we built simliar clusters for users and projects respectively. Finally, we used the collaborative filtering algorithm to generate recommendation lists based on user or project clusters. [Results] We examined the new method with dataset from Douban.com. Compared with the suboptimal models, the proposed approach significantly improved the Precision and Recall of the recommendation tasks (upto 19.7% and 44.6% respectively). [Limitations] More research is needed to further improve the representation and modeling of multi-dimensional fine-grained attributes. [Conclusions] The proposed model could effectively represent users’ interests and improve the performance of recommendation.

Figures and Tables | References | Related Articles | Metrics

Select

Topic Evolution of Online Reviews for Crowdfunding Campaigns

Wang Wei, Gao Ning, Xu Yuting, Wang Hongwei

2021, 5 (10): 103-123. DOI: 10.11925/infotech.2096-3467.2021.0029

Abstract

HTML ( 20 )

PDF(1927KB) ( 359 )

[Objective] This paper reveals the change of uers’ interests in the crowdfunding projects and analyzes the dynamic evolution of their online comments on these projects. [Methods] First, we retrieved 497,936 online comments on 6,537 technology-related projects from Kickstarter as corpus. Then, with the help of LDA model, we analyzed the topic evolution of these comments. Finally, we obtained the dynamic evolution model of the topics with the help of cosine similarity. [Results] In the initial stage of financing, the comments were mainly on basic project information. Then, the comments focused on return of investments and product information. In the final stage, these comments were on the shipping issues. For successful projects, the topics developed from project description to waiting time for products and deliveries. For the failed projects, the comments gradually evolved into the possible relaunch and prospect of a new project. [Limitations] We did not distinguish the project categories, which need to be analyzed in the future. This paper only examined the reward based crowdfunding model, which also needs to be expanded. [Conclusions] This article analyzes reviews of online crowdfunding projects and expands the application of LDA in the field of crowdfunding, which provides practical suggestion for platforms, project sponsors and investors.

Figures and Tables | References | Related Articles | Metrics

Select

Integrating Expert Reviews for Government Information Projects with Knowledge Fusion

Hua Bin, Wu Nuo, He Xin

2021, 5 (10): 124-136. DOI: 10.11925/infotech.2096-3467.2021.0137

Abstract

HTML ( 6 )

PDF(1340KB) ( 333 )

[Objective] This paper proposes a new method to integrate the short texts of multi-expert reviews for the same government information project, aiming to generate comprehensive opinion with knowledge fusion at the cognitive level. [Methods] First, we extracted knowledge from the reviews through content mining. Then, we analyzed semantics of these reviews with target knowledge concept tree and customized method. Third, we finished knowledge fusion at the micro and macro levels based on the text structure model and domain ontology. [Results] Compared with the original texts, the amount of information provided by our method was increased by 0.19, while the average ratio of knowledge elements reached 115.38%. [Limitations] The proposed method could be affected by the language of expert reviews and the integrity of domain knowledge. [Conclusions] Our new method could effectively integrate short texts from various fields.

Figures and Tables | References | Related Articles | Metrics