Data Analysis and Knowledge Discovery

Select

Entity Linking Method for Short Texts with Multi-Knowledge Bases: Case Study of Wikipedia and Freebase

Zhou Pengcheng,Wu Chuan,Lu Wei

New Technology of Library and Information Service. 2016, 32(6): 1-11. https://doi.org/10.11925/infotech.1003-3513.2016.06.01

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes an entity linking method using multi-knowledge bases, aiming at solving the problem of low coverage caused by entity linking with single knowledge base. [Methods] First, we generated n-gram of input text and obtained candidate mentions using part of speech and multi-mention-entity dictionary. Second, we generated and retained mention combinations of highest coverage which are not contained by other mention combinations. Third, we generated entity sequences and calculated their relevence degree using information from multi-knowledge bases. We listed entity sequence with the highest relevence degree as the final result. [Results] This case study showed that the Precision, Recall, and F-value of the entity linking based on Wikipedia+Freebase reaches 71.81%, 76.86%, and 74.25% respectively. [Limitations] Filtering n-gram based on part of speech lacked theoretical foundation, and the FACC1 dataset featured high precision but low recall. [Conclusions] Utilizing entity information from multi-knowledge bases can improve the performance of entity linking.

Select

Classifying Chinese News Texts with Denoising Auto Encoder

Liu Hongguang,Ma Shuanggang,Liu Guifeng

New Technology of Library and Information Service. 2016, 32(6): 12-19. https://doi.org/10.11925/infotech.1003-3513.2016.06.02

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposes a new method to improve the classification accuracy of the Chinese news texts with the help of Deep Learning theory. [Methods] We first used the denoising auto encoder to construct a deep network to learn the zipped and distributed representation of the Chinese news texts. Second, we used the SVM algorithm to classify these news texts. [Results] As the number of samples expanding, the precision rate, the recall rate and the F value of the proposed method increased too. The results are better than those of the applications using the KNN, BP and SVM algorithms. The average precision rate was higher than 95%. [Limitations] The data size was relatively small, thus, the proposed method did not fully utilize the parallel data processing capacity of the deep learning technology. [Conclusions] The proposed method improves the performance of applications classifying Chinese news texts.

Select

Using Word2vec with TextRank to Extract Keywords

Ning Jianfei,Liu Jiangzhen

New Technology of Library and Information Service. 2016, 32(6): 20-27. https://doi.org/10.11925/infotech.1003-3513.2016.06.03

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study extracts keywords through combining the internal structure of each single document and the word vector of the corpus. [Methods] First, we used Word2vec to represent all words’ vector from the document corpus and then calculated their similarities. Second, modified the TextRank algorithm and assigned weights to the keywords in accordance with their similarities and adjacency relations. Finally, we built a probability transfer matrix for the iterative calculation of the lexical graph model and then extracted keywords. [Results] The Word2vec and TextRank were integrated and extracted keywords effectively. [Limitations] The proposed method needs much training with the corpus to establish word vector and relation matrix. [Conclusions] The relationship among words from the document sets could help us modify the words relationship from a single document, and then increase the accuracy of extracting keywords from the individual document.

Select

Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields

Wang Miping,Wang Hao,Deng Sanhong,Wu Zhixiang

New Technology of Library and Information Service. 2016, 32(6): 28-36. https://doi.org/10.11925/infotech.1003-3513.2016.06.04

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper proposed a model to extract metallurgy patent terms in Chinese effectively. [Methods] We created the model to automatically identify metallurgy patent terminologies in Chinese with the help of conditional random fields(CRFs) technology. This model was tested with an incomplete core corpus. We discussed the development process and then compared the impacts of various CRFs factors to this character-role-labeled model. [Results] The new model combined the character sequences, level features, areal features and temperature features of the patent terms. Its precision rate was 94.26%, the recall rate was 94.37%, and the F1 value was 94.5%, while the length of the proximity window and the values of the parameter c and f were 3, 1, and 1 respectively. [Limitations] Some of the term labels were not accurate enough due to the incomplete core corpus. We did not compare our model with other methods to discuss the reliability of the CRFs. [Conclusions] The CRFs model could effectively identify the metallurgy patent terms in Chinese under appropriate working conditions.

Select

Web Users’ Group Attitudes to Online Rumors

Shen Chao,Zhu Qinghua,Shen Hongzhou

New Technology of Library and Information Service. 2016, 32(6): 37-45. https://doi.org/10.11925/infotech.1003-3513.2016.06.05

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study examines the motivation and patterns of the Web users’ (i.e., college students) group attitudes to various online rumors. [Methods] First, we used survey to collect the attitude and behavior data of the participants in the process of rumor spreading. Second, we analyzed the data with classification algorithm. [Results] We found that Web users tried to verify the rumors, the content of the rumors decides the disseminating channels, and the interaction among Web users changed dynamically. The main factors affecting the Web users’ attitudes include the initial awareness, group behavior, as well as information acquisition and communication channels. However, there is no significant correlation between the changing of Web users’ attitudes and the rumors’ contents. [Limitations] Only investigated college students’ reaction to rumors, which might affect the comprehensiveness of the conclusions. [Conclusions] Analyzing online rumors with empirical research and data mining technology will offer more practical insights to establish online rumor models.

Select

Properties of Scholarly Papers and Number of Citations

Xiao Xuebin,Chai Yanju

New Technology of Library and Information Service. 2016, 32(6): 46-53. https://doi.org/10.11925/infotech.1003-3513.2016.06.06

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] To examine the ties between properties of scholarly papers and the number of citations they received. [Methods] First, we adopted various measurements to reduce the influence of irrelevant factors. Second, we drew trending lines to analyze the relationship between the target properties and the number of citations for a period of three years. [Results] There was positive correlation between some properties, such as the numbers of authors, pages and references, as well as the length of abstract, and the number of citations. In the meantime, there is no relationship between the number of keywords and the number of citations. The titles posed mixed effects to the number of citations. [Limitations] All samples were collected from the SCIE database in the fields of Engineering and Mechanical. We might not be able to get similar results from other areas. [Conclusions] Specific properties of the paper pose positive effects to the number of citations.

Select

Analyzing Food Community with Recipes and Weibo User Reviews

Wu Xiaolan,Zhang Chengzhi

New Technology of Library and Information Service. 2016, 32(6): 54-62. https://doi.org/10.11925/infotech.1003-3513.2016.06.07

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study examines the structure of online food community with the help of large-scale real world data. [Methods] First, we collected recipes from meishij.net (a popular food network online) and user reviews from Sina Weibo (micro-blog) respectively. Second, we identified the Weibo users who mentioned recipes from meishij.net and mapped them to provinces and cuisines coordinate systems. Finally, we used community discovery algorithm to analyze the food community’s structure. [Results] The province and cuisines networks showed clear community structures. [Limitations] Demographic disparity might pose some effects to the conclusions. [Conclusions] The tastes of consumers from different provinces could be classified as “freshly salty”, “hot and spicy”, as well as “others”. “Sichuan” or “Yungui” dishes are rarely ordered together, while “Jing”, “Hu”, “Lu” and “Dongbei” dishes are often ordered along with each other. Besides, the regional cuisines have some geographical proximity among themselves.

Select

Content Using Behavior of Academic Social Network System: Case Study of Popular Blogs from Sciencenet.cn

Wang Yuefen,Jia Xinlu,Fu Zhu

New Technology of Library and Information Service. 2016, 32(6): 63-72. https://doi.org/10.11925/infotech.1003-3513.2016.06.08

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the content using behavior of an academic social network system. [Methods] First, we collected popular blog posts from Sciencenet.cn. Second, we classified these contents by their features. Finally, the user-content relationship and content contributors’ characteristics were explored with the help of analysis of variance and co-relation tests. [Results] We found that users were more interested in the posts exchanging opinion, as well as those sharing teaching and research experience. Meanwhile, there was a significant correlation between the number of comments and recommendations most posts received. [Limitations] We studied one academic social network system in Chinese and only analyzed its users’ reading behaviors. More research is needed to investigate other behaviors. [Conclusions] Many users exchange views on the academic social network system. These readers are more likely to recommend posts with their own comments to others.

Select

A Collaborative Filtering Recommendation Algorithm Based on Item Probability Distribution

Wang Yong,Deng Jiangzhou,Deng Yongheng,Zhang Pu

New Technology of Library and Information Service. 2016, 32(6): 73-79. https://doi.org/10.11925/infotech.1003-3513.2016.06.09

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This study tries to reduce the reliance of co-rated items in the traditional item similarity measurements and then improve the prediction precision of the sparse datasets. [Methods] First, we modified the Kullback-Leibler (KL) divergence from the signal processing domain to compute item similarities. Second, we calculated the similarity with the help of density distribution of ratings, and then found the neighboring items more effectively. [Results] We examined the proposed algorithm on MovieLens and the achieved F1 measure value was over 0.65. The accuracy, efficiency and error rates of the new prediction mechanism were much better than traditional item similarity measurements. [Limitations] The proposed algorithm considered the density of ratings, however, it did not utilize the absolute value of item ratings. [Conclusions] The proposed algorithm effectively uses the rating information to address the sparse dataset issue. Thus, it has strong potentiality in practice.

Select

Xie Qi,Cui Mengtian

New Technology of Library and Information Service. 2016, 32(6): 80-87. https://doi.org/10.11925/infotech.1003-3513.2016.06.10

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper tries to solve the issues of lacking similar services or users in Web service computing due to the data sparsity of Quality of Service (QoS) recommendation. [Methods] First, we created personalized similar user and service groups according to similarity distance of the target users and services. Second, we used the group center similarities of the user and service groups to design a new hybrid recommendation algorithm(GHQR), which was tested with real-world data of 1.97 million QoS records. [Results] Compared with two traditional recommendation algorithms, the GHQR reduced the Normalized Mean Absolute Error (NMAE) by 31% and 69%. It also increased the Coverage by 105% and 163%, respectively. [Limitations] Our study only examined the response time of QoS, and more research was needed to investigate other QoS properties. [Conclusions] Comprared with WSRec and CFBUGI, the GHQR can reduce the NMAE by 26% and 7.7%. It also increased the Coverage by 188% and 4%, respectively. GHQR not only enhances the prediction accuracy but also increases the coverage significantly.

Select

ng-info-chart: The Visualization Component Based on Customized HTML Tags

Chen Ting,Wang Xiaomei,Lv Weimin

New Technology of Library and Information Service. 2016, 32(6): 88-95. https://doi.org/10.11925/infotech.1003-3513.2016.06.11

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This research designs and implements the ng-info-chart, a front end visualisation component based on the MVC framework AngularJS. [Context] A good information analysis system requires multiple complex visualzation charts to present the results. Therefore, we need to create an advanced interactive Web-based visualzation charts for the new systems. [Methods] We intergrated visualzation charts with the ng-info-chart and the AngularJS Directive packages. The new component could call the charts Directive directly using a customized HTML tag. [Results] The ng-info-chart visualisation component has intergrated 5 third-party visualisation libraries of 11 types of visual charts. It supports IE9+, Firefox and other popular Web browsers. [Conclusions] The new visualisation component implements data asynchronization, automatic detection of data change, and real-time online visualzation. It also simpilfies the complex visualzation tasks for the information analysis system.

Select

Building a National System for the Reimbursable Prescription Drugs

Li Yazi,Zheng Jianli,Zhou Yiyang,Li Guolei

New Technology of Library and Information Service. 2016, 32(6): 96-101. https://doi.org/10.11925/infotech.1003-3513.2016.06.12

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper examines the current reimbursable prescription drugs list and creates a national prescription drugs catalog for the new rural cooperative medical system (NCMS). [Methods] We modified the technology framework of the Unified Medical Language System and used the mapping algorithms to aggregate the multi-source list of reimbursable prescription drugs. [Results] We designed the data structure and directory-mapping algorithm, for the integrated NCMS drugs catalog. [Limitations] More research was needed to analyze interactions among these drugs. [Conclusions] The proposed method helps us develop a list of reimbursable drugs from multiple sources. This new system solves the existing problems of data dictionary aggregation.

Select

Discovering Knowledge from Electronic Medical Records with Three Data Mining Algorithms

Mu Dongmei,Ren Ke

New Technology of Library and Information Service. 2016, 32(6): 102-109. https://doi.org/10.11925/infotech.1003-3513.2016.06.13

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This empirical study tries to identify risk factors for diseases from the heterogeneous Electronic Medical Records (EMR). [Methods] First, we collected EMR with various data structures. Second, we built models to predict risk factors for diseases with the help of three algorithms (i.e., decision-making tree, logistic regression and neutral network). Finally, we compared and evaluated these models statistically. [Results] The Decision Tree Model achieved higher recall and precision rates than the Logistic Regression and Neural Network ones. However, there was no significant difference among them. [Limitations] We did not optimize the EMR’s properties. [Conclusions] The Decision Tree Model does a better job than the Logistic Regression and Neural Network models in discovering the risk factors to predict diseases. The framework of knowledge discovery based on data mining algorithms, provides some directions for future research.

Please choose a citation manager

Content to export

25 June 2016, Volume 32 Issue 6

模态框（Modal）标题

Please choose a citation manager

Content to export

25 June 2016, Volume 32 Issue 6