Data Analysis and Knowledge Discovery

Select

Study of Sustainable Support Mechanisms for Long Term Preservation of Digital Publications

Jiancheng Zheng, Xiaolin Zhang, Yan Zhao, Zhenxin Wu, Gaolei Yin, Man Xiao, Xiujuan Chen

Data Analysis and Knowledge Discovery. 2016, 32(12): 1-8. https://doi.org/10.11925/infotech.1003-3513.2016.12.01

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective] This paper aims to analyze the challenges for long term preservation of digital publications, and to promote the development of sustainable support mechanisms. [Methods] Based on a systematic literature analysis, the paper focuses on developing a framework of sustainability issues and tools. Building on previous analysis, it presents the trustworthy auditing and certification needs, standards, and processes, and summarizes cost models and investment models for digital preservation. [Results] This paper puts forth some specific suggestions concerning sustainable support mechanisms for long term preservation of digital publications. [Limitations] Only provide a brief overview of economic support models and related research. [Conclusions] The long term sustainability includes format sustainability, system sustainability, and service sustainability for digital preservation, and it covers the concepts of managerial, financial, and political sustainability in the domain of service sustainability. The paper provides a few recommendations for developing sustainable support mechanisms for digital preservation.

Select

A New Text Clustering Method Based on Semantic Similarity

Qiang Bi, Jian Liu, Yulai Bao

Data Analysis and Knowledge Discovery. 2016, 32(12): 9-16. https://doi.org/10.11925/infotech.1003-3513.2016.12.02

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper proposes an algorithm based on semantic similarity to extract more information from the textual resources. [Methods] First, we calculated the semantic similarity of words with the Extended Dictionary of Synonyms, and then created a semantic similarity matrix. Second, we clustered the texts based on the new semantic similarity matrix. [Results] The proposed algorithm was examined with text corpus from Fudan University and the search engine Sogou. Compared to the traditional methods, the proposed algorithm achieved the highest precision rates and purity values (cluster number=10). [Limitations] Some partial similarity calculation results were manually adjusted due to the incomplete coverage of the Tongyici Cilin Extened Edition. [Conclusions] The proposed algorithm could extract more latent information from the texts, which is an effective method to cluster and recommend textual documents.

Select

A CA-LDA Model for Chinese Topic Analysis: Case Study of Transportation Law Literature

Hong Ma, Yongming Cai

Data Analysis and Knowledge Discovery. 2016, 32(12): 17-26. https://doi.org/10.11925/infotech.1003-3513.2016.12.03

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper aims to improve the effectiveness of extracting Chinese literature topics with the help of LDA model and co-word network analysis. [Methods] First, we added keywords to the word segmentation dictionary for the abstracts, which improved the semantic recognition of topic analysis. Second, we proposed a Latent Dirichlet Allocation Model with Co-word Analysis (CA-LDA) to control the topic distribution generated by the weight of co-word network topology parameters (i.e. Betweenness Centrality). Finally, we extracted the words with high connectivity (Betweenness Centrality) and frequency. [Results] The CA-LDA model retrieved high frequency and high connectivity words simultaneously, which were important for subject analysis. The proposed algorithm could also identify key node technical vocabularies with the help of co-word analysis. [Limitations] The K value (number of topics) was obtained by cross validation with perplexity. Thus, it was difficult to classify the document topics with larger K value. More research is needed to deal with this issue. [Conclusions] The proposed model effectively analyzes the topics of Chinese literature on transportation laws, which could also process literature data from other fields automatically.

Select

Classifying Short Texts with Word Embedding and LDA Model

Qun Zhang, Hongjun Wang, Lunwen Wang

Data Analysis and Knowledge Discovery. 2016, 32(12): 27-35. https://doi.org/10.11925/infotech.1003-3513.2016.12.04

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper proposes a short text classification method with the help of word embedding and LDA model, aiming to address the topic-focus and feature sparsity issues. [Methods] First, we built short text semantic models at the “word” and “text” levels. Second, we trained the word embedding with Word2Vec and created a short text vector at the “word” level. Third, we trained the LDA model with Gibbs sampling, and then expanded the feature of short texts in accordance with the maximum LDA topic probability. Fourth, we calculated the weight of expanded features based on word embedding similarity to obtain short text vector at the “text” level. Finally, we merged the “word” and “text” vectors to establish an integral short text vector and then generated their classification scheme with the k-Nearest Neighbors classifier. [Results] Compared to the traditional singleton-based methods, the precision, recall, F1 of the new method were increased by 3.7%, 4.1% and 3.9%, respectively. [Limitations] Our method was only examined with the k-Nearest Neighbors classifier. More research is needed to study its performance with other classifiers. [Conclusions] The proposed method could effectively improve the performance of short text classification systems.

Select

Recognizing Chinese Organization Names Based on Deep Learning: A Recurrent Network Model

Danhao Zhu, Lei Yang, Dongbo Wang

Data Analysis and Knowledge Discovery. 2016, 32(12): 36-43. https://doi.org/10.11925/infotech.1003-3513.2016.12.05

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]Chinese organization names are difficult to be recognized by computers due to their complex structures and using of rare words. Successful recognition of these names plays significant roles in information extraction and retrieval, knowledge mining as well as institution research evaluation. [Methods] First, we redefined the input and output of organization names based on recurrent neural network method and nature of Chinese words or phrases. Second, we proposed a new model at the word level. [Results] Compared to the recurrent network models at the phrase level, the proposed method significantly improved the precision, recall and F value. Among them, the F value increased 1.54%. For organization names with rare words, the F value increased by 11.05%. [Limitations] We adopted a greedy strategy to find the local optimal values. A conditional random field method will yield better results from the global perspective. [Conclusions] The proposed method, which uses Chinese word level features, is easy to be implemented, and could generate better results than its phrase based counterparts.

Select

New Collaborative Filtering Algorithm Based on Relative Similarity

Shuhao Jiang, Liyi Zhang, Zhixin Zhang

Data Analysis and Knowledge Discovery. 2016, 32(12): 44-49. https://doi.org/10.11925/infotech.1003-3513.2016.12.06

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]The purpose of this study is to improve the overall diversity of the recommendation results. The proposed algorithm reduces errors caused by the uneven distribution and sparsity of user rating data, and then improves the recommendation accuracy and diversity. [Methods] We first generated the relative similarity index based on the number of common ratings and individual weights. Second, we modified the similarity calculation method, and the rating prediction algorithm. The proposed model improved the aggregated diversity and maintained the recommendation accuracy, which improved the marketing effects. [Results] The aggregated diversity index increased 114, the accuracy improved 6.5% on the MovieLens data compared with results generated by the traditional cosine similarity calculation, (the rating threshold was 3.5 and number of KNN is 20). [Limitations] This method was only applicable to collaborative filtering based on the nearest neighbor, and it did not include other recommendation techniques. [Conclusions] The proposed method effectively improves the diversity and accuracy of recommendation results, which significantly improves the user experience.

Select

Mining Document Topics Based on Association Rules

Guangce Ruan, Lei Xia

Data Analysis and Knowledge Discovery. 2016, 32(12): 50-56. https://doi.org/10.11925/infotech.1003-3513.2016.12.07

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This study is to accurately identify potential knowledge correlations among textual information, and then enrich the methodology of knowledge mining. [Methods] First, we combined the topic model and association rules. Second, used the LDA model to extract topic set from the texts, which not only reduced the textual dimension but also realized the semantic space expression. Finally, we analyzed the semantic ties among the topics with association rules. [Results] We effectively found the potential knowledge association from the document texts with reasonable degrees of support and confidence, and then improved model’s “understanding” of the textual message. [Limitations] While preprocessing data, the self-defined dictionary posed some negative effects to the results. [Conclusions] The proposed method could extract the latent semantic association from unstructured textual information, and then improve the performance of knowledge discovery systems.

Select

Analyzing Return of Investment for New Energy Project with Big Data: Case Study of SG-ERP System in Y City

Qian Gao, Yang Yang, Guangwei Hu, Chao Xu, Gaofeng Shen, Jian Zhao

Data Analysis and Knowledge Discovery. 2016, 32(12): 57-65. https://doi.org/10.11925/infotech.1003-3513.2016.12.08

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper establishes data extraction model and evaluation mechanism for the return of investment analysis on new energy projects. All data is from the State Grid Jiangsu Electric Power Company in China. [Methods] First, we proposed a new big data management framework based on the State Grid Jiangsu Electric Power Company SG-ERP system architecture. Second, we extracted evaluative data based on Golden Gate technology, and constructed an evaluation system covering the economic, social and environmental aspects of the target projects at different development stages (i.e. decision-making, construction and operation). Finally, we examined the proposed system with the Delphi Law. [Results] We got the weight of variation coefficient and the economic, social and environmental benefits of new energy projects of Y City in 2015. [Limitations] The classification schemes for the evaluation criteria could be further refined. [Conclusions] The proposed system can evaluate the return of investments for new energy power grid projects. The data extraction method, evaluation system and weight algorithm could be used in other studies.

Select

Managing Patent Semantic Knowledge with Graph Database

Dongsheng Zhai, He Liu, Jie Zhang, Liwei Cai

Data Analysis and Knowledge Discovery. 2016, 32(12): 66-75. https://doi.org/10.11925/infotech.1003-3513.2016.12.09

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]Design and implement a semantic knowledge management system for the Derwent patent data. [Context] The proposed system collects the patent data as well as the semantic relations among them. It could retrieve patent information with semantic relation. [Methods] First, we analyzed the Derwent patent data and the semantic relations among the data. Second, we modified the method of patent semantic representation based on Ontology. Third, we proposed a Derwent patent graph data model based on property graph model. Finally, we used the Neo4j graphic database to store the instantiated patent data. [Results] We built a semantic knowledge management system using cloud computing technology patents. The new system showed stronger semantic integrity and faster retrieval speed than traditional ones. [Conclusions] The proposed patent semantic knowledge management system offers stable and efficient solutions for organizing and storing patent data.

Select

Content Authentication for Video Resources of Libraries, Museums and Archives with Semi-fragile Watermarking

Guang Zhu, Mining Feng

Data Analysis and Knowledge Discovery. 2016, 32(12): 76-84. https://doi.org/10.11925/infotech.1003-3513.2016.12.10

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This study designs a more secure semi-fragile watermarking algorithm for the big data environment, which protects the authenticity and integrity of online video resources of libraries, museums and archives (LAM). [Context] The algorithm improves the robustness of video resources in normal operation, and then meets the real-time demand of content authentication. [Methods] First, we embedded the binary watermarking image to the videos with the help of quantization modulation to protect their copyright. Second, we inserted the index watermarking into key frames of videos to detect the inter-frame modifications. Finally, we generated the authentication watermarking with XOR operation of least significant bit to detect the intra-frame tampers. [Results] The proposed algorithm was robust and transparent in normal operations, and its value of Peak Signal to Noise Ratio was above 33. The time of tamper localization was around 5 seconds. [Conclusions] The proposed algorithm protects the authenticity and integrity of LAM videos, and then promotes the information sharing and service integration.

Select

Public Opinion Dissemination over Social Media: Case Study of Sina Weibo and “8.12 Tianjin Explosion”

Haihan Liao, Yuefen Wang

Data Analysis and Knowledge Discovery. 2016, 32(12): 85-93. https://doi.org/10.11925/infotech.1003-3513.2016.12.11

Abstract ( ) Download PDF ( ) HTML ( )

Knowledge map

Save

[Objective]This paper studies the dissemination of public opinion over the social media, with the purpose of improving government management and decision making. [Methods] We set hypothesises of information dissemination with the 5W communication model and agenda setting theory, and then conducted correlation analysis to data from Sina Weibo. [Results] We found that the opinion leaders posed more impacts to the communication results. There was positive correlation between the attributes of micro-blog posters and communication results, while the correlation between volumes of disseminated information and the results was negative. [Limitations] We only chose one single topic from a specific period of time to conduct the empirical analysis. [Conclusions] This study could help the government, news agencies, and large enterprises understand the impacts and influencing factors of public opinions dissemination.

Please choose a citation manager

Content to export

25 December 2016, Volume 32 Issue 12

模态框（Modal）标题

Please choose a citation manager

Content to export

25 December 2016, Volume 32 Issue 12