[Objective] Research on how to use fixity check to ensure consistent fixity of digital object in preservation practices. [Methods] By reviewing standards and specifications, comparing fixity check tools, make a summary of current preservation projects and systems, and finally give a comprehensive analysis depending on preservation lifecycle. [Results] According to the actual needs, diversified methods and strategies of fixity check can be applied in the key nodes (ingest, AIP creation, storage, delivery) of the entire lifecycle management of digital preservation. And also do some reseach on how to store fixity information and how to build up fixity check. [Conclusions] This study may better help follow-up researchers and developers quickly understand and master the method and strategies of fixity check, help them to develop appropriate tools and strategies according to the actual need, so that preservation system can effectively ensure consistent fixity of digital object.
[Objective] Discuss the role of social networks to solve problems such as data sparseness and cold start of traditional personalized recommendation systems. [Coverage] This paper retrieves research literatures about trust recommendation at home and abroad from Springer and Google Scholar since 2004. [Methods] It summarizes the related literatures from perspectives of trust and distrust. [Results] Based on the summary, this paper demonstrates the existing problems such as the deficiency of calculation method for trust and lack of in-depth study of distrust and so on. [Limitations] Other factors in social networks should be combined with trust in an in-depth comparative analysis. [Conclusions] Context-aware trust recommendation, mining the value of weak relationship in social networks can be new valuable research directions in future.
[Objective] Summarize the researches on technical methods and tools of RDB-to-RDF and extract the key technologies. [Coverage] Retrieve English and Chinese literatures related to RDB-to-RDF from database of Elsevier, Springer, CNKI. [Methods] Use literature investigation, summarized by research topic. [Results] Summarize and analyze from the perspective of mapping ideas, techniques and implementation methods, comparing important feature and application occasions of mapping tools, and enumerating the typical application. [Limitations] Lack of specific quantitative evaluation when comparing mapping tools. [Conclusions] This study is helpful to understand the related key techniques, tools and main application scenarios in RDB-to-RDF process.
[Objective] Improve the degree of users' participation, realize the efficient management of mass data and quick query of information, and consummate information organization and representation of enterprises' websites. [Context] Problems of a mass of confused information and too many false products in online trading make users more dependent on enterprises' websites for quality assurance, as a result, there is a new requirement for the information organization and representation of enterprises' websites. [Methods] This paper proposes a method which uses RDF to express resources, tags and users of Folksonomy and allows users to assign tags freely by achieving the storage and query of RDF data. And apply it to enterprises' websites. [Results] Realize the storage and query of resource, tags and users, enabling users to tag freely, and achieve the relevant inquiries. [Conclusions] The method strengthens the communications between enterprises and users, provides a means for completely opening and sharing of information between users and widens the application range of Folksonomy.
[Objective] In order to promote the accuracy of text feature extraction method based on network, this paper builds a more accurate text network by dependency parsing. [Methods] This method determines the semantic association between feature words according to the result of dependency parsing and the direction of the edges by dependent direction of feature words. And then the improved PageRank algorithm is used to calculate the network node importance to complete the feature extraction. [Results] Experimental results show that to some extent, text feature extraction based on dependency parsing network can improve the effect of document clustering, compared to co-word network. [Limitations] This paper does not distinguish different dependent type when determines the direction between feature words by dependent relationship. [Conclusions] The proposed method based on dependency parsing network is effective on the text feature extraction.
[Objective] To solve the problem of the semantic deficiency in text representation based on Vector Space Model, this paper proposes an algorithm of Chinese text representation based on complex network. [Methods] Word relevance is calculated based on the concept pages, link structure and category system which are extracted from Wikipedia. Then, it represents the feature words of texts as nodes, and puts the semantic relevance relation between words as the edges, and uses the word relevance as edge weight of weighted complex network. [Results] Results of experiments show that the proposed text representation method can improve the calculation of text similarity and improve the performance of text categorization. [Limitations] The selection rules of co-occurred window and span in this paper draw lessons from the existing researches. [Conclusions] This text representation method can better keep the structure information and the correlation information between words. Besides, the computation method of word relevance based on Wikipedia makes semantic information represented by the text network more accurate.
[Objective] This paper explores the infulence of the combination of social tagging and text content. [Methods] In this paper, taking the English and Chinese blogs for example, using TF×IDF, TextRank and TextRank×IDF as text feature extraction method, basing on tags combining with text content where two types weighted methods is used, and AP clustering algorithm is used to cluster samples. [Results] The results show that acts the best in the clustering of three feature extraction. And content weighted with tags improve different degree of the clustering of English blogs, but not for Chinese blogs in the method of Sigmoid. In two kinds of similarity weighted, linear method performs better than the Sigmoid method. [Limitations] The authors cannot find the best weight coefficient of tag similarity and content similarity. AP clustering algorithm can't apply to big data and a lot of clustering results interfered the visualization of show. [Conclusions] The weighted similarity of social tags and text content can improve the effect of the clutering of Web text.
[Objective] Considering the difference of open Web scientific and techical information is minor, general rule-based and statistical learning methods cannot classify the information effectively for the practical application demands. [Methods] By analyzing the content and structure of Web pages, and utilizing the open resources (such as domain Ontology and thesaurus etc.) to perform the self-learning of domain features, this paper proposes a semi-supervised classification model of scientific and technical information. [Results] The experiment results show that the proposed method achieves the precision of 0.9016, recall of 0.8756 and F1 score of 0.8884 respectively, which are superior to Naive Bayes classification. [Limitations] Applying the proposed method to new domain, the domain seed features need be supplied still. [Conclusions] The proposed method can classify the scientific and technical information effectively and satisfy the demand of the information deep analysis and process.
[Objective] Collect and collate new words to expand the current dictionary, which can improve the accuracy of Chinese segment and promote the development of Chinese information processing. [Methods] A new word recognition method of context extension is proposed depending on features of query strings and new words. Firstly, get the seed collection based on features of query strings and obtain candidate new words through full extension. Secondly, get candidate new words according to the words time span. Finally, filter candidates by the use of improved left-right entropy according to the boundary information of words. [Results] Experiments on Sogou log show that precision rate of P@100 can reach 89.60%. [Limitations] The scale of contrast strings affects the accuracy of new words, to a certain extent. [Conclusions] Experiment results demonstrate that the method is suitable for the search logs of which context information to identify new words is missed.
[Objective] In order to reduce the impact of the noisy samples on the classification performances during the construction of the training samples, this paper proposes a process method of noise based on the distribution characteristics of category data in training samples. [Methods] The method represents the degree of similarity among the documents in one category by defining Category Cluster Density, and then it conducts a normal unitary processing on the similarity distribution. Afterwards, combined with the method of statistics, the paper adopts the method of approximate confidence interval estimation to get pairs of documents that contain noisy samples. Based on the relative entropy of distribution and Category Cluster Density, the paper realizes the verification of correctness of the noisy documents recognition. [Results] The classification performances on the specialized and self-built corpus is higher than before by 1.21% to 4.83% respectively. [Limitations] The paper will expand the richness of samples and test the samples in various fields and multi-type data environments. [Conclusions] The method is feasible and it could effectively detect the noisy documents. Meanwhile, it could realize panel testing on large amount of samples at one time.
[Objective] Aiming at Chinese news of medical research literature published on top journals, design an automatic gathering system which can gather news from different medical news websites, extract content and keywords, realize the subject classification and journal navigation. [Context] Provide information source of foreign academic research for active push and subject services. [Methods] Using HttpClient & HtmlParser to build Web-page collector, realize the news list page and content acquisition. Using IK Analyzer 2012 and MeSH to realize medical keywords extraction and subject classification. [Results] The system achieves automatic gathering, keyword extraction and subject classification of specified website news. [Conclusions] Librarians can use this system to provide effective medical academic information push service for medicine researchers.
[Objective] A software is designed to implement duplication checking and data fusion of the papers indexed by SCI and by EI. [Context] The software can help paper analysts obtain a dataset in the same format and meet demand of micro-analysis of subject information. [Methods] Two automatic algorithms and one semi-automatic algorithm are used to complete accurate data duplicate checking on the papers indexed by SCI and EI. Data fusion is based on detailed analysis of text features of data fields of SCI and EI. [Results] It can mark papers which are duplicated between SCI papers and EI papers and create a de-duplicated data fusion sheet. [Conclusions] The construction problem of the dataset from different data sources is solved effectively and its design ideas also can be applied to other databases.
[Objective] Peking University Library (PKULib) conductes a series of usability study to provide valuable usability improvement suggestions for its new home page design. [Context] With researchers' growing dependence on digital libraries, a library website is no longer an image or a window of a library, but a tool with excellent-usability which provides quick access to the resources required. User friendly interface has a direct impact on website usability. [Methods] This paper mainly focuses on the heuristic evaluation of the old version website of PKULib and the library websites peer-study. [Results] Based on the results of the heuristic evaluation and the websites peer-study, the paper introduces the PKULib website redesign. [Conclusions] As a result, the paper gets the common practices in library website design, which would provide valuable interface design recommendations for usability improvement of the library website design.