[Objective] The implementation details of data curation promotes to establish data curation policy. [Methods] Based on reviewing the research achievements related to data curation, and then systematically concluded from the aspects as data selection standards, data storage standards, data communication and sharing mechanism. [Results] The main elements of data curation policy include: data selection standards (compliance with the requirements of data submission process, priority selection principle, the statement of data credibility and availability, data sources uncontroversial), data storage standard (follow the policy, guarantee data integrity, meet common technical standards, guarantee long-term sustainable development), and data communication and sharing mechanism (accordance with the laws and industrial directory, open access license, disclaimer propagation behavior, data reuse documentation). [Limitations] In the future, need to combine with the reality of China to complete the policy framework in detail. [Conclusions] Research organizations, associations and funding agencies should actively promote and develop data curation policy.
[Objective] Propose a set of detailed structure specifications of scientific data management plan and in accordance with a data curation model constructed from the operational perspective. [Methods] This paper carries on the research and the statistics on the scientific data management plan specification of the main research and management agencies in the world, and makes supplement combining with the requirement and characteristic of current scientific research data management. [Results] This paper forms the detailed structure specification of data management plan with 8 major basic elements and 39 sub-elements and constructs a data curation model taking data management plan as the core driver. [Conclusions] The detailed structure specification of data management plan may regulate and guide the activities of scientific data management completely and accurately, it can also be effectively controlled and restricted the data curation process of the whole life cycle of scientific research at the operational level.
[Objective] Summarize the fundamental strategies and core issues in Cross-Modal Information Retrieval (CMIR) based on correlation, and do research about the pros and cons of using partial least squares in feature subspace projection in order to improve retrieval effect. [Methods] Based on Wikipedia CMIR dataset, LDA and BOW models are used as a characteristic expression of text and image resources, cosine distance as the similarity measure, and the least squares method is used to learn subspace projection function replacing canonical correlation analysis method. [Results] Using comparative analysis of the influence of three features subspace projection methods named canonical correlation analysis, partial least squares regression, partial least squares correlation on CMIR results according to three retrieval evaluation indicators that are P@K, MAP and NDCG, and the results show that partial least squares correlation obtains the best results. [Limitations]In dealing with data, partial least squares method assumes a linear relationship between the data and an orthogonal relationship between the data base vectors, therefore the non-linear, non-orthogonal problem can not be solved. [Conclusions] Feature subspace projection learning by using partial least squares correlation is more consistent with original spatial information, and CMIR results are more stable.
[Objective] By building a simple data sample, the low efficiency as the problem of traditional recognition method is solved. [Methods] This method uses URL features as the basis of recognition, and uses Support Vector Machine (SVM) to recognize page type. [Results] The precision of this method is 91.2%, also in terms of efficiency performance, the method is increased by nearly 60%. [Limitations] When the URL feature is not obvious or even completely contrary, the recognition accuracy will be greatly reduced. [Conclusions] The experimental results show that the method has a great advantage in efficiency, and it will increase the efficiency of the collection system.
[Objective] To solve the problem of fraud transaction in e-commerce platform. [Methods] This paper proposes a method that combine Deep Belief Networks and fuzzy set based on consumers’ purchase history and reviews. Through recognizing the users in fraud transactions—cheaters to recognize the fraud transactions. [Results] Tested by experiments using the data crawled from Taobao.com, the accuracy can be achieved 89%. Compared with the shallow machine learning model, the comprehensive performance improves significantly. [Limitations] In contrast with the huge normal users and the users in fraud transactions, the experimental data in the paper is relatively small. And the test data only from Taobao.com, lack of the data from the other e-commerce platform to be validated. [Conclusions] The users in fraud transactions can be identified by the method, and the fraud transaction in e-commerce can be reduced.
[Objective] To help online group-buying consumers find high quality merchants quickly and help merchants improve their credit efficiently. [Methods] Use similarity weight to distribute the weights of index system, consider the gotten composite indicator variables as the parameters of ant colony algorithm, and establish the credit evaluation model based on ACO and Similarity Weight Algorithm. [Results] Empirical results show that the model can effectively find out the shortest path to save time and money cost, obtain high quality merchant. [Limitations] Not considering the impact of special trade on online group-buying credit evaluation, such as refund and fictitious trading; directly using previous research conclusion of other parameters in ACO. [Conclusions] The results can help merchants improve credit, promote satisfaction of consumer group, and provide the references for further research on online group-buying problems.
[Objective] In order to reveal the relationships between contents, topics and authors of documents, this paper presents the Dynamic Author Topic (DAT) model which extends LDA model. [Context] Extracting features from large-scale texts is an important job for informatics researchers. [Methods] Firstly, collect the NIPS conference papers as data set and make preprocessing with them. Then divide data set into parts by published time, which forms a first-order Markov-chain. Then use perplexity to ensure the number of topics. At last, use Gibbs sampling to estimate the author-topic and topic-words distributions in each time slice. [Results] The results of experiments show that the document is represented as probability distributions of topics-words and authors-topics. On the dimension of time, the revolution of authors and topics can be observed. [Conclusions] DAT model can integrate contents and extra-features efficiently and accomplish text mining.
[Objective] This paper establishes the BA network model of public opinion transfer process, regarding “Bandwagon Effect” and “Threshold Effect” as a starting point and according to the special inspection of public opinion. [Methods] At the same time, collect the real online data of public opinion transfer network. This paper uses the link prediction method to predict the unknown links of public opinion nodes which will appear in the forthcoming transfer process of both simulation BA network data and real public opinion data. [Resualts]The analysis results show that among many similarity indices algorithms LP link prediction algorithm can get the best prediction. It means that LP link prediction algorithm is suitable for the link prediction in such public opinion delivery network. [Limitations] There is no improvement of link predict similarity index. [Conclutions] From the point of data view, this paper proposes an effective prediction method of public opinion trends analysis to provide the theoretical support for the network of public opinion control.
[Objective] This paper aims to analyze the structural features of Linked Open Data (LOD), and the results can be used to guide the organization of linked data in practice. [Methods] Describing LOD network with degree distribution, average path length, clustering coefficient and other indexes, this paper compares scale-free network and small-world network in the complex network theory. [Results] The structure of LOD network shows a power-law distribution, approximate the scale-free network. The Publication subnet of LOD shows a relatively homogeneous exponential distribution. Two networks both have a short average path length and high clustering coefficient. [Limitations] Lack of assigning key nodes to more weight. [Conclusions] Small-world phenomenon of LOD can optimize the retrieval efficiency, and scale-free feature will reduce the stability of the entire network.
[Objective] Discuss how to obtain the terminology taxonomic relation from Chinese domain unstructured text. [Methods] Based on Digital Library domain text from CNKI, construct terminology hierarchy by terminology extraction, terminology Vector Space Model construction, BIRCH clustering and cluster tag distribution. [Results] Obtain the terminology taxonomic relation of Digital Library domain, and evaluate the effectiveness. The accuracy of clustering reaches up to 80.88%, and the accuracy of cluster tag extraction reaches up to 89.71%. [Limitations] Evaluate the effectiveness by random sampling, and in comparison with one method only. [Conclusions] Making use of BIRCH algorithm to construct terminology taxonomic relation, this algorithm has obvious advantage compared with K-means clustering method, and has higher execution and clustering effectiveness.
[Objective] Construct the project website of the “Open Resources Development” to manage and distribute the project’s fruits using Drupal. [Context] The project of “Open Resources Development” needs to build up a project fruits distribution platform with the limited time and technical conditions. And, Drupal can satisfy these needs for its flexibility and simplicity. [Methods] Using both the basic and extent models of Drupal, construct the website from two aspects of data layer and presentation layer. Also, solve technical key problems in theme design and website upgrade. [Results] Complete the website construction and content development using Drupal in a short time and with a low cost, timely releasing the project fruits. [Conclusions] Drupal can well satisfy the needs from librarians of constructing small websites of projects or the service platform.
[Objective] To extract information from Chinese plant species diversity description text. [Methods] Take the plant species diversity domain ontology as the foundation, and adopt the strategy of stepwise selection and annotation on paragraph, sentence and concept. [Results] A sample including 4 734 information points is used to test. The value of extraction accuracy rate, recall rate and F-measure achieves 0.86, 0.85 and 0.85 respectively. [Limitations] In order to solve the problems on extracting information from description text, the rule set should be improved in the future. [Conclusions] The research scheme can fulfill the information extraction from Chinese plant species diversity description text effectively.
[Objective] To apply the third-party source metadata such as Web of Science metadata to NSTL joint data processing system. [Context] Based on NSTL Development Program, need to expand process metadata by oneself to acquire metadata in various ways such as buying third-party metadata. [Methods] Map Web of Science, Scopus Schema to NSTL Schema, analyze the characteristics of Web of Science metadata to revise NSTL Schema. Based on mapping results, export third-party metadata as NSTL Schema format and integrat it into NSTL joint data processing system. [Results] Integrate the third-party metadata into NSTL joint data processing system rapidly and efficiently. [Conclusions] The apply of Web of Science metadata in NSTL joint data processing system has improved the data processing speed. Revising existing NSTL Schema targeted constructs widen fremwork for adding other third-party metadata.