[Objective] To investigate and summarize the typical semantic retrieval system for scientific literature. [Coverage] Use literatures related to semantic search retrieved by Web of Knowledge or Google Scholar, references and research reports of semantic retrieval systems. [Methods] This paper classifies current systems into four categories according to the degree of semantic processing, semantic query expansion retrieval system, concepts or entities centered retrieval system, relation-centered retrieval system, and retrieval system for knowledge discovery. [Results] The authors propose a basic framework of semantic retrieval systems for scientific literature, and summarize the features of semantic retrieval systems for scientific literature. [Limitations] Lack of performance evaluation of semantic retrieval system. [Conclusions] It provides a good guide for developing a semantic retrieval system for the scientific literature.
[Objective] Provide evidence for the choice and application of contributor identifier system from the view of construction mechanism of Open Researcher & Contributor ID (ORCID) system. [Methods] Analyze ORCID system from three aspects, namely construction model, claim/authentication pattern and metadata specification, and compare with other contributor identifier systems based on documents and cases. [Results] Obtain ORCID's construction mechanism and characteristics. [Conclusions] ORCID is bottom-up and jointly constructed driven by users. ORCID mixes different types of claims and authentications and disambiguates based on authority and trust degree. ORCID's metadata control improves degree of authority through identifiers linkage and resolving.
[Objective] This article aims to extract concept attribute instances in innovation sentences, and then to explore the relationship between concepts. [Methods] A method of recognizing core concept and concept attribute instances from dependency tree is presented. This method is based on the results of semantic role labeling and dependency parsing, and takes advantage of property of classes in domain Ontology. Considering the feature of dependency parsing, a concept combination module and a conjunction relationship detection module are designed to improve the effect of concept attribute instances recognition. [Results] The results show that the F value of core concept recognition is 77.94%, and the average F value of concept attribute instances recognition is around 90%. [Limitations] Stanford parsing tool leads to wrong parsing results which may result in inaccurate recognition. The class of Properties or Attributes in NCIt is not well filtered and standardized. [Conclusions] This method can effectively extract core concepts and concept attribute instances in innovation sentences.
[Objective] By building a mathematical model, this paper studies the information interaction between micro-blog and other network media under the background of big data. [Methods] Analyze the information interactive features of micro-blog public opinion, define the information interaction coefficient, and establish the differential equation model of micro-blog information interaction. [Results] Using Matlab numerical simulation and six cases of network public opinion to analyze the feature of the model and validate the model, it is concluded that to build information interaction mechanism is the key for the government to response network public opinion under the background of big data. [Limitations] The research only builds the regular model of micro-blog information interaction, not considering the situation when the negative public opinions like Internet rumors spreads rapidly and widely. [Conclusions] The results can help the government take measures when facing complex micro-blog public opinion, and also provide some references for the further research on information interaction problem of public opinion.
[Objective] To optimize the diversity of the recommendation list by clustering weight redistribution. [Methods] This paper presents an algorithm to improve the recommendation diversity. Clustering is based on item scores. Clustering weight redistribution algorithm is used to reassign each clustering weight, and final recommendation list is generated from each cluster according to the weight. [Results] Experimental results show that z-diversity values of the recommendation list generated is increased by 0.46, 0.65 and 1.88 respectively for three algorithms on MovieLens data set, and z-diversity values is increased by 0.38, 0.49 and 0.76 respectively on Book-Crossing data set, when threshold is reduced from 20 to 1. [Limitations] This algorithm only applies to improve the recommendation list and does not involve the aggregate diversity. [Conclusions] This algorithm effectively improves diversity, while ensuring accuracy and lower time complexity compared with bounded greedy algorithm.
[Objective] To improve the classification performances of bibliographic information such as books, academic journals, combining with the structure characteristics of bibliography texts, this paper proposes a new feature selection method based on weighted Latent Dirichlet Allocation (wLDA) and multi-granularity. [Methods] On the basis of Pointwise Mutual Information (PMI) model, the method improves the feature weights from the elements of location and part of speech, and extends the process of feature generated by LDA model to get more expressive words. This paper adopts a certain strategy to obtain fine-granularity combined with TF-IDF model and uses multi-granularity features as the core feature sets to represent bibliographic texts. Realize bibliographic texts classification by applying KNN and SVM algorithms. [Results] Compared with the LDA model and traditional feature selection methods, the classification performances on the classifiers of the self-built corpuses for books and journals increase by an average of 3.60% and 4.79%. [Limitations] The experimental materials need to be expanded and more weighted strategies need to be explored to improve the classification performances. [Conclusions] Experimental results show that the method is effective and feasible, and can increase the expressive ability for the feature sets after feature selection, so as to improve the classification effect of text classification.
[Objective] Explore the data issues and methods of data preprocessing on paper similarity detection. [Methods] This paper makes a deep analysis to original data, and briefly introduces three data preprocessing methods, namely rule-based method, statistics-based method and semantic-based method. [Results] There are many data problems in the original data, based on which it describes the model of data preprocessing. [Limitations] The number of the corpora is limited and the preprocessing of figures and tables is not included. [Conclusions] Data preprocessing can help to improve the accuracy of paper similarity detection, and using the three methods together can improve the effect of data preprocessing.
Abstract: [Objective] Improve the method of hotspot detection to solve the lack of semantic understanding and the limitation of clustering algorithm in the traditional method of microblog hotspot. [Methods] This paper uses the Information Gain and the Latent Semantic Analysis as the way to construct a word-document matrix, then, the two-step clustering algorithm is put up which uses an improved K-means algorithm in hotspot detection as well as incremental clustering algorithm in hotspot refreshing. Meanwhile, similarity strength is adopted to solve the low accuracy of traditional method in which the number of hot topics is firstly determined and then the topic is detected. [Results] Compared with previous methods, the recall ratio of presented method is 91.3% and the precision ratio is 92.9%, clustering effect increased. It also can update data to reduce the complexity of the experiment. [Limitations] The experimental data has a small time span making the effect of update hotspot is not outstanding. [Conclusions] Experimental results show that the proposed method has good accuracy.
[Objective] The paper aims to find the user groups (influential clusters in social network) which have great influence on others in particular topics. The user groups can be employed as spread media to support the marketing decisions of enterprises. [Methods] With the data collected from Sina micro-blog, use the pedigree method to mine the influential clusters in social network, and analyze the information distribution and interaction among individuals to mine the influential clusters. [Resuls] The proposed method can find the user groups which have high influence in social network. Enterprises can utilize the user groups to distribute the marketing information and enhance the guiding rate of product sale. [Limitations] Only consider the factor which compose the influential ability of individuals, and do not take the unconventional behaviors of micro-blog users into account. [Conclusions] This paper provides the theoretical basis and practical method to support the social marketing decisions of enterprises.
[Objective] To explore the new idea and method, accumulate first-hand experience from the aspects of importing, storaging, retrievaling and bulk exporting the large-scale biomedical data. [Methods] Analyze the characteristics of the large-scale biomedical data, and compare the technologies, the advantages and disadvantages for solving the big data problem of the traditional relational databases (the representative Oracle) and the NoSQL database (the representative HBase), from the aspects of theoretic and test results. Take a drug database of genomic data storage systems as an example, and make a test for the performances of Oracle and HBase. [Results] HBase in practical application has a large advantage over Oracle when process large data. [Limitations] Lacking the deep mining and analysing to the pharmacogenomics data, the future research needs an in-depth technical optimization for Hadoop/HBase. [Conclusions] In this experiment, HBase can meet storage requirements for the large-scale biomedical data.
[Objective] This paper proposes a user model to understand mobile user behaviors. [Methods] Mobile user behaviors based on communication records from a Chinese telecom, including 10 thousand mobile users in a week with 40 thousand calls and 2 million network requests with locational information are analyzed. 14 fundamental indicators from the data are adopted based on four different categories, namely consumption level, call volume, network request, and amount of movement. [Results] Four user types, regular motion with large conversation, erratically motion with high network accessing, stay-in with economization, and erratically motion with high consumption, are finally deduced in this study by using K-means clustering method. [Limitations] Because of the limitation of user number and the quantity of data, complex machine learning methods are not used to create user model. [Conclusions] The results are valuable references to improve personalized services in mobile applications.
[Objective] The real-time statistical analysis system of DSpace logs is designed and implemented to meet the different needs of users, and to make up for lack of DSpace's statistical functions itself. [Context] For the design limitations, the DSpace's statistical functions are simple, rigid form of expression, and can not achieve interactive statistical analysis. [Methods] Use Logstash to collect and analyze DSpace logs, and use ElasticSearch to index the logs. Building QueryDSL to call ElasticSearch Java API to achieve different statistical functions, and show the graphical results with ECharts component. [Results] The real-time statistical analysis system of DSpace logs can get the browse rankings of items, collections and communities, get the download rankings of bitstreams, and get the regional rankings of website access, and so on. The statistics time can be customized by user, and the statistical result can be showed in different forms. [Conclusions] Using Logstash and ElasticSearch to achieve statistical analysis of DSpace logs has many excellences, just like no need to modify the code of DSpace, simple installation and deployment of the components, man-machine interactive query, fast and real-time, and rich forms to show the results.
[Objective] Be inspired by paper index and citation service, improve development of institutional repository by relevances between the institutional repository and the paper index and citation service. [Methods] Design data model, develop institutional repository based on paper-entity relationship model, and propose a new pattern for author claim. [Results] The new pattern is practiced via batch claim of subject librarian and email marketing to authors, implements inaccurate data correlation between paper and author entity. [Limitations] Because of data problems, paper index and citation service still need to verify in the databases. [Conclusions] This study reduces the difficulty of management and operation for institutional repository, and also provides support for the paper index and citation service.