[Objective] The article aims at building W2R framework for converting Web data to RDF format. [Methods] Build the bottom infrastructure of the framework with W2R vocabulary, and convert Web data to RDF format with mapping file which is consisted of system Ontology and Web page elements extracted in XPath syntax. Furthermore, use Virtuoso database as the persistent storage of RDF data. [Results] With the W2R framework, it is convenient for converting Web data to RDF format, merging data in different resources, storing them in named graphs and implementing simple inferences without changing any source code. [Limitations] The system Ontology is made up of public namespaces that describe the bibliographies currently. RDF data is only stored in Virtuoso database. [Conclusions] Through the W2R framework, this paper provides a new way of generating the standardized RDF data for semantic network and linked data applications.
[Objective] Find the semantic relations among large-scale Oracle Bone Inscription (OBI) data in order to provide semantic analysis function for OBI research. [Methods] Based on text mining, combined with the semantic Web technology, implement semantic search on the data set of RDF-based entities and their relationships. And using Ontology relationships and Ontology reasoning to extract explicit or implicit semantic relationships among RDF objects. [Results] Experimental results show that the F-Measure can reach 74.49% on OBI literature semantic mining and 70.61% on OBI semantic mining, which satisfy the need of OBI information processing. [Limitations] Semantic mining is based on three different Ontologies instead of an integrated one. [Conclusions] RDF can provide a structured semantic specification description and the LarKC system is suitable for large-scale OBI semantic processing.
[Objective] This paper constructs a human-annotated collection on the basis of Sogou query logs, aims at feature analysis and automatic identification of query specificity, as well as evaluates and compares the identifing results. [Methods] The queries' basic features and content features are selected and analyzed. And then the decision tree, SVM and Naive Bayes classifiers are built and trained to achieve the automatic query specificity classification. [Results] Using the features mentioned above, an effective query specificty identification is obtained. Finally, the macro average F-measures of the identification effects are all above 0.8. [Limitations] Users' clickthrough information is not selected during the feature selection, and the ignorance of the conditional independence assumption of the Naive Bayes classifier in this particular experiment should be further verified. [Conclusions] The queries' basic features and content features, by themselves, can well distinguish broad, medium, and specific queries.
[Objective] This paper researches on the acquisition of synonym from patent query logs. [Methods] Propose a method based on the analysis of user behavior. Use logic expression parser to generate candidate synonym pairs, combine features such as pinyin, Chinese character pattern, abbreviation, traditional Chinese and simplified style to generate a synonym dictionary. [Results] Experiment results show that precision rate reaches 74.5%. This method generates 17 495 synonym pairs and the scale of dictionary exceeds some existing methods. [Limitations] This method is feasible for library and information retrieval with complex expressions. [Conclusions] This research provides a certain significant reference for log-based knowledge acquisition.
[Objective] This paper aims to implement characteristic extension of short-text and improve short-text classification performance. [Methods] Extract the high frequency words and topic core words of each class of the training set as domain keyword set based on two different feature granularity, which is word and potential topic, and derive the topic probability distribution of the testing text using LDA model, while some topic probability is greater than a certain threshold, extend the keywords of the topic into the testing text. Calculate the sematic similarity of the testing text and the domain keyword set of each class by using HowNet. [Results] Compared with the short-text classification method based on LDA model, the proposed classification algorithm in Fudan corpora, Sogou corpus and the Micro-blog corpus average increase by 4.9%, 5.9% and 4.2% on Macro F1, on the Micro F1 average increased by 4.6%, 6.2% and 4.6%. Compared with the short-text classification method based on VSM model, the method can increase F-measure more than 13% in the all three corpus. And experimental proof in combination with characteristics of high frequency words and subject core words in the field of extension method classification performance is better than the extension method that only using high frequency words or subject core words. [Limitations] There are many words not included by HowNet, and these words cannot use HowNet to calculate similarity. It will affect classification results. [Conclusions] The method of this paper can effectively improve the short-text classification performance.
[Objective] This is an algorithm for improving the classification precision of Chinese text classification, which can calculate the similarity between Chinese texts more accurately. [Methods] With the TF-IDF algorithm calculating item weight and HowNet analyzing the semantic relationships between lexical items, this paper proposes a text similarity weighting algorithm based on HowNet semantics similarity, and makes an experiment on its Chinese text classification. [Results] The experiment resualts show that the proposed method can improve the text categorization performance comparing with the traditional ones. [Limitations] This algorithm is quite high in its time complexity, and its speed of text classification needs to be improved. [Conclusions] It is proved to be an effective algorithm for enhancing the classification accuracy of Chinese text by analyzing the semantic relationships between feature items.
[Objective] In big data environment, this paper aims to accurately and quickly detect bursty events from the text stream. [Methods] Using Kleinberg bursty detection and LDA topic model, the method is extended to MapReduce framework to achieve parallel corpus predisposed, parallel detection of bursty word, parallel filtration of bursty document and parallel extraction of topic. [Results] The results of simulation experiments on the news text stream show that precision reaches 87.50%, recall reaches 77.78%, and F-measure reaches 82.35% with the parallel method to detect bursty events in specific areas. [Limitations] The MapReduce parallel method is difficult to achieve Online and Real-time detection of bursty events with large-scale dynamic text stream. [Conclusions] Compared with the traditional serial detecting method of bursty events, the distributed parallel method not only guarantees the accuracy of detecting results, but also has a good scalability.
[Objective] This paper proposes a review credibility sorting model in order to assist customers to make the best shopping decision. [Methods] The review credibility indexes are adjusted and optimized on the Visual Studio application development platform. Through questionnaire investigation to obtain the indexes score, credibility sorting model is constructed by Fuzzy Analytic Hierarchy Process. [Results] The experiment resualts show that compared with the Web original reviews, the new reviews sorting method is more scientific and reasonable. Those reviews without “helpful vote” are not necessarily unreliable, so the “helpful vote” is important to review credibility, but not the only factor which determines the credibility. [Limitations] People have different attitudes on factor's weight, so the future work should attach more importance to the expertise of rating factors. [Conclusions] The sorting model in this paper synthesizes several indexes and adjustment methods, thus it provides a new credibility sorting method which considering objective information and semantic features for the Chinese online customer reviews.
[Objective] A centralized identity authentication model is raised to solve user identity management problem. [Context] In the National Public Culture Digital Platform, the identity authenticcation needs to consider the characters of the topological structure of the platform and the autonomous of the users from member libraries. [Methods] This model uses an implicit or explicit global identity and mapping relations of automomous identity in order to unify the autonomous identity of the member libraries. [Results] By this model, users don't need to remember multiple identities, member libraries can share users information and realize user-centered. New member libraries can join easily. [Conclusions] This model has certain feasibility, but it still has some problems such as the efficiency, identity disambiguation and security. It should be test and adjust when being implemented.
[Objective] The goal of the construction of Institutional Repositories of Chinese Academy of Agricultural Sciences (CAAS-IR) is to promote the preservation and dissemination of digital assets utilization. [Context] With the rapid development of domestic and foreign IR construction and the open access movement, CAAS-IR will become the important knowledge infrastructure of Chinese Academy of Agricultural Sciences. [Methods] The CAAS-IR uses DSpace as the prototype system and is optimized by Java programming and application of Solr. [Results] CAAS-IR platform extends the functionality of faceted search, retrieval and statistical analysis and other functions that are based on frame of DSpace-core. [Conclusions] Practice on CAAS-IR promotes cognitive level of IR for the scientific research personnel and management of science and technology department of CAAS. The construction of IR involves many aspects such as technology, resources construction, management and service. The effective incentive mechanism and value-added service will help the implementation of IR.
[Objective] A new Network Public Opinion (NPO) classification method based on parallel Naive Bayesian Classification Algorithm (NBCA) in Hadoop environment is proposed. [Context] The NPO are high-volume, high-distribution and high-variety information assets, thus the accurate and fast classification is difficult to achieve. [Methods] According to the distributed storage and parallel processing features of Hadoop platform, the NBCA is parallel encapsulated and the NPO documents are locally stored under HDFS frame and parallel classified in MapReduce process. [Results] The performance of MapReduce packaged parallel NBCA is testified and the results show that the execution efficiency of proposed algorithm improves 82% compared to centralized method and its classification accuracy rate arrives more than 85%. [Conclusions] The proposed algorithm can effectively improve the NPO classification efficiency and ability.
[Objective] In order to improve the efficiency of finding books in library, this article provides a library book location and navigation system based on smart phone. [Context] Readers often use a low efficient way to find books in library and they need a new method for fast book positioning and navigation. [Methods] Set up a landmark system and create a mapping table between books call number and their locations, and users can search books and their location by mobile, the system provides a navigation path by HEAA algorithm. [Results] Readers can search books and find their location in half-time than before. [Conclusions] This system is better than others in low cost, easy deployment and convenience. It has good accuracy in location and navigation.