Based on the analysis of some state-of-the-art knowledge extraction systems, i.e., MnM, KIM, Text2Onto, Amilcare and Melita, it brings forward that two kinds of technologies, i.e., machine learning and natural language analysis, are developed respectively and get benefits from the inter-reference. On machine learning aspect, some new methods, such as Adaptive Information Extraction, Open Information Extraction, are put forward and have a trend toward Ontology Learning. On nature language analysis aspect, the methods of Pattern-Based Annotation and Semantic Annotation get more attention than ever, and have a trend toward Ontology Based Information Extraction. Besides, Controlled Language Information Extraction method is introduced to reduce the cost of Ontology Construction and allow non-specialists to create or edit ontological data using simple nature language.
Automatic Term Recognition(ATR) is a key process of knowledge technology such as knowledge extraction and text mining. To enrich the text mining theories and methods based on term recognition, support constructing related systems, it refers to some main existing methods for ATR, find key problems of the process. Through researches on related programs and systems, existing term resources, we could choose the best one for ourselves’ ATR system.
Entity relation extraction is a very important task in text information extraction domain. It first summarizes the development of entity relation extraction related to MUC and ACE, and then points out that main difficulties exist in the process of relation extraction are acquisition of training dataset, acquisition of templates, and co-reference resolution. Based on the analysis of recent related literatures, systems and projects, it concludes the entity relation extraction methods as follows:templates method, lexicon driven method, machine learning method, Ontology driven method, and hybrid method. The analysis of these methods can help to build more efficient entity relation extraction system in further step.
Text visualization is a method which uses computer technology to make a graphical show of the specific text resources. This paper analyzes the current text visualization characteristics through analysis of the typical text visualization system. There are four different classes of text visualization, including based on vocabulary, based on article, based on time series, based on topic which reflects the main text visualization techniques. The final part is about how text visualization used in the information environment now.
Seeing that the keyword or key phrase can represent the feature of text, keyword extraction and filtration has great significance for information retrieval, information extraction and knowledge discovery. This paper first investigates current keyword extraction methods. Then it uses existing thesaurus and tools in the medical field and BM25F model in proposing a method for keyword extraction and filtration from medical texts. The proposed method mainly solves two key problems:identification and extraction of keywords, evaluation of keyword value and filtration of keywords. This paper applies the method on documents in the field of osteoarthritis from the year 2001 to 2007, and verifies its effectiveness, which offers an effective way for extracting keywords in knowledge discovery.
This paper puts forward a model which can eliminate sense ambiguity of Chinese segmentation. This model segments word based on MM and RMM at first. Then it compares the segmentation results with each other, and output a more accurate result for the segmentation. The process can be divided into three parts:discovery, extraction and disambiguation. The test result shows that this model is able to reduce the error rate of segmentation, which is caused by the ambiguity of word segmentation.
This paper proposes a personalized Web pages recommendation model based on sequential patterns. Firstly, this model extracts the Web transaction set by Web usage preparation. Secondly, it applies a sequential patterns algorithm to discover frequent (contiguous) sequences. Finally, the model utilizes frequent (contiguous) sequences tree to generate user interest view and provides personalized recommendation set.
This paper proposes a new query expansion method which combines user modeling research and the research of query expansion based on Ontology,realizing the personalized semantic query expansion. And it divides the process of personalized semantic query expansion into two stage ——the mapping from keywords to the concepts included in the user modeling and the semantic extension at the level of Ontology, and the algorithm of each stage is gaven in this papar. The experiment indicates that this method can enhance the accuracy ratio and the recall of the information retrieval, and meet personalized needs in the certain extent.
In view of the problem that the traditional information retrieval model can’t process uncertainty knowledge perfectly, the author combines rough set and fuzzy set theory, and puts forward an improved model of Web information retrieval based on fuzzy rough set. At the same time, the author proposes a key algorithm and a performance evaluation method performonce based on the model.The model is helpful to raise efficiency of information retrieval, and is valuable both in theory and application.
Indexing frame of Web page information is constructed referring to Dublin core metadata. Web page’s characteristic information is extracted. Web page information auto-indexing is realized by using ADO technology. The experiment result indicates that the accurate rate of mapping indexing information to Web page reaches 100%.Finally, classification and indexing technology are applied to the intelligent agent termination of the complementary architecture network. The effectiveness of UCL indexing method is proved. The experiment result indicates that through Web page information auto-classification and auto-indexing technology based on the UCL, active service of information is realized and user’s individual demand is satisfied.
This paper designs and implements an algorithm named TidlistApriori for mining association rule based on the identifier lists of transactions in database using Java.The results of experiment comparing TidlistApriori with Apriori based on Hash-Tree indicate that this algorithm can improve the efficiency of finding frequent item sets, and TidlistApriori can be used as efficient tool for mining topic association.
This paper presents a text mining system based on the co-occurrence of bibliographic items in literature databases. This system produces the principal bibliometric indicators of a given document set oriented to PubMed and Web of Science, and some of results are presented by visualization techniques. Further more, it provides cluster analysis and association analysis by investigating the co-occurrence data of high-frequent MeSH terms, high-productive authors, highly-cited papers and highly-cited authors. Using these approaches users can mining the potential association rules among MeSH terms, and engage scientometric investigations.
Aiming at the deficiencies of existenting analytical tools of Web,a reasonable Web application program and develop process are provided by using the Java to open out Webstat,and it is applied into the web evaluating.The practice shows that the project has the characteristics of to be prone operation, all-around and systemic result and practicability.
This paper introduces the design and implementation of remote access system for National Science Library of Chinese Academy of Sciences. This system has implemented the function of single sign on based on SAML, authorization，access management and reverse proxy, and it helps research users to visit the digital resources which is purchased by their institutes anytime and anywhere.
A book cover service in a Mashup mode on the OPAC has been designed and developed in Tsinghua University Library. When patrons access in OPAC, book covers will be displayed in the result pages seamlessly, so patrons can use them intuitionisticly. This article introduces the design and implementation of this book cover data source server, puts emphasis on the design ideas of the outside book cover data source, the method on building this data source by Servlet technology and how to connect the server with the library management system.
After discussing metadata paralleling harvesting framework, the paper presents an improved metadata paralleling harvesting framework based on digital library grid, mobile agent and OAI framework. Then it describes the major components and the function of the modules in this framework. Experiment results show that this framework overcomes the shortcomings of low performance and searching inefficiency and so on which exist in previous paralleling metadata harvesting framework.
Specially for the distributed and loosely coupled service system of the digital library consisting of many application services, this paper brings forward a solution based on application layer to monitoring the network service of the digital library, realizes the target of managing all the services which can be visited, gives the formula to compute the service performance and availability. At last, it discusses the design and implementation method of a service management system of the digital library.
An Electronic Resource Access Gateway System used for solving the usage statistics problem of networked electronic resources has been developed in Xi’an Jiaotong University Library. This paper describes the system design ideas and implementation in detail, including how to obtain valuable data, how to analyze data and access to the needed information, generation of the statistical report of electronic resources and existing problems.