The paper analyzes typical open source Web crawl software, such as Nutch, Heritrix, WCT, and Web-Harvest. Following the analyzed result, it puts forward a targeted websites harvest system based on Nutch. Four key issues of this system are discussed emphatically, which are the initial seed websites selection, the harvest process management, the web page content denoising, and discovering of new seed websites.
This article analyzes the levels and structure of knowledge organization system in digital library, emphasizes on four components -KOS building and management, KOS interoperation, KOS storage and administration, semantic metadata generation.Related open source software is chosen and application of each structure in the process of digital library knowledge organization is introduced. Finally, it proposes practical example on building knowledge organization system in digital library.
This article summarizes several typical index strategies through analyzing Web Archive projects with Wayback as access tool, also gives preliminary analysis for the scope of application, merits and faults of each strategy. Thus hopes to give companies of this area some reference.
This paper introduces the system architecture, indexing and retrieval process, and language analyzer of Lucene. According to the disadvantage of Lucene that it can only make one-word and two-word segmentation, this paper develops a Chinese-English language analyzer — ZH_CNAnalyzer. At last, an indexing and retrieval example of ZH_CNAnalyzer is given.
Mass data processing is a focal point of information techniques. This paper introduces architecture of open source parallel system-Hadoop, analyzes the MapReduce programming framework based on Hadoop, and proposes a method for generating co-occurrence matrix of mass data through multiple MapReduce operations.
This paper introduces grid service description techniques for multi-attributed DL grid, namely, setting uniform standard of metadata for each feature and describing each feature by its corresponding metadata standard. It discusses the levels of service semantic description in DL grid and establishes the semantic description model of DL grid’s service based on Ontology.
This paper introduces several standards of compound digital object,METS，MPEG-21 DIDL and OAI-ORE. The basic data models, applications and characters of these standards are analyzed and their processes of digital objects are compared.
In order to find out the micro factors that impact user’s satisfaction about library websites, this paper puts forward a flexible self-evaluation system. With this system, the library websites can choose suitable micro evaluation plan according to their own needs, and diagnose by themselves the underlying factors that impact user’s satisfaction. The system is highly user-definable, and the library managers can create experts’ weight-surveying questionnaires by using its indicator templates or by using self-defined indicators. Finally, the surveying data are analyzed and showed with 3D visualization graphics, and the micro factors needs to be improved are found out.
This paper gives a systemic discussion on the Knowledge Communication Network (KCN) drawn from CSDN, trying to mine the character of the knowledge communication in virtual communities. Firstly, the authors analysis properties of the statistics, and point out that the small-world effect and scale-free property do exist in the network. Then find out the two important motifs in knowledge communication through analyzing the triangle of the network.
The paper analyzes the query logs in March, 2007, from Sogou search engine. POS tagging is used to get the characters of high frequency POS results. Web users use nouns as primary and verbs as complementary methods in Web queries; but other parts of speech seldom appear in the queries. The empty words in natural language, such as “的”, do not appear in the high frequency POS results very often. Queries in the Web searching are different from natural language in syntax to a certain degree and they have shared characters at the same time. Web users’ use nouns to do concept-focused retrieval and keywords are still the primary method to search on the Web. The high frequency results of POS tagging partially obey the Zipf’s law.
The process of knowledge interchanging between agents is a complex process. It needs a conversation policy to manipulate the activities of agents. The paper proposes a method based on extended KQML language to simulate the hand-shaking mechanism in the TCP protocol. The method can deal well with the problems in the interchanging such as establishing a conversation, assurance of message delivery, et al.
This paper puts forward a new method for constitution of user preference model based on weighted XML data structure, with each node appends weight value for representing users’ personalized information.It also designs a new arithmetic to compare similarity of weighted XML model. Finally, this paper discusses the implementation of personalized product recommendation system based on this user preference model at detail.
With visual studio.NET development platform，C#，XML, a network subject knowledge database system has been designed and developed.Key techniques such as HTML Web pages metadata acquisition and XML files production，knowledge point mining，data fast transformation between XML files of network subject knowledge and relation database are researched in this paper.
This article discusses the migratory solutions of CAIRIC local system from physical servers to virtual machine based on library practice. The authors accomplish the CAIRIC local systems migratation and updation successfully, using backup and virtual machine techniques. It provides a valuable example for constructing library service system platform based on the virtual machine technique for the future.
This paper proposes a new algorithm based on multi-scale conditional random fields. This algorithm treats the binarization as a tagging process, using mCRF to label every pixel in the image, so as to realize the binarization of the full image. MCRF of discriminate model can accommodate any of the non-independent features, which makes full use of information in the image. From the result can see this algorithm is better than common threshold method in effect.
Selecting the Aba Zang and Qiang Autonomous Region’s tourism documents as information resources, the author analyzes the topic and topic type selection principle for the organization of tourism documents according to topic maps, defines the associations among topics in tourism documents, proposes a methodological approach to the construction of topic maps for tourism documents, and displays the effect of the organization of topic maps.
In this paper, an extraction model of experience and evaluation article is proposed, and an evaluation experiment about experience and evaluation article extraction from blogs is achieved. This model depends on collocation degree and distance of experience object, experience action, and experience evaluation instead of syntax analysis. The results of the experiment show that, the system based on this model achive high extraction precision.
The system implements the application of GIS in the management about the information of the university library. For setting up the system, this paper uses spatial query and spatial analysis functions of GIS, and sets up spatial basic geographic information system model, then it associates the spatial data and the attribute date. Users who have different authorities can manage, retrieval, query, analyse and apply the resources of library in a virtual environment. Readers can easily query spatial position through the resources’s attribute date and also they can obtain attribute data of their interested areas.