专利排名算法——运用引用次数与引文网络计算美国专利的研究

引用本文

顾立平. 专利排名算法——运用引用次数与引文网络计算美国专利的研究. 现代图书情报技术, 2011, 27(6): 14-19
Ku Liping. PatentRank Algorithm——A Study of Using Cited Time and Citation Network to Calculate U．S． Patents. 现代图书情报技术, 2011, 27(6): 14-19 复制到剪切板

Permissions

《现代图书情报技术》编辑部

专利排名算法——运用引用次数与引文网络计算美国专利的研究

顾立平

国立台湾大学图书资讯系台北 10617

摘要

在网页排名和论文排名基础上,采用引用频次标准和引文网络计算排名数值,建立专利排名算法。分析美国专利和商标局的数据库中的数字图书馆相关专利,研究结果显示专利排名算法能够区分相同引用次数的专利排名。该研究是网页排名算法的一种新型应用。

关键词: 专利检索; 专利分析; 专利计量; 专利评估; 专利表现; 专利质量

中图分类号:G312

PatentRank Algorithm——A Study of Using Cited Time and Citation Network to Calculate U．S． Patents

Ku Liping

Department of Library and Information Science, National Taiwan University, Taipei 10617, China

Abstract

Based on the PageRank and ArticleRank, the paper uses the times cited criterion and citation network to calculate the rank scores,thus establishes the PatentRank Algorithm． Then it analyzes the relevant digital library patents of the USPTO patent database and the results show that the PatentRank Algorithm can differentiate patents which of the same number of citations． The originality is a novel application of the PageRank Algorithm．

Keyword: Patent retrieval; Patent analysis Patentmetrics; Patent evaluation; Patent performance; Patent quality

Show Figures

1 引言

在SCI和SSCI的科学计量学研究领域中,有许多正在发展中的研究前沿,其中,利用网页排名算法来进行论文质量计算的算法研究与政策应用目前主要有两个方面：

（1）对作者、机构、国别的影响力评估,例如：基于引文的评价模型^{[ 1]}、对档案学报（Journal of Documentation）最有影响力的论文排名^{[ 2]},以及对单一作者的科学影响力评估^{[ 3]}等;

（2）观测学者的科学合作网络,例如：领域内的作者合作关系^{[ 4]}、以引用频次和作者合作改良的网页算法在科学论文中的应用^{[ 5]}、对图书情报学重要作者的计量研究^{[ 6]},以及对信息检索领域内的引文关系研究^{[ 7]}等。

这两种分流事实上是一体两面,即：采用相同的网页排名原理来进行论文排名研究,它可以作为评估科研经费投入效果的一种方法,也可以是一种了解某些领域的部分学者的社会网络现象的解读方式。

本文并不研究网页排名、论文排名或者论文排名的用处,而是研究另一种适合于专利排名的算法。由于论文和专利都是一种有明确目的的信息交流内容,因此,开发专利排名算法,有助于促进信息检索系统、产业竞争情报、科技资源投入产出评估、国家科技政策制定等诸多方面的应用。本文的研究目的是在已有的网页排名算法和论文排名算法的基础上,开发专利排名算法,以期未来能够成为决策支持系统（Decison Support System, DSS）的一项核心技术。

2 需求及技术思路

（1）国内外的应用现状

作为10种最普遍的数据挖掘技术之一^{[ 8]},网页排名或称佩奇算法,其核心思想是：反向链接的排名总和越高,其网页排名越靠前（A page has high rank if the sum of the ranks of its backlinks is high）^{[ 9]}。搜索引擎Google设计之初,参考了情报学中文献计量的引文网络的概念,并且针对以图书馆学的分面分类法为信息组织的第一代Yahoo搜索引擎进行改良（紧接着两者又相互借鉴和融合）^{[ 10, 11]}。之后,许多研究又针对网页排名的不足之处进行改良^{[ 12]},例如,以用户行为实现个性化网页排名为算法改进^{[ 13, 14]}等。

尽管网页排名来自引文网络^{[ 15]}的概念,并通过马科夫链予以矩阵方程化^{[ 16]};但是反过来网页排名又对于引文网络具有影响^{[ 17]},并且这种影响遍及文献计量、信息计量、科学计量与网络计量,相关定义参见文献[18]和[19],例如用来评量期刊与作者排名的链接分析排名（Link Analysis Rank）^{[ 20]}、文章排名（PaperRank）^{[ 21]}、个人网页排名^{[ 22]}等。

（2）工作中的应用需求

专利计量是运用文献计量、信息计量与科学计量的研究方法与一些数学运算方式,进行专利信息的分析与研究^{[ 23, 24]}。专利计量与其他计量方法的最大不同在于专利数据的格式和规范与一般论文和网页不同,常使用基本统计、引用分析和连接指标等三类计算方式^{[ 24]}。所以,利用网页排名的公式算法和论文排名的研究取向,可以丰富专利计量的分析技术和研究内容。本项研究不仅可以在工业工程的专利评估中应用,也为国家科技政策的决策提供了一种专利分析的基本工具。

（3）技术思路

运用引用频次标准和引文网络计算同一关键词下的美国专利的先后排名顺序。对美国专利和商标局（United States Patent and Trademark Office）的USPTO数据库,以ABST/“digital library”进行检索,共得46个相关专利。数据源（http：//patft．uspto．gov/netahtml/PTO/search-bool．html）是支持Open Access的机构网址,采集日期为2011年3月10日（数据库的更新日期为2011-03-08）,后续研究人员可重复验证。

（4）技术实现方法

设计一套专利排名算法,并与被引频次进行比较,可区分出专利排名算法的效果。以被引频次（Times Cited,TC）为控制组,以专利排名（PatentRank,PR）为对照组,进行实验研究。

3 具体解决方案

3.1 技术架构与实现方案

按照计算公式,针对实验对象,进行逐步计算分析。步骤如下：

（1）数据收集：下载实验所需的USPTO的专利数据。

（2）数据清理：根据USPTO专利元数据,用Ruby对TXT文档进行格式整理。

（3）数据分析：根据PatentRank公式,用Ruby编写程序进行分析。

（4）数据分析：拆解PatentRank公式,用Calc产生数值人为计算和分析。

（5）资料比对：对比两种分析结果,确定计算无误,汇编表格。

（6）排名产出：针对表格内容,验证计算公式。

（7）验证公式：根据TC组和PR组的结果,进行讨论。

3.2 关键的技术性问题与解决

在PageRank和ArticleRank的基础上,设计PatentRank 算法如下：

PatentRank（P）=（1-d）+d× ×

其中,PatentRank（P）表示专利P的排名数值;d表示随机几率,以PageRank的经验数据,数值设为0.85;c表示本专利族群中的引用次数; 表示c的平均值;Pi表示引用P的n个专利中的其中一个专利;c（Pi）表示在Pi引文网络中的参考文献数量。

4 应用效果

4.1 实现环境与应用测试

应用PageRank的PatentRank在实际操作时,需要进行迭代（Iterate）才能得到最终Rank结果。处理方式有两种：用Open Office的Calc试算表（类似Microsoft Excel）计算,人为操作多次后,进行人工计算;用计算机编程,在多次调适程序确定无误后,可以重复使用。因此,采用Ruby语言,对Patent Rank进行编写,并与人工计算进行对照。

在数据清理的阶段,其USPTO专利可被汇整以方便计算,如表1所示。可知,在USPTO数据库中,文摘中有Digital Library的专利,其被引频次有高有低,其中又有若干被引频次相同的专利。

表1 在USPTO的Digital Library专利

专利号	专利名称	引用	被引
7895288	Personalized time-shifted programming	119	0
7895243	Method and system for moving content in a content object stored in a data repository	152	0
7716589	Non-computer interface to a database and digital library	14	0
7613993	Prerequisite checking in a system for creating compilations of content	130	3
7613704	Enterprise digital asset management system and method	8	0
7613336	Methods and apparatus for image recognition and dictation	35	0
7513424	Digital system and method for home entertainment	6	0
7441192	Programming, selecting, and playing multimedia files	89	0
7401097	System and method for creating compilations of content	118	3
7356766	Method and system for adding content to a content object stored in a data repository	122	3
7346844	Method and system for moving content in a content object stored in a data repository	61	9
7340481	Method and system for adding user-provided content to a content object stored in a data repository	122	14
7089239	Method and system for preventing mutually exclusive content entities stored in a data repository to be included in the same compilation of content	101	26
7076494	Providing a functional layer for facilitating creation and manipulation of compilations of content	96	17
7072398	System and method for motion vector generation and analysis of digital video clips	6	7
7043488	Method and system for storing hierarchical content objects in a data repository	108	33
7035842	Method, system, and program for defining asset queries in a digital library	12	4
6996257	Method for lighting-and view-angle-invariant face description with first-and second-order eigenfeatures	6	5
6986102	Method and configurable model for storing hierarchical data in a non-hierarchical data repository	99	13
6961734	Method, system, and program for defining asset classes in a digital library	11	6
6850944	System, method, and computer program product for managing access to and navigation through large-scale information spaces	16	22
6842604	Personal digital content system	3	20
6839701	Hitmask for querying hierarchically related content entities	91	31
6748382	Method for describing media assets for their management	11	11
6735329	Methods and apparatus for image recognition and dictation	30	2
6611840	Method and system for removing content entity object in a hierarchically structured content object stored in a database	27	117
6489979	Non-computer interface to a database and digital library	15	4
6449627	Volume management method and system for a compilation of content	32	72
6353831	Digital library system	24	41
6338044	Personal digital content system	16	63
6321374	Application-independent generator to generate a database transaction manager in heterogeneous information systems	55	32
6260040	Shared file system for digital content	21	69
6256636	Object server for a digital library system	23	9
6253237	Personalized time-shifted programming	11	27
6243853	Development of automated digital libraries for in-circuit testing of printed curcuit boards	4	0
6199072	Method for creating archives on removable mass storage media and archive server	4	2
6154748	Method for visually mapping data between different record formats	16	34
6092080	Digital library system	41	48
6035303	Object management system for digital libraries	8	60
6021410	Extensible digital library	25	11
6005969	Methods and systems for manipulation of images of floor coverings or other fabrics	15	14
5966454	Methods and systems for manipulation of images of floor coverings or other fabrics	15	13
5940594	Distributed storage management system having a cache server and method therefor	2	68
5896506	Distributed storage management system having a cache server and method therefor	18	29
5835667	Method and apparatus for creating a searchable digital video library and a system and method of using such a library	4	162
5832499	Digital library system	40	49

表1 在USPTO的Digital Library专利

4.2 实验结果与应用效果分析

根据PatentRank公式,进行Ruby迭代15次后,得到计算结果,对比引用次数（Times Cited）的排名和PatentRank的排名,如表2所示：

表2 比较Times Cited和PatentRank的排名情况

由表2可知,由于Times Cited和PatentRank数值不同,若干专利号所代表的专利在被引排名和PR排名中也有所不同。

（1）专利号5835667和6611840的TC排名和PR排名皆为第一和第二;专利号6243853、7441192、7513424、7613336、7613704、7716589、7895243和7895288皆为第39至46的末尾位。所以,在极大极小值上的TC和PR并没有明显不同。如同数学证明PageRank迭代多次能够有效收敛^{[ 25, 26, 27]},在PatentRank上也能有效合理地收敛,并且赋予零项数值合理的随机参数。

（2）专利号6449627、6260040、5940594、6338044和6035303同属一个区间（第3至第7顺位）,但是根据TC和PR它们的排名顺序各有不同。所以在标准差前四分之一中的排名对象相同,而考虑引文网络后的排名顺序则优化了单纯以被引频次来计算的排名。

（3）专利号6005969和7340481、专利号5966454和6986102、专利号6021410和6748382、专利号6256636和7346844,以及专利号7356766、7401097、7613993等,在TC计算之中无法区分排名,而在PR计算中,可在被引频次相同的情况下,根据引文网络的计算而区分出排名顺序。

5 结语

（1）研究成果

本研究在PageRank和ArticleRank的基础上发展PatentRank技术,这项初探性研究仍有许多发展空间,特别是从专利用户行为来进行改良。在理论方面,理解用户行为是未来发展该技术的关键;在实践方面,本项研究成果可以直接转化为工业工程应用,利用开源软件进行大规模专利数据运算,为国家科技政策的决策提供支持。

（2）研究贡献

PageRank网页排名技术,在1998年发布后的10余年间,Google对其进行了算法改进,而许多研究也环绕在这项技术上^{[ 28]}。其中,最近三年开始有学者把PageRank改为一种ArticleRank算法,用来计算期刊文献的被引和排名,同时比较个人、机构和各国的科技竞争力排名。但是,目前鲜少有人将PageRank改为一种PatentRank算法来予以进行文献计量、信息计量、科学计量乃至专利计量的应用。

直接将PageRank算法套用在其他文献载体的排名计算上并不妥当。由于参考文献较少的文本会给予其他引用文本较大的影响数值,因此,在ArticleRank的建模过程中,曾经使用过开平方、指数、最大最小值差等方式处理权重问题^{[ 2]}。与ArticleRank相同,经过实验,目前PatentRank也需要运用引文平均值作为计算公式的参数之一,然而,这并不代表该参数是唯一一种可以去除规模谬误的规范化（Normalization）的合理参数。

（3）未来研究

除了需要考虑网页、论文和专利本身的数据结构不同,尚需深入研究“专利信息行为”而从专利排名的用途来进行研究^{[ 29]}。比如,若以用户文档（User Profile）建立用户兴趣层次（UIH）,则可发展个性化排名方法^{[ 30]}。网页搜索引擎必需考量网址和网页成长数、网页内容、链接结构和用户搜索需求等因素^{[ 31]},而如果用户意向被更好地运用,则能够一般化文本片段抽取（Text Snippet Extraction）,比方使用统计语言模型捕获文档和用户意向的共性^{[ 32]}。采用类似网页排名（PageRank）的实例算法（InstanceRank）能减少实例集的大小,从学习库中选择最有代表性的实例^{[ 33]}。伴随网页排名和论文排名的技术进步,可预见专利排名的突破性发展;如何将在信息系统、大学评量或者科技政策等的各种应用研究开发成为一套有用工具,以及如何规范化地优化这套算法,仍然出自于用户行为。

The authors have declared that no competing interests exist.

作者已声明无竞争性利益关系。

参考文献

View Option

[1]	Corso G M D, Romani F, Binii D A. Versatile Weighting Strategies for a Citation-based Research Evaluation Model[ED/OL]. [2010-10-27]. http://www.dmi.unipg.it/lmc/galn/data/talk/delcorso.pdf. [本文引用:1]
[2]	Li J, Willett P. ArticleRank: A PageRank-based Alternative to Numbers of Citations for Analyzing Citation Networks[J]. Aslib Proceedings: New Information Perspectives, 2009, 61(6): 605-618. [本文引用:2] [JCR: 0.432]
[3]	Yan E, Ding Y. Discovering Author Impact: A PageRank Perspective[J]. Information Processing & Management, 2011, 47(1): 125-134. [本文引用:1] [JCR: 0.488]
[4]	Liu X, Bollen J, Nelson M L, et al. Co-authorship Networks in the Digital Library Research Community[J]. Information Processing & Management, 2005, 41(6): 1462-1480. [本文引用:1] [JCR: 0.488]
[5]	Fiala D, Rousselot F, Jezek K. PageRank for Bibliographic Networks[J]. Scientometrics, 2008, 76(1): 135-158. [本文引用:1] [JCR: 2.133]
[6]	Yan E, Ding Y. Applying Centrality Measures to Impact Analysis: A Co-authorship Network Analysis[J]. Journal of the American Society for Information Science and Technology, 2009, 60(10): 2107-2118. [本文引用:1] [JCR: 2.005]
[7]	Ding Y, Yan E, Frazho A, et al. PageRank for Ranking Authors in Co-citation Networks[J]. Journal of the American Society for Information Science and Technology, 2009, 60(11): 2229-2243. [本文引用:1] [JCR: 2.005]
[8]	Wu X, Kumar V, Quinlan J R, et al. Top 10 Algorithms in Data Mining[J]. Knowledge and Information System, 2008, 14(1): 1-37. [本文引用:1] [JCR: 2.225]
[9]	Page L, Brin S, Motwani R, et al. The PageRank Citation Ranking: Bringing Order to the Web [ED/OL]. [2010-10-27]. http://ilpubs.stanford.edu:8090/422/. [本文引用:1]
[10]	Brin S, Page L. The Anatomy of a Large-scale Hypertextual Web Search Engine[J]. Computer Networks and ISDN Systems, 1998, 30(1-7): 107-117. [本文引用:1]
[11]	Uddin M N, Janecek P. A Framework for Integrating Faceted Classification Within a Content Management System[ED/OL]. [2011-04-07] . http://kst.buu.ac.th/proceedings/JCSSE2005/pdf/a-702.pdf. [本文引用:1]
[12]	Berlt K, Moura E S, Carvalho A, et al. Modeling the Web as a Hypergraph to Compute Page Reputation[J]. Information Systems, 2010, 35(5): 530-543. [本文引用:1] [JCR: 1.768]
[13]	Eirinaki M, Vazirgiannis M. Web Site Personalization Based on Link Analysis and Navigational Patterns[J]. ACM Transactions on Internet Technology, 2007, 7(4): 1-27. [本文引用:1] [JCR: 0.792]
[14]	Witten I H. Searching … in a Web[J]. Journal of Universal Computer Science, 2008, 14(10): 1739-1762. [本文引用:1]
[15]	Pinskis G, Narin F. Citation Influence for Journal Aggregates of Scientific Publications: Theory, with Application to the Literature of Physics[J]. Information Processing and Management, 1976, 12(5): 297-312. [本文引用:1] [JCR: 0.817]
[16]	Boldi R, Santini M, Vigna S. PageRank: Functional Dependencies[J/OL]. ACM Transactions on Information Systems. [2010-09-22]. http: //vigna. dsi. unimi. it/ftp/papers/PageRankFunctional. pdf. [本文引用:1]
[17]	Ma N, Guan J, Zhao Y. Bringing PageRank to the Citation Analysis[J]. Information Processing and Management, 2008, 44(2): 800-810. [本文引用:1] [JCR: 0.817]
[18]	Tague-Sutcliffe J. An Introduction to Informetrics[J]. Information Processing and Management, 1992, 28(1): 1-3. [本文引用:1] [JCR: 0.817]
[19]	Björneborn L, Ingwersen P. Perspectives of Webometrics[J]. Scientometrics, 2001, 50(1): 78-79. [本文引用:1] [JCR: 2.133]
[20]	Sidiropoulos A, Manolopoulos Y. Generalized Comparison of Graph-based Ranking Algorithms for Publications and Authors[J]. Journal of Systems and Software, 2006, 79(12): 1679-1700. [本文引用:1] [JCR: 1.135]
[21]	Krapivin M, Marchese M, Casati F. Exploring and Understand ing Citation-based Scientific Metrics [EB/OL]. [2010-09-22]. http://disi.unitn.it/~krapivin/acs-2009-metrics.pdf. [本文引用:1]
[22]	Yang W S, Jan Y S. Increasing the Authoritativeness of Web Recommendations Using PageRank-based Approaches[J]. Online Information Review, 2009, 33(2): 362-375. [本文引用:1] [JCR: 0.939]
[23]	罗思嘉. 专利计量分析与应用[J]. 国立成功大学图书馆馆刊, 2007(16): 43-54. [本文引用:1]
[24]	陈达仁, 黄慕萱. 专利资讯-检索、分析与策略[M]. 台北: 华泰文化, 2009. [本文引用:2]
[25]	Lin Y, Shi X, Wei Y. On Computing PageRank via Lumping the Google Matrix[J]. Journal of Computational and Applied Mathematics, 2009, 224(2): 702-708. [本文引用:1] [JCR: 0.989]
[26]	Gleich D F, Gray A P, Chen G, et al. An Inner-outer Iteration for Computing Pagerank[J]. Society for Industrial and Applied Mathematics, 2010, 32(1): 349-371. [本文引用:1] [JCR: 5.952]
[27]	Andersson F K, Silvestrov S D. The Mathematics of Internet Search Engines[J]. Acta Applicand ae Mathematicae, 2008, 104(2): 211-242. [本文引用:1] [JCR: 0.985]
[28]	Wills R S, Ipsen I C F. Ordinal Ranking for Google’s PageRank[J]. Matrix Annual, 2009, 30(4): 1677-1696. [本文引用:1]
[29]	Liu Y, Liu T Y, Gao B, et al. A Framework to Compute Page Importance Based on User Behaviors[J]. Information Retrieval, 2010, 13(1): 22-45. [本文引用:1] [JCR: 0.63]
[30]	Kim H, Chan P. Personalized Search Results with User Interest Hierarchies Learnt from Bookmarks[J]. Advances in Web Mining and Web Usage Analysis, 2006, 4198: 158-176. [本文引用:1]
[31]	Ke Y, Deng L, Ng W, et al. Web Dynamics and Their Ramifications for the Development of Web Search Engines[J]. Computer Networks, 2006, 50(10): 1430-1447. [本文引用:1] [JCR: 1.231]
[32]	Li Q, Chen Y P. Person樢ed Text Snippet Extraction Using Statistical Language Models[J]. Pattern Recognition, 2010, 43(1): 378-386. [本文引用:1] [JCR: 2.632]
[33]	Vallejo C G, Troyano J A, Ortega F J. InstanceRank: Bringing Order to Datasets[J]. Pattern Recognition Letters, 2010, 31(2): 133-142. [本文引用:1] [JCR: 1.266]

0.0

... （1）对作者、机构、国别的影响力评估,例如：基于引文的评价模型^[1]、对档案学报（Journal of Documentation）最有影响力的论文排名^[2],以及对单一作者的科学影响力评估^[3]等 ...

2009

0.432

0.0

... 由于参考文献较少的文本会给予其他引用文本较大的影响数值,因此,在ArticleRank的建模过程中,曾经使用过开平方、指数、最大最小值差等方式处理权重问题^[2] ...

2011

0.488

0.0

. 2011, 47(1):125-134 DOI:10.1016/j.ipm.2010.05.002

Discovering Author Impact: A PageRank Perspective

Abstract This article provides an alternative perspective for measuring author impact by applying PageRank algorithm to a coauthorship network. A weighted PageRank algorithm considering citation and coauthorship network topology is proposed. We test this algorithm under different damping factors by evaluating author impact in the informetrics research community. In addition, we also compare this weighted PageRank with the h -index, citation, and program committee (PC) membership of the International Society for Scientometrics and Informetrics (ISSI) conferences. Findings show that this weighted PageRank algorithm provides reliable results in measuring author impact.

2005

0.488

0.0

... （2）观测学者的科学合作网络,例如：领域内的作者合作关系^[4]、以引用频次和作者合作改良的网页算法在科学论文中的应用^[5]、对图书情报学重要作者的计量研究^[6],以及对信息检索领域内的引文关系研究^[7]等 ...

2008

2.133

0.0

. 2008, 76(1):135-158 DOI:10.1007/s11192-007-1908-4

PageRank for Bibliographic Networks

1.University of West Bohemia in Pilsen Department of Computer Science and Engineering Univerzitní 22 30614 Plzeň Czech Republic 2.INSA Strasbourg France

In this paper, we present several modifications of the classical PageRank formula adapted for bibliographic networks. Our versions of PageRank take into account not only the citation but also the co-authorship graph. We verify the viability of our algorithms by applying them to the data from the DBLP digital library and by comparing the resulting ranks of the winners of the ACM E. F. Codd Innovations Award. Rankings based on both the citation and co-authorship information turn out to be “better” than the standard PageRank ranking.

2009

2.005

0.0

. 2009, 60(10):2107-2118

Applying centrality measures to impact analysis: A coauthorship network analysis

Erjia Yan andYing Ding

School of Library and Information Science, Indiana University, 1320 East 10th Street, Bloomington, IN 47405-3907

> Many studies on coauthorship networks focus on network topology and network statistical mechanics. This article takes a different approach by studying micro-level network properties with the aim of applying centrality measures to impact analysis. Using coauthorship data from 16 journals in the field of library and information science (LIS) with a time span of 20 years (1988–2007), we construct an evolving coauthorship network and calculate four centrality measures (closeness centrality, betweenness centrality, degree centrality, and PageRank) for authors in this network. We find that the four centrality measures are significantly correlated with citation counts. We also discuss the usability of centrality measures in author ranking and suggest that centrality measures can be useful indicators for impact analysis.

2009

2.005

0.0

. 2009, 60(11):2229-2243

PageRank for ranking authors in co-citation networks

Ying Ding 1 ,Erjia Yan 1 ,Arthur Frazho 2 andJames Caverlee 3

1 School of Library and Information Science, Indiana University, 1320 East 10th Street, Bloomington, IN 47405-3907 2 School of Aeronautics and Astronautics, Purdue University, 701 West Stadium Avenue, ARMS 3201, West Lafayette, IN 47907-2045 3 Department of Computer Science, Texas A&M University, TAMU 3112, College Station, TX 77843-3112

> This paper studies how varied damping factors in the PageRank algorithm influence the ranking of authors and proposes weighted PageRank algorithms. We selected the 108 most highly cited authors in the information retrieval (IR) area from the 1970s to 2008 to form the author co-citation network. We calculated the ranks of these 108 authors based on PageRank with the damping factor ranging from 0.05 to 0.95. In order to test the relationship between different measures, we compared PageRank and weighted PageRank results with the citation ranking, h-index, and centrality measures. We found that in our author co-citation network, citation rank is highly correlated with PageRank with different damping factors and also with different weighted PageRank algorithms; citation rank and PageRank are not significantly correlated with centrality measures; and h-index rank does not significantly correlate with centrality measures but does significantly correlate with other measures. The key factors that have impact on the PageRank of authors in the author co-citation network are being co-cited with important authors.

2008

2.225

0.0

. 2008, 14(1):1-37 DOI:10.1007/s10115-007-0114-2

Top 10 Algorithms in Data Mining

1.University of Vermont Department of computer Science Burlington VT USA 2.University of Minnesota Department of Computer Science and Engineering Minneapolis MN USA 3.Rulequest Research Pty Ltd St Ives NSW Australia 4.University of Texas at Austin Department of Electrical and Computer Engineering Austin TX 78712 USA 5.Hong Kong University of Science and Technology Department of Computer Science Honkong China 6.AFOSR/AOARD and Osaka University 7-23-17 Roppongi Minato-ku, Tokyo 106-0032 Japan 7.The University of Queensland Department of Mathematics Brisbane Australia 8.Griffith University School of Medicine Brisbane Australia 9.University of Illinois at Chicago Department of Computer Science Chicago IL 60607 USA

This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006: C4.5, k -Means, SVM, Apriori, EM, PageRank, AdaBoost, k NN, Naive Bayes, and CART. These top 10 algorithms are among the most influential data mining algorithms in the research community. With each algorithm, we provide a description of the algorithm, discuss the impact of the algorithm, and review current and further research on the algorithm. These 10 algorithms cover classification, clustering, statistical learning, association analysis, and link mining, which are all among the most important topics in data mining research and development.

... 作为10种最普遍的数据挖掘技术之一^[8],网页排名或称佩奇算法,其核心思想是：反向链接的排名总和越高,其网页排名越靠前（A page has high rank if the sum of the ranks of its backlinks is high）^[9] ...

0.0

1998

0.0

. 1998, 30(1-7):null-null

The Anatomy of a Large-scale Hypertextual Web Search Engine

... 搜索引擎Google设计之初,参考了情报学中文献计量的引文网络的概念,并且针对以图书馆学的分面分类法为信息组织的第一代Yahoo搜索引擎进行改良（紧接着两者又相互借鉴和融合）^[10,11] ...

0.0

2010

1.768

0.0

. 2010, 35(5):530-543 DOI:10.1016/j.is.2009.02.005

Modeling the Web as a Hypergraph to Compute Page Reputation

Abstract In this work we propose a model to represent the web as a directed hypergraph (instead of a graph), where links connect pairs of disjointed sets of pages. The web hypergraph is derived from the web graph by dividing the set of pages into non-overlapping blocks and using the links between pages of distinct blocks to create hyperarcs. A hyperarc connects a block of pages to a single page, in order to provide more reliable information for link analysis. We use the hypergraph model to create the hypergraph versions of the Pagerank and Indegree algorithms, referred to as HyperPagerank and HyperIndegree, respectively. The hypergraph is derived from the web graph by grouping pages by two different partition criteria: grouping together the pages that belong to the same web host or to the same web domain. We compared the original page-based algorithms with the host-based and domain-based versions of the algorithms, considering a combination of the page reputation, the textual content of the pages and the anchor text. Experimental results using three distinct web collections show that the HyperPagerank and HyperIndegree algorithms may yield better results than the original graph versions of the Pagerank and Indegree algorithms. We also show that the hypergraph versions of the algorithms were slightly less affected by noise links and spamming.

... 之后,许多研究又针对网页排名的不足之处进行改良^[12],例如,以用户行为实现个性化网页排名为算法改进^[13,14]等 ...

2007

0.792

0.0

... 之后,许多研究又针对网页排名的不足之处进行改良^[12],例如,以用户行为实现个性化网页排名为算法改进^[13,14]等 ...

2008

0.0

... 之后,许多研究又针对网页排名的不足之处进行改良^[12],例如,以用户行为实现个性化网页排名为算法改进^[13,14]等 ...

1976

0.817

0.0

. 1976, 12(5):null-null

Citation Influence for Journal Aggregates of Scientific Publications: Theory, with Application to the Literature of Physics

... 尽管网页排名来自引文网络^[15]的概念,并通过马科夫链予以矩阵方程化^[16] ...

0.0

... 尽管网页排名来自引文网络^[15]的概念,并通过马科夫链予以矩阵方程化^[16] ...

2008

0.817

0.0

. 2008, 44(2):800-810 DOI:10.1016/j.ipm.2007.06.006

Bringing PageRank to the Citation Analysis

Abstract The paper attempts to provide an alternative method for measuring the importance of scientific papers based on the Google’s PageRank. The method is a meaningful extension of the common integer counting of citations and is then experimented for bringing PageRank to the citation analysis in a large citation network. It offers a more integrated picture of the publications’ influence in a specific field. We firstly calculate the PageRanks of scientific papers. The distributional characteristics and comparison with the traditionally used number of citations are then analyzed in detail. Furthermore, the PageRank is implemented in the evaluation of research influence for several countries in the field of Biochemistry and Molecular Biology during the time period of 2000–2005. Finally, some advantages of bringing PageRank to the citation analysis are concluded.

... 但是反过来网页排名又对于引文网络具有影响^[17],并且这种影响遍及文献计量、信息计量、科学计量与网络计量,相关定义参见文献[18]和[19],例如用来评量期刊与作者排名的链接分析排名（Link Analysis Rank）^[20]、文章排名（PaperRank）^[21]、个人网页排名^[22]等 ...

1992

0.817

0.0

. 1992, 28(1):null-null

An Introduction to Informetrics

2001

2.133

0.0

2006

1.135

0.0

. 2006, 79(12):null-null

Generalized Comparison of Graph-based Ranking Algorithms for Publications and Authors

0.0

2009

0.939

0.0

... 专利计量是运用文献计量、信息计量与科学计量的研究方法与一些数学运算方式,进行专利信息的分析与研究^[23,24] ...

2009

0.0

... 专利计量是运用文献计量、信息计量与科学计量的研究方法与一些数学运算方式,进行专利信息的分析与研究^[23,24] ...

... 专利计量与其他计量方法的最大不同在于专利数据的格式和规范与一般论文和网页不同,常使用基本统计、引用分析和连接指标等三类计算方式^[24] ...

2009

0.989

0.0

. 2009, 224(2):null-null

On Computing PageRank via Lumping the Google Matrix

... 如同数学证明PageRank迭代多次能够有效收敛^[25,26,27],在PatentRank上也能有效合理地收敛,并且赋予零项数值合理的随机参数 ...

2010

5.952

0.0

... 如同数学证明PageRank迭代多次能够有效收敛^[25,26,27],在PatentRank上也能有效合理地收敛,并且赋予零项数值合理的随机参数 ...

2008

0.985

0.0

. 2008, 104(2):211-242 DOI:10.1007/s10440-008-9254-y

The Mathematics of Internet Search Engines

1.WorldLight.com AB Kastanjeallén 1 30231 Halmstad Sweden 2.Lund University Centre for Mathematical Sciences Box 118 22100 Lund Sweden

This article presents a survey of techniques for ranking results in search engines, with emphasis on link-based ranking methods and the PageRank algorithm. The problem of selecting, in relation to a user search query, the most relevant documents from an unstructured source such as the WWW is discussed in detail. The need for extending classical information retrieval techniques such as boolean searching and vector space models with link-based ranking methods is demonstrated. The PageRank algorithm is introduced, and its numerical and spectral properties are discussed. The article concludes with an alternative means of computing PageRank, along with some example applications of this new method.

... 如同数学证明PageRank迭代多次能够有效收敛^[25,26,27],在PatentRank上也能有效合理地收敛,并且赋予零项数值合理的随机参数 ...

2009

0.0

... PageRank网页排名技术,在1998年发布后的10余年间,Google对其进行了算法改进,而许多研究也环绕在这项技术上^[28] ...

2010

0.63

0.0

. 2010, 13(1):22-45 DOI:10.1007/s10791-009-9098-8

A Framework to Compute Page Importance Based on User Behaviors

1.Beijing Jiaotong University School of Science Beijing China 2.Microsoft Research Asia Beijing China 3.Academy of Mathematical and Systems Science, CAS Beijing China

This paper is concerned with a framework to compute the importance of webpages by using real browsing behaviors of Web users. In contrast, many previous approaches like PageRank compute page importance through the use of the hyperlink graph of the Web. Recently, people have realized that the hyperlink graph is incomplete and inaccurate as a data source for determining page importance, and proposed using the real behaviors of Web users instead. In this paper, we propose a formal framework to compute page importance from user behavior data (which covers some previous works as special cases). First, we use a stochastic process to model the browsing behaviors of Web users. According to the analysis on hundreds of millions of real records of user behaviors, we justify that the process is actually a continuous-time time-homogeneous Markov process, and its stationary probability distribution can be used as the measure of page importance. Second, we propose a number of ways to estimate parameters of the stochastic process from real data, which result in a group of algorithms for page importance computation (all referred to as BrowseRank). Our experimental results have shown that the proposed algorithms can outperform the baseline methods such as PageRank and TrustRank in several tasks, demonstrating the advantage of using our proposed framework.

... 除了需要考虑网页、论文和专利本身的数据结构不同,尚需深入研究“专利信息行为”而从专利排名的用途来进行研究^[29] ...

2006

0.0

. 2006, 4198:158-176

Personalized Search Results with User Interest Hierarchies Learnt from Bookmarks

Hyoung-rae Kim (24) , Philip K. Chan (25)

24. Web Intelligence Laboratory, Hongje-dong, 158 bungi, Prugio 101dong 104ho, Gangneung-shi, Gangwon-do, 210-948, South Korea 25. Department of Computer Sciences, Florida Institute of Technology, Melbourne, FL, 32901, USA

Personalized web search incorporates an individual user’s interests when deciding relevant results to return. While, most web search engines are usually designed to serve all users, without considering the interests of individual users. We propose a method to (re)rank the results from a search engine using a learned user profile, called a user interest hierarchy (UIH), from web pages that are of interest to the user. The user’s interest in web pages will be determined implicitly, without directly asking the user. Experimental results indicate that our personalized ranking methods, when used with a popular search engine, can yield more potentially interesting web pages for individual users.

... 比如,若以用户文档（User Profile）建立用户兴趣层次（UIH）,则可发展个性化排名方法^[30] ...

2006

1.231

0.0

. 2006, 50(10):1430-1447 DOI:10.1016/j.comnet.2005.10.012

Web Dynamics and Their Ramifications for the Development of Web Search Engines

Abstract The World Wide Web has become the largest hypertext system in existence, providing an extremely rich collection of information resources. Compared with conventional information sources, the Web is highly dynamic in the following four factors: size (i.e., the growing number of Web sites and pages), Web pages (page content and page existence), hyperlink structures and users’ searching needs. As the most popular and important tools for finding information on the Web, Web search engines have to face many challenges arising from Web dynamics. This paper surveys the research issues on Web dynamics and discusses how search engines address the four factors of Web dynamics. We then briefly discuss the main issues and directions of future development of Web search engines.

... 网页搜索引擎必需考量网址和网页成长数、网页内容、链接结构和用户搜索需求等因素^[31],而如果用户意向被更好地运用,则能够一般化文本片段抽取（Text Snippet Extraction）,比方使用统计语言模型捕获文档和用户意向的共性^[32] ...

2010

2.632

0.0

. 2010, 43(1):378-386 DOI:10.1016/j.patcog.2009.06.003

Abstract In knowledge discovery in a text database, extracting and returning a subset of information highly relevant to a user's query is a critical task. In a broader sense, this is essentially identification of certain personalized patterns that drives such applications as Web search engine construction, customized text summarization and automated question answering. A related problem of text snippet extraction has been previously studied in information retrieval. In these studies, common strategies for extracting and presenting text snippets to meet user needs either process document fragments that have been delimitated a priori or use a sliding window of a fixed size to highlight the results. In this work, we argue that text snippet extraction can be generalized if the user's intention is better utilized. It overcomes the rigidness of existing approaches by dynamically returning more flexible start–end positions of text snippets, which are also semantically more coherent. This is achieved by constructing and using statistical language models which effectively capture the commonalities between a document and the user intention. Experiments indicate that our proposed solutions provide effective personalized information extraction services.

2010

1.266

0.0

. 2010, 31(2):133-142 DOI:10.1016/j.patrec.2009.09.022

InstanceRank: Bringing Order to Datasets

Abstract In this paper we present InstanceRank, a ranking algorithm that reflects the relevance of the instances within a dataset. InstanceRank applies a similar solution to that used by PageRank, the web pages ranking algorithm in the Google search engine. We also present ISR, an instance selection technique that uses InstanceRank. This algorithm chooses the most representative instances from a learning database. Experiments show that ISR algorithm, with InstanceRank as ranking criteria, obtains similar results in accuracy to other instance reduction techniques, noticeably reducing the size of the instance set.

... 采用类似网页排名（PageRank）的实例算法（InstanceRank）能减少实例集的大小,从学习库中选择最有代表性的实例^[33] ...