Home > Articles > A Survey of Google's PageRank: The Yahoo Bonus

A Survey of Google's PageRank: The Yahoo Bonus

skip to navigation

This section of the site features articles published between 2002 and 2004. They remain here for reference purposes and may contain information that is out of date.

Article Index

The Yahoo Bonus and its Impact on Search Engine Optimization

Many experts in search engine optimization assume that certain websites obtain a special PageRank evaluation from the Google search engine which needs a manual intervention and does not derive from the PageRank algorithm directly. Mostly, the directories Yahoo and Open Directory Project (dmoz.org) are considered to get this special treatment. In the context of search engine optimization, this assumption would have the consequence that an entry into the above mentioned directories had a big impact on a site's PageRank.

Two boxes labelled A and B. A number one with an arrow to A. A number 10 with an arrow to B. An arrow points from A to B and B to A.

An approach that is often mentioned when talking about influencing the PageRank of certain websites is to assign higher starting values to them before the iterative computation of PageRank begins. Such proceeding shall be reviewed by looking at a simple example web consisting of two pages, whereby each of these pages solely links to the other. We assign an initial PageRank of 10 to one page and a PageRank of 1 to the other. The damping factor d is set to 0.1, because the lower d is, the faster the PageRank values converge during the iterations. So, we get the following equations for the computation of the pages' PageRank values:

PR(A) = 0.9 + 0.1 PR(B)
PR(B) = 0.9 + 0.1 PR(A)

During the iterations, we get the following PageRank values:

Iteration PR (A) PR (B)
0 1 10
1 1.9 1.09
2 1.009 1.0009
3 1.00009 1.000009

It is obvious that despite assigning different starting values to the two pages each of the PageRank values converges to 1, just as it would have happened if no initial values were assigned. Hence, starting values have no effect on PageRank if a sufficient number of iterations takes place. Indeed, if the computation is performed with only few iterations, the starting values would influence PageRank. But in this case, we have to consider that in our example the PageRank relation between the two pages reverses after the first iteration. However, it shall be noted that for our computation the actual PageRank values within one iteration have been used and not the ones from the previous iteration. If those values would have been used, the PageRank relation had alternated after each iteration.

Modification of the PageRank Algorithm

If assigning special starting values at the begin of the PageRank calculations has no effect on the results of the computation, this does not mean that it is not possible to influence the PageRank of websites or web pages by an intervention in the PageRank algorithm. Lawrence Page, for instance, describes a method for a special evaluation of web pages in his PageRank patent specifications (United States Patent 6,285,999). The starting point for his consideration is that the random surfer of the Random Surfer Model may get bored and stop following links with a constant probabilty, but when he restarts, he won't take a random jump to any page of the web but will rather jump to certain web pages with a higher probability than to others. This behaviour is closer to the behaviour of a real user, who would more likely use, for instance, directories like Yahoo or ODP as a starting point for surfing.

If a special evaluation of certain web pages shall take place, the original PageRank algorithm has to be modified. With another expected value implemented, the algorithm is given as follows:

PR(A) = E(A) (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Here, (1-d) is the probability for the random surfer no longer following links. E(A) is the probability for the random surfer going to page A after he has stopped following links, weighted by the number of web pages. So, E is another expected value whose average over all pages is 1. In this way, the average of the PageRank values of all pages of the web will continue to converge to 1. Thus, the PageRank values do not vaccilate because of the special evaluation of web pages and the impact of PageRank on the general ranking of web pages remains stable.

Two boxes labelled A and B. A number 0.2 with an arrow to A. A number 1.8 with an arrow to B. An arrow points from A to B and B to A.

In our example, we set the probability for the random surfer going to page A after he has stopped following links to 0.1. The probability for him going to page B is set to 0.9. Since our web consists of two pages E(A) equals 0.2 and E(B) equals 1.8. At a damping factor d of 0.5 we get the following equations for the calculation of the single pages' PageRank values:

PR(A) = 0.2 0.5 + 0.5 PR(B)
PR(B) = 1.8 0.5 + 0.5 PR(A)

If we solve these equations we get the following PageRank values:

PR(A) = 11/15
PR(B) = 19/15

The sum of the PageRank values remains 2. The higher probability for the random surfer jumping to page B is reflected by its higher PageRank. Indeed, the uniform interlinking between both pages prevents our example pages' PageRank values from a more significant impact of our intervention.

So, it is possible to implement the special evaluation of certain web pages into the PageRank algorithm without having to change it fundamentally. It is questionable, indeed, what criteria is used for the evaluation. Lawrence Page suggests explicitly the utilization of real usage data in his PageRank patent specifications. Google, meanwhile, collects usage date by means of the Google Toolbar. And Google would not even need as much data, as if the whole ranking was solely based on usage data. A limited sample would be sufficient to determine the 1,000 or 10,000 most important pages on the web. The PageRank algorithm can then fill the holes in usage data and is thereby able to deliver a more accurate picture of the web.

Of course, all statements regarding the influence of real usage data on PageRank are pure speculation. Even if there is a special evaluation of certain web pages at all will in the end, stay a secret of the people at Google.

Nonetheless, Assigning Starting Values?

Although, assigning special starting values to pages at the begin of PageRank calculations has no effect on PageRank values it can, nonetheless, be reasonable.

Three boxes A, B and C. An arrow points from A to B, B to C, C to A and A to C.

We take a look at our example web constisting of the pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. In this case, the damping factor d is set to 0.75. So, we get the following equations for the iterative computation of the single pages' PageRank values:

PR(A) = 0.25 + 0.75 PR(C)
PR(B) = 0.25 + 0.75 (PR(A) / 2)
PR(C) = 0.25 + 0.75 (PR(A) / 2 + PR(B))

Basically, it is not necessary to assign starting values to the single pages before the computation begins. They simply start with a value of 0 and we get the following PageRank values during the iterations:

Iteration PR(A) PR(B) PR(C)
0 0 0 0
1 0.25 0.34375 0.60156
2 0.70117 0.51294 0.89764
3 0.92323 0.59621 1.04337
4 1.03253 0.63720 1.11510
5 1.08632 0.65737 1.15040
6 1.11280 0.66730 1.16777
7 1.12583 0.67219 1.17633
8 1.13224 0.67459 1.18054
9 1.13540 0.67578 1.18261
10 1.13696 0.67636 1.18363
11 1.13772 0.67665 1.18413
12 1.13810 0.67679 1.18438
13 1.13828 0.67686 1.18450
14 1.13837 0.67689 1.18456
15 1.13842 0.67691 1.18459
16 1.13844 0.67692 1.18460
17 1.13845 0.67692 1.18461
18 1.13846 0.67692 1.18461
19 1.13846 0.67692 1.18461
20 1.13846 0.67692 1.18461
21 1.13846 0.67692 1.18461
22 1.13846 0.67692 1.18462

If we assign 1 to each page before the computation starts, we get the following PageRank values during the iterations:

Iteration PR(A) PR(B) PR(C)
0 1 1 1
1 1 0.625 1.09375
2 1.07031 0.65137 1.13989
3 1.10492 0.66434 1.16260
4 1.12195 0.67073 1.17378
5 1.13034 0.67388 1.17928
6 1.13446 0.67542 1.18199
7 1.13649 0.67618 1.18332
8 1.13749 0.67656 1.18398
9 1.13798 0.67674 1.18430
10 1.13823 0.67684 1.18446
11 1.13835 0.67688 1.18454
12 1.13840 0.67690 1.18458
13 1.13843 0.67691 1.18460
14 1.13845 0.67692 1.18461
15 1.13845 0.67692 1.18461
16 1.13846 0.67692 1.18461
17 1.13846 0.67692 1.18461
18 1.13846 0.67692 1.18461
19 1.13846 0.67692 1.18462

If we now assign a starting value to each page, which is closer to its effective PageRank (1.1 for page A, 0.7 for page B and 1.2 for page C), we get the following results:

Iteration PR(A) PR(B) PR(C)
0 1.1 0.7 1.2
1 1.15 0.68125 1.19219
2 1.14414 0.67905 1.18834
3 1.14126 0.67797 1.18645
4 1.13984 0.67744 1.18552
5 1.13914 0.67718 1.18506
6 1.13879 0.67705 1.18483
7 1.13863 0.67698 1.18472
8 1.13854 0.67695 1.18467
9 1.13850 0.67694 1.18464
10 1.13848 0.67693 1.18463
11 1.13847 0.67693 1.18462
12 1.13847 0.67692 1.18462
13 1.13846 0.67692 1.18462

So, the closer the assigned starting values are to the effective results we would get by solving the equations, the faster do the PageRank values converge in the iterative computation. Less iterations are needed, which can be useful for providing more up to date search results, especially regarding the growth rate of the web. Starting point for an accurate presumption of the actual PageRank distribution may be the PageRank values of a former PageRank calculation. All the pages which are new in the index could get an initial PageRank of 1, which will then be a lot closer to the effective PageRank value after the first few iterations.

> > Additional Factors Influencing PageRank