- The PageRank Algorithm
- The Implementation of PageRank
- The Effect of Inbound Links
- The Effect of Outbound Links
- The Effect of the Number of Pages
- The Distribution of PageRank
- The Yahoo Bonus
- Additional Factors Influencing PageRank
Additional Factors Influencing PageRank
It has been widely discussed if additional criteria beyond the link structure of the web have been implemented in the PageRank algorithm since the scientific work on PageRank has been published by Lawrence Page and Sergey Brin. Lawrence Page himself outlines the following potential influencing factors in his patent specifications for PageRank:
- Visibility of a link
- Position of a link within a document
- Distance between web pages
- Importance of a linking page
- Up-to-dateness of a linking page
First of all, the implementation of additional criteria in PageRank would result in a better approximation of human usage regarding the Random Surfer Model. Considering the visibility of a link and its position within a document implies that a user does not click on links completely at haphazard, but rather follows links which are highly and immediately visible regardless of their anchor text. The other criteria would give Google more flexibility in determing in how far an inbound link of a page should be considered important, than the methods which have been described so far.
Whether or not the above mentioned factors are actually implemented in PageRank can not be proved empirically and shall not be discussed here. It shall rather be illustrated in which way additional influencing factors can be implemented in the PageRank algorithm and which options the Google search engine thereby gets in terms of influencing PageRank values.
Modification of the PageRank Algorithm
To implement additional factors in PageRank, the original PageRank algorithm has again to be modified. Since we have to assume that PageRank calculations are still based on numerous iterations and for the purpose of short computation times, we have to consider to keep the number of database queries during the iterations as small as possible. Therefore, the following modification of the PageRank algorithm shall be assumed:
PR(A) = (1-d) + d (PR(T1)+L(T1,A) + ... + PR(Tn)+L(Tn,A))
Here, L(Ti,A) represents the evaluation of a link which points from page Ti to page A. L(Ti,A) withal replaces the PageRank weighting of page Ti by the number of outbound links on page Ti which was given by 1/C(Ti). L(Ti,A) may consist of several factors, each of them having to be determined only once and then being merged to one value before the iterative PageRank calculation begins. So, the number of database queries during the iterations stays the same, although, admittedly, a much larger database has to be queried at each step in comparison to the computation by use of the original algorithm, since now there is an evaluation of each link instead of an evaluation of pages (by the number of their outbound links).
Different Evaluation of Links within a Document
Two of the criteria for the evaluation of links mentioned by Lawrence Page in his PageRank patent specifications are the visibilty of a link and its position within a document. Regarding the Random Surfer Model, those criteria reflect the probability for the random surfer clicking on a link on a specific web page. In the original PageRank algorithm, this probability is given by the term (1/C(Ti)), whereby the probability is equal for each link on one page.
Assigning different probabilities to each link on a page can, for instance, be realized as follows:
We take a look at a web consisting of three pages A, B anc C, where each of these pages has outbound links to both of the other pages. Links are weighted by two evaluation criteria X and Y. X represents the visibility of a link. X equals 1 if a link is not particularly emphasized, and 2 if the link is, for instance, bold or italic. Y represents the position of a link within a document. Y equals 1 if the link is on the lower half of the page, and 3 if the link is on the upper half of the page. If we assume a multiplicative correlation between X and Y, the links in our example are evaluated as follows:
X(A,B) + Y(A,B) = 1 + 3 = 3
X(A,C) + Y(A,C) = 1 + 1 = 1
X(B,A) + Y(B,A) = 2 + 3 = 6
X(B,C) + Y(B,C) = 2 + 1 = 2
X(C,A) + Y(C,A) = 2 + 3 = 6
X(C,B) + Y(C,B) = 2 + 1 = 2
For the purpose of determinig the single factors L, the evaluated links must not simply be weighted by the number of outbound links on one page, but in fact by the total of evaluated links on the page. Thereby, we get the following weighting quotients Z(Ti) for the single pages Ti:
Z(A) = X(A,B) + Y(A,B) + X(A,C) + Y(A,C) = 4
Z(B) = X(B,A) + Y(B,A) + X(B,C) + Y(B,C) = 8
Z(C) = X(C,A) + Y(C,A) + X(C,B) + Y(C,B) = 8
The evaluation factors L(T1,T2) for a link which is pointing from page T1 to page T2 are hence given by:
L(T1,T2) = X(T1,T2) + Y(T1,T2) / Z(T1)
Their values regarding our example are as follows:
L(A,B) = 0.75
L(A,C) = 0.25
L(B,A) = 0.75
L(B,C) = 0.25
L(C,A) = 0.75
L(C,B) = 0.25
At a damping factor d of 0.5, we get the following equations for the calculation of PageRank values:
PR(A) = 0.5 + 0.5 (0.75 PR(B) + 0.75 PR(C))
PR(B) = 0.5 + 0.5 (0.75 PR(A) + 0.25 PR(C))
PR(C) = 0.5 + 0.5 (0.25 PR(A) + 0.25 PR(B))
Solving these equations gives us the follwing PageRank values for our example:
PR(A) = 819/693
PR(B) = 721/693
PR(C) = 539/693
First of all, we see that page A has the highest PageRank of all three pages. This is caused by page A receiving the relatively higher evaluated link from page B as well as from page C.
Furthermore, we see that even by the evaluation of single links, the sum of the PageRank values of all pages equals 3 (2079/693) and thereby the total number of pages. So, the PageRank values computed by our modified PageRank algorithm can be used for the general ranking of web pages by Google without any normalisation being needed.
Different Evaluation of Links by Page Specific Criteria
Besides the unequal evaluation of links within a document, Lawrence Page mentions the possibility of evaluating links according to criteria which are based upon the linking page. At first glance, this does not seem necessary since it is the main principle of PageRank to rank pages the higher, the more high ranking pages link to them. But, at the time of their scientific work on PageRank, Page and Brin have already recognized that their algorithm is vulnerable to artificial inflation of PageRank.
An artificial influence on PageRank might be exerted by webmasters who generate a multitude of web pages whose links distribute PageRank in a way that single pages within that system receive a special importance. Those pages can have a high PageRank without being linked to from other pages with high PageRank. So, not only the concept of PageRank is undermined, but also the search engine’s index is spammed with an innumerable amount of web pages which were solely created to influence PageRank.
In his patent specifications for PageRank, Lawrence Page presents the evaluation of links by the distance between pages as a means to avoid the artificial inflation of PageRank, because the bigger the distance between two pages, the less likely has one webmaster control over both. A criterium for the distance between two pages may be if they are on the same domain or not. In this way, internal links would be weighted less than external links. In the end, any general measure of the distance between links can be used to determine such a weighting. This comprehends if pages are on the same server or not and also the geographical distance between servers.
As another indicator for the importance of a document, Lawrence Page mentions the up-to-dateness of the documents which link to it. This argument considers that the information on a page is less likely outdated, the more pages which have been modified recently link to it. In contrast, the original PageRank concept, just like any method of measuring link popularity, favours older documents which gained their inbound links in the course of their existence and have at a higher probability been modified less recently than new documents. Basically, recently modified documents may be given a higher evaluation by weighting the factor (1-d). In this way, both those recently modified documents and the pages they link to receive a higher PageRank. But, if a page has been modified recently, is not necessarily an indicator for the importance of the information presented on it. So, as suggested by Lawrence Page, it is advisable not to favour recently modified pages but only their outbound links.
Finally, Page mentions the importance of the web location of a page as an indicator of the importance of its outbound links. As an example for an important web location he names the root page of a domain, but, in the end, Google could exert influence on PageRank absolutely arbitrarily.
To implement the evaluation of the linking page into PageRank, the evaluation factor of the modified algorithm must consist of several single factors. For a link that points from page Ti to page A, it can be given as follows:
L(Ti,A) = K(Ti,A) + K1(Ti) + ... + Km(Ti)
where K(Ti,A) is the above presented weighting of a single link within a page by its visibility or position. Additionally, an evaluation of page Ti by m criteria which are represented by the factors Kj(Ti) takes place.
To implement the evaluation of the linking pages, not only the algorithm but also the proceedings of PageRank calculation have to be modified. This shall be illustrated by an example.
We take a look at a web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. The outbound links of one page are evaluated equally, so there is no weighting by visibilty or position. But now, the pages are evaluated by one criterium. In this way, an inbound link from page C shall be considered four times as important as an inbound link from one of the other pages. After weighting by the number of pages, we get the following evaluation factors:
K(A) = 0.5
K(B) = 0.5
K(C) = 2
At a damping factor d of 0.5, the equations for the computation of the PageRank values are given by:
PR(A) = 0.5 + 0.5 + 2 PR(C)
PR(B) = 0.5 + 0.5 + 0.5 + 0.5 PR(A)
PR(C) = 0.5 + 0.5 (0.5 PR(B) + 0.5 + 0.5 PR(A))
Solving the equations gives us the follwing PageRank values:
PR(A) = 4/3
PR(B) = 2/3
PR(C) = 5/6
At the current modifications of the PageRank algorithm, the accumulated PageRank of all pages no longer equals the number of pages. The reason therefore is that the weighting of the page evaluation by the number of pages was not appropriate. To determine the proper weighting, the web’s linking structure would have to be anticipated, which is not possible in case of the actual WWW. Therefore, the PageRank calculated by an evaluation of linking pages has to be normalized if there shall not be any unfounded effects on the general ranking of pages by Google. Within the iterative calculation, a normalization would have to take place after each iteration to minimize unintentional distortions.
In the case of a small web, the evaluation of pages often causes severe distortions. In the case of the actual WWW, these distortions should normally equalise by the number of pages. Indeed, it is to be expected that the evaluation of the distance between pages will cause distortions on PageRank, since pages with many inbound links surely tend to be linked to from different geographical regions. But such effects can be anticipated by experience from previous calculation periods, so that a normalisation would only have to be marginal.
In either case, implementing additional factors in PageRank is possible. Indeed, the computation of PageRank values would take more time.
Copyright for this article belongs to pr.efactory.de.