How PageRank is calculated
On a simple level, we can tell quite a lot about how PageRank is calculated. This is because when Google was just a university research project, the creators of PageRank published a paper that detailed a formula for calculating it. This formula is now more than a handful of years old, and we suspect that it has changed somewhat since then. However, for detailing the over-riding principles of how PageRank works, it is as accurate today as the day it was written.
PR(A) = (1-d) + d (PR(T1) /C(T1) + ... + PR(Tn) /C(Tn) )
Where PR(A) is the PageRank of Page A ( the one we want to work out ) .
D is a dampening factor. Nominally this is set to 0.85
PR(T1) is the PageRank of a site point ing to Page A
C(T1) is the number of links off that page
PR(Tn) /C(Tn) means we do that for each page point ing to Page A
Couldn’t be simpler right? Depending on your mathematical competence, this is
either remarkably easy to understand, or remarkably complex. So here’s the deal
– this isn’t a one-time calculation. It looks pretty above, but you can’t just do one
simple calculation and get an answer with the PageRank of a page! If you look at
it, to calculate the PageRank of Page A, we need to know the PageRank of all
the pages pointing to it. To calculate the PageRank of those pages we need to
know the PageRank of all the pages pointing to them (one of which could, and is
quite likely to be Page A!). If this seems like it could go around in circles forever,
that’s because it nearly does. We have to do a whole lot of little calculations
many times over before we actually get the answer. What this formula tells us is
that however you slice it, and however the formula may have changed since it
was written:
The PageRank given to Page A by a Page B pointing to it, is decreased with
each link to anywhere that exists on Page B. This means that a page’s
PageRank is essentially a measure of its vote; it can split that vote between one
link or two or many more, but its overall voting power will always remain the
same.
This really is important. So to be totally clear, let’s put some numbers to it (these
numbers are made up for demonstration purposes only, and have no relevancy
to any particular page). Say Page B has a PageRank value of 5 and has a single
link on it pointing to Page A. Page A’s PageRank is improved by a proportion of
Page B’s value of 5 (Page B doesn’t lose anything, but Page A gains). If Page B
has two links, that PageRank improvement would be split, and Page A would
only gain half the PageRank that it did before.
Now put the formula out of your mind for a moment, as it's easier to understand
how it works using a diagram. Let’s say we have a hypothetical set of pages
imaginatively titled Page A, Page B, Page C and Page D. They link to each other .
To begin with, in our example at least, we don’t know what the page’s starting
PageRanks are. There’s nothing special about the number that we pick to start
with (in fact, if you read forward to the section on convergence – you can see that
we can start at any number we want). Since in the last version of this paper we
performed the calculations by setting these values to 1 – we’re going to set them
to zero this time around in order to prove that it doesn’t matter what the starting
values are.
Next, we perform the necessary calculation to obtain the PageRank for each
page. The rules are:
1. We take a 0.85 * a page’s PageRank, and divide it by the number of
links on the page.
2. We add that amount on to a new total for each page it’s being passed
to.
3. We add 0.15 to each of those totals.
The first calculation is easy. Because we’ve started at zero – 0 * 0.85 is always 0.
So each page gets just 0.15 + 0. Meaning each page now has a PageRank of
0.15. Clearly we’re not done – we want to show the importance of each page
based upon links, and they’re all the same; so we need to run the calculation
again.
Page A links to pages B, C and D. Page A’s PageRank is 0.15 so it will add 0.85
* 0.15 = 0.1275 to the new PageRank scores of the pages it links to. There are
three of them so they each get 0.0425.
Page B links to page C. Page B’s PageRank is 0.15 so it will add 0.85 * 0.15 =
0.1275 to the new PageRank score of the pages it links to. Since it only links to
page C, page C will get it all.
Page C links to Page A, all 0.1275 passes to page A.
Page D links to Page C. Again all 0.1275 passes to page C.
The new totals for each page them become:
Page A: 0.15 (base) + 0.1275 (from Page C) = 0.2775
Page B: 0.15 (base) + 0.0425 (from Page A) = 0.1925
Page C: 0.15 (base) + 0.0425 (from Page A) + 0.1275 (from Page B) + 0.1275
(from Page D) = 0.4475
Page D: 0.15 (base) + 0.0425 (from Page A) = 0.1925
So we’ve got:
15
Pretty neat huh? Already we’re begining to see that Page C is probably the most
important page in the system (but we can’t be sure yet – it could well change).
We carry on doing these calculations until the value for each page no longer
changes (this is called convergence – there’s more about this in the next
section). In practice, Google probably doesn't wait for this convergence, but
instead run a number of iterations of the calculation which is likely to give them
fairly accurate values (more on this later as well). If we carried out all the
calculations for the example given, it would take us 143 calculations [ Excel
Example 1] and we’d reach final values of:
As suspected, Page C is the most important. If we take a quick look at these raw
values we can see something about the number of links pointing out from a page.
Look at Page A, which has a link from a high PageRank page (Page C), which
has only one outbound link. Then look at Page B and D; both share links from a
high PageRank page (Page A), with three outbound links. The number of links s
ignificantly alters the way PageRank is distributed.



Comments
Post a Comment