The comedian Groucho Marx once said of PageRank: “This is so simple a four-year-old child could understand it…Quick! Somebody get me a four-year-old child!” OK, perhaps he wasn’t talking about PageRank, but he easily could have been, because PageRank is so simple that any four-year-old could figure it out. And, I hope that with a few simple diagrams and math no more complicated than the pre-school level, I can explain it to you.
Back in the late 1990′s, when Larry Page and Sergey Brin were developing the search engine that became Google, the librarian of the Web, they ran into a problem. When the Web has millions or billions of pages, any search on basically any topic will turn up hundreds or thousands of good results. The question they faced was: How can they sort these pages from most important to least important?
To answer that question, you have to stop thinking like a librarian and step outside to the playground with me for a game of Special Hopscotch, a simplified version of hopscotch for those kids who have trouble mastering the complex rules of regular hopscotch. How does this game work? It’s easy: we draw some circles on the ground, one of which surrounds you, and connect those circles with arrows. Because this is Special Hopscotch, there is only 1 rule to remember: at each circle, jump randomly to one of the other circles to which your current circle is connected.
In Figure 1, let’s consider one of the simplest games of Special Hopscotch: there are two circles, A and B, each of which points to the other. The question is: If you jump from one circle to the other at a regular interval, what fraction of the time do you spend at circle A? This game is a lot like playing with a light switch – just call circle A ‘OFF’ and circle B ‘ON’. And anyone who has ever let a four-year-old play with a light switch knows that the light will go on and off as quickly as possible, spending half of the time in each state. The same is true in our hopscotch game: when there are only two circles and each points to the other, a person who jumps between them will spend 50% of their time at each. So, in an Internet with only two pages, each of which points to the other, both pages have the same PageRank: 50%.
Now let’s play the game again but we’ll add another circle, C, like in Figure 2. In the new game, circles A and B still point to each other, but also circle A points to C and C points to B. Now, if you were to play Special Hopscotch on this board, still starting at circle A, which fraction of your time would you spend at each circle? This one is a little trickier. (Actually, this question is really easy, since I’ve written the answers on the figure: 40% for A and B, 20% for C.) But if I hadn’t written the answers on the diagram, you’d at least know that the answers could not be 33.3% at each circle, because B doesn’t point to C and C doesn’t point to A.
One way to answer this question is just to get up and do it. Get some chalk and a patch of asphalt (or just put three pieces of paper on the ground) and start jumping around. Go ahead and try it. A little recess time never hurt anyone. Page and Brin decided it would be easier to write some computer programs to do the equivalent. One program, Googlebot, just surfs the web, drawing the map of the circles on the ground and the connections between them. Their PageRank calculator does ‘The Google Dance’ by playing hopscotch on those circles. (That’s not even close in practice to how the PageRanks are actually calculated, but in concept it’s very similar.)
The game Google plays is more complicated than our 3-circle example, since Google has to deal with cases like when they reach a page that doesn’t connect to any other pages (that is, a ‘dangling’ page). For that case, they’ve added a simple rule: If you reach a page that doesn’t connect to any other page, just restart the game at any randomly selected page. Even if you don’t reach a dangling page, you still might get caught in a loop of two or three pages that only connect to each other if one (or more) of those pages have inbound links from the rest of Web, but don’t link outwards to any other page outside the group. (This is similar to a dangling page, just with a cluster of pages rather than a single one.) To handle dangling clusters, the Google guys added a third rule: Even if you don’t get to a dangling page, still restart the game every now and then. (It’s believed that Google uses a value of about 15% for how often this third rule is used.)
Those are the three basic rules that the ‘random surfer’ version of PageRank calculation follows:
1. When you’re at any page, jump along any random one of its outbound links.
2. If you reach a page with no outbound links, jump to any page on the entire Web at random.
3. Even if you don’t reach a page without outbound links, still jump to any page on the entire Web at random every now and then.
Some of the implications of this version of hopscotch are obvious. For example, the sum of all PageRanks across all pages on the Web must be 100%. This means that when a new link is made between pages, or a new page added to the Web, for one page’s PageRank to go up, someone else (or many other pages), must go down.
This ‘If I win, you lose’ implication is perhaps the reason why when some people talk about PageRank, they describe it as if every inbound link provides a dribble of ‘Googlejuice’ and every webmaster’s job is simply to get as many inbound links from high PageRank pages as possible.
But some implications of this game of hopscotch are not as obvious. For instance, if you actually play this game on a real segment of the Web, you’ll find that the PageRanks of some pages is many, many times larger than other pages. Sort of like the distribution of wealth on planet Earth: billions are very poor, a few hundred million are fairly well-off, and a handful are very rich. Google reports PageRank as a number from 0 to 10, but a PageRank of 5 is not 25% larger than 4, it’s more like 10 times larger. And 6 is 10 times larger than that. So, a highly ranked page can have a PageRank that’s literally millions times more than a low-ranked page.
Google itself in ‘Ten Things Google Has Found to be True‘ item #4 says that they use the PageRank algorithm because they believe “Democracy on the web works.” They say: “PageRank evaluates all of the sites linking to a web page and assigns them a value, based in part on the sites linking to them. By analyzing the full structure of the web, Google is able to determine which sites have been “voted” the best sources of information by those most interested in the information they offer. This technique actually improves as the web gets bigger, as each new site is another point of information and another vote to be counted.” And on their Technology Overview page they say: “PageRank also considers the importance of each page that casts a vote, as votes from some pages are considered to have greater value, thus giving the linked page greater value.”
Some democracy! What kind of democracy is it where some votes count millions of times more than others?
The important thing to realize is that PageRank is not a fluid and it’s not an election. PageRank is simply the answer to the question: If you follow the three ‘random surfer’ hopscotch rules listed above, what fraction of the time will you spend at any given page? That fraction is the PageRank of that page.
So, while the concept is very simple, the effects of changes in link structure on PageRank can be very non-intuitive. Some people are reluctant to give links, fearing that doing so might give away some of their PageRank. Others think that giving an outbound link will increase the PageRank of the receiving page, but doesn’t affect the PageRank of the giving page at all. Both are wrong. If you think about the calculation of PageRank as a game of hopscotch, then you realize that giving away an outbound link could decrease your PageRank, but it could actually increase it too, depending on the connectivity of all the other pages on the web. Consider a web of pages containing pages A, B and C linked as we discussed earlier, but also now joined by some other pages, D, E, F and G, connected as shown in Figure 3.
By playing hopscotch on this diagram (considering only the blue connections), we find that a ‘random surfer’ spends about 34.6% of their time on page B and 7.8% on page D. But, if we make a connection (the red line) from B to D, then the random surfer spends 34.8% on page B and 16.2% on page D. That is, by making a link from page B to page D, the PageRanks of both the giving page and the receiving page go up!
How can the PageRanks of both pages go up? Simple: because the PageRanks of all the other pages in the network go down. In this specific example, page A is hurt the most by the new link, going from a PageRank of 17.5% to 12.6%. (The caveat I’m making is, if all of these pages happen to be under your control, adding or deleting links or pages can affect the PageRanks of other pages in your domain, even ones that were not directly involved in the changes.)
All that having been said, the ‘random surfer’ version of the PageRank calculation was proposed by Brin and Page as part of their graduate studies and has almost certainly been modified since then. For example, one modification assumes that when restarting the game, the surfer does not jump to a random page on the web, but rather is biased toward pages that are similar to ones the surfer has visited in the past. (This is the basis for how Google’s personalized results feature works.) Other possible modifications assume that the surfer doesn’t follow links randomly, but rather is more ‘intentional’ or ‘intelligent’ – selecting prominent, interesting links more often than minor, uninteresting links. And on top of all that, Google admits that PageRank is only one factor in determining the ‘relevance’ of a page, which ultimately determine the order of search results, since things like the page’s loading time, the age of the page and the presence of malware or spam-like content on the site are also important.
But the point remains: The behavior of even the three-rule original version of PageRank is non-intuitive enough that it can’t be boiled down into simple dicta like: “Getting links is good; giving links is bad.” Admittedly, B’s increase for the specific example in Figure 3 was modest, but even so, giving a link was still beneficial, not harmful or neutral as many people like to believe. As Groucho said, “Who are you going to believe: me or your lying eyes?”