- The Search Agents - http://www.thesearchagents.com -

Analyzing Googlebot’s Crawl of a Large Site During a Massive URL Migration

Posted By George Gearhart On October 25, 2010 @ 5:09 pm In Featured,SEO | 10 Comments

We recently made a massive URL change at Geni where millions of profile URLs were changed from this format:


to this format:


We had myriad issues to work through, including:

  • Antiquated code that wasn’t properly 301ing URLs if a cookie wasn’t set (which meant we were accidentally cloaking the /genealogy URLs as 200 OK to most bots, but showing a 301 to anything that could set a cookie)
  • Caching issues (memcache [1] requires a lot of resources when scaling over tens of millions of URLs, and yes, we live and die by the belief that speed is important [2])
  • Resources for reporting and testing (I’m in my fifth week, we’re working on it!)
  • Indexed pages were decreasing significantly in GWT
  • We hadn’t updated our directory [3] or our XML Sitemaps to reflect the URL change yet (we wanted Google to see as many of the redirects as possible to speed up the migration process)
  • Recent changes to robots.txt had significantly increased crawling on our family tree [4] pages, which aren’t really optimized (and kinda slow for users but not bots, hi Flash).
  • We made significant changes to how we tag URLs in Google Analytics (this made it very difficult to see what was happening because we didn’t have an easy way to compare data – I know, slow down George, right?)

Luckily, traffic wasn’t (and at the time of this writing, still isn’t) affected.  We’re pretty flat WoW, even though we introduced a MASS of confusion to bots.  Right now we’re about two weeks into the URL change, and we updated the directory/Sitemaps a week ago (before we understood the cookie/accidental cloaking issue).

When trying to make sense of what was happening, I began sifting through the mountains of data in our access logs.  Quite some time before I joined Geni, they began parsing their access logs into a database.  This isn’t a fun database, and the queries are very expensive [5] if you want to look at data across any frame of time larger than a couple hours, so my exploration was very slow and deliberate.  To see what was happening, this is the query I ended up using:


date(timestamp) as day,

count(*) as total,

SPLIT_PART (REPLACE(uri, ‘http://www.geni.com’, ”), ‘/’, 2) as subdir

from click_logs where

timestamp > ‘2010-XX-XX 24:00:00′

and timestamp < ‘2010-XX-XX 24:00:00′

and user_agent like ‘Mozilla/5.0 (compatible; Googlebot%’

group by day, subdir

order by total desc

I used the SPLIT_PART function to segregate by subdirectory (and used the REPLACE function to canonicalize the URL requests).

I had to run this query in chunks of four days at a time (at approximately 450 seconds per query), and I ended up analyzing about 35 days worth of data so that I could gauge Googlebot’s resources to different sections of our site.

Once I put the data into Excel and figured out how to pivot everything, I was able to generate this graph (limited to subdirs that receive a meaningful crawl):


In this graph we can see a couple things that are relevant to the changes we had made in the previous 30 days:

–          Family tree pages (which I have yet to canonicalize) are getting crawled in spurts.  This is taking resources away from the crawl of our migrated URLs, which are much more important to us.

–          The /genealogy/people URLs are still getting crawled in a massive way, while the /people URLs are not being crawled

Based on this graph (and our lack of realization on the accidental 200 OK cloaking to bots), we updated Sitemaps and the directory to show the /people URLs instead of the /genealogy/people URLs.

After a couple more days we were still seeing drops in pages indexed and a significant amount of Googlebot resources crawling /genealogy/people instead of /people.  I would have expected it to shift sooner.

So I put in a request with one of our brilliant engineers to turn this into a report (saves me hours of my life on a recurring basis :) Thanks Scott!).

Some notes on this report:

–          We aren’t capturing anything in our click_logs table beyond 200 OK responses.  We will be augmenting our data capture process in the days to come.  This means that we aren’t seeing redirects from Googlebot in this report yet.  (This type of reporting will also be very handy for identifying 404 errors and other oddities based on Googlebot requests.)

–          Yesterday evening we fixed the cookie/cloaking issue, and the change in crawl was both immediate and noticeable.  Based on access log data that I reviewed during a similar migration about 2 years ago, this is a major improvement in Google’s crawl.  There used to be a much longer lag between seeing a major change and altering its crawl to explore the change.

–          We have a URL shortener that we use, and without condensing it down to a “category” instead of each individual URL, this report would be unreadable.  For anyone implementing this type of reporting, you will definitely want to customize it so that you can use it to easily see issues.

–          I could probably write notes about this forever, but I’ll just show you the report:


The report immediately told us two very important things:

1.      Googlebot immediately recognized the changes and shifted their crawl from /genealogy/people to /people URLs

a.       We actually found the cookie/cloaking issue from analyzing this report yesterday)

b.      Again, this report doesn’t take anything besides 200 OK responses into account, so I’m unable to highlight requests for redirected pages yet.

2.      We have a serious issue with caching in our popular family tree [4] section of the site (completely separate issue that we weren’t aware of)

We should have put banners on this report, I would have been responsible for a few CPMs of revenue in the last 24 hours. :)

Hopefully this is helpful in explaining both some simple and complex things that we were able to implement to help us understand what was happening and identify issues that we weren’t able to see with our own eyes.

Article printed from The Search Agents: http://www.thesearchagents.com

URL to article: http://www.thesearchagents.com/2010/10/analyzing-googlebot%e2%80%99s-crawl-of-a-large-site-during-a-massive-url-migration/

URLs in this post:

[1] memcache: http://memcached.org/

[2] speed is important: http://googlewebmastercentral.blogspot.com/2010/04/using-site-speed-in-web-search-ranking.html

[3] directory: http://www.geni.com/directory/people/a.html

[4] family tree: http://www.geni.com/popular

[5] very expensive: http://en.wikipedia.org/wiki/Query_optimizer#Cost_estimation

[6] Image: http://www.thesearchagents.com/wp-content/uploads/2010/10/graph.png

[7] Image: http://www.thesearchagents.com/wp-content/uploads/2010/10/googlebotlive.png

[8] How to Manage Your Search Strategy During Unplanned Server Downtime: http://www.thesearchagents.com/2012/09/how-to-manage-your-search-strategy-during-unplanned-server-downtime/

[9] URL Confusion and How to Fix it: http://www.thesearchagents.com/2009/12/url-confusion-and-how-to-fix-it/

[10] SEO Website Migration Checklist: http://www.thesearchagents.com/2012/03/seo-website-migration-checklist/

Copyright © 2009 The Search Agents. All rights reserved.