- The Search Agents - http://www.thesearchagents.com -

Analyzing Googlebot’s Crawl of a Large Site During a Massive URL Migration

We recently made a massive URL change at Geni where millions of profile URLs were changed from this format:


to this format:


We had myriad issues to work through, including:

Luckily, traffic wasn’t (and at the time of this writing, still isn’t) affected.  We’re pretty flat WoW, even though we introduced a MASS of confusion to bots.  Right now we’re about two weeks into the URL change, and we updated the directory/Sitemaps a week ago (before we understood the cookie/accidental cloaking issue).

When trying to make sense of what was happening, I began sifting through the mountains of data in our access logs.  Quite some time before I joined Geni, they began parsing their access logs into a database.  This isn’t a fun database, and the queries are very expensive [5] if you want to look at data across any frame of time larger than a couple hours, so my exploration was very slow and deliberate.  To see what was happening, this is the query I ended up using:


date(timestamp) as day,

count(*) as total,

SPLIT_PART (REPLACE(uri, ‘http://www.geni.com’, ”), ‘/’, 2) as subdir

from click_logs where

timestamp > ‘2010-XX-XX 24:00:00’

and timestamp < ‘2010-XX-XX 24:00:00’

and user_agent like ‘Mozilla/5.0 (compatible; Googlebot%’

group by day, subdir

order by total desc

I used the SPLIT_PART function to segregate by subdirectory (and used the REPLACE function to canonicalize the URL requests).

I had to run this query in chunks of four days at a time (at approximately 450 seconds per query), and I ended up analyzing about 35 days worth of data so that I could gauge Googlebot’s resources to different sections of our site.

Once I put the data into Excel and figured out how to pivot everything, I was able to generate this graph (limited to subdirs that receive a meaningful crawl):


In this graph we can see a couple things that are relevant to the changes we had made in the previous 30 days:

–          Family tree pages (which I have yet to canonicalize) are getting crawled in spurts.  This is taking resources away from the crawl of our migrated URLs, which are much more important to us.

–          The /genealogy/people URLs are still getting crawled in a massive way, while the /people URLs are not being crawled

Based on this graph (and our lack of realization on the accidental 200 OK cloaking to bots), we updated Sitemaps and the directory to show the /people URLs instead of the /genealogy/people URLs.

After a couple more days we were still seeing drops in pages indexed and a significant amount of Googlebot resources crawling /genealogy/people instead of /people.  I would have expected it to shift sooner.

So I put in a request with one of our brilliant engineers to turn this into a report (saves me hours of my life on a recurring basis :) Thanks Scott!).

Some notes on this report:

–          We aren’t capturing anything in our click_logs table beyond 200 OK responses.  We will be augmenting our data capture process in the days to come.  This means that we aren’t seeing redirects from Googlebot in this report yet.  (This type of reporting will also be very handy for identifying 404 errors and other oddities based on Googlebot requests.)

–          Yesterday evening we fixed the cookie/cloaking issue, and the change in crawl was both immediate and noticeable.  Based on access log data that I reviewed during a similar migration about 2 years ago, this is a major improvement in Google’s crawl.  There used to be a much longer lag between seeing a major change and altering its crawl to explore the change.

–          We have a URL shortener that we use, and without condensing it down to a “category” instead of each individual URL, this report would be unreadable.  For anyone implementing this type of reporting, you will definitely want to customize it so that you can use it to easily see issues.

–          I could probably write notes about this forever, but I’ll just show you the report:


The report immediately told us two very important things:

1.      Googlebot immediately recognized the changes and shifted their crawl from /genealogy/people to /people URLs

a.       We actually found the cookie/cloaking issue from analyzing this report yesterday)

b.      Again, this report doesn’t take anything besides 200 OK responses into account, so I’m unable to highlight requests for redirected pages yet.

2.      We have a serious issue with caching in our popular family tree [4] section of the site (completely separate issue that we weren’t aware of)

We should have put banners on this report, I would have been responsible for a few CPMs of revenue in the last 24 hours. :)

Hopefully this is helpful in explaining both some simple and complex things that we were able to implement to help us understand what was happening and identify issues that we weren’t able to see with our own eyes.

About George Gearhart

George is a former employee, longtime friend, and current in-house SEO for a genealogy start-up.