5 Differences: Moving from XML Sitemap module to Google's Sitemap Generators

Feb 15 2010

For a large site that I maintain, we recently disabled the XML Sitemap module (we're using the 1.x branch) and switched to the Google Sitemap Generators tool (the Python one). We have noticed a few unsurprising things, and a few very surprising things.

We identified five big differences (all positive) that we have seen since moving to the Google Sitemap Generators Python tool. <!--break--> The Google Sitemap Generators Python script is composed of a single executable script, sitemap_gen.py, that can be configured to use various sources of input to generate a site map. We use our server's access logs to generate the map.

I hear that a newer Google tool provides yet another way of capturing this information, but this requires Apache. We run Nginx.

Here are five substantial differences we have noticed since we started running the Python script a month ago.

1. Unsurprisingly, our performance improved

One of the main reasons we turned it off was because of known issues with performance. There is no need to go into details. We have tens of thousands of nodes, and that brought the generator to its knees on occasion.

In our new configuration, the Python script that generates the sitemap runs once during the day at an off-peak time. While the building of the site map still takes time, we control exactly when it happens. And since it is a secondary process (not part of the web site) we have open possibilities for further tuning, like nice-ing the process.

2. Surprisingly, our site map quickly got much better.

We configured sitemap_gen.py to crawl several hundred megs of log files once a day and generate the sitemap based on that. We immediately noticed several things:

  • We got much deeper coverage of our site.
  • The rankings were much more appropriate.
  • Our sitemap reflected the actual usage pattern of our visitors (something we noticed was not happening with the XML Sitemap module)

For example, articles that get a lot of traffic for our site (like a basic page on Degenerative Disk Disease) scored very high. Some that got less traffic (like a random article on whiplash) were ranked lower, as we would have expected.

We think that the reason why the python script has been more successful in ranking is simply because we can have it delve deeper into our data. We currently have in mining about 875M of log file data per run. That is far more than before.

3. Surprisingly, Google is crawling more

From logs and webmaster tools, we have seen three obvious trends:

  • Google crawls us more frequently. We see googlebot almost daily, and it is making a large number of requests.
  • Google is hitting pages on our site it has never hit before. Activity has gone up about 15-20% at least.
  • New content is showing up in search results. Larger parts of our site, like the doctor directory, have been crawled very deeply. Pages several levels beneath the root are now showing up in searches.

4. Unsurprisingly, the sitemap got bigger

The specification says that sitemaps can have up to 50,000 entries as long as it is no bigger than 10M. Our new site map is many times larger than our old map, and reflects better coverage of our site (though we still don't see everything show up in the site map).

5. Surprisingly, we have more control

We've been able to conduct some very fine-grained tuning by applying regular expressions to URL patterns. The Python tool allows you to build sophisticated lists of filters like this:

<filter action="drop" type="regexp"   pattern="/\.[^/]*"     />
<filter action="drop" type="wildcard" pattern="*icons*" />
<filter action="drop" type="wildcard" pattern="*logos*" />
<filter action="drop" type="wildcard" pattern="*todo*" />
<filter action="drop" type="wildcard" pattern="*Easter*" />
<filter action="drop" type="wildcard" pattern="*/help/help/*" />
<filter action="drop" type="wildcard" pattern="*/press/*.gif" />

We've also been able to very easily configure our sitemap to reflect our information architecture by using a URL List file, which looks something like this:

http://www.spine-health.com/ changefreq=daily priority=1.0
http://www.spine-health.com/conditions changefreq=monthly priority=0.9

http://www.spine-health.com/conditions/arthritis changefreq=monthly priority=0.8
http://www.spine-health.com/conditions/back-pain changefreq=monthly priority=0.8
http://www.spine-health.com/conditions/chronic-pain changefreq=monthly priority=0.8
http://www.spine-health.com/conditions/degenerative-disc-disease changefreq=monthly priority=0.8

# Rest of the file is removed

This has made it simple for us to give the bot an idea of our IA.


Overall, we have been happy with our move. Though we miss the ease of maintenance we had before, the trade-off clearly has been worth it.

We are looking forward to taking another look at the XML Sitemap module when the 2.0 version stabilizes, though we probably won't get around to this until we migrate to Drupal 7.