By Matt Butcher
5 Differences: Moving from XML Sitemap module to Google's Sitemap Generators
For a large site that I maintain, we recently disabled the XML Sitemap module (we're using the 1.x branch) and switched to the Google Sitemap Generators tool (the Python one). We have noticed a few unsurprising things, and a few very surprising things.
We identified five big differences (all positive) that we have seen since moving to the Google Sitemap Generators Python tool.
The Google Sitemap Generators Python script is composed of a single executable script, sitemap_gen.py, that can be configured to use various sources of input to generate a site map. We use our server's access logs to generate the map.
I hear that a newer Google tool provides yet another way of capturing this information, but this requires Apache. We run Nginx.
Here are five substantial differences we have noticed since we started running the Python script a month ago.
1. Unsurprisingly, our performance improved
One of the main reasons we turned it off was because of known issues with performance. There is no need to go into details. We have tens of thousands of nodes, and that brought the generator to its knees on occasion.
In our new configuration, the Python script that generates the sitemap runs once during the day at an off-peak time. While the building of the site map still takes time, we control exactly when it happens. And since it is a secondary process (not part of the web site) we have open possibilities for further tuning, like nice-ing the process.
2. Surprisingly, our site map quickly got much better.
We configured sitemap_gen.py to crawl several hundred megs of log files once a day and generate the sitemap based on that. We immediately noticed several things:
- We got much deeper coverage of our site.
- The rankings were much more appropriate.
- Our sitemap reflected the actual usage pattern of our visitors (something we noticed was not happening with the XML Sitemap module)
For example, articles that get a lot of traffic for our site (like a basic page on Degenerative Disk Disease) scored very high. Some that got less traffic (like a random article on whiplash) were ranked lower, as we would have expected.
We think that the reason why the python script has been more successful in ranking is simply because we can have it delve deeper into our data. We currently have in mining about 875M of log file data per run. That is far more than before.
3. Surprisingly, Google is crawling more
From logs and webmaster tools, we have seen three obvious trends:
- Google crawls us more frequently. We see googlebot almost daily, and it is making a large number of requests.
- Google is hitting pages on our site it has never hit before. Activity has gone up about 15-20% at least.
- New content is showing up in search results. Larger parts of our site, like the doctor directory, have been crawled very deeply. Pages several levels beneath the root are now showing up in searches.
4. Unsurprisingly, the sitemap got bigger
The specification says that sitemaps can have up to 50,000 entries as long as it is no bigger than 10M. Our new site map is many times larger than our old map, and reflects better coverage of our site (though we still don't see everything show up in the site map).
5. Surprisingly, we have more control
We've been able to conduct some very fine-grained tuning by applying regular expressions to URL patterns. The Python tool allows you to build sophisticated lists of filters like this:
<filter action="drop" type="regexp" pattern="/.[^/]<em>" /> <filter action="drop" type="wildcard" pattern="</em>icons<em>" /> <filter action="drop" type="wildcard" pattern="</em>logos<em>" /> <filter action="drop" type="wildcard" pattern="</em>todo<em>" /> <filter action="drop" type="wildcard" pattern="</em>Easter<em>" /> <filter action="drop" type="wildcard" pattern="</em>/help/help/<em>" /> <filter action="drop" type="wildcard" pattern="</em>/press/*.gif" />
We've also been able to very easily configure our sitemap to reflect our information architecture by using a URL List file, which looks something like this:
# TOP NAV http://www.spine-health.com/ changefreq=daily priority=1.0 http://www.spine-health.com/conditions changefreq=monthly priority=0.9 # SECONDARY NAV - CONDITION CENTERS http://www.spine-health.com/conditions/arthritis changefreq=monthly priority=0.8 http://www.spine-health.com/conditions/back-pain changefreq=monthly priority=0.8 http://www.spine-health.com/conditions/chronic-pain changefreq=monthly priority=0.8 http://www.spine-health.com/conditions/degenerative-disc-disease changefreq=monthly priority=0.8 # Rest of the file is removed
This has made it simple for us to give the bot an idea of our IA.
Conclusion
Overall, we have been happy with our move. Though we miss the ease of maintenance we had before, the trade-off clearly has been worth it.
We are looking forward to taking another look at the XML Sitemap module when the 2.0 version stabilizes, though we probably won't get around to this until we migrate to Drupal 7.








Great overview
This is the first review I've read of the python Google Sitemap Generator. I have a non Drupal site that I have been meaning to develop a generator of my own. After playing with Google's generator early on I got the impression that my own tool would be better as it would cover all content from the database. Google's generator left me with the question of what about pages that didn't exist in the access log.
After your glowing review and a lack of time to develop my own generator I think I'll deploy it.
The xml sitemap 1.x module
The xml sitemap 1.x module sure does have its problem and its good to know and see your success with Google's sitemap generator. You talk about greater control, but it presupposes the person understands regular expressions. Where the xml sitemap module offers a UI it is easier for beginners to use. Granted beginners may not be administering a site with tens of thousands of nodes. It would be interesting to know the difficulty and opportunities to integrate a module UI to the Google Sitemap Generator.
Regular expressions for sure
Yes, you totally have to be comfortable with regular expressions and wildcards and the like. The sitemap generator from Google is clearly intended for sys admins and web developers. There is nothing "plug-and-play" about it.
Yes its too bad you couldn't
Yes its too bad you couldn't give the 2.x version a try at all since there's been a lot of hard work going into it and it's been very stable for a while. :(
Theres pros and cons to in-application generator (like the Module) vs outside/log generators. But in the end you have to use what is best for your situation.
I am running 2.x here!
I am in fact running the 2.x version here, and I'm very happy with it. I even read most of the code just to check -- and it looks great.
But, alas... it wasn't to be for the Spine Health site.
Sounds good Matt. :) Looking
Sounds good Matt. :) Looking forward to when you can give it a try on the Spine Health site!
As another follow-up, you
As another follow-up, you should probably mention that with the generator you don't get probably the most important XML link value, the last modification date. Since this tells search engines what's recent and what's not, so they can index your new stuff first. Changefreq and priority are nice, but for the most part they're kind of ignored by search engines since they're basically 'hints'.
Also, you've got a lot of 'fun' stuff in your sitemap now like the following links:
http://www.spine-health.com/#1 (duplicate page of your homepage)
http://www.spine-health.com/blog/10582 (empty blog)
http://www.spine-health.com/blog/10582/feed (empty blog's feed)
http://www.spine-health.com/doctor/%2525252fsaint-john/neurosu (search pages)
http://www.spine-health.com/taxonomy/term/955/feed (should be feed URL, but it ends up as duplicate page of taxonomy/term/955)
http://www.spine-health.com/y_key_bb9b3c4b67580d19.html (your Yahoo verification file)
But that's the nature of a log file crawler. :)
Some drawbacks, for sure
Last modification date is an interesting one. If I didn't see Google, Bing, Yahoo Slurp, and so on crawling our site all the time, I would worry more. But as it is, changefreq seems to be doing the trick. I was also under the impression that Google ignores modification date anyway.
Yes, the downside of the log crawler is that we do have to do some more cleaning. And stupid user typos (like the #1 thing above) end up in the sitemap for a few days.
Search pages, incidentally, were intentionally left in the sitemap. They're not actually search result pages -- they're nodes with strange aliases.
But, hey... with 20,0000 entries in the site map, it still seems to me to be a good trade-off.
In the end, though, if a module can perform the same task as an additional (admittedly not-maintenance-free) program, I'd happily go back to the module. First, though, I'd want to know that it performs, and that it can generate a nice deep site map.
A good navigation menu will
A good navigation menu will help users access necessary information quickly and easily.the eye-catching look, it will increase the performance...
http://www.guide-au-bingo.fr
what's the best way
Will you treat the conventional or the site map generator to be the best way create a site map. I use the tool but in some places I find it really useful as i wasn't edit it a bit but not in a position to do it mac file recovery
good sitemap generator
Works very well. After configuring it then just set it up to run it on a regular basis file data recovery
First of all, Go Daddy hosts
First of all, Go Daddy hosts my site. So I used the Go Daddy sitemap generator and I came to find it only included URLs that are PARENT PAGES. So I tried using different sitemap generators and had the same problem (no child pages were included
the site was maintained using
the site was maintained using Microsoft FrontPage 2003®. When it was decided to completely revise the site in 2007, one of the prime motivators was to move to being fully standards-based (XHTML and CSS).
Post new comment