A QueryPath script for checking on a sitemap

Feb 15 2010

I've been tuning our sitemap during the last few months, and one thing I needed was a quick tool to check on the effectiveness of various sitemap generation strategies.

To do this, I wrote a quick QueryPath script (see a full-sized image of the output). The script is explained below.

The code is pretty straightforward. It simply retrieves a URL, parses the sitemap contents, and then sorts them. Finally, it displays the top 100 entries. I've tested it on sitemaps with over 20,000 items. While it is a little slow on such a large document, it works fine.

#!/usr/bin/env php
<?php
require 'QueryPath/QueryPath.php';

define('MAX_ITEMS', 100);

$sitemap = 'http://example.com/sitemap.xml';

$urls = array();
print "Parsing sitemap...\n";
$qp = qp($sitemap, ':root>url>loc');
$size = $qp->size();
$max = $size > MAX_ITEMS ? MAX_ITEMS : $size;
printf("Found %d entries; printing top %d\n\n", $size, $max);

try {
    foreach ($qp as $url) {
      $loc = $url->text();
      $score = $url->nextAll('priority')->text();
      $urls[$loc] = $score;
    }
} catch (Exception $e) {
  print $e->getMessage();
}

arsort($urls);

$filter = "%d: %0.5f  %s\n";

foreach ($urls as $uri => $score) {
    if ($i++ == $max) break;
    printf($filter, $i, $score, $uri);
};
?>

Basically, the script above simply fetches all of the URLs out of the sitemap, and then sorts them by their corresponding score. Only the top MAX_ITEMS (100) are shown. <!--break-->