A QueryPath script for checking on a sitemap
Feb 15 2010
I've been tuning our sitemap during the last few months, and one thing I needed was a quick tool to check on the effectiveness of various sitemap generation strategies.
To do this, I wrote a quick QueryPath script (see a full-sized image of the output). The script is explained below.
The code is pretty straightforward. It simply retrieves a URL, parses the sitemap contents, and then sorts them. Finally, it displays the top 100 entries. I've tested it on sitemaps with over 20,000 items. While it is a little slow on such a large document, it works fine.
#!/usr/bin/env php
<?php
require 'QueryPath/QueryPath.php';
define('MAX_ITEMS', 100);
$sitemap = 'http://example.com/sitemap.xml';
$urls = array();
print "Parsing sitemap...\n";
$qp = qp($sitemap, ':root>url>loc');
$size = $qp->size();
$max = $size > MAX_ITEMS ? MAX_ITEMS : $size;
printf("Found %d entries; printing top %d\n\n", $size, $max);
try {
foreach ($qp as $url) {
$loc = $url->text();
$score = $url->nextAll('priority')->text();
$urls[$loc] = $score;
}
} catch (Exception $e) {
print $e->getMessage();
}
arsort($urls);
$filter = "%d: %0.5f %s\n";
foreach ($urls as $uri => $score) {
if ($i++ == $max) break;
printf($filter, $i, $score, $uri);
};
?>
Basically, the script above simply fetches all of the URLs out of the sitemap, and then sorts them by their corresponding score. Only the top MAX_ITEMS
(100) are shown.
<!--break-->