A QueryPath script for checking on a sitemap

15 Feb in php, programming, querypath, sitemap, xml

Sitemap ScoresSitemap ScoresI've been tuning our sitemap during the last few months, and one thing I needed was a quick tool to check on the effectiveness of various sitemap generation strategies.

To do this, I wrote a quick QueryPath script (see a full-sized image of the output). The script is explained below.

The code is pretty straightforward. It simply retrieves a URL, parses the sitemap contents, and then sorts them. Finally, it displays the top 100 entries. I've tested it on sitemaps with over 20,000 items. While it is a little slow on such a large document, it works fine.

#!/usr/bin/env php
<?php
require 'QueryPath/QueryPath.php';
 
define('MAX_ITEMS', 100);
 
$sitemap = 'http://example.com/sitemap.xml';
 
$urls = array();
print "Parsing sitemap...\n";
$qp = qp($sitemap, ':root>url>loc');
$size = $qp->size();
$max = $size > MAX_ITEMS ? MAX_ITEMS : $size;
printf("Found %d entries; printing top %d\n\n", $size, $max);
 
try {
    foreach ($qp as $url) {
      $loc = $url->text();
    $score = $url->nextAll('priority')->text();
    $urls[$loc] = $score;
    }
} catch (Exception $e) {
  print $e->getMessage();
}
 
arsort($urls);
 
$filter = "%d: %0.5f  %s\n";
 
foreach ($urls as $uri => $score) {
  if ($i++ == $max) break;
   printf($filter, $i, $score, $uri);
};
?>

Basically, the script above simply fetches all of the URLs out of the sitemap, and then sorts them by their corresponding score. Only the top MAX_ITEMS (100) are shown.