Come to the 2010 CMS Expo

querypath

A QueryPath script for checking on a sitemap

Sitemap ScoresSitemap ScoresI've been tuning our sitemap during the last few months, and one thing I needed was a quick tool to check on the effectiveness of various sitemap generation strategies.

To do this, I wrote a quick QueryPath script (see a full-sized image of the output). The script is explained below.

The code is pretty straightforward. It simply retrieves a URL, parses the sitemap contents, and then sorts them. Finally, it displays the top 100 entries. I've tested it on sitemaps with over 20,000 items. While it is a little slow on such a large document, it works fine.

#!/usr/bin/env php
<?php
require 'QueryPath/QueryPath.php';
 
define('MAX_ITEMS', 100);
 
$sitemap = 'http://example.com/sitemap.xml';
 
$urls = array();
print "Parsing sitemap...\n";
$qp = qp($sitemap, ':root>url>loc');
$size = $qp->size();
$max = $size > MAX_ITEMS ? MAX_ITEMS : $size;
printf("Found %d entries; printing top %d\n\n", $size, $max);
 
try {
    foreach ($qp as $url) {
      $loc = $url->text();
    $score = $url->nextAll('priority')->text();
    $urls[$loc] = $score;
    }
} catch (Exception $e) {
  print $e->getMessage();
}
 
arsort($urls);
 
$filter = "%d: %0.5f  %s\n";
 
foreach ($urls as $uri => $score) {
  if ($i++ == $max) break;
   printf($filter, $i, $score, $uri);
};
?>

Basically, the script above simply fetches all of the URLs out of the sitemap, and then sorts them by their corresponding score. Only the top MAX_ITEMS (100) are shown.

QueryPath on WebMonkey

It just came to my attention that a WebMonkey article (Parsing HTML? There's an App for That) from a few months ago suggested using QueryPath as an alternative to attempting to parse HTML by hand.

Webmonkey on QueryPathWebmonkey on QueryPath

Appropriately, last week I wrote a QueryPath script to analyze a site and extract all links so that I could feed them to Siege and simulate something like a real load against a server. It's nice to be able to easily extract data from HTML.

Acquia Webinar: "Playing Nicely with Others"

In our webinar Playing Nicely With Others: Integrating Drupal with Third-Party Data, Ken, George, Larry, and I talk about integrating various web services with Drupal. We talk about SOAP, content importing, digital asset management systems, and QueryPath (surprisingly, I'm not the one plugging QueryPath in this vid).

Thanks to Acquia for doing a fantastic job putting together their webinar series.

Streamlining Iterators in QueryPath 3.x

Work has officially begun on QueryPath 3.x. The upcoming release is focused on implementing and supporting many of the new features introduced in PHP 5.3, including enhanced SPL support, namespaces, closures, and phar archives.

In an earlier article, I examined the performance of various iteration strategies in QueryPath. After taking a hard look at the patterns I observed there, I revisited QueryPath's QueryPathIterator class to see if I could make a sizable performance improvement.

Iteration Techniques and Performance in QueryPath

QueryPath provides multiple methods of iterating. This article demonstrates the performance impact of various looping types. In this article, we are going to look at four different ways of iterating through the items wrapped by a QueryPath object:

  • Using QueryPath's iterator
  • Looping through DOMNode objects
  • Using each() and a callback
  • Using each() and an anonymous function

This last item is specific to PHP 5.3 and later, and offers intriguing possibilities when paired with closures.

Finally, at the end of the article, I will show some representative performance numbers.

QueryPath Performance Optimizations on Reduncery

Continuing a trend on the non-evilness of optimization, this article discusses some methods of improving performance in QueryPath.

Early this week, a Twitter analysis tool called Reduncery was launched by a friend of mine. Reduncery calculates how much of a "redunce" a particular user is -- that is, what percentage of a user's tweets are retweets (RT). It can also calculate how ineffective it is for one person to retweet another. In this case, it calculates the overlap in the followers of the original tweeter with the followers of the retweeter. In what follows, we will look at the ways Reduncery optimizes QueryPath to keep page load times down.

Reduncery: Calculating retweet idiocy

Ever get irritated by reading the same tweet multiple times, retweeted by the same old people? Ever wondered how effective re-tweeting is? Are new people really reading the tweet, or are the same people just being notified multiple times? You can now find out for sure with Reduncery.

RedunceryReduncery

Reduncery was built on QueryPath and Drupal. In a future blog, I'll tell you about some of the the performance optimizations Reduncery uses to speed up searches of 200k+ users.

QueryPath slides from DrupalCon Paris, 2009

I finally posted my slides for DrupalCon Paris at slideshare.

Feel free to download a copy and use it in conjunction with the video from Paris. The slides, though, cover some information that I did not have time to cover in the video. Conversely, the video features Ken and David each talking about QueryPath projects they worked on.

SPL in PHP 5.3

Here is a great slideshow that explains what is so important (and so interesting) about the SPL libraries included in PHP 5.3.

And for additional reading, head over to the PHP manual: http://us3.php.net/manual/en/spl.datastructures.php

When I switched QueryPath from an array to an SplObjectStorage object, I noticed tremendous speed improvements. And the 5.3 random access extensions to SPLObjectStorage will continue to speed QueryPath's engine.

QueryPath at DrupalCamp Atlanta?

I was happy to see that QueryPath made the hallway track at DrupalCamp Atlanta. I assume I have Ken to thank for that. Ken co-presented with me twice at DrupalCon Paris -- once on how we did Foreign Affairs, and once on QueryPath (video of session).Josh Brauer's BlogJosh Brauer's Blog

Syndicate content

Recent comments