Iteration Techniques and Performance in QueryPath

Nov 26 2009

QueryPath provides multiple methods of iterating. This article demonstrates the performance impact of various looping types. In this article, we are going to look at four different ways of iterating through the items wrapped by a QueryPath object:

  • Using QueryPath's iterator
  • Looping through DOMNode objects
  • Using each() and a callback
  • Using each() and an anonymous function

This last item is specific to PHP 5.3 and later, and offers intriguing possibilities when paired with closures.

Finally, at the end of the article, I will show some representative performance numbers.

Using QueryPath's Iterator

Perhaps the most common way of looping through a set of QueryPath results is to do something like this:

<?php
$qp = qp($xml, 'a');

foreach ($qp as $item) {
  // $item is a QueryPath object
}
?>

This method takes advantage of the fact that QueryPath is iterable. Whenever a QueryPath object is accessed inside of a foreach loop, the next DOMNode is retrieved from the QueryPath, and is then wrapped in a QueryPath object.

This method of looping is very powerful in cases where you want to perform operations on each element in a list. However, it is not the fastest method of iterating over a set because a new QueryPath object is created for each item.

Most of the time, this won't make a difference. But when you are dealing with objects tracking thousands of elements, the performance impact can be noticeable.

There are three other ways we are going to look at accessing each item in a QueryPath object.

Iterating over DOMNodes

A second method is to retrieve the list of DOMNodes that a QueryPath object wraps, and then iterating over that list. Since this particular method involves no new object creation and very little additional memory allocation, it is very fast (and less memory-intensive).

Here is the basic method of iterating in this way:

<?php
$qp = qp($xml, 'a');

foreach ($qp->get() as $item) {
  // $item is a DOMNode object, usually a DOMElement.
}
?>

This method tells QueryPath to retrieve an iterable list of nodes, and then iterate over that list. There is a minor (non-obvious) performance hit taken when using this method. And I mean minor. When get() is called this way, the list of nodes is copied into an array, and the array is returned. This was done for historical reasons, and may not be the default behavior in future releases of QueryPath.

However, you can avoid that hit with a little bit of black magic:

<?php
$qp = qp($xml, 'a');

foreach ($qp->get(NULL, TRUE) as $item) {
  // $item is a DOMNode object, usually a DOMElement.
}
?>

What's going on in this example? In short, we are calling get() with an extra argument. When QueryPath 2.x was released, one minor change in the API -- a change made primarily for QueryPath internals -- was the addition of an extra flag to get(). This second flag, if TRUE, will tell QueryPath to forgo the conversion of the SPLObjectStorage to an array.

Since this flag was considered internal, it was grafted on in a not-very-intuitive way. And in QueryPath 3, which is targeted to PHP 5.3, the return of an SPLObjectStorage object may be standard.

So what's the performance difference? Here are some representative numbers on PHP 5.3 (Mac OS 10.6):

  Test of foreach get()       took      0.001356 sec.
  Test of foreach get(NULL, FALSE) took 0.000692 sec.

In this particular run get(NULL, FALSE) is twice as fast. On average, though, it seems to tend toward 33% faster.

Using each() and a Callback

QueryPath 1.x introduces a method of executing a callback on each item in a QueryPath. It worked something like this:

<?php
function myFunc($index, $item) {
  // Do something...
}

qp($xml, 'a')->each('myFunc');
?>

This provided a tool for running an arbitrary function on each element in a QueryPath. The callback in an each() function is given both the index of the item being executed, and also the item itself (typically a DOMNode). As long as the function does not return FALSE, QueryPath executes the function once for each element.

This method has some clear advantages. Performance-wise, it falls somewhere between the two methods discussed above. But defining the function or method to be called often felt clumsy.

Using each() and an Anonymous Function

In PHP 5.3, functions can now be declared anonymously -- and assigned to variables. The follow expression is now legal:

<?php
$myFunc = function ($a, $b) {
  // Do whatever...
};

$myFunc('some val', 'another val');
?>

And QueryPath can use anonymous functions in each() functions:

<?php
qp($xml, 'a')->each(function ($index, $item) { /* do something. */ });
?>

This compact syntax makes it a little less obtrusive to add arbitrary functionality to be performed on each item in a QueryPath object.

What is more, with the addition of PHP's closure syntax, we can pass in additional outside variables that we want to make accessible inside of the anonymous function.

<?php
$outside = 'something';
qp($xml, 'a')->each(function ($index, $item) use ($outside) { /* do something. */ });
?>

In this case, we can now pass in additional variables from outside of the looping scope. This can be very useful when passing anonymous functions through a function wrapper.

Using anonymous functions and closures is slightly faster than using a callback. Of course, anonymous functions are not nearly as re-usable as the class methods and functions typically used in callbacks. Conversely, they are a powerful way of tackling a repetitive task in a way that doesn't clutter the code with single-use functions.

You may recall earlier that the QueryPath iterator can be used to iterate through each item in a QueryPath in such a way that each item is wrapped in its own QueryPath. We can actually replicate this behavior using each() and a closure:

<?php
$outside = 'something';
qp($xml, 'a')->each(function ($index, $item) use ($outside) { 
  $innerQP = qp($item);
  // Do whatever...
});
?>

The interesting thing about this method is that it gives us more explicit control over what external variables are accessible from within the loop. This may be advantageous in some areas, though perhaps not in all.

After running multiple performance tests comparing closures and iterators in PHP 5.3, I have not yet seen a substantial performance difference between the two. I know that during the PHP 5.2 days it was claimed that SPL iterators were slow (and they are if you override every method of the container object). They seem to be substantially faster in PHP 5.3.

Performance Testing

While riding the train, I composed a very simple performance test to compare the various iteration techniques listed above. Basically, the test creates an example document with 1,100 a elements. Why 1,100? Because 1,000 is considered to be the point at which arrays cease to perform as well as more specialized data structures.

Here is a representative set of numbers:

  Test of foreach get() took 0.001330 sec.
  Test of foreach get(NULL, FALSE) took 0.000935 sec.
  Test of iterator took 0.049583 sec.
  Test of callback took 0.006122 sec.
  Test of anonymous function took 0.005694 sec.
  Test of closure took 0.005820 sec.
  Test of anonymous function generating qp took 0.047485 sec.
  Test of closure generating qp took 0.048378 sec.

It is important to note that the numbers vary slightly from run to run. The first pair of comparisons, for example, seem to diverge by different margins from run to run -- probably depending on the particular hash distribution during a particular generation of an SPLObjectStorage object. But for the most part, the example above is representative.

Here is the code used to generate it:

<?php
require_once('QueryPath/QueryPath.php');

$xml = '<?xml version="1.0"?><r>';

$i = 0;
do {
  $xml .= '<a><b/></a>';
}
while (++$i < 1100);

$xml .= '</r>';

// Do nothing. Gets DOM loaded to even 
// out tests.
qp($xml, 'a')->each(function ($i, $e){});

$qp_it = qp($xml, 'a');
$qp_foreach = qp($xml, 'a');
$qp_foreach2 = qp($xml, 'a');
$qp_cb = qp($xml, 'a');
$qp_anon = qp($xml, 'a');
$qp_anon2 = qp($xml, 'a');


////////////////
// Test foreach
$end = NULL;
$start = microtime();
//print "Start: $start" . PHP_EOL;
//var_dump(microtime());

// Test looping speed.
foreach ($qp_foreach->get() as $item) {}

$end = microtime();
printf("Test of foreach get() took %f sec.\n", $end - $start);

//////////////////////////
// Test foreach over Splos
$end = NULL;
$start = microtime();
//print "Start: $start" . PHP_EOL;
//var_dump(microtime());

// Test looping speed.
foreach ($qp_foreach->get(NULL, TRUE) as $item) {}

$end = microtime();
printf("Test of foreach get(NULL, FALSE) took %f sec.\n", $end - $start);


/////////////////////
// IteratorAggregate
$start = microtime();

// Test looping speed.
foreach ($qp_it as $item) {}

$end = microtime();
printf("Test of iterator took %f sec.\n", $end - $start);

////////////////
// Test callback
function cb($i, $e) {
  return 1;
}

$end = NULL;
$start = microtime();

// Test looping spead;
$qp_cb->each('cb');

$end = microtime();
printf("Test of callback took %f sec.\n", $end - $start);

//////////////////
// Test anon func
$end = NULL;
$start = microtime();

$qp_anon->each(function ($i, $e) {});

$end = microtime();
printf("Test of anonymous function took %f sec.\n", $end - $start);

//////////////////
// Test closure
$end = NULL;
$start = microtime();

$outside = 9;
$qp_anon->each(function ($i, $e) use ($outside) {});

$end = microtime();
printf("Test of closure took %f sec.\n", $end - $start);


/////////////////////////
// Test anon func w/ qp()
$end = NULL;
$start = microtime();

$qp_anon2->each(function ($i, $e) {qp($e);});

$end = microtime();
printf("Test of anonymous function generating qp took %f sec.\n", $end - $start);

/////////////////////////
// Test closure w/ qp()
$end = NULL;
$start = microtime();

$outside = 5;
$qp_anon2->each(function ($i, $e) use ($outside) {qp($e);});

$end = microtime();
printf("Test of closure generating qp took %f sec.\n", $end - $start);
?>

(Note: All tests were done on a MacBook Pro 2.4 GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 running OS 10.6 -- PHP 5.3.0.)



comments powered by Disqus