Streamlining Iterators in QueryPath 3.x
Work has officially begun on QueryPath 3.x. The upcoming release is focused on implementing and supporting many of the new features introduced in PHP 5.3, including enhanced SPL support, namespaces, closures, and phar archives.
In an earlier article, I examined the performance of various iteration strategies in QueryPath. After taking a hard look at the patterns I observed there, I revisited QueryPath's QueryPathIterator
class to see if I could make a sizable performance improvement.
The Original Code
This is the iterator class for QueryPath 2.x. Anytime one uses the foreach($qp as $item)
construct, this is the iterator that is used:
<?php
class QueryPathIterator extends \IteratorIterator {
public $options = array();
public function current() {
return qp(parent::current(), NULL, $this->options);
}
}
?>
(This is from the original QueryPath 3.x branch, and uses the backslash to identify the IteratorIterator
's namespace. If that doesn't make sense to you, just ignore it.)
What makes the SPL iteration classes so compelling is the ease with which they can be selectively extended. By overriding only one function, current(), we can create a custom iterator that wraps every item in a QueryPath object. For that reason, we can execute a loop like this:
<?php
foreach ($qp as $item) {
// $item is a QueryPath because of current().
}
?>
Each time an item is retrieved from the QueryPath collection, it is wrapped in a QueryPath
object.
As nice as this is, it has a drawback, as my previous benchmarks showed. It is slow, and the primary reason for slowness is the creation of a new QueryPath object for each item.
This got me thinking: What if the same QueryPath object was re-used, instead of creating a new QueryPath for each item? With this in mind, I re-visited the code.
The New Iterator Code
The QueryPath
class already has a method for manually setting the contents of the QueryPath item list. While this method has been around for a long time, it has been marked as private
to prevent outside access. However, while working on these benchmarks I realized that re-marking this as public
would not only help me solve my problem, but open the doorway for others to create useful outer iterators on QueryPath
objects.
Thus, the first step in fixing the problem was to mark the setMatches()
method as public.
With that done, the QueryPathIterator
class can be re-written as follows:
<?php
class QueryPathIterator extends \IteratorIterator {
public $options = array();
private $qp = NULL;
public function current() {
// Re-using the QueryPath object cuts of 4/5 of the iteration time
// on large sets.
if (!isset($this->qp)) {
$this->qp = qp(parent::current(), NULL, $this->options);
}
else {
$splos = new \SplObjectStorage();
$splos->attach(parent::current());
$this->qp->setMatches($splos);
}
return $this->qp;
}
}
?>
The real change in the code above is that only one QueryPath object is created. It is then re-used for each item in the collection.
While this solution is considerably more complex (it requires much more knowledge of QueryPath's internals), it represents a tremendous performance gain.
Original numbers (before this change) looked something like this:
Test of iterator took 0.053398 sec.
(This is using the same benchmarking I mentioned in the previous article.)
With the re-architecting of the QueryPathIterator
the time goes down substantially. Here's a representative example:
Test of iterator took 0.018096 sec.
So the new version is just about three times as fast as the original. While this is still slower than working directly with the DOMNode
objects, it is still a substantial improvement, and should help when working with moderately sized lists of nodes. (Recall that the benchmark is based on a node list with 1100 elements.)
This new improvement will come along in QueryPath 3.x. Because the 2.x branch will likely remain active for a longer period of time as PHP 5.2 continues to be dominant, I may backport this change and release it as part of the next QueryPath 2.x release.