QueryPath Performance Optimizations on Reduncery

Nov 20 2009

Continuing a trend on the non-evilness of optimization, this article discusses some methods of improving performance in QueryPath.

Early this week, a Twitter analysis tool called Reduncery was launched by a friend of mine. Reduncery calculates how much of a "redunce" a particular user is -- that is, what percentage of a user's tweets are retweets (RT). It can also calculate how ineffective it is for one person to retweet another. In this case, it calculates the overlap in the followers of the original tweeter with the followers of the retweeter. In what follows, we will look at the ways Reduncery optimizes QueryPath to keep page load times down. <!--break--> If you haven't tried Reduncery yet, go try it. It's fun. And this article will still be here when you get back.

Reduncery was built with QueryPath, Drupal, and the QueryPath Drupal module. And since some of the calculations that Redunce runs can be fairly intensive, the author of that tool invested some time in optimizing the QueryPath code. He shared his results with me.

The performance optimizations I am sharing here are specific to QueryPath 2.x, because they take advantage of knowledge of the QP internals. When QueryPath 3.x is release, the same sorts of performance benefits may or may not obtain.

Faster Looping

The big performance hit in Reduncery is the comparison of one user's list of followers to another's. Some Twitter users have tens- or hundreds of thousands of followers. (A few have a million or more.) Since Twitter's API breaks those lists into batches of 5,000, processing follower lists can be an intensive multi-step process of fetching a record set, parsing out the 5000 entries, and then fetching another record set, and so on. When one client request necessitates quickly processing that many records, and when the HTTP hit for the XML API is untunable, speeding up QueryPath is the easiest way to gain performance.

One of the early loops that was identified as a pefromance bottleneck was this:

<?php
foreach ($qp->find('id') as $user) {
  $followers[] = $user->text();
}
?>

This snippet simply loops through all of the id elements in an XML doc and retrieves the user's ID (username). It takes advantage of the fact that QueryPath is Iterable.

But when you're dealing with many thousands of records, the loop above is slow. Why? Because every time through the loop, one new QueryPath object is created. $user is populated with a new QueryPath that points to only one element. This was done to simplify the (very common) pattern where developers would loop through a list of elements and wrap each one in a new QueryPath instance.

In the code above, however, the functionality of QueryPath isn't really needed. All that is needed is to retrieve the text inside the element. That is easy enough to do on a native DOMNode object. It doesn't require a QueryPath.

So is there a way to avoid creating that new QueryPath object on each run through the loop? Yes. Here's an alternative:

<?php
foreach ($qp->find('id')->get() as $node) {
  $followers[] = $node->textContent;
}
?>

In this case, instead of iterating over the QueryPath object, we use QueryPath's get() method to retrieve the array of matching DOMNode objects. When looping this way, QueryPath doesn't need to do anything fancy, and doesn't need to allocate more memory. The DOMNodes are returned by reference.

A DOMNode has a textContent property which returns all of the child text, which is precisely what is needed for Redunce.

What was the result? Simply dropping out the creation of a new object sped up the loop by almost 30%.

That was one speed optimization. Here's another.

Using the QueryPath Cache in Drupal

For the most part, the QueryPath Drupal module simply exposes QueryPath to Drupal and binds the database APIs. But it includes an extra module that can come in useful with applications like Reduncery.

The QueryPath Cache (qpcache) module provides a database-level XML-optimized caching layer. This caching layer is distinct from the Drupal cache in several ways: First, it is optimized for storing document. Second, it is optimized for very fast key-based record retrieval. Third, control over cached records is left largely to the application. A Drupal cache clear will not clear the QP Cache. This makes it possible for records to live for longer periods of time, and for applications to have finer-grained control over what records are stored, and for how long.

As I mentioned before, Reduncery often retrieves record sets with as many as 5,000 records per set. And sometimes it needs to retrieve several sets per user request. Making the HTTP request from Reduncery to Twitter is time consuming, and there's very little that can be done to make that connection faster.

But the HTTP request can be avoided in some cases. Reduncery uses the QP Cache to stored the retrieved Twitter XML documents for a fixed period of time. When the same user's record is retrieved multiple times, Reduncery doesn't have to re-connect to Twitter and get the data again. It uses the cached copy.

The QP Cache API is incerdibly easy to use in cases like this. All the Reduncery folks had to do was replace a call to qp() with a call to qpcache_qp(), a simple function that transparently handles caching.

The nice thing about this technique (in the context of web APIs) is that the performance gain is not a matter of milliseconds, but of seconds.

Here's another way to shave time when working with big documents.

Faster Searches

When a query is executed in QueryPath, it is run in the most general way. Consider a query like this: qp($doc, 'nextcursor'). This will cause QueryPath to search the entire document for all occurances of an element named nextcursor.

Now consider a document like the ones Reduncery parses. With 5,000 records, each with several elements, executing a search such as the one above may involve searching 35,000+ DOM nodes.

Now consider the fact that the next_cursor element only appears once in the entire document, and it only appears in a known location: directly beneath the document root.

At first blush, the easiest way to find this next_cursor element is to do something like this:

<?php
qp($doc, 'next_cursor');
?>

That query, according to reports from the reduncery developers, took .44 seconds to run. Why so long? Because the entire document had to be searched for any occurance of that element.

If we could reduce the amount of searching -- restrict the scope of the search -- we could cut way down on that time. In point of fact, we can do that quite conveniently:

<?php
qp($doc, ':root>next_cursor');
?>

This version of the query tells QueryPath to look only for next_cursor elements that appear directly beneath the document root. This drastically reduces the number of elements that need to be searched (from 35,000+ to 6, in fact).

It should come as no surprise, then, that this second version of the query takes only .006 seconds to execute.

With a little bit of selector tweaking, Reduncery got a significant performance boost.

The Nature of Performance Improvements

I would be remiss if I did not mention the dark side of optimizations. The third trick works well because of the particular top-down selection model that QueryPath uses. In fact, QueryPath was designed to be able to make such optimizations.

But in the next version of QueryPath, we may switch to a bottom-up selection model, in which case QueryPath would search the entire DOM for this second version just as it did for the first. What is more, a (very minor) additional performance penalty would be had for requiring that the element be a direct child of the root element. Granted, in this case it would be so nominal that it would be unnoticable.

Why would we change things in QueryPath 3? Because it reduces the number of lookups for general cases of non-optimized selectors, and a gain in general efficiency may be desireable. (Again, no final decisions have been made.)

The point, though, is that optimizing at this level is a little tricky, and may mean additional work later. For that reason, it's a good idea to optimize only in cases (such as Redunce) where that extra parcel of time makes a big overall difference.

Special thanks to heyrocker for sharing this information with me.