A Set of Objects in PHP: Arrays vs. SplObjectStorage

May 29 2009

One of my projects, QueryPath, performs many tasks that require maintaining a set of unique objects. In my quest to optimize QueryPath, I have been looking into various ways of efficiently storing sets of objects in a way that provides expedient containment checks. In other words, I want a data structure that keeps a list of unique objects, and can quickly tell me if some object is present in that list. The ability to loop through the contents of the list is also necessary.

UPDATE Jan 4, 2013: See the updated benchmarks.

Recently I narrowed the list of candidates down to two methods:

  1. Use good old fashioned arrays to emulate a hash set.
  2. Use the SPLObjectStorage system present in PHP 5.2 and up.

Before implementing anything directly in QueryPath, I first set out designing the two methods, and then ran some micro-benchmarks (with Crell's help) on the pair of methods. To say that the results were surprising is an understatement. The benchmarks will likely change the way I structure future code, both inside and outside of Drupal.

The Designs

Before presenting the benchmarks, I want to quickly explain the two designs that I settled on.

Arrays emulating a hash set

The first method I have been considering is using PHP's standard array() to emulate a set backed by a hash mapping (a "hash set"). A set is a data structure designed to keep a list of unique elements. In my case, I am interested in storing a unique set of DOM objects. A hash set is a set that is implemented using a hash table, where the key is a unique identifier for the stored value. While one would normally write a class to encapsulate this functionality, I decided to test the implementation as a bare array with no layers of indirection on top. In other words, what I am about to present are the internals of what would be a hash set implementation.

The Goal: Store a (unique) set of objects in a way that makes them (a) easy to iterate, and (b) cheap to check membership. The Strategy: Create an associative array where the key is a hash ID and the value is the object.

With a reasonably good hashing function, the strategy outlined above should work as desired.

"Reasonably good hashing function" -- that was the first gotcha. How do you generate a good hashing function for an object like DOMDocument? One (bad) way would be to serialize the object and then, perhaps, take its MD5 hash. That, however, will not work on many objects -- specifically any object that cannot serialze. The DOMDocument, for example, is actually backed by a resource and cannot be serialized.

One needed look far for a such a function, though. It turns out that there is an object hashing function in PHP 5. It's called splobjecthash(), and it can take any object (even one that is not native PHP) and generate a hashcode for it.

Using splobjecthash() we can build a simple data structure that functions like a hash set. This structure looks something like this:

array(
  $hashcode => $object
);

For example, we an generate an entry like so:

$object = new stdClass();
$hashcode = spl_object_hash($object);

$arr = array(
  $hashcode => $object
);

In the example above, then, the hashcode string is an array key, and the object itself is the array value. Note that since the hashcode will be the same each time an object is re-hashed, it serves not only as a comparison point ("if object a's hashkey == object b's hashkey, then a == b"), it also functions as a uniqueness constraint. Only one object with the specified hashcode can exist per array, so there is no possibility of two copies (actually, two references) to the same object being placed in the array.

With a data structure like this, we have a host of readily available functions for manipulating the structure, since we have at our disposal all of the PHP array functions. So to some degree this is an attractive choice out of the box.

The most import task, in our context at least, is that of determining whether an entry exists inside of the set. There are two possible candidates for this check, and both require supplying the hashcode:

  1. Check whether the key exists using arraykeyexists().
  2. Check whether the key is set using isset().

To cut to the chase, isset() is faster than arraykeyexists(), and offers the same features in our context, so we will use that. (The fact that they handle null values differently makes no difference to us. No null values will ever be inserted into the set.)

With this in mind, then, we would perform a containment check using something like this:

$object = new stdClass();
$hashkey = spl_object_hash($object);
// ...

// Check whether $arr has the $object.
if (isset($arr[$hashkey])) {
  // Do something...
}

Again, using an array that emulates a hash set allows us to use all of the existing array functions and also provides easy iterability. We can easily drop this into a foreach loop and iterate the contents. Before looking at how this performs, though, let's look at the other possible solution.

Using SplObjectStorage

The second method under consideration makes use of the new SplObjectStorage class from PHP 5.2+ (it might be in 5.1). This class, which is backed by a C implementation, provides a set-like storage mechanism for classes. It enforces uniqueness; only one of each object can be stored. It is also traversable, as it implements the Iterable interface. That means you can use it in loops such as foreach. (On the down side, the version in PHP 5.2 does not provide any method of random access or index-based access. The version in PHP 5.3 rectifies this shortcoming.)

The Goal: Store a (unique) set of objects in a way that makes them (a) easy to iterate, and (b) cheap to check membership. The Strategy: Instantiate an object of class SplObjectStorage and store references to objects inside of this.

Creating a new SplObjectStorage is simple:

$objectStore = new SplObjectStorage();

An SplObjectStorage instance not only retains uniqueness information about objects, but objects are also stored in predictable order. SplObjectStorage is a FIFO -- First In, First Out.

Adding objects is done with the attach() method:

$objectStore = new SplObjectStorage();
$object = new stdClass();

$objectStore->attach($object);

It should be noted that attach will only attach an object once. If the same object is passed to attach() twice, the second attempt will simply be ignored. For this reason it is unnecessary to perform a contains() call before an attach() call. Doing so is redundant and costly.

Checking for the existence of an object within an SplObjectStorage instance is also straightforward:

$objectStore = new SplObjectStorage();
$object = new stdClass();

$objectStore->attach($object);

// ...

if ($objectStore->contains($object)) {
  // Do something...
}

While SplObjectStorage has nowhere near the number of supporting methods that one has access to with arrays, it allows for iteration and somewhat limited access to the objects stored within. In many use cases (including the one I am investigating here), SplObjectStorage provides the requisite functionality.

Now that we have taken a look at the two candidate data structures, let's see how they perform.

The Comparisons

Anyone who has seen Larry (Crell) Garfield's micro-benchmarks for arrays and SPL ArrayAccess objects will likely come into this set of benchmarks with the same set of expectations Larry and I had. We expected PHP's arrays to blow the SplObjectStorage out of the water. After all, arrays are a primitive type in PHP, and have enjoyed years of optimizations. However, the documentation for the SplObjectStorage indicates that the search time for an SplObjectStorage object is O(1), which would certainly make it competitive if the base speed is similar to that of an array.

My testing environments are:

  • An iMac (current generation) with a 3.06 Ghz Intel Core 2 Duo and 2G of 800mhz DDR2 RAM. MAMP 1.72 (PHP 5.2.5) provides the AMP stack.
  • A MacBook Pro with a 2.4 Ghz Intel Core 2 Duo and 4G of 667mhz DDR2 RAM. MAMP 1.72 (PHP 5.2.5) provides the AMP stack.

In both cases, the performance tests averaged about the same. Benchmarks in this article come from the second system.

Our basic testing strategy was to build a simple test that captured information about three things:

  1. The amount of time it takes to load the data structure
  2. The amount of time it takes to seek the data structure
  3. The amount of memory the data structure uses

We did our best to minimize the influence of other factors on the test. Here is our testing script:

  <?php
  /**
   * Object hashing tests.
   */
  $sos = new SplObjectStorage();

  $docs = array();
  $iterations = 100000;

  for ($i = 0; $i < $iterations; ++$i) {
    $doc = new DOMDocument();
    //$doc = new stdClass();

    $docs[] = $doc;

  }

  $start = $finis = 0;

  $mem_empty = memory_get_usage();

  // Load the SplObjectStorage
  $start = microtime(TRUE);
  foreach ($docs as $d) {
    $sos->attach($d);
  }
  $finis = microtime(TRUE);

  $time_to_fill = $finis - $start;

  // Check membership on the object storage
  $start = microtime(FALSE);
  foreach ($docs as $d) {
    $sos->contains($d);
  }

  $finis = microtime(FALSE);

  $time_to_check = $finis - $start;

  $mem_spl = memory_get_usage();

  $mem_used = $mem_spl - $mem_empty;

  printf("SplObjectStorage:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);

  unset($sos);
  $mem_empty = memory_get_usage();

  // Test arrays:
  $start = microtime(TRUE);
  $arr = array();

  // Load the array
  foreach ($docs as $d) {
    $arr[spl_object_hash($d)] = $d;
  }
  $finis = microtime(TRUE);

  $time_to_fill = $finis - $start;

  // Check membership on the array
  $start = microtime(FALSE);
  foreach ($docs as $d) {
    //$arr[spl_object_hash($d)];
    isset($arr[spl_object_hash($d)]);
  }

  $finis = microtime(FALSE);

  $time_to_check = $finis - $start;
  $mem_arr = memory_get_usage();

  $mem_used = $mem_arr - $mem_empty;


  printf("Arrays:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);
  ?>

The test above is broken into four separate tests. The first two test how well the SplObjectStorage method handles loading and containment checking. The second two perform the same test on our improvised array data structure.

There are two things worth noting about the test above.

First, the object of choice for our test was a DOMDocument. There are a few reasons for this. The obvious reason is that this test was done with the intent of optimizing QueryPath, which works with elements from the DOM implementation. There are two other interesting reasons, though. One is that DOMDocuments are not lightweight. The other is that DOMDocuments are backed by a C implementation, making them one of the more difficult cases when storing objects. (They cannot, for example, be conveniently serialized.)

That said, after observing the outcome, we repeated the test with basic stdClass objects and found the performance results to be nearly identical, and the memory usage to be proportional.

The second thing worth mention is that we used 100,000 iterations to test. This was about the upper bound that my PHP configuration allowed before running out of memory. Other than that, though, the number was chosen arbitrarily. When I ran tests with lower iteration counts, the SplObjectStorage definitely scaled linearly. Array performance was less predictable (larger standard deviation) with smaller data sets, though it seemed to average around the same for lower sizes as it does (more predictably) for larger sized arrays.

The Results

So how did these two strategies fare in our micro-benchmarks? Here is a representative sample of the output generated when running the above:

SplObjectStorage:
Time to fill: 0.085041999817.
Time to check: 0.073099000000.
Memory: 6124624

Arrays:
Time to fill: 0.193022966385.
Time to check: 0.153498000000.
Memory: 8524352

Averaging this over multiple runs, SplObjectStorage executed both fill and check functions twice as fast as the array method presented above. We tried various permutations of the tests above. Here, for example, are results for the same test with a stdClass object:

SplObjectStorage:
Time to fill: 0.082209110260.
Time to check: 0.070617000000.
Memory: 6124624

Arrays:
Time to fill: 0.189271926880.
Time to check: 0.152644000000.
Memory: 8524360

Not much different. Even adding arbitrary data to the object we stored does not make a difference in the time it takes for the SplObjectStorage (though it does seem to raise the time ever so slightly for the array).

Our conclusion is that SplObjectStorage is indeed a better solution for storing lots of objects in a set. Over the last week, I've ported QueryPath to SplObjectStorage (see the Quark branch at GitHub -- the existing Drupal QueryPath module can use this experimental branch without alteration), and will likely continue benchmarking. But preliminary results seem to provide a clear indication as to the best approach.

As a result of these findings, I'm much less inclined to default to arrays as "the best choice" simply because they are basic data types. If the SPL library contains features that out-perform arrays, they should be used when appropriate. From QueryPath to my Drupal modules, I expect that my code will be impacted by these findings.

Thanks to Crell for his help, and for Eddie at Frameweld for sparking my examination of these two methods in the first place.



comments powered by Disqus