Come to the 2010 CMS Expo

A Set of Objects in PHP: Arrays vs. SplObjectStorage

One of my projects, QueryPath, performs many tasks that require maintaining a set of unique objects. In my quest to optimize QueryPath, I have been looking into various ways of efficiently storing sets of objects in a way that provides expedient containment checks. In other words, I want a data structure that keeps a list of unique objects, and can quickly tell me if some object is present in that list. The ability to loop through the contents of the list is also necessary.

Recently I narrowed the list of candidates down to two methods:

  1. Use good old fashioned arrays to emulate a hash set.
  2. Use the SPLObjectStorage system present in PHP 5.2 and up.

Before implementing anything directly in QueryPath, I first set out designing the two methods, and then ran some micro-benchmarks (with Crell's help) on the pair of methods. To say that the results were surprising is an understatement. The benchmarks will likely change the way I structure future code, both inside and outside of Drupal.

The Designs

Before presenting the benchmarks, I want to quickly explain the two designs that I settled on.

Arrays emulating a hash set

The first method I have been considering is using PHP's standard array() to emulate a set backed by a hash mapping (a "hash set"). A set is a data structure designed to keep a list of unique elements. In my case, I am interested in storing a unique set of DOM objects. A hash set is a set that is implemented using a hash table, where the key is a unique identifier for the stored value. While one would normally write a class to encapsulate this functionality, I decided to test the implementation as a bare array with no layers of indirection on top. In other words, what I am about to present are the internals of what would be a hash set implementation.

The Goal: Store a (unique) set of objects in a way that makes them (a) easy to iterate, and (b) cheap to check membership.
The Strategy: Create an associative array where the key is a hash ID and the value is the object.

With a reasonably good hashing function, the strategy outlined above should work as desired.

"Reasonably good hashing function" -- that was the first gotcha. How do you generate a good hashing function for an object like DOMDocument? One (bad) way would be to serialize the object and then, perhaps, take its MD5 hash. That, however, will not work on many objects -- specifically any object that cannot serialze. The DOMDocument, for example, is actually backed by a resource and cannot be serialized.

One needed look far for a such a function, though. It turns out that there is an object hashing function in PHP 5. It's called spl_object_hash(), and it can take any object (even one that is not native PHP) and generate a hashcode for it.

Using spl_object_hash() we can build a simple data structure that functions like a hash set. This structure looks something like this:

array(
  $hashcode => $object
);

For example, we an generate an entry like so:

$object = new stdClass();
$hashcode = spl_object_hash($object);
 
$arr = array(
  $hashcode => $object
);

In the example above, then, the hashcode string is an array key, and the object itself is the array value. Note that since the hashcode will be the same each time an object is re-hashed, it serves not only as a comparison point ("if object a's hashkey == object b's hashkey, then a == b"), it also functions as a uniqueness constraint. Only one object with the specified hashcode can exist per array, so there is no possibility of two copies (actually, two references) to the same object being placed in the array.

With a data structure like this, we have a host of readily available functions for manipulating the structure, since we have at our disposal all of the PHP array functions. So to some degree this is an attractive choice out of the box.

The most import task, in our context at least, is that of determining whether an entry exists inside of the set. There are two possible candidates for this check, and both require supplying the hashcode:

  1. Check whether the key exists using array_key_exists().
  2. Check whether the key is set using isset().

To cut to the chase, isset() is faster than array_key_exists(), and offers the same features in our context, so we will use that. (The fact that they handle null values differently makes no difference to us. No null values will ever be inserted into the set.)

With this in mind, then, we would perform a containment check using something like this:

$object = new stdClass();
$hashkey = spl_object_hash($object);
// ...
 
// Check whether $arr has the $object.
if (isset($arr[$hashkey])) {
  // Do something...
}

Again, using an array that emulates a hash set allows us to use all of the existing array functions and also provides easy iterability. We can easily drop this into a foreach loop and iterate the contents. Before looking at how this performs, though, let's look at the other possible solution.

Using SplObjectStorage

The second method under consideration makes use of the new SplObjectStorage class from PHP 5.2+ (it might be in 5.1). This class, which is backed by a C implementation, provides a set-like storage mechanism for classes. It enforces uniqueness; only one of each object can be stored. It is also traversable, as it implements the Iterable interface. That means you can use it in loops such as foreach. (On the down side, the version in PHP 5.2 does not provide any method of random access or index-based access. The version in PHP 5.3 rectifies this shortcoming.)

The Goal: Store a (unique) set of objects in a way that makes them (a) easy to iterate, and (b) cheap to check membership.
The Strategy: Instantiate an object of class SplObjectStorage and store references to objects inside of this.

Creating a new SplObjectStorage is simple:

$objectStore = new SplObjectStorage();

An SplObjectStorage instance not only retains uniqueness information about objects, but objects are also stored in predictable order. SplObjectStorage is a FIFO -- First In, First Out.

Adding objects is done with the attach() method:

$objectStore = new SplObjectStorage();
$object = new stdClass();
 
$objectStore->attach($object);

It should be noted that attach will only attach an object once. If the same object is passed to attach() twice, the second attempt will simply be ignored. For this reason it is unnecessary to perform a contains() call before an attach() call. Doing so is redundant and costly.

Checking for the existence of an object within an SplObjectStorage instance is also straightforward:

$objectStore = new SplObjectStorage();
$object = new stdClass();
 
$objectStore->attach($object);
 
// ...
 
if ($objectStore->contains($object)) {
  // Do something...
}

While SplObjectStorage has nowhere near the number of supporting methods that one has access to with arrays, it allows for iteration and somewhat limited access to the objects stored within. In many use cases (including the one I am investigating here), SplObjectStorage provides the requisite functionality.

Now that we have taken a look at the two candidate data structures, let's see how they perform.

The Comparisons

Anyone who has seen Larry (Crell) Garfield's micro-benchmarks for arrays and SPL ArrayAccess objects will likely come into this set of benchmarks with the same set of expectations Larry and I had. We expected PHP's arrays to blow the SplObjectStorage out of the water. After all, arrays are a primitive type in PHP, and have enjoyed years of optimizations. However, the documentation for the SplObjectStorage indicates that the search time for an SplObjectStorage object is O(1), which would certainly make it competitive if the base speed is similar to that of an array.

My testing environments are:

  • An iMac (current generation) with a 3.06 Ghz Intel Core 2 Duo and 2G of 800mhz DDR2 RAM. MAMP 1.72 (PHP 5.2.5) provides the AMP stack.
  • A MacBook Pro with a 2.4 Ghz Intel Core 2 Duo and 4G of 667mhz DDR2 RAM. MAMP 1.72 (PHP 5.2.5) provides the AMP stack.

In both cases, the performance tests averaged about the same. Benchmarks in this article come from the second system.

Our basic testing strategy was to build a simple test that captured information about three things:

  1. The amount of time it takes to load the data structure
  2. The amount of time it takes to seek the data structure
  3. The amount of memory the data structure uses

We did our best to minimize the influence of other factors on the test. Here is our testing script:

  <?php
  /**
   * Object hashing tests.
   */
  $sos = new SplObjectStorage();
 
  $docs = array();
  $iterations = 100000;
 
  for ($i = 0; $i < $iterations; ++$i) {
    $doc = new DOMDocument();
    //$doc = new stdClass();
 
    $docs[] = $doc;
 
  }
 
  $start = $finis = 0;
 
  $mem_empty = memory_get_usage();
 
  // Load the SplObjectStorage
  $start = microtime(TRUE);
  foreach ($docs as $d) {
    $sos->attach($d);
  }
  $finis = microtime(TRUE);
 
  $time_to_fill = $finis - $start;
 
  // Check membership on the object storage
  $start = microtime(FALSE);
  foreach ($docs as $d) {
    $sos->contains($d);
  }
 
  $finis = microtime(FALSE);
 
  $time_to_check = $finis - $start;
 
  $mem_spl = memory_get_usage();
 
  $mem_used = $mem_spl - $mem_empty;
 
  printf("SplObjectStorage:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);
 
  unset($sos);
  $mem_empty = memory_get_usage();
 
  // Test arrays:
  $start = microtime(TRUE);
  $arr = array();
 
  // Load the array
  foreach ($docs as $d) {
    $arr[spl_object_hash($d)] = $d;
  }
  $finis = microtime(TRUE);
 
  $time_to_fill = $finis - $start;
 
  // Check membership on the array
  $start = microtime(FALSE);
  foreach ($docs as $d) {
    //$arr[spl_object_hash($d)];
    isset($arr[spl_object_hash($d)]);
  }
 
  $finis = microtime(FALSE);
 
  $time_to_check = $finis - $start;
  $mem_arr = memory_get_usage();
 
  $mem_used = $mem_arr - $mem_empty;
 
 
  printf("Arrays:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);
  ?>

The test above is broken into four separate tests. The first two test how well the SplObjectStorage method handles loading and containment checking. The second two perform the same test on our improvised array data structure.

There are two things worth noting about the test above.

First, the object of choice for our test was a DOMDocument. There are a few reasons for this. The obvious reason is that this test was done with the intent of optimizing QueryPath, which works with elements from the DOM implementation. There are two other interesting reasons, though. One is that DOMDocuments are not lightweight. The other is that DOMDocuments are backed by a C implementation, making them one of the more difficult cases when storing objects. (They cannot, for example, be conveniently serialized.)

That said, after observing the outcome, we repeated the test with basic stdClass objects and found the performance results to be nearly identical, and the memory usage to be proportional.

The second thing worth mention is that we used 100,000 iterations to test. This was about the upper bound that my PHP configuration allowed before running out of memory. Other than that, though, the number was chosen arbitrarily. When I ran tests with lower iteration counts, the SplObjectStorage definitely scaled linearly. Array performance was less predictable (larger standard deviation) with smaller data sets, though it seemed to average around the same for lower sizes as it does (more predictably) for larger sized arrays.

The Results

So how did these two strategies fare in our micro-benchmarks? Here is a representative sample of the output generated when running the above:

SplObjectStorage:
Time to fill: 0.085041999817.
Time to check: 0.073099000000.
Memory: 6124624
 
Arrays:
Time to fill: 0.193022966385.
Time to check: 0.153498000000.
Memory: 8524352

Averaging this over multiple runs, SplObjectStorage executed both fill and check functions twice as fast as the array method presented above. We tried various permutations of the tests above. Here, for example, are results for the same test with a stdClass object:

SplObjectStorage:
Time to fill: 0.082209110260.
Time to check: 0.070617000000.
Memory: 6124624
 
Arrays:
Time to fill: 0.189271926880.
Time to check: 0.152644000000.
Memory: 8524360

Not much different. Even adding arbitrary data to the object we stored does not make a difference in the time it takes for the SplObjectStorage (though it does seem to raise the time ever so slightly for the array).

Our conclusion is that SplObjectStorage is indeed a better solution for storing lots of objects in a set. Over the last week, I've ported QueryPath to SplObjectStorage (see the Quark branch at GitHub -- the existing Drupal QueryPath module can use this experimental branch without alteration), and will likely continue benchmarking. But preliminary results seem to provide a clear indication as to the best approach.

As a result of these findings, I'm much less inclined to default to arrays as "the best choice" simply because they are basic data types. If the SPL library contains features that out-perform arrays, they should be used when appropriate. From QueryPath to my Drupal modules, I expect that my code will be impacted by these findings.

Thanks to Crell for his help, and for Eddie at Frameweld for sparking my examination of these two methods in the first place.

Hash size and non hash alternatives

You use a hash as the object identifier and the hash creation appears to be proportional to the data in the object. A comparison of 1000 objects, each containing 100KB, to 1000 objects containing 1 byte would show the effect of hashing.

I do have projects with 100000 objects in memory but most have defined identifiers at creation, making a simple $objects[] = $object; a good storage method, coupled to indexes of the identifiers to be used later in the program. That means I can go straight to the identifier or identifiers I need and use them to find the index of the object. There is never any hash creation to slow things down. A gigabyte array performs the same as an array that takes up only a few megabytes. The result is almost exactly the same as an indexed database table and is useful where you want to use the same data several times selected and sorted different ways.

The approach also gets around the problem of hashing properties where cloned objects produce the same hash despite being different objects. Does spl_object_hash() guarantee unique hashes for objects with identical properties? If it does then what value is added to the hash to make it unique?

SPLObjectStorage is a nice idea and, for my use, will need selection methods similar to in memory databases. A simpler approach might be to take one of the in memory databases and give it native storage of PHP objects.

This is the data structure I am migrating away from

An indexed array of objects is the form of object storage used in QueryPath 1.x. There were two huge gotchas that did not work for me, and have led me to abandon this.

  1. Checking for containment does not perform well. Simple methods like in_array() perform very poorly. QueryPath has to do this check frequently, so this makes a big difference.
  2. When adding things to the array, I had to check whether the object was already in the array, which meant every addition incurred a big run through the loop. Various attempts at resolving this resulted in worse performance or too much fuzziness.

Regarding cloned objects, I actually want a different entry for a cloned object. So SplObjectStorage and spl_object_hash() provide the desired property: a different entry/key for each object, regardless of properties.

So, yeah, in some cases, the array-based storage you are talking about will work very well. My use case didn't happen to be one of them because of my need to repeatedly check for membership and enforce uniqueness.

A really fascinating article

A really fascinating article on parts of Drupal I rarely have the bravery to delve into. Thanks for the overview, I learned a lot.

Efficiency: Is it the array or the key generation?

Wes Munsil pointed out to me that I didn't indicate what parts of the array model are causing the slowdown. The culprit, he suggests, may be spl_object_hash(). While I haven't tested this, I suspect that he may be correct. A simple hashkey generator for your own use case may in fact outperform spl_object_hash(), in which case you may get better results off of this style of benchmark when using arrays.

And the culprit is...

Again, following Wes's prompting, I went ahead and ran a set of tests for looping and generating hash keys to see if this sheds some light on the source of the array bottleneck.

In fact, it sheds a lot of light on it.

Here's the timing code, which I included along with the rest of the tests presented in this article. It does two things: (1) times the amount of time spent looping through an array, and (2) times the process of generating hash keys with spl_object_hash().

// Time to loop
$start = microtime(TRUE);
 
foreach ($docs as $d) {}
$finis = microtime(TRUE);
$time_to_loop = $finis - $start;
 
// Time hashing by itself
 
$start = microtime(TRUE);
 
foreach ($docs as $d) {
  spl_object_hash($d);
}
$finis = microtime(TRUE);
 
$time_to_hash = $finis - $start;
 
printf("SPL Object hash:\nTime to loop: %0.12f.\nTime to loop + hash: %0.12f.\n\n", $time_to_loop, $time_to_hash);

The results seem to pinpoint the bottleneck:

SPL Object hash:
Time to loop: 0.002760171890.
Time to loop + hash: 0.013108968735.

Clearly, the brunt of the time spent in the set-like array emulation is spent hashing object keys with spl_object_hash(). A better hashing strategy may save a lot of time.

Re-checking with pre-computed keys (where the check function didn't require doing and spl_object_hash() indicates that the isset[$array[$key]) is indeed very fast (I get numbers around 0.003163 for a 100,000 checks with pre-computed keys).

All of this suggests that for the array method to perform well, a different hash key generator must be used.

Finding the culprit

A more fair approach would be to add the spl_object_hash() calls to the loops that fill and check the SplObjectStorage(). I ran the benchmark on my laptop but the results differed too much to say anything about the influence of spl_object_hash() (both the array and SplObjectStorage approach ran in about the same time, maybe my PHP 5.3 setup has something to do with that).

Nice article though, especially interesting in situations where you do not have a predefined unique id for each object.

Not sure I understand...

I don't understand why I would be adding spl_object_hash() calls to the SplObjectStorage loops. Wouldn't that just add unnecessary overhead to the SplObjectStorage loop?

I do know that the SPL library in PHP 5.3 is much improved. Does that translate to a faster spl_object_hash() (or a slower SplObjectStorage)? I don't know.

Thanks!

I think it's the key generation

I've run your test on my machine, adding a further test only on spl_object_hash.
My results show that the problem is spl_object_hash, taking 0.18 over 0.25 seconds.
test_spl_object_hash.php is appended below.

Your test is interesting but imho the problem is generating an unique key given of object, that php doesn't provide you in an easy way.

Apart from this example, will you need to generate the object id given the object, or will you be able to have an unique key in another way? E.g. a page url or similar?

$ php test_spl_object_hash.php
 SplObjectStorage:
Time to fill: 0.116365909576.
Time to check: 0.097703000000.
Memory: 5724580
 
Arrays:
Time to fill: 0.249284029007.
Time to check: 0.217324000000.
Memory: 8124476
 
spl_object_hash:
Time to fill: 0.199151039124.
Memory: 188

test_spl_object_hash.php

  1. <?php
  2. /**
  3.   * Object hashing tests.
  4.   */
  5. $sos = new SplObjectStorage();
  6.  
  7. $docs = array();
  8. $iterations = 100000;
  9.  
  10. for ($i = 0; $i < $iterations; ++$i) {
  11. $doc = new DOMDocument();
  12. //$doc = new stdClass();
  13.  
  14. $docs[] = $doc;
  15.  
  16. }
  17.  
  18. $start = $finis = 0;
  19.  
  20. $mem_empty = memory_get_usage();
  21.  
  22. // Load the SplObjectStorage
  23. $start = microtime(TRUE);
  24. foreach ($docs as $d) {
  25. $sos->attach($d);
  26. }
  27. $finis = microtime(TRUE);
  28.  
  29. $time_to_fill = $finis - $start;
  30.  
  31. // Check membership on the object storage
  32. $start = microtime(FALSE);
  33. foreach ($docs as $d) {
  34. $sos->contains($d);
  35. }
  36.  
  37. $finis = microtime(FALSE);
  38.  
  39. $time_to_check = $finis - $start;
  40.  
  41. $mem_spl = memory_get_usage();
  42.  
  43. $mem_used = $mem_spl - $mem_empty;
  44.  
  45. printf("SplObjectStorage:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);
  46.  
  47. unset($sos);
  48. $mem_empty = memory_get_usage();
  49.  
  50. // Test arrays:
  51. $start = microtime(TRUE);
  52. $arr = array();
  53.  
  54. // Load the array
  55. foreach ($docs as $d) {
  56. $arr[spl_object_hash($d)] = $d;
  57. }
  58. $finis = microtime(TRUE);
  59.  
  60. $time_to_fill = $finis - $start;
  61.  
  62. // Check membership on the array
  63. $start = microtime(FALSE);
  64. foreach ($docs as $d) {
  65. //$arr[spl_object_hash($d)];
  66. isset($arr[spl_object_hash($d)]);
  67. }
  68.  
  69. $finis = microtime(FALSE);
  70.  
  71. $time_to_check = $finis - $start;
  72. $mem_arr = memory_get_usage();
  73.  
  74. $mem_used = $mem_arr - $mem_empty;
  75.  
  76.  
  77. printf("Arrays:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);
  78.  
  79. // Test spl_object_hash
  80. $start = microtime(TRUE);
  81. $arr = array();
  82.  
  83. // Load the array
  84. foreach ($docs as $d) {
  85. spl_object_hash($d);
  86. }
  87. $finis = microtime(TRUE);
  88.  
  89. $time_to_fill = $finis - $start;
  90. $mem_arr = memory_get_usage();
  91.  
  92. $mem_used = $mem_arr - $mem_empty;
  93.  
  94.  
  95. printf("spl_object_hash:\nTime to fill: %0.12f.\nMemory: %d\n\n", $time_to_fill, $mem_used);
  96. ?>

My apologies: your comment

My apologies: your comment "And the culprit is..." was already stating that, but I was mislead by the numbers. I guess you run the script with less iterations, then the numbers were far from the initial figures.

Thanks for confirming

Thanks for going through the trouble of confirming the results. I'm glad that you, too, found the same things to be true. And I think your comment is clearer than my original "and the culprit is" comment.

My apologies: your comment

My apologies: your comment "And the culprit is..." was already stating that, but I was mislead by the numbers. I guess you run the script with less iterations, then the numbers were far from the initial figures.

Nice coding skill Thanks for

Nice coding skill

Thanks for sharing

I dont understand the purpose

I dont understand the purpose of this portion of the code:

// Check membership on the array63. $start = microtime(FALSE);64. foreach ($docs as $d) {65. //$arr[spl_object_hash($d)];66. isset($arr[spl_object_hash($d)]);67. }68. 69. $finis = microtime(FALSE);70. 71. $time_to_check = $finis - $start;72. $mem_arr = memory_get_usage();73. 74. $mem_used = $mem_arr - $mem_empty;75. 76. 77. printf("Arrays:\nTime to fill: %0.12f.\nTime to check: %0.12f.\nMemory: %d\n\n", $time_to_fill, $time_to_check, $mem_used);

Checking for a key

The purpose of this code?

// Check membership on the array
 
foreach ($docs as $d) {
  isset($arr[spl_object_hash($d)]);
}

What we're timing here is array access time for determining whether $arr has a $d.

Basically, we're checking whether the hashkey exists for $d, and to do this, we have to run spl_object_hash().

Again, most of this is just an attempt to make use of PHP's fastest features. Unfortunately, PHP 5.2 wasn't very fast at generating object hashes (a flaw I hear is remedied in PHP 5.3.)

e-greeting cards design | low

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <h4>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>. Beside the tag style "<foo>" it is also possible to use "[foo]".
  • Lines and paragraphs break automatically.
  • Images can be added to this post.

More information about formatting options

Recent comments