Misplaced Optimization: A story of PHP performance woes

Apr 25 2013

I recently began working on some PHP code for resolving HTML5 entities into their Unicode codepoints. According to the code, it had been optimized for performance. The code was moderately complex, and the authors appeared to have gone through great pains to build a specialized lookup algorithm. But when I took a closer look, I doubted. I decided to compare the "optimized" version with what I would call a naive version -- the simplest solution to the problem.

Here I show the two solutions, and then benchmark them for both memory and speed. <!--break--> Up front, I want to state that I am not poking fun at or deriding the original authors. Their solution has merits, and in a compiled language it might actually be a faster implementation. But to optimize for PHP requires both an understanding of the language and an appreciation for opcode-based interpretation.

UPDATE: Jeff Graham made a very astute observation on Twitter:

UPDATE: In the initial version of this article, I claimed that the tree was a b*tree. It's actually not. It's just a standard node tree. As such, there is no way to make it outperform a hash table. Based on this, the conclusion of the article is patently obvious. However, it is good to see how damaging mis-application of an algorithm can be to overall performance.

The Code: HTML5 Character Reference Resolving

HTML5 defines over 2,000 character references (expressed as entities). There is no computable way to match the string-based entity names to their Unicode code points. So to solve the problem, one must build a lookup mechanism that maps the character reference's string name to a Unicode character.

For example, the character reference name of ampersand (&) is amp. Likewise, Á is Aacute. A lookup tool should be able to take that string name (amp, Aacute) and return the right unicode character. PHP has an older function that does this, but it only supports a subset of the characters in HTML5, so full standards support requires work.

The Optimized Code

The original code solved the problem as follows:

  • The string names were broken down into characters and then packed into a datastructure designed to work like a node tree.
  • The node tree was serialized to a nice compact 178k file (see the end of the article for a sample).
  • At runtime, the serialized file is read into memory once.
  • To lookup a character, a function walks the node tree. When it finds the full reference, it returns a codepoint.

In theory, this sounds very good. But does it perform? I took the code, optimized it as much as possible without changing the underlying algorithm, and wrote this simple test harness:

<?php

class Data {
  protected static $refs;
  public static function getRefs() {
    // A serialized node tree of entities is stored on disk,
    // It takes 178k of disk space. It is only loaded once.
    if (empty(self::$refs)) {
      self::$refs = unserialize(file_get_contents('./named-character-references.ser'));
    }
    return self::$refs;
  }
}


function test($lookup)  {
  // Load the entities tree.
  $refs = Data::getRefs();

  // Split the string into an array.
  $stream = str_split($lookup);
  $chars = array_shift($stream);

  // Dive into the tree and look for a match.
  // The tree is structured like a tree:
  // array (a => array(m => array(p => 38)))
  $codepoint = false;
  $char = $chars;
  while ($char !== false && isset($refs[$char])) {
    $refs = $refs[$char];
    if (isset($refs['codepoint'])) {
      $id = $chars;
      $codepoint = $refs['codepoint'];
    }
    $chars .= $char = array_shift($stream);
  }

  // Return the codepoint.
  return chr($codepoint);
};

$r = test('amp');

printf("Lookup %s >Current mem: %d Peak mem: %d\n", $r, memory_get_usage(), memory_get_peak_usage());

(Note that using static classes seems to improve memory usage in PHP5. It saves around 0.5M of runtime memory compared to a global or local variable.)

Several things stood out to me, though:

  1. While a serialized file might seem logical, it's actually a burden on PHP. Unserializing is not fast.
  2. The node tree isn't really a node tree. It's actually an n-depth hash table. Tree traversal is not going to be very fast.
  3. Not only will tree traversal be slow, but the node tree-like table is going to require a lot of memory.

The Naive Code

Based on my observations, I decided to compare it to a naive implementation: I stored all of the entities in a single hashtable (in PHP parlance, this is an array). Rather than serialize, I just put the entire thing into a PHP file.

So my code looked like this:

<?php

class Entities {
public static $byName = array (
  'Aacute' => 'Á',
  'Aacut' => 'Á',
  'aacute' => 'á',
  'aacut' => 'á',
  // Snip a few thousand lines
  'zwj' => '',
  'zwnj' => '',
);
}

function test($lookup)  {
  if (!isset(Entities::$byName[$lookup])) return FALSE;

  return Entities::$byName[$lookup];
};

$r = test('amp');

printf("Lookup %s >Current mem: %d Peak mem: %d\n", $r, memory_get_usage(), memory_get_peak_usage());

Compared to the previous code, this should be self-explanatory. We take a string to look up, and we see look it up. (We could actually remove the isset() check by adding @ on the second call to Entities.) The total file size for this file is 47k, so it's still smaller than the serialized node tree.

My hypothesis coming out of this was that the lookup would be substantially faster, and the memory usage would be a little lower. The real results surprised me, though.

Comparing

Memory Usage

The above examples are both tooled already to report memory usage. Let's compare:

$ php test-btree.php                                          
Lookup & >Current mem: 5289928 Peak mem: 5575336

To do a single entity lookup, this code took around 5MB of memory. How does that compare to the naive version?

php test-hash.php
Lookup  >Current mem: 1263200 Peak mem: 1276768

The naive version used a little over a fifth of the memory that the node tree version used.

Speed Test

So the node tree version takes a little extra memory... that's not a deal breaker, is it? Probably not. In fact, if they perform about the same, then it's probably inconsequential.

To test performance, I did the following:

  • Put the test script on a local webserver
  • Warmed the server by requesting each script ten times
  • Ran ab -n 1000 http://localhost/Perf/test-SCRIPT.php

The output of ab (The Apache Benchmarking Tool) is verbose, but here's the pertinent bit for the node tree version:

Concurrency Level:      1
Time taken for tests:   10.564 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    94.66 [#/sec] (mean)
Time per request:       10.564 [ms] (mean)
Time per request:       10.564 [ms] (mean, across all concurrent requests)
Transfer rate:          24.13 [Kbytes/sec] received

The easiest number to zoom in on is Time taken for tests: 10.564 seconds, which conveniently averages to about 10.6 msec per request.

Let's compare with the hashtable version:

Concurrency Level:      1
Time taken for tests:   2.541 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    393.61 [#/sec] (mean)
Time per request:       2.541 [ms] (mean)
Time per request:       2.541 [ms] (mean, across all concurrent requests)
Transfer rate:          100.33 [Kbytes/sec] received

Again, the important number: 2.541 seconds, or about 2.5 msec per request.

The naive version took only one quarter of the time that the optimized version took.

Now here's an interesting additional piece of data: This was without opcode caching. Would opcode caching make a difference?

Speed Test with Opcode Caching

To test opcode caching, I installed apc, restarted Apache, and then re-ran the battery of tests.

Here's how the optimized version fared:

Concurrency Level:      1
Time taken for tests:   10.636 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    94.02 [#/sec] (mean)
Time per request:       10.636 [ms] (mean)
Time per request:       10.636 [ms] (mean, across all concurrent requests)
Transfer rate:          23.96 [Kbytes/sec] received

For all intents and purposes, there was no real change.

Compare that to the naive version:

Concurrency Level:      1
Time taken for tests:   2.025 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      261000 bytes
HTML transferred:       49000 bytes
Requests per second:    493.81 [#/sec] (mean)
Time per request:       2.025 [ms] (mean)
Time per request:       2.025 [ms] (mean, across all concurrent requests)
Transfer rate:          125.86 [Kbytes/sec] received

The opcode cache has trimmed a little over a half a millisecond off of the request time. That makes the naive version more than 5x faster than the optimized version.

Why did it make a difference for the naive version, but not the node tree-based version? The reason is that the .ser file, introduced ostensibly to speed things up, cannot be cached, as it's not code. So on each request, it must be re-loaded into memory.

Meanwhile, all 2,000 entities in the hashed version are conveniently stored in-memory. Assuming the server has sufficient cache space, that data will not need to be re-read and re-interpreted on each subsequent request.

One Additional Strength of the Optimized Code

While I opted to take the naive version of the code, there is one additional strength of the optimized code: Under certain conditions, this sort of algorithm can become more fault-tolerant. The optimized version can sometimes find a codepoint for a reference that was not well formed, because it traverses until it finds a match, and then it stops. The problem with this algorithm, though, is that given the input string 'foobar' and an entity map that contains 'foo' and 'foobar', the matched candidate will be 'foo'.

The naive version of the code does not correct for encoding errors. If the entity name isn't an exact match, it is not resolved.

Appendix: XHProf Stack

Want to know where all of that time is spent? Here's an xhprof dump of the two call stacks.

Using xhprof_enable(XHPROF_FLAGS_CPU + XHPROF_FLAGS_MEMORY);, I gathered the following stats:

node tree Version

Array
(
    [Data::getRefs==>file_get_contents] => Array
        (
            [ct] => 1
            [wt] => 219
            [cpu] => 0
            [mu] => 183888
            [pmu] => 192392
        )

    [Data::getRefs==>unserialize] => Array
        (
            [ct] => 1
            [wt] => 9083
            [cpu] => 8001
            [mu] => 4644728
            [pmu] => 4729336
        )

    [test==>Data::getRefs] => Array
        (
            [ct] => 1
            [wt] => 9346
            [cpu] => 8001
            [mu] => 4648312
            [pmu] => 4921728
        )

    [test==>str_split] => Array
        (
            [ct] => 1
            [wt] => 4
            [cpu] => 0
            [mu] => 2072
            [pmu] => 0
        )

    [test==>array_shift] => Array
        (
            [ct] => 4
            [wt] => 10
            [cpu] => 0
            [mu] => 480
            [pmu] => 0
        )

    [test==>chr] => Array
        (
            [ct] => 1
            [wt] => 3
            [cpu] => 0
            [mu] => 1128
            [pmu] => 0
        )

    [main()==>test] => Array
        (
            [ct] => 1
            [wt] => 9435
            [cpu] => 8001
            [mu] => 4654408
            [pmu] => 4921728
        )

    [main()==>xhprof_disable] => Array
        (
            [ct] => 1
            [wt] => 0
            [cpu] => 0
            [mu] => 1080
            [pmu] => 0
        )

    [main()] => Array
        (
            [ct] => 1
            [wt] => 9454
            [cpu] => 8001
            [mu] => 4657560
            [pmu] => 4921728
        )

)

Hash Version

Array
(
    [main()==>test] => Array
        (
            [ct] => 1
            [wt] => 139
            [cpu] => 0
            [mu] => 1072
            [pmu] => 0
        )

    [main()==>xhprof_disable] => Array
        (
            [ct] => 1
            [wt] => 1
            [cpu] => 0
            [mu] => 1080
            [pmu] => 0
        )

    [main()] => Array
        (
            [ct] => 1
            [wt] => 178
            [cpu] => 0
            [mu] => 4160
            [pmu] => 0
        )

)

Appendix 2: Sample SER data

Array
(
    [A] => Array
        (
            [E] => Array
                (
                    [l] => Array
                        (
                            [i] => Array
                                (
                                    [g] => Array
                                        (
                                            [;] => Array
                                                (
                                                    [codepoint] => 198
                                                )

                                            [codepoint] => 198
                                        )

                                )

                        )

                )

            [M] => Array
                (
                    [P] => Array
                        (
                            [;] => Array
                                (
                                    [codepoint] => 38
                                )

                            [codepoint] => 38
                        )

                )