Redux: Compressing PHP source code

Jun 21 2009

Earlier this month I posted a short example of a compressor I was working on for QueryPath. I received a couple of very helpful comments that pointed to a PHP built-in library that I didn't know about: tokenizer.

The tokenizer is just what I needed. It parses arrays of PHP code and returns arrays of token information. Since it does the hard work of breaking up a file into meaningful chunks, building a compressor on top of it is trivially easy.

As you may recall, the goals of the compressor are:

Pack all of the PHP files in a package into one PHP file.
Remove all of the comments
Get rid of extraneous whitespace
Keep the output in pure PHP (i.e. don't require any additional processing steps to interpret the file)

With the tokenizer, we can accomplish these goals in about sixty lines of code.

Note that compressed code is not likely to result in huge execution speed improvements. When presented with this fact, several people have asked me why anyone would want to compress their code. I have done this as a result of requests to make a tiny single-file QueryPath distribution. I suppose use-cases for compact libraries range from embedded systems to faster downloads to simply smaller overall codebases.

Here's the revised code that performs the same sort of compression as the previous article. Note that I have also generalized it and made it into a shell script.

<?php
/**
 * Compact PHP code.
 *
 * Strip comments, combine entire library into one file.
 */

if ($argc < 3) {
  print "Strip unecessary data from PHP source files.\n\n\tUsage: php compactor.php DESTINATION.php SOURCE.php";
  exit;
}


$source = $argv[2];
$target = $argv[1];
print "Compacting $source into $target.\n";

include $source;

$files = get_included_files();

$out = fopen($target, 'w');
fwrite($out, '<?php' . PHP_EOL);
fwrite($out, '// QueryPath. Copyright (c) 2009, Matt Butcher.' . PHP_EOL);
fwrite($out, '// This software is released under the LGPL, v. 2.1 or an MIT-style license.' . PHP_EOL);
fwrite($out ,'// http://opensource.org/licenses/lgpl-2.1.php');
fwrite($out, '// http://querypath.org.' . PHP_EOL);
foreach ($files as $f) {
  if ($f !== __FILE__) {
    $contents = file_get_contents($f);
    foreach (token_get_all($contents) as $token) {
      if (is_string($token)) {
        fwrite($out, $token);
      }
      else {
        switch ($token[0]) {
          case T_REQUIRE:
          case T_REQUIRE_ONCE:
          case T_INCLUDE_ONCE:
          // We leave T_INCLUDE since it is rarely used to include
          // libraries and often used to include HTML/template files.
          case T_COMMENT:
          case T_DOC_COMMENT:
          case T_OPEN_TAG:
          case T_CLOSE_TAG:
            break;
          case T_WHITESPACE:
            fwrite($out, ' ');
            break;
          default:
            fwrite($out, $token[1]);
        }

      }
    }
  }
}
fclose($out);
?>

The main idea of this script is to read a main file and all of its included libraries and compress them all into one output file, which will function as executable PHP code.

Since most of this has been covered in the previous article, I just want to point out the section that makes use of the tokenizer.

The important part of the code above is the large foreach loop:

    $contents = file_get_contents($f);
    foreach (token_get_all($contents) as $token) {
      if (is_string($token)) {
        fwrite($out, $token);
      }
      else {
        switch ($token[0]) {
          case T_REQUIRE:
          case T_REQUIRE_ONCE:
          case T_INCLUDE_ONCE:
          // We leave T_INCLUDE since it is rarely used to include
          // libraries and often used to include HTML/template files.
          case T_COMMENT:
          case T_DOC_COMMENT:
          case T_OPEN_TAG:
          case T_CLOSE_TAG:
            break;
          case T_WHITESPACE:
            fwrite($out, ' ');
            break;
          default:
            fwrite($out, $token[1]);
        }

      }
    }

The main function for the PHP tokenizer is tokengetall(). This takes an array of strings and returns an array of tokens resulting from the parsing of that string. A token can be either a bare character or an array with a token identifier ($token[0]), the value of the token ($token[1]) and the line number on which the token appears ($token[2]).

What we do above is loop through the tokens in a file, and then apply this set of rules:

If the token is just a character, print it to the destination file.
Skip any tokens that are:
- require, requireonce, or includeonce directives
- comments of some sort
- an open (<?php) or close (?>) PHP directive
Compress all whitespace sections into a single character. We could do some fancy additional compression, but there is really no need for present purposes.
All else is printed verbatim to the output file

At the end of this process, we are left with one compacted file. Here's how the above script is executed against the QueryPath library:

php ./compactor.php QueryPath.compact.php ../src/QueryPath/QueryPath.php

This reads in four source files (QueryPath.php and the three files required) and processes them all into a single 65k output file named QueryPath.compact.php.

The source code is now hosted at GitHub. Feel free to fork: http://github.com/technosophos/PHPCompactor/tree/master

Redux: Compressing PHP source code

Archive

Books

Go In Practice

Drupal 7 Multi-Site Configuration

Drupal 7 Module Development

Mastering OpenLDAP

Drupal 6 JavaScript and jQuery

Drupal 6 Module Development

Managing and Customizing OpenCms 6 Websites

Building Websites with OpenCms 5

Popular Tags

Projects