Redux: Compressing PHP source code
Earlier this month I posted a short example of a compressor I was working on for QueryPath. I received a couple of very helpful comments that pointed to a PHP built-in library that I didn't know about: tokenizer.
The tokenizer is just what I needed. It parses arrays of PHP code and returns arrays of token information. Since it does the hard work of breaking up a file into meaningful chunks, building a compressor on top of it is trivially easy.
As you may recall, the goals of the compressor are:
- Pack all of the PHP files in a package into one PHP file.
- Remove all of the comments
- Get rid of extraneous whitespace
- Keep the output in pure PHP (i.e. don't require any additional processing steps to interpret the file)
With the tokenizer, we can accomplish these goals in about sixty lines of code.
Note that compressed code is not likely to result in huge execution speed improvements. When presented with this fact, several people have asked me why anyone would want to compress their code. I have done this as a result of requests to make a tiny single-file QueryPath distribution. I suppose use-cases for compact libraries range from embedded systems to faster downloads to simply smaller overall codebases.
Here's the revised code that performs the same sort of compression as the previous article. Note that I have also generalized it and made it into a shell script.
<?php
/**
* Compact PHP code.
*
* Strip comments, combine entire library into one file.
*/
if ($argc < 3) {
print "Strip unecessary data from PHP source files.\n\n\tUsage: php compactor.php DESTINATION.php SOURCE.php";
exit;
}
$source = $argv[2];
$target = $argv[1];
print "Compacting $source into $target.\n";
include $source;
$files = get_included_files();
$out = fopen($target, 'w');
fwrite($out, '<?php' . PHP_EOL);
fwrite($out, '// QueryPath. Copyright (c) 2009, Matt Butcher.' . PHP_EOL);
fwrite($out, '// This software is released under the LGPL, v. 2.1 or an MIT-style license.' . PHP_EOL);
fwrite($out ,'// http://opensource.org/licenses/lgpl-2.1.php');
fwrite($out, '// http://querypath.org.' . PHP_EOL);
foreach ($files as $f) {
if ($f !== __FILE__) {
$contents = file_get_contents($f);
foreach (token_get_all($contents) as $token) {
if (is_string($token)) {
fwrite($out, $token);
}
else {
switch ($token[0]) {
case T_REQUIRE:
case T_REQUIRE_ONCE:
case T_INCLUDE_ONCE:
// We leave T_INCLUDE since it is rarely used to include
// libraries and often used to include HTML/template files.
case T_COMMENT:
case T_DOC_COMMENT:
case T_OPEN_TAG:
case T_CLOSE_TAG:
break;
case T_WHITESPACE:
fwrite($out, ' ');
break;
default:
fwrite($out, $token[1]);
}
}
}
}
}
fclose($out);
?>
The main idea of this script is to read a main file and all of its included libraries and compress them all into one output file, which will function as executable PHP code.
Since most of this has been covered in the previous article, I just want to point out the section that makes use of the tokenizer.
The important part of the code above is the large foreach
loop:
$contents = file_get_contents($f);
foreach (token_get_all($contents) as $token) {
if (is_string($token)) {
fwrite($out, $token);
}
else {
switch ($token[0]) {
case T_REQUIRE:
case T_REQUIRE_ONCE:
case T_INCLUDE_ONCE:
// We leave T_INCLUDE since it is rarely used to include
// libraries and often used to include HTML/template files.
case T_COMMENT:
case T_DOC_COMMENT:
case T_OPEN_TAG:
case T_CLOSE_TAG:
break;
case T_WHITESPACE:
fwrite($out, ' ');
break;
default:
fwrite($out, $token[1]);
}
}
}
The main function for the PHP tokenizer is tokengetall()
. This takes an array of strings and returns an array of tokens resulting from the parsing of that string. A token can be either a bare character or an array with a token identifier ($token[0]), the value of the token ($token[1]) and the line number on which the token appears ($token[2]).
What we do above is loop through the tokens in a file, and then apply this set of rules:
- If the token is just a character, print it to the destination file.
- Skip any tokens that are:
require
,requireonce
, orincludeonce
directives- comments of some sort
- an open (
<?php
) or close (?>
) PHP directive
- Compress all whitespace sections into a single character. We could do some fancy additional compression, but there is really no need for present purposes.
- All else is printed verbatim to the output file
At the end of this process, we are left with one compacted file. Here's how the above script is executed against the QueryPath library:
php ./compactor.php QueryPath.compact.php ../src/QueryPath/QueryPath.php
This reads in four source files (QueryPath.php
and the three files required) and processes them all into a single 65k output file named QueryPath.compact.php
.
The source code is now hosted at GitHub. Feel free to fork: http://github.com/technosophos/PHPCompactor/tree/master <!--break-->