PHP Stream Filters: Compress, transform, and transcode on the fly.

Feb 28 2012

Your task: In PHP code, open a file compressed with BZ2, convert its contents from one character set to another, convert the entire contents to uppercase, run ROT-13 over it, and then write the output to another file. And do it as efficiently as possible.

Oh, and do it without any loops. Just for fun.

Actually, this task is exceptionally easy to do. Just make use of an often overlooked feature of PHP stream API: stream filters. Here's how.

Stream Filters In Theory

PHP uses the concept of "streams" as an abstraction layer for IO. Reading from and writing to files can be done with streams. Sockets, too, can be read from and written to as streams. FTP and HTTP servers have native stream support. That's why you can write this:

<?php
$contents = file_get_contents('http://example.com');
?>

The above gets the entire contents of the webpage at the destination URL, and reads it as if it were a file on the local filesystem.

You can also open compressed or archive files like Phar, BZip2, and Gzip files, having the content decompressed on the fly.

Stream filters provide one more layer on top of this: They allow you to open a stream, and then have one or more tasks (filters) run on the stream as the data is read from or written to the stream.

For example, you can open a stream to a remote URL to a gzipped file, and have the file uncompressed as it is read.

Stream Filters In Practice

By way of reminder, here's our task:

We want to read a file compressed with BZip2 and transform it into a file where the data is capitalized, and run through the ROT-13 obfuscator (which "rotates" each character 13 places in the alphabet).

As we go, we will also re-encode the file from ISO-8859-1 to UTF-8.

Broken down into a sequence, we will do the following:

  • Open a stream for reading
  • Open a (plain) stream for writing
  • Uncompress the input stream
  • Transcode from ISO-8859-1 to UTF-8
  • Convert the contents of the stream to uppercase
  • Rotate the characters by ROT-13
  • Write the file out to a plain text file
  • Clean up

With stream filters, this is accomplished by creating a pair of streams, and then assigning filters to each stream. When we copy the data from one stream to another, the filters will be run internally. Other than assigning the filters, we do not have to intervene.

We will begin with the file test.txt.bz2, which is a bzip2-compressed text file whose contents are Hello World.. And we will generate a file called test-uppercase.txt.

Here's how we do it:

<?php
/**
 * Example of stream filtering.
 */

// Open two file handles.
$in = fopen('test.txt.bz2', 'rb');
$out = fopen('test-uppercase.txt', 'wb');

// Add a decode filter to the first.
stream_filter_prepend($in, 'bzip2.decompress', STREAM_FILTER_READ);

// Change the charset from ISO-8859-1 to UTF-8
stream_filter_append($out, 'convert.iconv.ISO-8859-1/UTF-8', STREAM_FILTER_WRITE);

// Uppercase the entire string.
stream_filter_append($out, 'string.toupper', STREAM_FILTER_WRITE);

// Run ROT-13 on the output.
stream_filter_append($out, 'string.rot13', STREAM_FILTER_WRITE);

// Now copy. All of the filters are applied here.
stream_copy_to_stream($in, $out);

// Clean up.
fclose($in);
fclose($out);
?>

Now if we take a look at test-uppercase.txt, we will see that its contents look like this:

URYYB JBEYQ.

What the code does

The code above basically does the following:

  • Open an input and an output file.
  • On the input file...
    • Assign the bzip.decompress filter in READ mode, which will decompress the input stream as it is read.
  • On the output file...
    • Use iconv to transcode from ISO-8859-1 to UTF-8
    • Use the string.toupper filter to transform the data to uppercase where applicable.
    • Use the string.rot13 filter to obfuscate the contents.
  • Then copy the input stream ($in) to the output stream ($out) in one step.
  • Finally, close the files.

It is important to note that none of the filters are actually applied until the streams are processed. So it is only when stream_copy_to_stream() is executed that all four filters are applied.

This method is far more efficient than performing the same operations in a loop because the copying is done at a lower level, where data does not have to be passed into and out of user space. So in addition to being easier to code, it is also faster and less memory intensive.

Some Important Details

Why doesn't stream filtering get used more often? One reason is that the documentation is sparse. To figure out how to use it, in fact, I had to read part of the C source code for PHP. (The unit tests helped a lot, too).

Here are some useful tips, though:

  • To find out (roughly) what filters are supported, you can use stream_get_filters().
  • The order of filters can be managed using stream_filter_append(), stream_filter_prepend(), and stream_filter_remove().
  • You can even write your own filters, should you so desire.

But one of the most frustrating aspects of the filtering library was figuring out which particular filters are supported. Running stream_get_filters() returns data like this:

<?php
Array
(
    [0] => zlib.*
    [1] => bzip2.*
    [2] => convert.iconv.*
    [3] => string.rot13
    [4] => string.toupper
    [5] => string.tolower
    [6] => string.strip_tags
    [7] => convert.*
    [8] => consumed
    [9] => dechunk
)
?>

But what do we do with zlib.*? Here's what I found:

ZLib Filters

This supports GZip compressing and decompressing.

  • zlib.inflate
  • zlib.deflate

BZip2 Filters

These support reading to and writing from a BZip compressed stream.

  • bzip2.decompress
  • bzip2.compress

Convert Filters

Base-64 and Quoted Printable seem to be the two formats supported by the convert filters:

  • convert.base64-encode
  • convert.base64-decode
  • convert.quoted-printable-encode
  • convert.quoted-printable-decode

Convert.Iconv Filters

The filter format for these is different than the others. It is something like this:

'convert.iconv.FROM_ENCODING/TO_ENCODING'

Thus, convert.iconv.ISO-8859-13/ISO-8859-15 would convert from ISO-8859-13 into ISO-8859-15.

Presumably, any charactersets recognized by Iconv are supported by the filter.

String Filters

These perform simple string manipulations:

  • string.toupper
  • string.tolower
  • string.rot13
  • string.strip_tags (removes HTML-like tags)

Dechunk

This reads data passed in using the chunked transfer encoding.

  • dechunk

Consumed

I am not sure what this filter is for. The C code looks like it counts the number of bytes consumed during a particular filter run, but I'm not sure what this is used for. Testing it returns nothing.

  • consumed

If you know what this one is for, let me know in the comments.



comments powered by Disqus