PHP Stream Filters: Compress, transform, and transcode on the fly.
Your task: In PHP code, open a file compressed with BZ2, convert its contents from one character set to another, convert the entire contents to uppercase, run ROT-13 over it, and then write the output to another file. And do it as efficiently as possible.
Oh, and do it without any loops. Just for fun.
Actually, this task is exceptionally easy to do. Just make use of an often overlooked feature of PHP stream API: stream filters. Here's how. <!--break-->
Stream Filters In Theory
PHP uses the concept of "streams" as an abstraction layer for IO. Reading from and writing to files can be done with streams. Sockets, too, can be read from and written to as streams. FTP and HTTP servers have native stream support. That's why you can write this:
<?php
$contents = file_get_contents('http://example.com');
?>
The above gets the entire contents of the webpage at the destination URL, and reads it as if it were a file on the local filesystem.
You can also open compressed or archive files like Phar, BZip2, and Gzip files, having the content decompressed on the fly.
Stream filters provide one more layer on top of this: They allow you to open a stream, and then have one or more tasks (filters) run on the stream as the data is read from or written to the stream.
For example, you can open a stream to a remote URL to a gzipped file, and have the file uncompressed as it is read.
Stream Filters In Practice
By way of reminder, here's our task:
We want to read a file compressed with BZip2 and transform it into a file where the data is capitalized, and run through the ROT-13 obfuscator (which "rotates" each character 13 places in the alphabet).
As we go, we will also re-encode the file from ISO-8859-1 to UTF-8.
Broken down into a sequence, we will do the following:
- Open a stream for reading
- Open a (plain) stream for writing
- Uncompress the input stream
- Transcode from ISO-8859-1 to UTF-8
- Convert the contents of the stream to uppercase
- Rotate the characters by ROT-13
- Write the file out to a plain text file
- Clean up
With stream filters, this is accomplished by creating a pair of streams, and then assigning filters to each stream. When we copy the data from one stream to another, the filters will be run internally. Other than assigning the filters, we do not have to intervene.
We will begin with the file test.txt.bz2
, which is a bzip2-compressed text file whose contents are Hello World.
. And we will generate a file called test-uppercase.txt
.
Here's how we do it:
<?php
/**
* Example of stream filtering.
*/
// Open two file handles.
$in = fopen('test.txt.bz2', 'rb');
$out = fopen('test-uppercase.txt', 'wb');
// Add a decode filter to the first.
stream_filter_prepend($in, 'bzip2.decompress', STREAM_FILTER_READ);
// Change the charset from ISO-8859-1 to UTF-8
stream_filter_append($out, 'convert.iconv.ISO-8859-1/UTF-8', STREAM_FILTER_WRITE);
// Uppercase the entire string.
stream_filter_append($out, 'string.toupper', STREAM_FILTER_WRITE);
// Run ROT-13 on the output.
stream_filter_append($out, 'string.rot13', STREAM_FILTER_WRITE);
// Now copy. All of the filters are applied here.
stream_copy_to_stream($in, $out);
// Clean up.
fclose($in);
fclose($out);
?>
Now if we take a look at test-uppercase.txt
, we will see that its
contents look like this:
URYYB JBEYQ.
What the code does
The code above basically does the following:
- Open an input and an output file.
- On the input file...
- Assign the
bzip.decompress
filter in READ mode, which will decompress the input stream as it is read.
- Assign the
- On the output file...
- Use
iconv
to transcode from ISO-8859-1 to UTF-8 - Use the
string.toupper
filter to transform the data to uppercase where applicable. - Use the
string.rot13
filter to obfuscate the contents.
- Use
- Then copy the input stream ($in) to the output stream ($out) in one step.
- Finally, close the files.
It is important to note that none of the filters are actually applied until the streams are processed. So it is only when stream_copy_to_stream()
is executed that all four filters are applied.
This method is far more efficient than performing the same operations in a loop because the copying is done at a lower level, where data does not have to be passed into and out of user space. So in addition to being easier to code, it is also faster and less memory intensive.
Some Important Details
Why doesn't stream filtering get used more often? One reason is that the documentation is sparse. To figure out how to use it, in fact, I had to read part of the C source code for PHP. (The unit tests helped a lot, too).
Here are some useful tips, though:
- To find out (roughly) what filters are supported, you can use
stream_get_filters()
. - The order of filters can be managed using
stream_filter_append()
,stream_filter_prepend()
, andstream_filter_remove()
. - You can even write your own filters, should you so desire.
But one of the most frustrating aspects of the filtering library was figuring out which particular filters are supported. Running stream_get_filters()
returns data like this:
<?php
Array
(
[0] => zlib.*
[1] => bzip2.*
[2] => convert.iconv.*
[3] => string.rot13
[4] => string.toupper
[5] => string.tolower
[6] => string.strip_tags
[7] => convert.*
[8] => consumed
[9] => dechunk
)
?>
But what do we do with zlib.*
? Here's what I found:
ZLib Filters
This supports GZip compressing and decompressing.
- zlib.inflate
- zlib.deflate
BZip2 Filters
These support reading to and writing from a BZip compressed stream.
- bzip2.decompress
- bzip2.compress
Convert Filters
Base-64 and Quoted Printable seem to be the two formats supported by the convert
filters:
- convert.base64-encode
- convert.base64-decode
- convert.quoted-printable-encode
- convert.quoted-printable-decode
Convert.Iconv Filters
The filter format for these is different than the others. It is something like this:
'convert.iconv.FROM_ENCODING/TO_ENCODING'
Thus, convert.iconv.ISO-8859-13/ISO-8859-15
would convert from ISO-8859-13 into ISO-8859-15.
Presumably, any charactersets recognized by Iconv are supported by the filter.
String Filters
These perform simple string manipulations:
- string.toupper
- string.tolower
- string.rot13
- string.strip_tags (removes HTML-like tags)
Dechunk
This reads data passed in using the chunked transfer encoding.
- dechunk
Consumed
I am not sure what this filter is for. The C code looks like it counts the number of bytes consumed during a particular filter run, but I'm not sure what this is used for. Testing it returns nothing.
- consumed
If you know what this one is for, let me know in the comments.