XML, character sets, and setting the right encoding

May 5 2009

Working with incorrectly encoded XML documents is painful.

Today I encountered an XML document that did not declare (in its XML header) what encoding it used. If a document does not have an explicit encoding set in the XML declaration, it must be treated as UTF-8. But in this case, the document was actually encoded in some variant of ISO-8859-1 (it appeared to have snippets of MS Word generated HTML copied and pasted into it). When encountering high ASCII characters in the document, the parser (rightly) choked.

Here's what the declaration looked like:

<?xml version="1.0"?>

Because it is encoded as an ISO-8859-1 document, it should have looked like this:

<?xml version="1.0" encoding="iso-8859-1"?>

So what do you do in a case like this? PHP's (actually, libxml's) parser does not allow you to explicitly override the XML declaration. Consider what might appear to be a working option:

$doc = DOMDocument('1.0', 'ISO-8859-1');

This will fail (as will attempts to load the document with SimpleXML). The document's own (implicit) UTF-8 declaration will override the settings for the DOMDocument object. Similarly, trying to set $doc->encoding will also be ineffective for the same reason.

Running an automated replacement is dangerous, though. Just because one document was ISO-8859-1, I cannot assume that all will be. Thus, building a solution could involve using iconv or other similar tools. Indubitably, the correct route is to have the XML producer correctly set the encoding (or correctly convert to contents to UTF-8). Barring, that, though, iconv is your best bet.

Here's how to correct encoding errors from the command line. This should work on most UNIX-like systems, including Mac OS X:

$ iconv -f 'iso-8859-1' -t 'utf-8' bad-iso8859-1.xml > utf-8.xml

In this case, iconv will read the original file, convert it from ISO-8859-1 to UTF-8, and then write the results to utf-8.xml. <!--break-->