XML, character sets, and setting the right encoding

Working with incorrectly encoded XML documents is painful.

Today I encountered an XML document that did not declare (in its XML header) what encoding it used. If a document does not have an explicit encoding set in the XML declaration, it must be treated as UTF-8. But in this case, the document was actually encoded in some variant of ISO-8859-1 (it appeared to have snippets of MS Word generated HTML copied and pasted into it). When encountering high ASCII characters in the document, the parser (rightly) choked.

Here's what the declaration looked like:

<?xml version="1.0"?>

Because it is encoded as an ISO-8859-1 document, it should have looked like this:

<?xml version="1.0" encoding="iso-8859-1"?>

So what do you do in a case like this? PHP's (actually, libxml's) parser does not allow you to explicitly override the XML declaration. Consider what might appear to be a working option:

$doc = DOMDocument('1.0', 'ISO-8859-1');
$doc->load('my/broken/doc.xml');

This will fail (as will attempts to load the document with SimpleXML). The document's own (implicit) UTF-8 declaration will override the settings for the DOMDocument object. Similarly, trying to set $doc->encoding will also be ineffective for the same reason.

Running an automated replacement is dangerous, though. Just because one document was ISO-8859-1, I cannot assume that all will be. Thus, building a solution could involve using iconv or other similar tools. Indubitably, the correct route is to have the XML producer correctly set the encoding (or correctly convert to contents to UTF-8). Barring, that, though, iconv is your best bet.

Here's how to correct encoding errors from the command line. This should work on most UNIX-like systems, including Mac OS X:

$ iconv -f 'iso-8859-1' -t 'utf-8' bad-iso8859-1.xml > utf-8.xml

In this case, iconv will read the original file, convert it from ISO-8859-1 to UTF-8, and then write the results to utf-8.xml.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <h4>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>. Beside the tag style "<foo>" it is also possible to use "[foo]".
  • Lines and paragraphs break automatically.
  • Images can be added to this post.

More information about formatting options