QueryPath and Character Sets: Converting content with mb_convert_encoding()

May 3 2010

QueryPath can be used to crawl the web, parsing web pages and gleaning information. But the HTML of remote websites is not always as pristine and standards compliant as we would like, and one thing that can be particularly frustrating is determining the encoding of a document. (This gets substantially more complicated when HTTP headers list one encoding and HTML meta tags list another -- a common configuration error).

QueryPath is primarily a library for working with XML and HTML, but it assumes that you know from the outset what character set your document uses. This is not always a good assumption to make. Here is one way to circumvent the problem: Rather than write code to find out a document's character set, use PHP built-in functions (assuming you have the MB library compiled in) to do this for you.

require 'QueryPath/QueryPath.php';

$url = 'http://mopy.fr/';
$contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto');
$opts = array('ignore_parser_warnings' => TRUE);

print @qp($contents, 'title', $opts)->text() . PHP_EOL;

In the code above, I access a French language website (pointed out to me by a posting on the QueryPath support list), and then prepare it for loading. By default, the HTML DOM uses ISO-8859-1 for its character set.

The really important line above is this:

$contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto');

This does two things:

  • It retrieves the URL from the remote site with file_get_contents.
  • It automatically determines the encoding of the document, and converts it to ISO-8859-1. This is done by mbconvertencoding.

So $contents is going to be in a known and supported character set before it is passed into QueryPath.

Note that in the call to QueryPath, we pass in the ignore_parser_warnings flag and we suppress error messages (with @). While this has nothing to do directly with the encoding issue, it is one way of preventing the ickiness of HTML markup from causing warning and error messages in your output.

(For another way of converting, see this earlier article on iconv, a strategy that works better if you are bulk importing lots of content from the local file system.)