Parsing Old Remote HTML Docs with QueryPath

Jun 15 2009

An issue in the QueryPath issue queu made me realize that parsing crufty old HTML documents is not exactly intuitive. Here's a quick tutorial for parsing HTML documents.

The most difficult part of handling documents is that it is not always clear what kind of content a document is. While some extensions (like .html and .xml) make this easy, others (like .qti, and XML format) are not so readily discernable. And files fetched from a URL may not have an extension at all!

So sometimes when QueryPath parses files, it basically guesses at the content type.

Here are the guessing rules that QueryPath follows:

  1. If passed a string of markup (e.g, ), check whether it begins with an XML declaration. By definition, every XML file must begin with something like this:
<?xml version="1.0"?>
<sometag/>

If the declaration does not appear on the first line, the document is not to be considered XML. If this is the case, QueryPath throws it into the HTML parser.

  1. If passed a file name and a context, inspect the contents. QueryPath uses the PHP stream system to handle files. One of the benefits of this is that you can pass in a context which tells QueryPath how to retrieve the document. Typically, this is used to modify HTTP parameters.

Whenever a QueryPath object is created with a context, QueryPath automatically inspects the contents of a file. Example:

<?php
require '../Code/QueryPath/bin/QueryPath.compact.php';
$url = 'http://example.com/old_crufty_html.foo';

$cxt = array('context' => stream_context_create());

// This will parse it as HTML:
print qp($url, 'title', $cxt)->text();
?>

Assuming that the URL above points to an old and crufty HTML document, this code will parse the document as HTML. And if the URL above pointed to an XML document (or an XHTML document), the XML parser would be used instead. This happens because when a context is passed into QueryPath, it checks the content to see what data type it is dealing with.

  1. If passed a file name and no context, inspect the file extension. This is sort of a last-ditch attempt, and what it will do is assume that if the file ends with .html the code is HTML. Otherwise, it will assume that the file is XML.

Obviously this will work for simple cases. When retrieving URLs, though, it may have unexpected results. So why do things this way? One word: Performance. Large files work much better when we use this method. The underlying system can optimize reading of the file.

QueryPath 2.0 may change this behavior. The next version of QueryPath may use the method outlined above for all files. For QueryPath 1, though, this is how files are interpreted.

When parsing moderately sized old HTML files, you will do best to pass a context into qp(). This will give you the greatest chances of successfully parsing the document.



comments powered by Disqus