Come to the 2010 CMS Expo

Parsing Old Remote HTML Docs with QueryPath

An issue in the QueryPath issue queu made me realize that parsing crufty old HTML documents is not exactly intuitive. Here's a quick tutorial for parsing HTML documents.

The most difficult part of handling documents is that it is not always clear what kind of content a document is. While some extensions (like .html and .xml) make this easy, others (like .qti, and XML format) are not so readily discernable. And files fetched from a URL may not have an extension at all!

So sometimes when QueryPath parses files, it basically guesses at the content type.

Here are the guessing rules that QueryPath follows:

  1. If passed a string of markup (e.g, <foo><bar/></foo>), check whether it begins with an XML declaration. By definition, every XML file must begin with something like this:
<?xml version="1.0"?>
<sometag/>

If the declaration does not appear on the first line, the document is not to be considered XML. If this is the case, QueryPath throws it into the HTML parser.

  1. If passed a file name and a context, inspect the contents. QueryPath uses the PHP stream system to handle files. One of the benefits of this is that you can pass in a context which tells QueryPath how to retrieve the document. Typically, this is used to modify HTTP parameters.

Whenever a QueryPath object is created with a context, QueryPath automatically inspects the contents of a file. Example:

<?php
require '../Code/QueryPath/bin/QueryPath.compact.php';
$url = 'http://example.com/old_crufty_html.foo';
 
$cxt = array('context' => stream_context_create());
 
// This will parse it as HTML:
print qp($url, 'title', $cxt)->text();
?>

Assuming that the URL above points to an old and crufty HTML document, this code will parse the document as HTML. And if the URL above pointed to an XML document (or an XHTML document), the XML parser would be used instead. This happens because when a context is passed into QueryPath, it checks the content to see what data type it is dealing with.

  1. If passed a file name and no context, inspect the file extension. This is sort of a last-ditch attempt, and what it will do is assume that if the file ends with .html the code is HTML. Otherwise, it will assume that the file is XML.

Obviously this will work for simple cases. When retrieving URLs, though, it may have unexpected results. So why do things this way? One word: Performance. Large files work much better when we use this method. The underlying system can optimize reading of the file.

QueryPath 2.0 may change this behavior. The next version of QueryPath may use the method outlined above for all files. For QueryPath 1, though, this is how files are interpreted.

When parsing moderately sized old HTML files, you will do best to pass a context into qp(). This will give you the greatest chances of successfully parsing the document.

Change since QP 2.0

What changed in QP2.0 regarding the above crufty old HTML handling? QP2 doesn't seem to like my HTML files anymore, grumbling about

<Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::load() [<a href='function.DOMDocument-load'>function.DOMDocument-load</a>]: Opening and ending tag mismatch: link line 8 and head...

My clean urls don't end in HTML but are HTML content.

I get the same error Fatal

I get the same error

Fatal error: Uncaught exception 'QueryPathParseException' with message 'DOMDocument::loadHTML() [domdocument.loadhtml]: htmlParseEntityRef: no name in Entity, line: 295 (D:\wamp\www\QueryPath\examples\src\QueryPath\QueryPath.php: 2792)' in D:\wamp\www\QueryPath\examples\src\QueryPath\QueryPath.php:3324 Stack trace: #0 [internal function]: QueryPathParseException::initializeFromError(2, 'DOMDocument::lo...', 'D:\wamp\www\Que...', 2792, Array) #1 D:\wamp\www\QueryPath\examples\src\QueryPath\QueryPath.php(2792): DOMDocument->loadHTML('<!DOCTYPE html ...') #2 D:\wamp\www\QueryPath\examples\src\QueryPath\QueryPath.php(351): QueryPath->parseXMLString('<!DOCTYPE html ...') #3 D:\wamp\www\QueryPath\examples\src\QueryPath\QueryPath.php(164): QueryPath->__construct('<!DOCTYPE html ...', 'title', Array) #4 D:\wamp\www\QueryPath\examples\simple_example.php(30): qp('<!DOCTYPE html ...', 'title') #5 {main} thrown in D:\wamp\www\QueryPath\examples\src\QueryPath\QueryPath.php on line 3324

This is a great question to pose to the list!

An error like that could occur for more than one reason. If you post in one of the mailing lists (querypath-support@groups.google.com is probably best), you'll likely get a good answer.

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <h4>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>. Beside the tag style "<foo>" it is also possible to use "[foo]".
  • Lines and paragraphs break automatically.
  • Images can be added to this post.

More information about formatting options

Recent comments