html

05 May

QueryPath in Practice: Migrating ICANN.org to Drupal

in drupal, html, import, migrate, php, programming, querypath

The Four Kitchens blog is running a story on how they used QueryPath and the Migrate module to migrate over 10,000 pages of content, in many different languages, into Drupal. I love to hear stories about the creative ways developers use QueryPath to accomplish complex tasks. A huge thanks to Mark Theunissen for the detailed write-up.

In related news, the new QueryPath 3 engine is just about done, and will make monster imports like this much faster.

22 Aug

In-depth resource on how browsers work

in css, html, javascript, programming

How Browsers Work is an in-depth resource -- about 50 printed pages long -- explaining how a browser begins with HTML, CSS, and JavaScript files, and ends up with a display.

The clear explanation, based on FireFox and Chrome/webkit, explains many of the nuances that frequently cause edge-case bugs in CSS rendering or JavaScript code. It also provides a good introduction to parsing techniques (browsers use various kinds of parsers and parser generators), along with other lower-level implementation details.

03 Sep

jQuery Checkboxes: Checking and unchecking the right way

in html, javascript, jquery

When working with checkboxes in jQuery and JavaScript, sometimes all you really want to do is toggle the checked state of the checkbox. There are many examples of strange ways of accomplishing this, many of which are wrong or will only work on some browsers. Here is a correct (XHTML-correct) and compact way of doing this.

Check to see if a checkbox is checked

// Returns a boolean, true if checked, false otherwise
jQuery('#my-checkbox').is(':checked');

Check a checkbox

26 Aug

Data URLs and QueryPath: How to embed images into XML or HTML

in dataurl, html, php, programming, querypath

QueryPath 2.1 is adding support for writing files directly into URLs using Data URLs. What this means is that you can encode and embed images or other documents straight into your HTML or XML.

Here's a simple example from the QueryPath 2.1 unit tests:

<?php
$xml = '<?xml version="1.0"?><root><item/></root>';
qp($xml, 'item')->dataURL('secret', 'Hi!', 'text/plain');
?>

The above will generate an XML fragment that looks like this:

<?xml version="1.0"?>
<root>
  <item secret="data:text/plain;base64,SGkh"/>
</root>

The important part there is the attribute secret="data:text/plain;base64,SGkh. This attribute includes an embedded text document with the contents Hi!. What we've done is encode the data and injected it as a document inside of the XML.

Sure, that's novel... but what would we want to use that for? How about adding images directly into a document?

09 Jul

QueryPath: "What's with the &#13; at the end of every line?"

in html, php, querypath

Old problems never die, it seems. A few people have mentioned an interesting QueryPath problem they have experienced. Roughly summarized, the question is "What's with the &#13 at the end of every line?"

note: The trailing ; of the entity name has been removed to avoid stripping by an overzealous content filter.

I hadn't experienced this problem for myself until recently, as I worked on an importer that was parsing thousands of ancient HTML files. These files came from all over the place, and dozens of them had the XML entity &#13 appended to the end of every line. While this may look odd at first glance, a second glance reveals that this is a problem we've likely all seen in the past.

What is &#13?

One of the things QueryPath does automatically (unless you tell it not to) is re-code entities. This is important for XML, since it does not (out of the box) support the array of named entities that are part of HTML. Instead of using named entities like &nbsp;, XML uses numeric representations of the character.

What is &#13? It is the decimal notation for Carriage Return (often encoded as \r). Yup, the old CR-LF problem rears its head again. When a document with Windows CR-LFs is serialized by QueryPath (which, in turn, just uses the PHP DOM library), any CRs are converted to entities.

How do we solve the problem?

As far as I can tell, this behavior accords with the XML standard. For that reason, I'm not inclined to change it.

However, you can avoid the problem altogether by removing CR characters from a document. This is as easy as doing something like this:

<?php
$doc = str_replace(chr(13), '', file_get_contents($file));
qp($doc);
?>

The above will remove all of the carriage returns using str_replace<code>, and then pass the file contents on to QueryPath.

15 Jun

Parsing Old Remote HTML Docs with QueryPath

in html, querypath

An issue in the QueryPath issue queu made me realize that parsing crufty old HTML documents is not exactly intuitive. Here's a quick tutorial for parsing HTML documents.

The most difficult part of handling documents is that it is not always clear what kind of content a document is. While some extensions (like .html and .xml) make this easy, others (like .qti, and XML format) are not so readily discernable. And files fetched from a URL may not have an extension at all!

So sometimes when QueryPath parses files, it basically guesses at the content type.

Here are the guessing rules that QueryPath follows:

  1. If passed a string of markup (e.g, <foo><bar/></foo>), check whether it begins with an XML declaration. By definition, every XML file must begin with something like this:
<?xml version="1.0"?>
<sometag/>

If the declaration does not appear on the first line, the document is not to be considered XML. If this is the case, QueryPath throws it into the HTML parser.

  1. If passed a file name and a context, inspect the contents. QueryPath uses the PHP stream system to handle files. One of the benefits of this is that you can pass in a context which tells QueryPath how to retrieve the document. Typically, this is used to modify HTTP parameters.

Whenever a QueryPath object is created with a context, QueryPath automatically inspects the contents of a file. Example:

<?php
require '../Code/QueryPath/bin/QueryPath.compact.php';
$url = 'http://example.com/old_crufty_html.foo';
 
$cxt = array('context' => stream_context_create());
 
// This will parse it as HTML:
print qp($url, 'title', $cxt)->text();
?>

Assuming that the URL above points to an old and crufty HTML document, this code will parse the document as HTML. And if the URL above pointed to an XML document (or an XHTML document), the XML parser would be used instead. This happens because when a context is passed into QueryPath, it checks the content to see what data type it is dealing with.

  1. If passed a file name and no context, inspect the file extension. This is sort of a last-ditch attempt, and what it will do is assume that if the file ends with .html the code is HTML. Otherwise, it will assume that the file is XML.

Obviously this will work for simple cases. When retrieving URLs, though, it may have unexpected results. So why do things this way? One word: Performance. Large files work much better when we use this method. The underlying system can optimize reading of the file.

QueryPath 2.0 may change this behavior. The next version of QueryPath may use the method outlined above for all files. For QueryPath 1, though, this is how files are interpreted.

When parsing moderately sized old HTML files, you will do best to pass a context into qp(). This will give you the greatest chances of successfully parsing the document.

01 Jun

Escaping JavaScript in QueryPath

in html, javascript, php, querypath, xml

Sometimes the HTML you parse with QueryPath will contain JavaScript or other embedded scripting languages. And sometimes such scripts will contain characters that the XML parser might misinterpret as XML or HTML structures.

There are two ways to escape such content -- both of which are standard, and are often done regardless of whether or not you are using QueryPath.

The first method, which is preferred when working with HTML, is to enclose any scripts inside of HTML comments:

<html>
<head>
< script>
<!--
// Script goes here
-->
< /script>
</head>
<body></body>
</html>

(Extra spacing has been added in the example above to keep the tags from being stripped by this blog's formatter. Those spaces should not be present in your code.)

The comment enclosure will prevent the HTML parser from parsing the contents of the script.

In other cases, XMxmlL CDATA sections may be a better fit for your needs:

<html>
<head>
<![[CDATA
// Script goes here
]]>
< /script>
</head>
<body></body>
</html>

CDATA sections will be readily available in the parsed DOM, but the contents of a CDATA section will not be parsed and interpreted. It is therefore safe to embed JavaScript as well as XML/HTML-like tags.

With these two strategies, you should have the tools necessary to prevent embedded scripts from causing QueryPath parse errors.

12 May

Microformats and RDFa are used by Google

in google, html, microformats, rdfa, seo

I've seen a couple of unimpressive RDFa demonstrations. They tend to involve either a beta search server from Yahoo! or a custom tool with ugly regular expressions. In spite of the quality of the presentations, though, I was sold on the value of using RDFa to embed metadata into HTML. But what good is metadata-rich markup when use case #1 (better SEO) is still absent? The tide is changing -- or, perhaps, has already changed.

Google now says that it supports both RDFa and microformats:

At Google, we believe in openness, so we are using two open standards to allow you to annotate structured data on your site: microformats and RDFa. Both standards allow markup of information on your pages. To ensure that Google understands your markup, we encourage you to follow the format of our examples. You don't need any prior knowledge of microformats or RDFa to use these standards, just a basic knowledge of XHTML.

(http://google.com/support/webmasters/bin/answer.py?hl=en&answer=99170)

Google lists at least a few microformats that they support, and offer a brief primer on RDFa and (apparently) how Google looks for RDFa information.

As I write this, I'm seeing Twitter messages pointing to O'Reilly's article discussing the same. It is clear that there is a new way for SEO....

Who says the Semantic Web is irrelevant?

06 May

QueryPath and HTML: The Basics

in html, php, querypath

QueryPath can be used to work with XML or HTML. Here, I will introduce the typical tasks one uses when working with HTML documents.

We will look at the following:

  • Loading HTML documents
  • Modifying documents
  • Sending a document to the web browser
  • Creating new documents from scratch

Loading Documents

The first common task we will look at is loading an existing document. In most cases, we will be loading documents straight from the file system. Sometimes we may load them from a string of existing HTML, too. Here, we will look at each.

Loading from a file

To load a document from a file, all we need is the path to the file. Let's say we have an HTML document located in /var/www/html/index.html. Here's how we can load that file using QueryPath:

<?php
require 'QueryPath/QueryPath.php';
 
qp('/var/www/html/index.html');
?>

The code above will load the file from the file system. The last line of code will create a new QueryPath object that wraps the content of index.html. A little later, we will build on this example.

Loading from a string of HTML

Often times, HTML is generated on the fly and then sent to the web browser. QueryPath can be used as a filter for altering such dynamically generated HTML content.

For example, consider a case where the we have a string, $html that contains some HTML. Here's how we would load that string:

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
qp($html);
?>

In the example above, $html has an entire HTML document. Note that the document is not technically correct -- it is missing a document type declaration. While QueryPath is strict about XML formatting, it is much more forgiving for HTML markup.

Again, the last line of the code above creates a new QueryPath object, this time wrapping the contents of the string.

It is important to note that in the above two examples, both used the qp() function to build the document. In fact, both used the function with the same signature: qp($string). QueryPath is "smart enough" to determine whether a string is an HTML document, an XML document, or a path to a file in the filesystem.

Tip: In some cases, you can get HTML from the output buffer (see the 'ob' functions in PHP) and then pass the markup on to QueryPath. In this way, you can do post-processing on data that has been output from the application already.

Modifying documents

Let's build on our last example to see how QueryPath can be used to process existing HTML.

In our new example, we will change the title (in the document head) and add a new paragraph beneath the h1.

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
$qp = qp($html, 'title') // Load doc and find <title>
  ->text('A new title') // Set the new title
  ->top()  // Go back to the top of the document
  ->find('h1') // Find the <h1>
  ->after('<p>This is the new paragraph</p>'); // Add a new paragraph after.
?>

The example above is considerably more dense than our previous examples. Again, it begins with a string of HTML. But this time, when qp() is called, a query is passed in as the second argument. title will instruct QueryPath to search for any title elements. It will, as we can see, find one: The title inside of the document head.

The second line of our QueryPath chain will set the text (text()) of the title to A new title.

The third line will navigate back to the top of the document. This is necessary because we are not going to do any more manipulation of the head. We want to start looking for new content from the top of the document.

Next, we need to find our H1 tag. This is done with find('h1'). At this point, the QueryPath object is pointing to the h1 tag. We want to add some content after this tag.

The final step of the QueryPath chain adds a new paragraph after the h1 tag: after('<p>This is the new paragraph</p>'). QueryPath's after() method is one of the dozen or so tools for inserting or updating content in a document. Check out the article at IBM DeveloperWorks for an overview of the other methods.

Here, we've seen two methods, text() and after(), that can be used to modify the document. Next, let's see how to get the results of our modification.

Sending the results to a browser

Again, let's just continue on from our previous example.

At any point, we can get the current state of the HTML using the html() method. For example, we can do something like this:

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
$content = qp($html)->html();
 
?>

Now, content will be a string that should look basically the same as $html (except that it will be cleaned up by QueryPath).

The html() method always works from the local context, though. So if we wanted to get just the body of the above HTML, we could do this:

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
print qp($html, 'body')->html();
 
?>

This would output the following:

  <body>
    <h1>The title</h1>
  </body>

See how the html() method is only grabbing the contents that are currently selected? Since we queried for body, only the body is shown.

Most of the time, though, we are more interested in printing the entire document. The clumsy way of doing this is to do something like this:

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
print qp($html, 'title')->text('New title')->top()->html();
 
?>

There are three steps here (all on the one line) involved in getting the entire HTML document to print.

First, there is an explicit print statement. Second, there is a call to the top() method to get us back to the top of the document. Third, there is the call to html() to get the HTML string.

The above can be further condensed using a different method, writeHTML(). This basically bundles the three steps above into just one step. Thus, we could rewrite the above like this:

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
qp($html, 'title')->text('New title')->writeHTML();
 
?>

As a result of running this code, the following document would be shipped to the browser:

  <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
  <html>
  <head><title>New title</title></head>
  <body><h1>The title</h1></body>
  </html>

(Notice that the doctype has been automatically added.)

Creating new documents

The last thing we will look at is creating a new HTML document with QueryPath.Really, there are two ways. You can either create the entire document from scratch, or you can use a built-in HTML stub document.

Building from scratch

The first way of creating a new HTML document is to build the document from scratch. We've actually seen this method already in our string handling examples above. There, we created a document as a string and then passed it into QueryPath:

<?php
require 'QueryPath/QueryPath.php';
 
$html = '<html>
<head>
  <title>Existing HTML</title>
</head>
<body>
  <h1>The title</h1>
</body>
</html>';
 
qp($html);
 
?>

Building documents that way is always acceptable. Should you so choose, you can even build it up in an even more piecemeal fashion:

<?php
require 'QueryPath/QueryPath.php';
 
qp()->append('<html>')->children()->append('<head/><body/>'); // etc.
 
?>

This, however, is not a terribly efficient method of document building, and is generally only useful in rare cases.

Building from a stub

The easiest method is to use the HTML stub document included with QueryPath. This stub provides a skeleton XHTML document that you can then build using the methods we talked about above.

Here is a quick example:

<?php
require 'QueryPath/QueryPath.php';
 
qp(QueryPath::HTML_STUB, 'title')->text('New title')->writeHTML();
?>

This example creates a new stub document, sets the title, and writes the output. Here is what the output looks like:

  <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
  <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
       <meta http-equiv="Content-Type" content="text/html; charset=utf-8"></meta>
     <title>New title</title>
    </head>
    <body></body>
    </html>

You may notice that here even more work is done for you. The content type is set, as is the doctype. This last method is the quickest way for you to author HTML documents.

Conclusion

We have quickly covered the basics of using QueryPath to work with HTML. We have looked at loading existing documents, changing documents, writing documents to the web browser, and creating documents from scratch.

Of course, QueryPath can be used for many other things. For more information, head over to http://querypath.org. And check back here for more articles like this one.

QueryPath

in dom, html, php, programming, querypath, xml

QueryPath is a tool for manipulating HTML and XML documents in PHP using a chainable interface. It is similar to jQuery in that respect.

A typical QueryPath script looks something like this:

<?php
require_once 'QueryPath/QueryPath.php';