The Best Tool for Web Page Speed Evaluation

August 31, 2010

It seems that, for me, this is the year of website performance optimization. From working with nginx and a crazy memcached setup to recently deploying a handful of Varnish servers, I have been deeply entrenched in the world of website page speed optimizations.

At this point, I've used dozens of...

Building Homebrew Packages

August 27, 2010

Recently I have been using Homebrew to install Open Source tools on my Mac. I find it to be a little more to my taste than MacPorts. But when I wanted to install TTYtter, a small Twitter command-line client, I discovered that Homebrew doesn't yet have a package for TTYtter.

After a quick chat with...

Reflections on Google Summer of Code

August 26, 2010

This was the second year that I have been involved as a mentor for Google's Summer of Code program. And in both cases, I've worked as a mentor for Drupal. Last year, I worked with sivaji on a project involving the Quiz module. This year, I worked with eabrand on QueryPath and the QueryPath module

Compiling varnishstat, varnishtop, and varnishhist

August 26, 2010

I noticed recently that on one of my Debian systems, my installation of varnish did not have any of its monitoring utilities installed. /usr/local/bin was missing varnishstat, varnishtop, varnishhist, and varnishsizes.

I re-ran configure and make a couple of times. I couldn't find any errors, yet...

Building PHP from Source on Ubuntu

August 18, 2010

This article describes how to build PHP from source on Ubuntu. I am doing this because I need to build PHP with the embeded SAPI -- an option not available by default. For my purposes, I don't want this PHP to replace the existing PHP. Instead, I want it to be available alongside my normal PHP. So...

Slides for my Dojo presentation: "QueryPath: It's like PHP jQuery in Drupal!"

August 18, 2010

I posted the slides from yesterday's Drupal Dojo presentation. These should be much more readable than the video feed.

Drupal Dojo: "QueryPath: It's like PHP jQuery in Drupal!"

August 16, 2010

On August 17th at 12pm EDT (9AM PDT), I will be doing the Drupal Dojo session, "QueryPath: It's like PHP jQuery in Drupal!". To sign up, head over to the webinar signup.

I'm particularly excited about this for three reasons:

  1. Emily will be joining me to talk about her GSoC project.
  2. We will be discussing...

Never Use GET Again, at PHP Architect

August 8, 2010

I've been fairly silent for the last few months. A vacation, a business trip, and a huge job change have been the primary factors. There's also a big make-over going on with QueryPath right now. With a Summer of Code project promising to bring new features to QueryPath, and a new logo to give it a...

A PHP jQuery Library: QueryPath Overview

August 8, 2010

jQuery is a JavaScript library for efficiently working with HTML and CSS. Its chainable and compact API has made it a popular choice for web developers seeking to quickly build rich web applications. But did you know there is a PHP jQuery library? QueryPath is a PHP implementation of jQuery's interface...

QueryPath and Character Sets: Converting content with mb_convert_encoding()

May 3, 2010

QueryPath can be used to crawl the web, parsing web pages and gleaning information. But the HTML of remote websites is not always as pristine and standards compliant as we would like, and one thing that can be particularly frustrating is determining the encoding of a document. (This gets substantially more complicated when HTTP headers list one encoding and HTML meta tags list another -- a common configuration error).

QueryPath is primarily a library for working with XML and HTML, but it assumes that you know from the outset what character set your document uses. This is not always a good assumption to make. Here is one way to circumvent the problem: Rather than write code to find out a document's character set, use PHP built-in functions (assuming you have the MB library compiled in) to do this for you.

<?php
require 'QueryPath/QueryPath.php';

$url = 'http://mopy.fr/';
$contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto');
$opts = array('ignore_parser_warnings' => TRUE);

print @qp($contents, 'title', $opts)->text() . PHP_EOL;

In the code above, I access a French language website (pointed out to me by a posting on the QueryPath support list), and then prepare it for loading. By default, the HTML DOM uses ISO-8859-1 for its character set.

The really important line above is this:

<?php
$contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto');
?>

This does two things:

  • It retrieves the URL from the remote site with file_get_contents.
  • It automatically determines the encoding of the document, and converts it to ISO-8859-1. This is done by mbconvertencoding.

So $contents is going to be in a known and supported character set before it is passed into QueryPath.

Note that in the call to QueryPath, we pass in the ignore_parser_warnings flag and we suppress error messages (with @). While this has nothing to do directly with the encoding issue, it is one way of preventing the ickiness of HTML markup from causing warning and error messages in your output.

(For another way of converting, see this earlier article on iconv, a strategy that works better if you are bulk importing lots of content from the local file system.)