Executing a SPARQL Query from QueryPath

May 28 2009

The Semantic Web. It is a concept that has sparked heated debate for years. While the debate may continue to rage for some time, there are already a host of technologies that can be used to build advanced applications based on XML technology. In this article, we will see how the SPARQL query language can be used to retrieve XML information from remote semantic databases (usually called SPARQL endpoints).

QueryPath already contains all of the tools necessary for running a SPARQL query and handling the results. This is not because QueryPath has been specially fitted to the task, but because SPARQL uses technologies that are widely supported: XML and HTTP. Since QueryPath can be used to make HTTP requests and then digest the XML results, we can use it to execute SPARQL queries and handle the results.

In this article, we will look at a basic SPARQL query, and see how we can use QueryPath to execute it and parse the returned results.

While SPARQL will be introduced here, it is far too robust a language to be explained in a short article. One starting point is the SPARQL Working Group home page.

The queries presented in this chapter will be run against DBPedia, a semantic version of Wikipedia. It makes all of the content from Wikipedia available as semantic content.

The SPARQL Query: A Brief Anatomy

Let's begin by looking at the SPARQL query that we will be running:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?name ?label
WHERE {
  ?uri foaf:name ?name .
  ?uri rdfs:label ?label
  FILTER (?name = "The Beatles")
  FILTER (lang(?label) = "en")
}

The query above begins by defining two prefixes:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

A prefix is a convenient method for representing a namespace URI with a short string. Above, we create one for the Friend of a Friend namespace (foaf:) and one for the RDF Schema namespace (rdfs). Now, whenever we need to represent entities from those two schemata, we can just use the short prefix instead of the full URL.

The next part of the code above is the actual query:

SELECT ?uri ?name ?label
WHERE {
  ?uri foaf:name ?name .
  ?uri rdfs:label ?label
  FILTER (?name = "The Beatles")
  FILTER (lang(?label) = "en")
}

We are going to use the URI a lot, and it is easy to get hung up on the URI as a URL expressing a location. However, you are better off thinking of the URI as a unique identifier for an object -- a unique identifier that just happens to also be "dereferenceable". We can, in fact, use the URI to access information over the network (in this case).

If you have developed SQL before, this should look vaguely familiar. It functions similarly to a SQL SELECT operation. Here's what the code above does, phrased in plain English:

  1. Select the uri, name, and label
  2. where...
  3. the uri has the name ?name (or, where the uri's name is stored in ?name)
  4. the uri has a label ?label
  5. the name is "The Beatles"
  6. the language of the label is English

There are a few things to note about the structure of the query.

First, remember that the URI (?uri), is just a unique identifier. It is functioning sort of like a primary key for each object we query.

Second, the items that begin with question marks (?) are variables. Their value is assigned when the query is being executed.

Third, the items in the WHERE clause are not simply restrictive, as they are in SQL. In fact, the purpose of lines 3 and 4 isn't so much to limit the items returned, but to express a relationship between items. The general pattern of lines 3 and 4 is:

?subject ?relationship ?object

So ?uri foaf:name ?name can be understood to mean "Some object ID (subject) named (relationship) Some name(object)". As you may have guessed, foaf:name expresses the relationship "is named". Likewise, rdfs:label expresses the relationship "is labeled".

Assuming that we did not have the two FILTER functions, the query would simply return all objects (together with their names and labels) that had a name and a label.

The FILTER function is used to limit what content is returned. Above, we used two filters:

  FILTER (?name = "The Beatles")
  FILTER (lang(?label) = "en")

The first filter says that the value of ?name must match (exactly) the string "The Beatles". Keep in mind that a given item may have multiple foaf:name items. The filter need only match one of the items.

The second filter requires that the label's language be in English. RDFS labels in the DBPedia database tend to have attributes indicating the language of the label. We are only interested in the English language content. In the query above, if we omit this, we will see results in Chinese, German, and Spanish, as well as other languages.

Putting this all together, then, our query will return the URI, the name, and the label for any URIs in the database that...

  • Have a name
  • Have a label
  • Have a name that is "The Beatles"
  • Have a label that is in English.

Next, we're ready to see how this query can be run against a remote, publicly available SPARQL endpoint (server) from QueryPath.

Running the Query

The query is, by far, the most complex aspect of our sample code. Here's what the entire code looks like:

<?php
require '../src/QueryPath/QueryPath.php';

// We are using the dbpedia database to execute a SPARQL query.

// URL to DB Pedia's SPARQL endpoint.
$url = 'http://dbpedia.org/sparql';

// The SPARQL query to run.
$sparql = '
  PREFIX foaf: <http://xmlns.com/foaf/0.1/>
  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
  SELECT ?uri ?name ?label
  WHERE {
    ?uri foaf:name ?name .
    ?uri rdfs:label ?label
    FILTER (?name = "The Beatles")
    FILTER (lang(?label) = "en")
  }
';

// We first set up the parameters that will be sent.
$params = array(
  'query' => $sparql,
  'format' => 'application/sparql-results+xml',
);

// DB Pedia wants a GET query, so we create one.
$data = http_build_query($params);
$url .= '?' . $data;

// Next, we simply retrieve, parse, and output the contents.
$qp = qp($url, 'head');

// Get the headers from the resulting XML.
$headers = array();
foreach ($qp->children('variable') as $col) {
  $headers[] = $col->attr('name');
}

// Get rows of data from result.
$rows = array();
$col_count = count($headers);
foreach ($qp->top()->find('results>result') as $row) {
  $cols = array();
  $row->children();
  for ($i = 0; $i < $col_count; ++$i) {
    $cols[$i] = $row->branch()->eq($i)->text();
  }
  $rows[] = $cols;
}

// Turn data into table.
$table = '<table><tr><th>' . implode('</th><th>', $headers) . '</th></tr>';
foreach ($rows as $row) {
  $table .= '<tr><td>';
  $table .= implode('</td><td>', $row);
  $table .= '</td></tr>';
}
$table .= '</table>';

// Add table to HTML document.
qp(QueryPath::HTML_STUB, 'body')->append($table)->writeHTML();
?>

While the code may look complex at first blush, it is actually a straightforward tool.

We will begin by taking a quick glance at the first dozen lines:

  // URL to DB Pedia's SPARQL endpoint.
  $url = 'http://dbpedia.org/sparql';

  // The SPARQL query to run.
  $sparql = '
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?uri ?name ?label
    WHERE {
      ?uri foaf:name ?name .
      ?uri rdfs:label ?label
      FILTER (?name = "The Beatles")
      FILTER (lang(?label) = "en")
    }
  ';

  // We first set up the parameters that will be sent.
  $params = array(
    'query' => $sparql,
    'format' => 'application/sparql-results+xml',
  );

  // DB Pedia wants a GET query, so we create one.
  $data = http_build_query($params);
  $url .= '?' . $data;

The snippet above shows all of the preparation we must make to run the query.

We begin with a base $url, which points to the DBPedia SPARQL endpoint. Next we write our SPARQL query. The query above is the same as the one we saw earlier in this article.

With the query and the base URL, we need to build a full URL to access the remote server. This is done with the $params array. There we create the name/value pairs that will be condensed into a GET string by httpbuildquery(). Note that we set the MIME type as the value of the $params['format'] entry. This is to tell the remote server what kind of data we expect to have returned.

A SPARQL query need not return information encoded as XML. Other data formats are equally capable of representing SPARQL query results. XML is probably the most widely used, though, and is the easiest for us to parse.

In the last line of the snippet above, we assemble our base URL and query params into a complete URL.

Next, we need to execute the query and handle the results.

<?php
  // Next, we simply retrieve, parse, and output the contents.
  $qp = qp($url, 'head');

  // Get the headers from the resulting XML.
  $headers = array();
  foreach ($qp->children('variable') as $col) {
    $headers[] = $col->attr('name');
  }

  // Get rows of data from result.
  $rows = array();
  $col_count = count($headers);
  foreach ($qp->top()->find('results>result') as $row) {
    $cols = array();
    $row->children();
    for ($i = 0; $i < $col_count; ++$i) {
      $cols[$i] = $row->branch()->eq($i)->text();
    }
    $rows[] = $cols;
  }
?>

We begin by creating a new QueryPath object, stored in $qp. Based on the CSS query, we can see that it will be pointed to the header element in the returned results. This element will contain the names of each of the returned variables of data.

From there, we build an array of $headers, getting the name of each returned variable. These we will use to generate the headers in our table. The headers come back in variable elements, and each variable has a name attribute. To fetch them, then, we select the variables and loop through them, retrieving the name attribute of each.

Next comes the fancy part. We need to loop through each result and fetch each variable out of each result. Or, to use the table metaphor we SQL developers are familiar with, we loop through each row, and fetch each column of data. This is al accomplished in the this foreach loop:

foreach ($qp->top()->find('results>result') as $row) {
  $cols = array();
  $row->children();
  for ($i = 0; $i < $col_count; ++$i) {
    $cols[$i] = $row->branch()->eq($i)->text();
  }
  $rows[] = $cols;
}  

When this loop is finished, there will be an array of rows, each of which will have an array of columns. The index of the columns should match the index of the headers array. That is how we correlate headers to columns. You may also notice that we use QueryPath's 'branch() method in combination with eq() so that we can (relatively cheaply) get the text for each column.

With this complete, the next thing to do is format the table output:

<?php
// Turn data into table.
$table = '<table><tr><th>' . implode('</th><th>', $headers) . '</th></tr>';
foreach ($rows as $row) {
  $table .= '<tr><td>';
  $table .= implode('</td><td>', $row);
  $table .= '</td></tr>';
}
$table .= '</table>';

// Add table to HTML document.
qp(QueryPath::HTML_STUB, 'body')->append($table)->writeHTML();
?>

The code above is straightforward. We are taking the data returned from the SPARQL query and formatting it into an HTML table, looping through each row of data.

On the final line, we create a new QueryPath object using the HTML_STUB HTML stub document. We add our new table to that, write the HTML document to the web browser.

Conclusion

This article illustrates how QueryPath can be used to execute SPARQL queries against remote semantic databases, and how QueryPath can then use the results. SPARQL is a complex language, and the introduction here has been brief. However, with such a robust query language at your disposal, and with QueryPath's HTTP, XML, and HTML capabilities, you can make use of the semantic web from your web applications. <!--break-->