Reading ODT Files with QueryPath
One of the most popular word processing document formats is the ODT (Open Document Text) format, supported natively by OpenOffice.org, and supported as an export format for other major word processors, including Microsoft Office.
An ODT document is actually a ZIP archive composed of several files, including metadata, the document text, and embedded items. Most of these files are XML documents. And that means we can easily access their contents using QueryPath.
In this short article, we'll see how to access the contents of an ODT file.
The code
The text content of an ODT file is stored in the content
file inside of the ODT ZIP archive. To skip through it, you can do something like this:
$ unzip openoffice.odt
$ cat content.xml
The command above will display the text contents (in XML format) of the document named openoffice.odt
. This is the file we are going to parse.
For our simple example, what we are going to do is print out a plain-text outline of the document. We will build the outline by reading the section headers from the ODT file.
Here's the code:
<?php
require_once 'QueryPath/QueryPath.php';
$file = 'zip://openoffice.odt#content.xml';
$doc = qp($file);
foreach ($doc->find('text|h') as $header) {
$style = $header->attr('text:style-name');
$attr_parts = explode('_', $style);
$level = array_pop($attr_parts);
$out = str_repeat(' ', $level) . '- ' . $header->text();
print $out . PHP_EOL;
}
?>
After requiring the QueryPath library, we get to work parsing the file. The file is a ZIP archive. Rather than unzip it ourselves, though, we want to use PHP's ZIP stream handler to uncompress it (as needed) internally. To cause PHP to invoke the ZIP stream handler, we use a special URL to refer to the file:
$file = 'zip://openoffice.odt#content.xml';
The above tells PHP to unzip the openoffice.odt
file and access the content.xml
file in that archive.
We can pass that URL straight into QueryPath, which will then unzip and access the desired data.
The foreach
loop contains the brunt of our application code. It iterates over all of the headers in the document, and then formats some output based on the header.
Notice the CSS selector passed into find()
. It is a little out of the ordinary: text|h
. The pipe operator is rarely used in CSS 3 when you are working with HTML documents. It provides XML namespace support for CSS. So the above will seek for elements that look like this:
<text:h>Header text</text:h>
Effectively, the pipe (|
) in the selector replaces the colon in the tagname. In ODT, headers are stored as h
tags inside of the namespace urn:oasis:names:tc:opendocument:xmlns:text:1.0
, which in turn is usually aliased to text
. Thus, text|h
searches for all headers in the document.
So the iterator is now looping through all of the headers in the file. With each header, five things are done:
$style = $header->attr('text:style-name');
$attr_parts = explode('_', $style);
$level = array_pop($attr_parts);
$out = str_repeat(' ', $level) . '- ' . $header->text();
print $out . PHP_EOL;
First, we extract the style of the header. This will enable us to determine what level of heading this is (e.g. level 1, 2, 3, and so on).
Style is stored in the namespaced attribute text:style-name
. Since we are retrieving the attribute, we use the entire XML name: text:style-name
. (We do not replace the ':' with a '|' because we are not executing a CSS 3 Selector.)
A style name will look like this: text:style-name="Heading202"
. The last digit, 2
indicates the level of the heading, and that is the number in which we are interested.
We retrieve the heading number by exploding the attribute name and then retrieving the last item from the attribute array:
$attr_parts = explode('_', $style);
$level = array_pop($attr_parts);
Now $level
contains a digit indicating the header number. From there, we simply want to display the header formatted to indicate its depth in the outline:
$out = str_repeat(' ', $level) . '- ' . $header->text();
print $out . PHP_EOL;
The str_repeat
function pads the beginning of the string with one space for every heading level. A first level heading will be indented one space. A third level heading will be indented three spaces. After the spaces, we add a dash (for formatting) and then the title of the section.
Finally, each line is printed to standard output.
So what does the output of this command look like? Let's take a quick look at our sample document as rendered by OpenOffice.org:
Notice the multiple levels of headings. Let's extract those now with the tool we just built:
- Section One
- Subsection A
- Subsection B
- Section 2
- Subsection 2A
- Item AA
- Item BB
- Subsection 2B
- Conclusion
Our simple tool parsed OpenOffice.org's XML and displayed an outline based on the headings from the document.
The code presented here is based on documents output from OpenOffice.org 3.x. Because of the flexibility of XML, it is possible for other ODT documents to be generated which will not conform to the same namespace convention we have used. What does this mean? It means you may have to tweak this little example to make it work on certain files (YMMV). But the principles will remain the same.
Word DocX documents also contain an XML payload. In the future, perhaps we will examine parsing and reading such files. <!--break-->