Come to the 2010 CMS Expo

QueryPath: "What's with the 
 at the end of every line?"

Old problems never die, it seems. A few people have mentioned an interesting QueryPath problem they have experienced. Roughly summarized, the question is "What's with the &#13 at the end of every line?"

note: The trailing ; of the entity name has been removed to avoid stripping by an overzealous content filter.

I hadn't experienced this problem for myself until recently, as I worked on an importer that was parsing thousands of ancient HTML files. These files came from all over the place, and dozens of them had the XML entity &#13 appended to the end of every line. While this may look odd at first glance, a second glance reveals that this is a problem we've likely all seen in the past.

What is &#13?

One of the things QueryPath does automatically (unless you tell it not to) is re-code entities. This is important for XML, since it does not (out of the box) support the array of named entities that are part of HTML. Instead of using named entities like  , XML uses numeric representations of the character.

What is &#13? It is the decimal notation for Carriage Return (often encoded as \r). Yup, the old CR-LF problem rears its head again. When a document with Windows CR-LFs is serialized by QueryPath (which, in turn, just uses the PHP DOM library), any CRs are converted to entities.

How do we solve the problem?

As far as I can tell, this behavior accords with the XML standard. For that reason, I'm not inclined to change it.

However, you can avoid the problem altogether by removing CR characters from a document. This is as easy as doing something like this:

<?php
$doc = str_replace(chr(13), '', file_get_contents($file));
qp($doc);
?>

The above will remove all of the carriage returns using

str_replace<code>, and then pass the file contents on to QueryPath.

This helped alot

I had a similar problem on this project your solution did the trick thanks!

Solved!

Great! Via the search engine I came across this solution, so happy I found the trick. With my project I had the same problem. I will keep on following this blog, great resource!

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You can use Markdown syntax to format and style the text. Also see Markdown Extra for tables, footnotes, and more.
  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <h1> <h2> <h3> <h4>
  • You can enable syntax highlighting of source code with the following tags: <code>, <blockcode>. Beside the tag style "<foo>" it is also possible to use "[foo]".
  • Lines and paragraphs break automatically.
  • Images can be added to this post.

More information about formatting options

Recent comments