Old problems never die, it seems. A few people have mentioned an interesting QueryPath problem they have experienced. Roughly summarized, the question is "What's with the
at the end of every line?"
note: The trailing
; of the entity name has been removed to avoid stripping by an overzealous content filter.
I hadn't experienced this problem for myself until recently, as I worked on an importer that was parsing thousands of ancient HTML files. These files came from all over the place, and dozens of them had the XML entity
appended to the end of every line. While this may look odd at first glance, a second glance reveals that this is a problem we've likely all seen in the past.
One of the things QueryPath does automatically (unless you tell it not to) is re-code entities. This is important for XML, since it does not (out of the box) support the array of named entities that are part of HTML. Instead of using named entities like
, XML uses numeric representations of the character.
? It is the decimal notation for Carriage Return (often encoded as
\r). Yup, the old CR-LF problem rears its head again. When a document with Windows CR-LFs is serialized by QueryPath (which, in turn, just uses the PHP DOM library), any CRs are converted to entities.
How do we solve the problem?
As far as I can tell, this behavior accords with the XML standard. For that reason, I'm not inclined to change it.
However, you can avoid the problem altogether by removing CR characters from a document. This is as easy as doing something like this:
The above will remove all of the carriage returns using
str_replace<code>, and then pass the file contents on to QueryPath.