By Matt Butcher
QueryPath and Character Sets: Converting content with mb_convert_encoding()
QueryPath can be used to crawl the web, parsing web pages and gleaning information. But the HTML of remote websites is not always as pristine and standards compliant as we would like, and one thing that can be particularly frustrating is determining the encoding of a document. (This gets substantially more complicated when HTTP headers list one encoding and HTML meta tags list another -- a common configuration error).
QueryPath is primarily a library for working with XML and HTML, but it assumes that you know from the outset what character set your document uses. This is not always a good assumption to make. Here is one way to circumvent the problem: Rather than write code to find out a document's character set, use PHP built-in functions (assuming you have the MB library compiled in) to do this for you.
<?php require 'QueryPath/QueryPath.php'; $url = 'http://mopy.fr/'; $contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto'); $opts = array('ignore_parser_warnings' => TRUE); print @qp($contents, 'title', $opts)->text() . PHP_EOL;
In the code above, I access a French language website (pointed out to me by a posting on the QueryPath support list), and then prepare it for loading. By default, the HTML DOM uses ISO-8859-1 for its character set.
The really important line above is this:
<?php $contents = mb_convert_encoding(file_get_contents($url), 'iso-8859-1', 'auto'); ?>
This does two things:
- It retrieves the URL from the remote site with
file_get_contents. - It automatically determines the encoding of the document, and converts it to ISO-8859-1. This is done by mb_convert_encoding.
So $contents is going to be in a known and supported character set before it is passed into QueryPath.
Note that in the call to QueryPath, we pass in the ignore_parser_warnings flag and we suppress error messages (with @). While this has nothing to do directly with the encoding issue, it is one way of preventing the ickiness of HTML markup from causing warning and error messages in your output.
(For another way of converting, see this earlier article on iconv, a strategy that works better if you are bulk importing lots of content from the local file system.)








wholesale cheap sports shoes
Nike Factory Nike Wholesaler, Cheap Nike shoes wholesale Cheap discount Jordan Flight sneakers Jordan 6 Rings New Jordan AJ1 Fusions, Cheap Nike Dunk sb Womans Dunks, Nike Air Force One XXV New Air Force 2009, Nike ATO Nike Supra Nike Shox Air Bed Shox TL3 R4 R5 OZ NZ Turbo Shox Moster, wholesale Nike Air Yeezy sneakers Air MaxTN LTD Max87' 88' 89' 90' 91' 95' 97' 180' 360' New Max2009', Nike Zoom LeBron Zoom Kobe Nike Anfernee Hardawayand, and other Brand Creative Recreation shoes Gucci Prada Puma Bape D&G Mauri shoes Timberland UGG Adidas Coach Coogi Ed Hardy Armani shoes.
www.nike-ec.com
moncler outlet
We can not express the quality of moncler outlet. Because this brand is famous in the world and persued by outdoor enthusiasts.Quilted down filled zip moncler outlet with snap contrast knitted trims. Snap off hood. Stand up collar. Snap button closure. Slash zip pockets at waist. Elastic band sleeve cuffs with snap button closure. Rib knit waistband inside. Signature logo patch at chest pocket.moncler doudoune shows noble and fashion.
Post new comment