Message 146002 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	ajaksu2, christian.heimes, eric.araujo, ezio.melotti, mastrodomenico, orsenthil
Date	2011-10-20.02:58:31
SpamBayes Score	2.469485e-07
Marked as misclassified	No
Message-id	<1319079512.28.0.741957623214.issue4733@psf.upfronthosting.co.za>
In-reply-to

Content
> Christian Heimes wrote: > There is no generic and simple way to detect the encoding of a > remote site. Sometimes the encoding is mentioned in the HTTP header, > sometimes it's embedded in the <head> section of the HTML document. FWIW for HTML pages the encoding can be specified in at least 3 places: * the HTTP headers: e.g. "content-type: text/html; charset=utf-8"; * the XML declaration: e.g. "<?xml version="1.0" encoding="utf-8" ?>"; * the <meta> tag: e.g. "<meta http-equiv="Content-Type" content="text/html; charset=utf-8"> Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority. The XML declaration is sometimes (mis)used in (X)HTML pages. Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of. Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better? Maybe something like: >>> page = urlopen(some_url) >>> page.encoding # get the encoding from the HTTP headers 'utf-8' >>> page.decode() # same as page.read().decode(page.encoding) '...' The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to. OTOH other file-like objects don't have similar methods, so it might get a bit confusing.

> Christian Heimes wrote:
>   There is no generic and simple way to detect the encoding of a
>   remote site. Sometimes the encoding is mentioned in the HTTP header,
>   sometimes it's embedded in the <head> section of the HTML document.

FWIW for HTML pages the encoding can be specified in at least 3 places:
* the HTTP headers: e.g. "content-type: text/html; charset=utf-8";
* the XML declaration: e.g. "<?xml version="1.0" encoding="utf-8" ?>";
* the <meta> tag: e.g. "<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

Browsers usually follow this order while searching the encoding, meaning that HTTP headers have the highest priority.  The XML declaration is sometimes (mis)used in (X)HTML pages.

Anyway, since urlopen() is a generic function that can download anything, it shouldn't look at XML declarations and meta tags -- that's something parsers should take care of.

Regarding the implementation, wouldn't having a new method on the file-like object returned by urlopen better?
Maybe something like:
>>> page = urlopen(some_url)
>>> page.encoding  # get the encoding from the HTTP headers
'utf-8'
>>> page.decode()  # same as page.read().decode(page.encoding)
'...'

The advantage of having these as new methods/attribute is that you can pass the 'page' around and other functions can get back the decoded content if/when they need to.  OTOH other file-like objects don't have similar methods, so it might get a bit confusing.

History
Date	User	Action	Args
2011-10-20 02:58:32	ezio.melotti	set	recipients: + ezio.melotti, orsenthil, christian.heimes, ajaksu2, eric.araujo, mastrodomenico
2011-10-20 02:58:32	ezio.melotti	set	messageid: <1319079512.28.0.741957623214.issue4733@psf.upfronthosting.co.za>
2011-10-20 02:58:31	ezio.melotti	link	issue4733 messages
2011-10-20 02:58:31	ezio.melotti	create