Message 171356 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	mastrodomenico
Recipients	ajaksu2, christian.heimes, eric.araujo, ezio.melotti, mastrodomenico, orsenthil, serhiy.storchaka
Date	2012-09-26.19:25:05
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1348687505.76.0.248777441283.issue4733@psf.upfronthosting.co.za>
In-reply-to

Content
FYI, the exact algorithm for determining the encoding of HTML documents is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding There are lots of different algorithms documented all over the intertubes for determining HTML encoding; the one above is the one used by browsers. But that should only be used as part of a full HTML parsing library (e.g. https://code.google.com/p/html5lib/), urlopen should not attempt to do encoding sniffing from the data transferred.

FYI, the exact algorithm for determining the encoding of HTML documents is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

There are lots of different algorithms documented all over the intertubes for determining HTML encoding; the one above is the one used by browsers.

But that should only be used as part of a full HTML parsing library (e.g. https://code.google.com/p/html5lib/), urlopen should not attempt to do encoding sniffing from the data transferred.

History
Date	User	Action	Args
2012-09-26 19:25:05	mastrodomenico	set	recipients: + mastrodomenico, orsenthil, christian.heimes, ajaksu2, ezio.melotti, eric.araujo, serhiy.storchaka
2012-09-26 19:25:05	mastrodomenico	set	messageid: <1348687505.76.0.248777441283.issue4733@psf.upfronthosting.co.za>
2012-09-26 19:25:05	mastrodomenico	link	issue4733 messages
2012-09-26 19:25:05	mastrodomenico	create