This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author mastrodomenico
Recipients ajaksu2, christian.heimes, eric.araujo, ezio.melotti, mastrodomenico, orsenthil, serhiy.storchaka
Date 2012-09-26.19:25:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1348687505.76.0.248777441283.issue4733@psf.upfronthosting.co.za>
In-reply-to
Content
FYI, the exact algorithm for determining the encoding of HTML documents is http://www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining-the-character-encoding

There are lots of different algorithms documented all over the intertubes for determining HTML encoding; the one above is the one used by browsers.

But that should only be used as part of a full HTML parsing library (e.g. https://code.google.com/p/html5lib/), urlopen should not attempt to do encoding sniffing from the data transferred.
History
Date User Action Args
2012-09-26 19:25:05mastrodomenicosetrecipients: + mastrodomenico, orsenthil, christian.heimes, ajaksu2, ezio.melotti, eric.araujo, serhiy.storchaka
2012-09-26 19:25:05mastrodomenicosetmessageid: <1348687505.76.0.248777441283.issue4733@psf.upfronthosting.co.za>
2012-09-26 19:25:05mastrodomenicolinkissue4733 messages
2012-09-26 19:25:05mastrodomenicocreate