This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author hodgestar
Recipients hodgestar, yanne
Date 2008-10-03.11:09:13
SpamBayes Score 1.6653345e-16
Marked as misclassified No
Message-id <1223032155.96.0.398070025919.issue3932@psf.upfronthosting.co.za>
In-reply-to
Content
I've tracked down the cause to the .unescape(...) method in HTMLParser.
The replaceEntities function passed to re.sub() always returns a unicode
character, even when matching string s is a byte string. Changing line
383 to:

  return self.entitydefs[s].encode("utf-8")

makes the test pass. Unfortunately this is obviously not a viable
solution in the general case. The problem is that there is no way to
know what character set to encode in without knowing both the HTTP
headers (which are not available to HTMLParser) and looking at the XML
and HTML headers.

Python 3.0 implicitly rejects non-unicode strings right at the start of
html.parser.HTMLParser.feed(...) by adding '' to the data passed in.

Given Python 3.0's behaviour, the docs should perhaps be updated to say
HTMLParser does not support non-unicode strings? If it should support
byte strings, we'll have to figure out how to handle encoded entity issues.

It's a bit weird that character and entity references outside
tags/attributes result in calls to .entityref(...) and .charref(...)
while those inside get unescape called automatically. Don't really see
what can be done about that though.
History
Date User Action Args
2008-10-03 11:09:16hodgestarsetrecipients: + hodgestar, yanne
2008-10-03 11:09:15hodgestarsetmessageid: <1223032155.96.0.398070025919.issue3932@psf.upfronthosting.co.za>
2008-10-03 11:09:15hodgestarlinkissue3932 messages
2008-10-03 11:09:13hodgestarcreate