Message 74239 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	hodgestar
Recipients	hodgestar, yanne
Date	2008-10-03.11:09:13
SpamBayes Score	1.6653345e-16
Marked as misclassified	No
Message-id	<1223032155.96.0.398070025919.issue3932@psf.upfronthosting.co.za>
In-reply-to

Content
I've tracked down the cause to the .unescape(...) method in HTMLParser. The replaceEntities function passed to re.sub() always returns a unicode character, even when matching string s is a byte string. Changing line 383 to: return self.entitydefs[s].encode("utf-8") makes the test pass. Unfortunately this is obviously not a viable solution in the general case. The problem is that there is no way to know what character set to encode in without knowing both the HTTP headers (which are not available to HTMLParser) and looking at the XML and HTML headers. Python 3.0 implicitly rejects non-unicode strings right at the start of html.parser.HTMLParser.feed(...) by adding '' to the data passed in. Given Python 3.0's behaviour, the docs should perhaps be updated to say HTMLParser does not support non-unicode strings? If it should support byte strings, we'll have to figure out how to handle encoded entity issues. It's a bit weird that character and entity references outside tags/attributes result in calls to .entityref(...) and .charref(...) while those inside get unescape called automatically. Don't really see what can be done about that though.

I've tracked down the cause to the .unescape(...) method in HTMLParser.
The replaceEntities function passed to re.sub() always returns a unicode
character, even when matching string s is a byte string. Changing line
383 to:

  return self.entitydefs[s].encode("utf-8")

makes the test pass. Unfortunately this is obviously not a viable
solution in the general case. The problem is that there is no way to
know what character set to encode in without knowing both the HTTP
headers (which are not available to HTMLParser) and looking at the XML
and HTML headers.

Python 3.0 implicitly rejects non-unicode strings right at the start of
html.parser.HTMLParser.feed(...) by adding '' to the data passed in.

Given Python 3.0's behaviour, the docs should perhaps be updated to say
HTMLParser does not support non-unicode strings? If it should support
byte strings, we'll have to figure out how to handle encoded entity issues.

It's a bit weird that character and entity references outside
tags/attributes result in calls to .entityref(...) and .charref(...)
while those inside get unescape called automatically. Don't really see
what can be done about that though.

History
Date	User	Action	Args
2008-10-03 11:09:16	hodgestar	set	recipients: + hodgestar, yanne
2008-10-03 11:09:15	hodgestar	set	messageid: <1223032155.96.0.398070025919.issue3932@psf.upfronthosting.co.za>
2008-10-03 11:09:15	hodgestar	link	issue3932 messages
2008-10-03 11:09:13	hodgestar	create