Message31177
Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the code for handling character escapes assumes that ASCII characters have values up to 255.
But the correct limit is 127, of course.
If a Unicode string is run through SGMLparser, and that string has a character in an attribute with a value between 128 and 255, which is valid in Unicode, the
value is passed through as a character with "chr", creating a
one-character invalid ASCII string.
Then, when the bad string is later converted to Unicode as the output is assembled, the UnicodeDecodeError exception is raised.
So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below. This forces characters above 127 to be expressed with
escape sequences. Please patch accordingly. Thanks.
def convert_charref(self, name):
"""Convert character reference, may be overridden."""
try:
n = int(name)
except ValueError:
return
if not 0 <= n <= 127 : # ASCII ends at 127, not 255
return
return self.convert_codepoint(n)
|
|
Date |
User |
Action |
Args |
2007-08-23 14:51:44 | admin | link | issue1651995 messages |
2007-08-23 14:51:44 | admin | create | |
|