This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author nagle
Recipients
Date 2007-02-07.07:57:19
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the code for handling character escapes assumes that ASCII characters have values up to 255.
But the correct limit is 127, of course.

If a Unicode string is run through SGMLparser, and that string has a character in an attribute with a value between 128 and 255, which is valid in Unicode, the
value is passed through as a character with "chr", creating a
one-character invalid ASCII string.  

Then, when the bad string is later converted to Unicode as the output is assembled, the UnicodeDecodeError exception is raised. 

So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below.  This forces characters above 127 to be expressed with
escape sequences.  Please patch accordingly.  Thanks.

def convert_charref(self, name):
    """Convert character reference, may be overridden."""
    try:
        n = int(name)
    except ValueError:
        return
    if not 0 <= n <= 127 : # ASCII ends at 127, not 255
        return
    return self.convert_codepoint(n)
History
Date User Action Args
2007-08-23 14:51:44adminlinkissue1651995 messages
2007-08-23 14:51:44admincreate