classification
Title: sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: ajaksu2, bind, cyhawk, georg.brandl, hawking, nagle, odormond, wrstlprmpft
Priority: high Keywords: patch

Created on 2007-02-04 22:34 by nagle, last changed 2009-03-31 22:12 by georg.brandl. This issue is now closed.

Files
File name Uploaded Description Edit
issue1651995.patch cyhawk, 2009-03-31 21:00 patch and unittest
Messages (10)
msg31175 - (view) Author: John Nagle (nagle) Date: 2007-02-04 22:34
   I'm running a website page through BeautifulSoup.  It parses OK with Python 2.4, but Python 2.5 fails with an exception:

Traceback (most recent call last):
  File "./sitetruth/InfoSitePage.py", line 268, in httpfetch
    self.pagetree = BeautifulSoup.BeautifulSoup(sitetext) # parse into tree form
  File "./sitetruth/BeautifulSoup.py", line 1326, in __init__
    BeautifulStoneSoup.__init__(self, *args, **kwargs)
  File "./sitetruth/BeautifulSoup.py", line 973, in __init__
    self._feed()
  File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 291, in parse_starttag
    self.finish_starttag(tag, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 340, in finish_starttag
    self.handle_starttag(tag, method, attrs)
  File "/usr/lib/python2.5/sgmllib.py", line 376, in handle_starttag
    method(attrs)
  File "./sitetruth/BeautifulSoup.py", line 1416, in start_meta
    self._feed(self.declaredHTMLEncoding)
  File "./sitetruth/BeautifulSoup.py", line 998, in _feed
    SGMLParser.feed(self, markup or "")
  File "/usr/lib/python2.5/sgmllib.py", line 99, in feed
    self.goahead(0)
  File "/usr/lib/python2.5/sgmllib.py", line 133, in goahead
    k = self.parse_starttag(i)
  File "/usr/lib/python2.5/sgmllib.py", line 285, in parse_starttag
    self._convert_ref, attrvalue)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa7 in position 0: ordinal not in range(128)

    The code that's failing is in "_convert_ref", which is new in Python 2.5. That function wasn't present in 2.4.  I think the code is trying to handle single quotes inside of double quotes in HTML attributes, or something like that.

    To replicate, run

	http://www.bankofamerica.com
or
	http://www.gm.com

through BeautifulSoup.  

Something about this code doesn't like big companies. Web sites of smaller companies are going through OK.
msg31176 - (view) Author: wrstl prmpft (wrstlprmpft) Date: 2007-02-05 07:16
I had a similar problem recently and did not have time to file a bug-report. Thanks for doing that.

The problem is the code that handles entity and character references in SGMLParser.parse_starttag. Seems that it is not careful about unicode/str issues.
(But maybe Beautifulsoup needs to tell it to?)

My quick'n'dirty workaround was to remove the offending char-entity from the website before feeding it to Beautifulsoup::

  text = text.replace('®', '') # remove rights reserved sign entity

cheers,
stefan
msg31177 - (view) Author: John Nagle (nagle) Date: 2007-02-07 07:57
Found the problem. In sgmllib.py for Python 2.5, in convert_charref, the code for handling character escapes assumes that ASCII characters have values up to 255.
But the correct limit is 127, of course.

If a Unicode string is run through SGMLparser, and that string has a character in an attribute with a value between 128 and 255, which is valid in Unicode, the
value is passed through as a character with "chr", creating a
one-character invalid ASCII string.  

Then, when the bad string is later converted to Unicode as the output is assembled, the UnicodeDecodeError exception is raised. 

So the fix is to change 255 to 127 in convert_charref in sgmllib.py,
as shown below.  This forces characters above 127 to be expressed with
escape sequences.  Please patch accordingly.  Thanks.

def convert_charref(self, name):
    """Convert character reference, may be overridden."""
    try:
        n = int(name)
    except ValueError:
        return
    if not 0 <= n <= 127 : # ASCII ends at 127, not 255
        return
    return self.convert_codepoint(n)
msg31178 - (view) Author: John Nagle (nagle) Date: 2007-04-27 21:41
We've been running this fix for several months now, and it seems to work.  Would someone please check it and put it into the trunk?  Thanks.
msg31179 - (view) Author: Olivier Dormond (odormond) Date: 2007-06-06 16:38
Hello,

 I've been able to fix this entity conversion bug with the following patch.

Cheers,

Odie

--- /usr/lib/python2.5/sgmllib.py       2007-05-27 17:55:15.000000000 +0200
+++ modules/sgmllib.py  2007-06-06 18:29:13.000000000 +0200
@@ -396,7 +396,7 @@
         return self.convert_codepoint(n)
 
     def convert_codepoint(self, codepoint):
-        return chr(codepoint)
+        return unichr(codepoint)
 
     def handle_charref(self, name):
         """Handle character reference, no need to override."""
msg57014 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-11-01 17:15
Restore bug title.
msg57022 - (view) Author: Simon (bind) Date: 2007-11-01 17:55
The 255 -> 127 change works for me. Let me know if I can help with unit
tests or whatever to get this patched.
msg84648 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-03-30 21:06
A patch against SVN trunk including a unittest would be great.
msg84899 - (view) Author: Daniel Darabos (cyhawk) Date: 2009-03-31 21:00
Attached patch against SVN trunk including unittest. The test is not 
great, because it practically only checks if the patch was applied and 
not the real-life situation where the exception occurs, but I'm not too 
handy with sgmllib (I only encountered this problem through 
BeautifulSoup).
msg84934 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2009-03-31 22:12
Committed in r70906.
History
Date User Action Args
2009-03-31 22:12:05georg.brandlsetstatus: open -> closed
resolution: accepted
messages: + msg84934
2009-03-31 21:00:36cyhawksetfiles: + issue1651995.patch

messages: + msg84899
2009-03-30 21:06:19ajaksu2setnosy: + ajaksu2
messages: + msg84648

keywords: + patch
type: behavior
stage: test needed
2008-10-02 21:47:21cyhawksetnosy: + cyhawk
2008-09-12 12:18:29barrysetversions: + Python 3.0
2008-09-12 12:18:19barrysetpriority: normal -> high
versions: + Python 2.6
2007-11-01 17:55:31bindsetmessages: + msg57022
2007-11-01 17:37:45hawkingsetnosy: + hawking
2007-11-01 17:15:59georg.brandlsetnosy: + georg.brandl
messages: + msg57014
title: sgmllib _convert_ref UnicodeDecodeError exception, new in 2. -> sgmllib _convert_ref UnicodeDecodeError exception, new in 2.5
2007-10-31 12:50:16bindsetnosy: + bind
2007-02-04 22:34:48naglecreate