Message 73571 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	yanne
Recipients	yanne
Date	2008-09-22.12:32:08
SpamBayes Score	6.393158e-11
Marked as misclassified	No
Message-id	<1222086790.7.0.800001604957.issue3932@psf.upfronthosting.co.za>
In-reply-to

Content
It seems that HTMLParser.feed throws an exception whenever an attribute name contains both quotation mark '&' and non-ascii characters. Running the attached test file with Python 2.5 succeeds, but with Python 2.6, the result is: C:\Python26>python.exe test.py Without & in attribute OK With & in attribute Traceback (most recent call last): File "test.py", line 18, in <module> HP().feed(s) File "C:\Python26\lib\HTMLParser.py", line 108, in feed self.goahead(0) File "C:\Python26\lib\HTMLParser.py", line 148, in goahead k = self.parse_starttag(i) File "C:\Python26\lib\HTMLParser.py", line 249, in parse_starttag attrvalue = self.unescape(attrvalue) File "C:\Python26\lib\HTMLParser.py", line 386, in unescape return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+\|\w{1,8}));", replaceEntities, s) File "C:\Python26\lib\re.py", line 150, in sub return _compile(pattern, 0).sub(repl, string, count) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) I am running: Python 2.6rc2 (r26rc2:66507, Sep 18 2008, 14:27:33) [MSC v.1500 32 bit (Intel)] on win32

It seems that HTMLParser.feed throws an exception whenever an attribute
name contains both quotation mark '&' and non-ascii characters.

Running the attached test file with Python 2.5 succeeds, but with Python
2.6, the result is:

C:\Python26>python.exe test.py
Without & in attribute
OK
With & in attribute
Traceback (most recent call last):
  File "test.py", line 18, in <module>
    HP().feed(s)
  File "C:\Python26\lib\HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "C:\Python26\lib\HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "C:\Python26\lib\HTMLParser.py", line 249, in parse_starttag
    attrvalue = self.unescape(attrvalue)
  File "C:\Python26\lib\HTMLParser.py", line 386, in unescape
    return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));",
replaceEntities, s)
  File "C:\Python26\lib\re.py", line 150, in sub
    return _compile(pattern, 0).sub(repl, string, count)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0:
ordinal
not in range(128)

I am running:

Python 2.6rc2 (r26rc2:66507, Sep 18 2008, 14:27:33) [MSC v.1500 32 bit
(Intel)] on win32

History
Date	User	Action	Args
2008-09-22 12:33:10	yanne	set	recipients: + yanne
2008-09-22 12:33:10	yanne	set	messageid: <1222086790.7.0.800001604957.issue3932@psf.upfronthosting.co.za>
2008-09-22 12:32:10	yanne	link	issue3932 messages
2008-09-22 12:32:08	yanne	create