Title: HTMLParser fails to handle charref in attribute value
Python 3.2, Python 3.3, Python 2.7
ezio.melotti, fdrake, jhylton
Created on 2005-05-12 02:30 by jhylton, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Author: Jeremy Hylton (jhylton) Date: 2005-05-12 02:30
The HTML spec describes two ways to encode an attribute
value that contains a URI with an ampersand.

>>> from HTMLParser import *
>>> class P(HTMLParser):
...   def handle_starttag(self, tag, attrs):
...     print attrs
>>> P().feed("<tag attr=\"&\">")
[('attr', '&')]
>>> P().feed("<tag attr=\"&\">")
[('attr', '&')]

It seems that each string should produce the same
parsed value.  I would hazard a guess that the easiest
way to make this happen is to extend the current
unescape() to unescape character references.  Is there
any reason not to do that?  I'll provide a fix if that
sounds like a reasonable answer.
Author: Daniel Diniz (ajaksu2) Date: 2009-02-16 01:01
Maybe the charrefs were lost in the SF -> Roundup transition?
Author: Ezio Melotti (ezio.melotti) Date: 2011-11-08 01:11
unescape() already converts named, decimal and hexadecimal entities, so this can be closed.
Author: Roundup Robot (python-dev) Date: 2011-11-14 16:57
New changeset 3c3009f63700 by Ezio Melotti in branch '2.7':
#1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser.

New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2':
#1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser.

New changeset 426f7a2b1826 by Ezio Melotti in branch 'default':
#1745761, #755670, #13357, #12629, #1200313: merge with 3.2.
Author: Ezio Melotti (ezio.melotti) Date: 2011-11-14 17:16
There was actually a bug with entities in unquoted attribute values. I fixed it and added tests for all the cases (quoted and unquoted).
