This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser handle_starttag replaces entity references in attribute value even without semicolon
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.6, Python 3.4, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, frogcoder
Priority: normal Keywords:

Created on 2015-09-26 16:46 by frogcoder, last changed 2022-04-11 14:58 by admin.

Files
File name Uploaded Description Edit
parserentity.py frogcoder, 2015-09-26 16:46 an example of the example described
Messages (2)
msg251654 - (view) Author: Sean Liu (frogcoder) Date: 2015-09-26 16:46
In the document of HTMLParser.handle_starttag, it states "All entity references from html.entities are replaced in the attribute values." However it will replace the string if it matches ampersand followed by the entity name without the semicolon.

For example <a href="go?t=buy&currency=usd">foo</a> will produce "t=buy¤cy=usd" as the value of href attribute due to "curren" is the entity name for the currency sign.
msg251657 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2015-09-26 17:06
This seems indeed to be a bug.  The relevant bit seems to be at http://www.w3.org/TR/html5/syntax.html#consume-a-character-reference :

"""
If the character reference is being consumed as part of an attribute, and the last character matched is not a ";" (U+003B) character, and the next character is either a "=" (U+003D) character or an alphanumeric ASCII character, then, for historical reasons, all the characters that were matched after the U+0026 AMPERSAND character (&) must be unconsumed, and nothing is returned. However, if this next character is in fact a "=" (U+003D) character, then this is a parse error, because some legacy user agents will misinterpret the markup in those cases.
"""

Off the top of my head, this paragraph is not implemented in HTMLParser (and it should).
Also note that <a href="go?t=buy&currency=usd">foo</a> is not valid HTML and the & should have been escaped with &amp;.
History
Date User Action Args
2022-04-11 14:58:21adminsetgithub: 69426
2015-09-26 17:06:12ezio.melottisetassignee: ezio.melotti
stage: test needed
messages: + msg251657
versions: + Python 2.7, Python 3.5, Python 3.6
2015-09-26 16:59:49serhiy.storchakasetnosy: + ezio.melotti
2015-09-26 16:46:39frogcodercreate