classification
Title: HTMLParser fails to handle some characters in the starttag
Type: behavior Stage: resolved
Components: Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: ezio.melotti, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2013-11-02 16:57 by ezio.melotti, last changed 2013-11-07 16:36 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
starttag.diff ezio.melotti, 2013-11-02 16:57 Patch against 3.3.
starttag27.diff ezio.melotti, 2013-11-02 17:11 Patch against 2.7.
Messages (7)
msg201980 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-11-02 16:57
HTMLParser fails to handle some characters in the starttag, e.g. <a$b> should see 'a$b' as name but currently stops at $.  The attached patch fixes the issue.
msg201983 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-11-02 17:11
Attached a patch for 2.7.  The patch removes a "public" name/regex (tagfind_tolerant).  The name is not documented and it's supposed to be private like all the other top-level names on HTMLParser, and even creating an alias with tagfind_tolerant = tagfind won't work because the groups in the regex changed.
msg201986 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-02 20:54
In the maintenance releases you should leave tagfind_tolerant defined with its old value, with a comment that it is internal, no longer used, and has been removed in 3.4.
msg202119 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-11-04 11:07
For 2.7 that sounds like a reasonable option, for 3.3/3.4 however I'm keeping the name but I change the regex groups, so it might break if someone is using it with groups.  In theory I could add a third name and leave that unchanged, but I'm not sure it's worth it.
msg202121 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-11-04 11:38
Then for 3.3 you are bug fixing the regex, so anyone using it ought to want the change, right? :)

If the same is true for 2.7, then creating an alias would probably be fine.  I haven't looked at the details, so I'll leave it to your judgment.  I just don't want the name going away in 2.7, it might be a gratuitous breaking of someone's code.
msg202361 - (view) Author: Roundup Robot (python-dev) Date: 2013-11-07 16:35
New changeset 695f988824bb by Ezio Melotti in branch '2.7':
#19480: HTMLParser now accepts all valid start-tag names as defined by the HTML5 standard.
http://hg.python.org/cpython/rev/695f988824bb

New changeset 9b9d188ed549 by Ezio Melotti in branch '3.3':
#19480: HTMLParser now accepts all valid start-tag names as defined by the HTML5 standard.
http://hg.python.org/cpython/rev/9b9d188ed549

New changeset 7d8a37020db9 by Ezio Melotti in branch 'default':
#19480: merge with 3.3.
http://hg.python.org/cpython/rev/7d8a37020db9
msg202362 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2013-11-07 16:36
Fixed, thanks for the feedback!
History
Date User Action Args
2013-11-07 16:36:51ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg202362

stage: commit review -> resolved
2013-11-07 16:35:57python-devsetnosy: + python-dev
messages: + msg202361
2013-11-04 11:38:13r.david.murraysetmessages: + msg202121
2013-11-04 11:07:13ezio.melottisetmessages: + msg202119
2013-11-02 20:54:38r.david.murraysetmessages: + msg201986
2013-11-02 17:11:24ezio.melottisetfiles: + starttag27.diff

nosy: + r.david.murray
messages: + msg201983

stage: patch review -> commit review
2013-11-02 16:57:52ezio.melotticreate