classification
Title: Handling of broken comments in HTMLParser
Type: behavior Stage: committed/rejected
Components: Library (Lib) Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: eric.araujo, ezio.melotti, python-dev
Priority: normal Keywords: patch

Created on 2012-02-07 11:56 by ezio.melotti, last changed 2012-02-13 14:14 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
issue13960.diff ezio.melotti, 2012-02-07 11:56 Patch against 3.2 review
Messages (8)
msg152806 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-07 11:56
html.parser fails to handle the following invalid comments:
<! foo >
<! bar -->
<! -- baz -->
The attached patch follows the HTML5 specs [0], and parses them as "bogus comments".  Currently the patch fixes the problem only when strict=False, but it might be better to make this the default behavior and apply it to 2.7 too.

[0]: http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
msg152861 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-08 14:28
LGTM.  What did our last discussion about following HTML5 rules for Python 2.7 lead to?  I don’t remember if we agreed that “3.3 is soon enough” or “let’s fix the bugs with HTML5 as reference”.
msg152869 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-08 15:30
After reading some emails again, I’m +1 on porting the fixes to 2.7.

1) We agree that HTML5 is the reference specification.

2) I don’t think there is sane code that would be broken if some previously unparsable page became parsable (an exception can be HTML parsers, but the obvious example BeautifulSoup does not use HTMLParser for example); HTMLParser is not a validating parser and never made any guarantee about the validity of handled pages.

3) Most people should be happy to have more pages handled by HTMLParser.

4) 2.7 is unique as long-term support, last 2.7 release.
msg153032 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-10 08:10
I'll fix this for 3.x non-strict and then see if it can be backported to 2.7 (there are still other fixes that should be backported to 2.7 before this can be applied).
msg153035 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-10 08:51
New changeset 242b697449d8 by Ezio Melotti in branch '3.2':
#13960: HTMLParser is now able to handle broken comments when strict=False.
http://hg.python.org/cpython/rev/242b697449d8

New changeset 44366541dd86 by Ezio Melotti in branch 'default':
#13960: merge with 3.2.
http://hg.python.org/cpython/rev/44366541dd86
msg153036 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-10 08:52
This is now fixed in 3.2/3.3, I'll wait for 2.7 before closing it.
On a side note, the empty <!> comment doesn't seem to be valid in HTML5.
HTMLParser just ignores it, and doesn't report it as an empty comment (so this should be fine).
msg153271 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-13 14:10
New changeset 333e3acf2008 by Ezio Melotti in branch '2.7':
#13960: HTMLParser is now able to handle broken comments.
http://hg.python.org/cpython/rev/333e3acf2008
msg153272 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-13 14:14
I now backported this to 2.7, together with some improvements in the handling of declarations that I committed on 3.2 (4c4ff9fd19b6) and 3.3 (06a6fed0da56).
Apparently <!> is not a valid comment in HTML5, but it is considered a bogus comment and should still emit a "comment" with no content.  This is now fixed too.
History
Date User Action Args
2012-02-13 14:14:59ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg153272

stage: patch review -> committed/rejected
2012-02-13 14:10:58python-devsetmessages: + msg153271
2012-02-10 08:52:03ezio.melottisetmessages: + msg153036
2012-02-10 08:51:06python-devsetnosy: + python-dev
messages: + msg153035
2012-02-10 08:10:04ezio.melottisetmessages: + msg153032
2012-02-08 15:30:34eric.araujosetmessages: + msg152869
2012-02-08 14:28:56eric.araujosetmessages: + msg152861
2012-02-08 12:19:53ezio.melottisetassignee: ezio.melotti
2012-02-07 11:56:35ezio.melotticreate