classification
Title: Handling of broken condcoms in HTMLParser
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: eric.araujo, ezio.melotti, python-dev
Priority: normal Keywords: patch

Created on 2011-12-11 01:58 by ezio.melotti, last changed 2011-12-19 05:46 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
issue13576.diff ezio.melotti, 2011-12-11 01:58 Tests against 3.2. review
Messages (2)
msg149204 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-12-11 01:58
The attached patch adds a few tests about the handling of broken conditional comments (condcoms).
A valid condcom looks like <!--[if ie 6]>...<![endif]-->.
An invalid one looks like <![if ie 6]>...<![endif]>.
This seems a common mistake, and it's found even on popular sites like adobe, linkedin, deviantart.

Currently, HTMLParser calls unknown_decl() passing e.g. 'if ie 6', and if strict=True an error is raised.  With strict=False no error is raised and the unknown declaration is ignored.

The HTML5 specs say:
"""
[After '<!',] If the next two characters are both U+002D HYPHEN-MINUS characters (-), consume those two characters, [...]
Otherwise, this is a parse error. Switch to the bogus comment state.[0]

[Once in the bogus comment state,] Consume every character up to and including the first U+003E GREATER-THAN SIGN character (>) or the end of the file (EOF), whichever comes first. Emit a comment token whose data is the concatenation of all the characters starting from and including the character that caused the state machine to switch into the bogus comment state, up to and including the character immediately before the last consumed character (i.e. up to the character just before the U+003E or EOF character), but with any U+0000 NULL characters replaced by U+FFFD REPLACEMENT CHARACTER characters. (If the comment was started by the end of the file (EOF), the token is empty.)[1]
"""

So, IIUC, '<![if ie 6]>...<![endif]>' should emit a '[if ie 6]' comment, parse the '...' normally, and emit a '[endif]' comment.

However I think it's fine to leave the current behavior for the following reasons:
  1) backward compatibility;
  2) handling broken condcoms in unknown_decl is easier than doing it in handle_comment, where all the other comments are sent;
  3) no one probably cares about them anyway;

[0]: http://www.w3.org/TR/html5/tokenization.html#markup-declaration-open-state
[1]: http://www.w3.org/TR/html5/tokenization.html#bogus-comment-state
msg149819 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-12-19 05:36
New changeset 9c60fd12664f by Ezio Melotti in branch '2.7':
#13576: add tests about the handling of (possibly broken) condcoms.
http://hg.python.org/cpython/rev/9c60fd12664f

New changeset 4ddbb756b602 by Ezio Melotti in branch '3.2':
#13576: add tests about the handling of (possibly broken) condcoms.
http://hg.python.org/cpython/rev/4ddbb756b602

New changeset 6452edbc5f12 by Ezio Melotti in branch 'default':
#13576: merge with 3.2.
http://hg.python.org/cpython/rev/6452edbc5f12
History
Date User Action Args
2011-12-19 05:46:34ezio.melottisetstatus: open -> closed
type: behavior -> enhancement
resolution: fixed
stage: commit review -> resolved
2011-12-19 05:36:13python-devsetnosy: + python-dev
messages: + msg149819
2011-12-11 01:58:33ezio.melotticreate