This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Handling of broken markup in HTMLParser on 2.7
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: benjamin.peterson, eli.bendersky, eric.araujo, ezio.melotti, python-dev, r.david.murray
Priority: normal Keywords: patch

Created on 2012-02-10 13:45 by ezio.melotti, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue13987.diff ezio.melotti, 2012-02-10 13:45 First patch against 2.7.
Messages (5)
msg153043 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-10 13:45
The attached patch fixes a few problems with HTMLParser on 2.7.
Instead of raising error when invalid markup is detected, the parser now consumes the invalid input and proceeds.  This patch is a partial backport of #1486713.

After this two more patches will follow.
The first will get rid of errors raised while parsing declarations and should also solve #13576:
     def unknown_decl(self, data):
-        self.error("unknown declaration: %r" % (data,))
+        pass

The second will take care of "bogus comments" (see #13960).

Once this is done HTMLParser should be able to parse (almost) everything.  I'm planning to commit this before the release of 2.7.3.
msg153100 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2012-02-11 05:28
LGTM, http://shipitsquirrel.github.com/
msg153398 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-02-15 10:44
New changeset 11a31eb5da93 by Ezio Melotti in branch '2.7':
#13987: HTMLParser is now able to handle EOFs in the middle of a construct.
http://hg.python.org/cpython/rev/11a31eb5da93
msg153399 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2012-02-15 11:19
New changeset 3d7904e3f4b9 by Ezio Melotti in branch '2.7':
#13987: HTMLParser is now able to handle malformed start tags.
http://hg.python.org/cpython/rev/3d7904e3f4b9
msg153400 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-02-15 11:27
This should be fixed now.
The first two chunks of the attached patch have been committed in the two changesets linked in the previous messages.  The third chunk about the end tag has been fixed as part of #13933.  The error previously raised by unknown_decl has been removed in 4743a3a1e669.  More fixes have been backported as part of #13960.
2.7 should now behave like 3.2 non-strict.
History
Date User Action Args
2022-04-11 14:57:26adminsetgithub: 58195
2012-02-15 11:27:15ezio.melottisetstatus: open -> closed
resolution: fixed
messages: + msg153400

stage: patch review -> resolved
2012-02-15 11:19:30python-devsetmessages: + msg153399
2012-02-15 10:44:35python-devsetnosy: + python-dev
messages: + msg153398
2012-02-11 05:28:41eric.araujosetmessages: + msg153100
2012-02-10 13:46:39eli.benderskysetnosy: + eli.bendersky
2012-02-10 13:45:58ezio.melotticreate