This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser can't handle erronous end tags with additional info in them
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, r.david.murray, ritave
Priority: normal Keywords:

Created on 2012-04-05 11:51 by ritave, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
minimal.py ritave, 2012-04-05 11:51 Minimal example where the not ideal behavior can be spotted
Messages (5)
msg157570 - (view) Author: Olaf Tomalka (ritave) Date: 2012-04-05 11:51
While this is wrongly formated html, I've spotted such an example on real website on the web, and all browsers handle the bad tag gracefully, while the python html parser throws an exception with "bad end tag", I think additional info in end tag should be ignored, no exception thrown and rest of the page parsed.
I'm including minimal example.
msg157582 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-05 13:02
Which version of python did you test with?  There have been several improvements html parsing recently.
msg157583 - (view) Author: Olaf Tomalka (ritave) Date: 2012-04-05 13:04
Python 3.2.2, which is latest on arch linux
msg157585 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-04-05 13:08
I just tested your script on 3.2.3a2+, and it raises an error.  Ezio made the other parsing changes, I'll leave it to him to evaluate what if anything should be done here.
msg157601 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-04-05 16:11
This is already fixed, but only in non-strict mode (and 3.2.3 iirc).
You should always use HTMLParser(strict=False).  The non-strict mode will probably become the default and strict=True will be deprecated.
Thanks anyway for the report, and please report any failure that you might find with strict=False.
History
Date User Action Args
2022-04-11 14:57:28adminsetgithub: 58711
2012-04-05 16:11:49ezio.melottisetstatus: open -> closed
resolution: not a bug
messages: + msg157601

stage: resolved
2012-04-05 13:08:32r.david.murraysetmessages: + msg157585
versions: + Python 3.3
2012-04-05 13:04:59ritavesetmessages: + msg157583
2012-04-05 13:02:04r.david.murraysetnosy: + ezio.melotti, r.david.murray
messages: + msg157582
2012-04-05 12:28:18ritavesettitle: HTMLParser can't handle erronous end tags with additional tags in it -> HTMLParser can't handle erronous end tags with additional info in them
2012-04-05 11:51:41ritavecreate