This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: HTMLParser improperly handling open tags when strict is False
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Christopher.Allen-Poole, ezio.melotti, python-dev
Priority: normal Keywords: patch

Created on 2011-10-27 07:56 by Christopher.Allen-Poole, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue13273.diff ezio.melotti, 2011-10-27 14:49 review
Messages (5)
msg146479 - (view) Author: Christopher Allen-Poole (Christopher.Allen-Poole) Date: 2011-10-27 07:56
This is is encountered when extending html.parser.HTMLParser and running with strict mode False.

Expected behavior:
When '''<div style=""    ><b>The <a href="some_url">rain</a> <br /> in <span>Spain</span></b></div>''' is passed to the feed method, div, b, a, br, and span should all be passed to the handle_starttag method.

Actual behavior
The handle_data method receives the values <div style=""    >,<b>,<a href="some_url">,<br />,<span> in addition to the regular text.

This can be fixed by changing this (inside the parse_starttag method):

m = hparse.attrfind_tolerant.search(rawdata, k)

to

m = hparse.attrfind_tolerant.match(rawdata, k)
msg146481 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-27 08:31
Incidentally I was just investigating this very same issue, and your suggestion seems to work for me too.
I'll see if the change has any downside and come up with a patch + test.
Thanks for the report!
msg146490 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-27 14:49
The attached patch fixes replaces search with match as you suggested and tweaks a regex to make the old tests pass.
msg146550 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-10-28 10:24
New changeset 41d41776aa6d by Ezio Melotti in branch '3.2':
#13273: fix a bug that prevented HTMLParser to properly detect some tags when strict=False.
http://hg.python.org/cpython/rev/41d41776aa6d

New changeset b194117f176c by Ezio Melotti in branch 'default':
#13273: merge with 3.2.
http://hg.python.org/cpython/rev/b194117f176c
msg146552 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-10-28 10:27
Fixed, thanks a lot for the report!
History
Date User Action Args
2022-04-11 14:57:23adminsetgithub: 57482
2011-10-28 10:27:48ezio.melottisetstatus: open -> closed
versions: - Python 2.7
messages: + msg146552

resolution: fixed
stage: commit review -> resolved
2011-10-28 10:24:13python-devsetnosy: + python-dev
messages: + msg146550
2011-10-27 14:49:41ezio.melottisetfiles: + issue13273.diff
versions: + Python 2.7, Python 3.3
messages: + msg146490

keywords: + patch
stage: test needed -> commit review
2011-10-27 08:31:15ezio.melottisetassignee: ezio.melotti
messages: + msg146481
2011-10-27 08:15:13ezio.melottisetnosy: + ezio.melotti

stage: test needed
2011-10-27 07:56:01Christopher.Allen-Poolecreate