improve HTMLParser attribute processing regexps #38664

smroid · 2003-06-17T03:09:17Z

BPO	755670
Nosy	@ezio-melotti
Dependencies	bpo-683938: HTMLParser attribute parsing bug
Files	diff.txt issue755670.diff: Failing test

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2011-11-14.17:07:52.150>
created_at = <Date 2003-06-17.03:09:17.000>
labels = ['type-bug', 'library']
title = 'improve HTMLParser attribute processing regexps'
updated_at = <Date 2011-11-14.17:07:52.148>
user = 'https://bugs.python.org/smroid'

bugs.python.org fields:

activity = <Date 2011-11-14.17:07:52.148>
actor = 'ezio.melotti'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2011-11-14.17:07:52.150>
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation = <Date 2003-06-17.03:09:17.000>
creator = 'smroid'
dependencies = ['683938']
files = ['5384', '23582']
hgrepos = []
issue_num = 755670
keywords = ['patch']
message_count = 11.0
messages = ['44032', '44033', '44034', '44035', '44036', '44037', '55984', '114245', '146786', '147611', '147619']
nosy_count = 5.0
nosy_names = ['titus', 'smroid', 'timtoo', 'ezio.melotti', 'python-dev']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue755670'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

smroid · 2003-06-17T03:09:17Z

HTML examples seen in the wild that cause parse errors
in HTMLParser include:

<a width="100%"cellspacing=0>
-- note lack of space between val and next attr name

<a foo=>
-- trailing attribute has no value after =

<a href=javascript:popup('/popup/html.html')>
-- javascript fragment with embedded quotes

My patch contains improvements to the 'attrfind' and
'locatestarttagend' regexps that allow these examples
to parse.

The existing test_htmlparser.py unit test continues to
pass, except for the one test case where it considers
<a foo=> to be an error.

I commented out that case and added new test cases to
cover the examples above.

smroid · 2003-06-17T03:10:57Z

Logged In: YES
user_id=159908

Base version for HTMLParser.py is 1.11.2.1; base version for
test_htmlparser.py is 1.8.8.1

smroid · 2003-06-18T03:22:20Z

Logged In: YES
user_id=159908

This also fixes bugs 683938 and 699079.

titus · 2004-12-19T00:42:48Z

Logged In: YES
user_id=23486

This patch allows developers to override the behavior of HTMLParser
when parsing malformed HTML. Normally HTMLParser calls the function
self.error(), which raises an exception. This patch adds appropriate
return values for situations where self.error has been redefined in
subclasses to *not* raise an exception.

It does not change the default behavior of HTMLParser and so presents
no backwards compatibility issues.

The patch itself consists of an added comment and two added lines of
code that call 'return' with appropriate values after a self.error call.
Nothing wrong with 'em. I can't verify that the "junk characters" error
call will leave the parser in a good state, though, if execution returns
from error().

The library documentation could be updated to reflect the ability to
override
error() behavior; I've written a short patch, available at

http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch

More problems exist with markupbase.py, upon which HTMLParser is
based.
markupbase calls error() as well, and has some stickier situations. See
comments in bug 917188 as well.

Comments in 683938 and 699079 suggest that raising an exception is the
correct response to the parse errors. I recommend application of the
patch anyway, because it (a) doesn't change any behavior by default
and (b) may solve some problems for people.

An alternative would be to distinguish between unrecoverable errors
and recoverable errors by having two different functions, e.g. error()
(for
recoverable errors) and _fail() (for unrecoverable errors). By default
error() would call _fail() and internal code could be changed to call
_fail() where recovery is impossible. This might alter behavior in
situations where subclasses override error() but then again that's not
legitimate to do anyway, at least not at the moment -- error() isn't
in the docs ;).

If nothing done, at least close patch 755660 and bug 736428 with a
comment saying that this behavior will not be addressed ;).

titus · 2004-12-19T00:45:12Z

Logged In: YES
user_id=23486

whoops, attached to wrong patch! dangitall. sorry...

titus · 2004-12-19T06:58:59Z

Logged In: YES
user_id=23486

I don't think HTMLParser should parse clearly invalid HTML without
complaining. Perhaps such behavior should be overrideable (see patch
755660 & comments therein), but this patch changes behavior to parse
invalid HTML w/o complaint.

Patch 699079, to "fix" similar behavior, was closed by adsr and bcannon
because such behavior is correct. Patch 683938 is also similar but is
being kept open for some reason.

Recommend closing patch w/o applying.

timtoo · 2007-09-17T21:20:47Z

I for one thank smroid for the patch. I also have hit *all* of these
cases in the wild. This patch makes real-life a lot less frustrating.
This patch is surely a lot more preferable than HTMLParser's tendency
to just throw up its hands and quietly give up. And one might even
make a case that in the horrific world of HTML "standards" case #2 and
#3 can be considered actually valid (as much as it hurts to say so).

BreamoreBoy · 2010-08-18T16:32:11Z

There are messages both for and against the patch which contains a unit test. Can we have a statement from a knowledgeable HTML person as to whether the patch should be accepted or rejected.

ezio-melotti · 2011-11-01T16:11:55Z

Attached patch includes the tests in diff.txt. On Python 3, with strict=False, the first test (adjacent attributes) passes, but the other two still fail.
See also bpo-12629.

python-dev · 2011-11-14T16:57:14Z

New changeset 3c3009f63700 by Ezio Melotti in branch '2.7':
bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: improve attribute handling in HTMLParser.
http://hg.python.org/cpython/rev/3c3009f63700

New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2':
bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: improve attribute handling in HTMLParser.
http://hg.python.org/cpython/rev/16ed15ff0d7c

New changeset 426f7a2b1826 by Ezio Melotti in branch 'default':
bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: merge with 3.2.
http://hg.python.org/cpython/rev/426f7a2b1826

ezio-melotti · 2011-11-14T17:07:52Z

Fixed, thanks for the report!

smroid mannequin added stdlib Python modules in the Lib dir labels Jun 17, 2003

devdanzin mannequin added type-feature A feature request or enhancement labels Feb 12, 2009

ezio-melotti added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Nov 1, 2011

ezio-melotti self-assigned this Nov 14, 2011

ezio-melotti closed this as completed Nov 14, 2011

ezio-melotti transferred this issue from another repository Apr 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve HTMLParser attribute processing regexps #38664

improve HTMLParser attribute processing regexps #38664

smroid mannequin commented Jun 17, 2003

smroid mannequin commented Jun 17, 2003

smroid mannequin commented Jun 17, 2003

smroid mannequin commented Jun 18, 2003

titus mannequin commented Dec 19, 2004

titus mannequin commented Dec 19, 2004

titus mannequin commented Dec 19, 2004

timtoo mannequin commented Sep 17, 2007

BreamoreBoy mannequin commented Aug 18, 2010

ezio-melotti commented Nov 1, 2011

python-dev mannequin commented Nov 14, 2011

ezio-melotti commented Nov 14, 2011