Issue 755670: improve HTMLParser attribute processing regexps

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38664

classification

Title:	improve HTMLParser attribute processing regexps
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.2, Python 3.3, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:	683938	Superseder:
Assigned To:	ezio.melotti	Nosy List:	ezio.melotti, python-dev, smroid, timtoo, titus
Priority:	normal	Keywords:	patch

Created on 2003-06-17 03:09 by smroid, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
diff.txt	smroid, 2003-06-17 03:09
issue755670.diff	ezio.melotti, 2011-11-01 16:11	Failing test	review

Messages (11)
msg44032 - (view)	Author: Steven Rosenthal (smroid)	Date: 2003-06-17 03:09
HTML examples seen in the wild that cause parse errors in HTMLParser include: <a width="100%"cellspacing=0> -- note lack of space between val and next attr name <a foo=> -- trailing attribute has no value after = <a href=javascript:popup('/popup/html.html')> -- javascript fragment with embedded quotes My patch contains improvements to the 'attrfind' and 'locatestarttagend' regexps that allow these examples to parse. The existing test_htmlparser.py unit test continues to pass, except for the one test case where it considers <a foo=> to be an error. I commented out that case and added new test cases to cover the examples above.
msg44033 - (view)	Author: Steven Rosenthal (smroid)	Date: 2003-06-17 03:10
Logged In: YES user_id=159908 Base version for HTMLParser.py is 1.11.2.1; base version for test_htmlparser.py is 1.8.8.1
msg44034 - (view)	Author: Steven Rosenthal (smroid)	Date: 2003-06-18 03:22
Logged In: YES user_id=159908 This also fixes bugs 683938 and 699079.
msg44035 - (view)	Author: Titus Brown (titus)	Date: 2004-12-19 00:42
Logged In: YES user_id=23486 This patch allows developers to override the behavior of HTMLParser when parsing malformed HTML. Normally HTMLParser calls the function self.error(), which raises an exception. This patch adds appropriate return values for situations where self.error has been redefined in subclasses to not raise an exception. It does not change the default behavior of HTMLParser and so presents no backwards compatibility issues. The patch itself consists of an added comment and two added lines of code that call 'return' with appropriate values after a self.error call. Nothing wrong with 'em. I can't verify that the "junk characters" error call will leave the parser in a good state, though, if execution returns from error(). The library documentation could be updated to reflect the ability to override error() behavior; I've written a short patch, available at http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch More problems exist with markupbase.py, upon which HTMLParser is based. markupbase calls error() as well, and has some stickier situations. See comments in bug 917188 as well. Comments in 683938 and 699079 suggest that raising an exception is the correct response to the parse errors. I recommend application of the patch anyway, because it (a) doesn't change any behavior by default and (b) may solve some problems for people. An alternative would be to distinguish between unrecoverable errors and recoverable errors by having two different functions, e.g. error() (for recoverable errors) and _fail() (for unrecoverable errors). By default error() would call _fail() and internal code could be changed to call _fail() where recovery is impossible. This might alter behavior in situations where subclasses override error() but then again that's not legitimate to do anyway, at least not at the moment -- error() isn't in the docs ;). If nothing done, at least close patch 755660 and bug 736428 with a comment saying that this behavior will not be addressed ;).
msg44036 - (view)	Author: Titus Brown (titus)	Date: 2004-12-19 00:45
Logged In: YES user_id=23486 whoops, attached to wrong patch! dangitall. sorry...
msg44037 - (view)	Author: Titus Brown (titus)	Date: 2004-12-19 06:58
Logged In: YES user_id=23486 I don't think HTMLParser should parse clearly invalid HTML without complaining. Perhaps such behavior should be overrideable (see patch 755660 & comments therein), but this patch changes behavior to parse invalid HTML w/o complaint. Patch 699079, to "fix" similar behavior, was closed by adsr and bcannon because such behavior is correct. Patch 683938 is also similar but is being kept open for some reason. Recommend closing patch w/o applying.
msg55984 - (view)	Author: T. Middleton (timtoo)	Date: 2007-09-17 21:20
I for one thank smroid for the patch. I also have hit all of these cases in the wild. This patch makes real-life a lot less frustrating. This patch is surely a lot more preferable than HTMLParser's tendency to just throw up its hands and quietly give up. And one might even make a case that in the horrific world of HTML "standards" case #2 and #3 can be considered actually valid (as much as it hurts to say so).
msg114245 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-18 16:32
There are messages both for and against the patch which contains a unit test. Can we have a statement from a knowledgeable HTML person as to whether the patch should be accepted or rejected.
msg146786 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-01 16:11
Attached patch includes the tests in diff.txt. On Python 3, with strict=False, the first test (adjacent attributes) passes, but the other two still fail. See also #12629.
msg147611 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-11-14 16:57
New changeset 3c3009f63700 by Ezio Melotti in branch '2.7': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/3c3009f63700 New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2': #1745761, #755670, #13357, #12629, #1200313: improve attribute handling in HTMLParser. http://hg.python.org/cpython/rev/16ed15ff0d7c New changeset 426f7a2b1826 by Ezio Melotti in branch 'default': #1745761, #755670, #13357, #12629, #1200313: merge with 3.2. http://hg.python.org/cpython/rev/426f7a2b1826
msg147619 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-11-14 17:07
Fixed, thanks for the report!

History
Date	User	Action	Args
2022-04-10 16:09:16	admin	set	github: 38664
2011-11-14 17:07:52	ezio.melotti	set	status: open -> closed versions: + Python 2.7 messages: + msg147619 dependencies: - allow HTMLParser to continue after a parse error resolution: fixed stage: patch review -> resolved
2011-11-14 16:57:13	python-dev	set	nosy: + python-dev messages: + msg147611
2011-11-14 12:45:57	ezio.melotti	set	assignee: ezio.melotti
2011-11-01 16:11:55	ezio.melotti	set	files: + issue755670.diff versions: + Python 3.3 nosy: + ezio.melotti, - BreamoreBoy messages: + msg146786 type: enhancement -> behavior
2010-08-18 16:32:10	BreamoreBoy	set	versions: + Python 3.2, - Python 2.7 nosy: + BreamoreBoy messages: + msg114245 stage: test needed -> patch review
2009-02-12 03:01:19	ajaksu2	set	dependencies: + HTMLParser attribute parsing bug, allow HTMLParser to continue after a parse error type: enhancement stage: test needed versions: + Python 2.7, - Python 2.3
2007-09-17 21:20:47	timtoo	set	nosy: + timtoo messages: + msg55984
2003-06-17 03:09:17	smroid	create