This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: allow HTMLParser to continue after a parse error
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: BreamoreBoy, ajaksu2, ezio.melotti, frafra, smroid, titus
Priority: normal Keywords: easy, patch

Created on 2003-06-17 02:27 by smroid, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
patch.txt smroid, 2003-06-17 02:28
htmlparser_error.diff ajaksu2, 2009-02-12 00:14 Steven's patch updated to trunk review
parser.diff BreamoreBoy, 2010-08-18 13:52 review
Messages (8)
msg44029 - (view) Author: Steven Rosenthal (smroid) Date: 2003-06-17 02:27
The HTMLParser.error method raises HTMLParseError,
terminating the parse upon detection of a parse error.

This patch is to allow HTMLParser to continue parsing
if the error() method is overridden to not throw an
exception.

Doc impact is on the error() method. The existing
test_htmlparser.py unit test is unaffected by the patch.

The base file is HTMLParser.py, revision 1.11.2.1
msg44030 - (view) Author: Steven Rosenthal (smroid) Date: 2003-06-18 03:13
Logged In: YES 
user_id=159908

this fixes bug #736428 (submitted by me earlier)
msg44031 - (view) Author: Titus Brown (titus) Date: 2004-12-19 00:45
Logged In: YES 
user_id=23486

This patch allows developers to override the behavior of HTMLParser
when parsing malformed HTML.  Normally HTMLParser calls the function
self.error(), which raises an exception.  This patch adds appropriate
return values for situations where self.error has been redefined in
subclasses to *not* raise an exception.

It does not change the default behavior of HTMLParser and so presents
no backwards compatibility issues.

The patch itself consists of an added comment and two added lines of
code that call 'return' with appropriate values after a self.error call.
Nothing wrong with 'em.  I can't verify that the "junk characters" error
call will leave the parser in a good state, though, if execution returns
from error().

The library documentation could be updated to reflect the ability to 
override
error() behavior; I've written a short patch, available at

http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch

More problems exist with markupbase.py, upon which HTMLParser is 
based.
markupbase calls error() as well, and has some stickier situations.  See
comments in bug 917188 as well.

Comments in 683938 and 699079 suggest that raising an exception is the
correct response to the parse errors.  I recommend application of the
patch anyway, because it (a) doesn't change any behavior by default
and (b) may solve some problems for people.

An alternative would be to distinguish between unrecoverable errors
and recoverable errors by having two different functions, e.g. error() 
(for
recoverable errors) and _fail() (for unrecoverable errors).  By default
error() would call _fail() and internal code could be changed to call
_fail() where recovery is impossible.  This might alter behavior in
situations where subclasses override error() but then again that's not
legitimate to do anyway, at least not at the moment -- error() isn't
in the docs ;).

If nothing done, at least close patch 755660 and bug 736428 with a
comment saying that this behavior will not be addressed ;).
msg81693 - (view) Author: Daniel Diniz (ajaksu2) * (Python triager) Date: 2009-02-12 00:14
Tests still pass with updated patch, but new tests (and docs!) for this
feature are needed if Titus' positive review stands.
msg95107 - (view) Author: Francesco Frassinelli (frafra) Date: 2009-11-10 12:16
I'm using Python 3.1.1 and the patch (patch.txt, provided by smroid)
works very well. It's usefull, and I really need it, thanks :)
Without this patch, I can't parse: http://ftp.vim.org/pub/vim/ (due to a
fake tag, like "<user@mail.com>"), and many others websites.

I hope this patch will be merged in Python 3.2 :)
msg95109 - (view) Author: Francesco Frassinelli (frafra) Date: 2009-11-10 12:47
Site: http://ftp.vim.org/pub/vim/unstable/patches/

Outuput without error customized function:
[...]
  File "./takeit.py", line 54, in inspect
    parser.feed(data.read().decode())
  File "/home/frafra/Scrivania/takeit/html/parser.py", line 107, in feed
    self.goahead(0)
  File "/home/frafra/Scrivania/takeit/html/parser.py", line 163, in goahead
    k = self.parse_declaration(i)
  File "/usr/local/lib/python3.1/_markupbase.py", line 97, in
parse_declaration
    decltype, j = self._scan_name(j, i)
  File "/usr/local/lib/python3.1/_markupbase.py", line 387, in _scan_name
    % rawdata[declstartpos:declstartpos+20])
  File "/home/frafra/Scrivania/takeit/html/parser.py", line 122, in error
    raise HTMLParseError(message, self.getpos())
html.parser.HTMLParseError: expected name token at '<! gives an error
me', at line 153, column 48

Output with error customized function:
[...]
  File "./takeit.py", line 55, in inspect
    parser.feed(data.read().decode())
  File "/home/frafra/Scrivania/takeit/html/parser.py", line 107, in feed
    self.goahead(0)
  File "/home/frafra/Scrivania/takeit/html/parser.py", line 163, in goahead
    k = self.parse_declaration(i)
  File "/usr/local/lib/python3.1/_markupbase.py", line 97, in
parse_declaration
    decltype, j = self._scan_name(j, i)
TypeError: 'NoneType' object is not iterable
msg114219 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-18 13:52
Attached a patch for py3k where the file name has changed.  Doc changes could be based on the comment added to the error method in the patch.  I don't think a unit test is needed but could easily be persuaded otherwise.
msg158787 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-04-20 00:33
HTMLParser should now be able to parse invalid HTML too, so this patch is not necessary anymore.
History
Date User Action Args
2022-04-10 16:09:15adminsetgithub: 38663
2012-04-20 00:33:01ezio.melottisetstatus: open -> closed

assignee: ezio.melotti
versions: + Python 3.3, - Python 3.2
nosy: + ezio.melotti

messages: + msg158787
resolution: out of date
stage: patch review -> resolved
2011-11-14 17:07:52ezio.melottiunlinkissue755670 dependencies
2010-08-18 13:52:52BreamoreBoysetfiles: + parser.diff

versions: + Python 3.2, - Python 2.7
keywords: + patch
nosy: + BreamoreBoy

messages: + msg114219
stage: test needed -> patch review
2009-11-10 12:47:08frafrasetmessages: + msg95109
2009-11-10 12:17:00frafrasetnosy: + frafra
messages: + msg95107
2009-04-22 18:49:51ajaksu2setkeywords: + easy, - patch
2009-04-05 18:45:17georg.brandllinkissue736428 superseder
2009-02-12 03:01:19ajaksu2linkissue755670 dependencies
2009-02-12 00:15:27ajaksu2settype: enhancement
2009-02-12 00:15:02ajaksu2setfiles: + htmlparser_error.diff
nosy: + ajaksu2
stage: test needed
messages: + msg81693
versions: + Python 2.7, - Python 2.3
2003-06-17 02:27:45smroidcreate