classification
Title: markupbase declaration errors aren't recoverable
Type: enhancement Stage: committed/rejected
Components: Library (Lib) Versions: Python 3.3
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: eric.araujo, ezio.melotti, mnot, r.david.murray, terry.reedy
Priority: normal Keywords:

Created on 2010-06-03 11:14 by mnot, last changed 2012-06-20 19:27 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
testcase_8885.py mnot, 2010-06-11 01:48 test case
Messages (12)
msg106938 - (view) Author: Mark Nottingham (mnot) Date: 2010-06-03 11:14
In markupbase.py's ParserBase.parse_declaration, an unexpected character is caught like this:

            else:
                self.error(
                    "unexpected %r char in declaration" % rawdata[j])

However, the position (j) isn't updated, which means that error() will be called again once it returns.

For example, this declaration:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" http://www.w3.org/TR/html4/loose.dtd>

(which I think is generated by MS Office) will trigger this behaviour.

Two possible resolutions:

1) increment J and try the next character in this case

2) document that error() is not recoverable; i.e., it MUST raise an exception.

My preference is strongly for #1 (as HTML parsing should be forgiving, and HTMLParser is based upon markerbase).
msg106996 - (view) Author: Mark Nottingham (mnot) Date: 2010-06-03 22:39
Just to be clear -- if error() returns, it will cause an infinite loop.
msg107109 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-06-04 23:11
Neither markerbase nor markupbase are in the list of 2.6 stdlib modules at
http://docs.python.org/modindex.html
even with all packages [+] listings expanded to [-].
So I have to guess this is a third party module. If so, please close and report to *its* authors, not here.
msg107114 - (view) Author: Mark Nottingham (mnot) Date: 2010-06-05 00:18
http://svn.python.org/view/python/trunk/Lib/markupbase.py?view=log
msg107457 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-06-10 12:02
"This module is used as a foundation for the HTMLParser and sgmllib
modules (indirectly, for htmllib as well).  It has no documented
public API and should not be used directly."

So, #2 is not relevant unless you are talking about a docstring update or comment in ParserBase.

Do you have a test case using one of the consumer modules that demonstrates a bug?  markupbase has no test suite of its own (which probably should be fixed someday :)
msg107518 - (view) Author: Mark Nottingham (mnot) Date: 2010-06-11 01:45
I'm using it from HTMLParser; try to parse a document with the DTD given when error is something like:

    def error(self, msg):
        self.errors += 1

and it will loop.
msg107519 - (view) Author: Mark Nottingham (mnot) Date: 2010-06-11 01:48
Attaching test case.
msg124525 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-12-23 00:59
I verified the looping behavior of the testcase in both 2.7.1 and, with minor mods, 3.1.3 and 3.2b1, so this is a valid issue.

The HTMLParcer docs (2.7, 3.2) do not mention the .error method. The default is
    def error(self, message):
        raise HTMLParseError(message, self.getpos())

If this is *not* intended to be part of the api and over-ridden, the name should be changed to ._error and .error deprecated. If it is, it should be documented.

I think the self.error call should be followed either by j+=1 so parsing continues with the next char or by a raise statememt so it is definitely stopped.
msg158786 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-04-20 00:30
HTMLParser shouldn't raise errors anymore, so the "error" method (and probably the HTMLParseError exception too) should be deprecated along with the non-strict mode on 3.3.
msg158789 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-04-20 00:43
s/non-strict/strict/
msg158836 - (view) Author: Mark Nottingham (mnot) Date: 2012-04-20 15:17
Why remove 2.7? It'd be an easy bug fix if j is incremented.
msg158853 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2012-04-20 17:22
Because even on 2.7 the parser is now able to handle broken markup, so "error" won't be called anymore.
History
Date User Action Args
2012-06-20 19:27:39ezio.melottisetstatus: open -> closed
resolution: out of date
stage: needs patch -> committed/rejected
2012-04-20 17:22:19ezio.melottisetmessages: + msg158853
2012-04-20 15:17:51mnotsetmessages: + msg158836
2012-04-20 00:43:19ezio.melottisetmessages: + msg158789
2012-04-20 00:30:15ezio.melottisetversions: + Python 3.3, - Python 3.1, Python 2.7, Python 3.2
nosy: + ezio.melotti

messages: + msg158786

assignee: ezio.melotti
type: behavior -> enhancement
2010-12-23 00:59:27terry.reedysetnosy: terry.reedy, mnot, eric.araujo, r.david.murray
messages: + msg124525
versions: + Python 3.1, Python 3.2
2010-12-22 08:54:56eric.araujosetnosy: + eric.araujo
title: markerbase declaration errors aren't recoverable -> markupbase declaration errors aren't recoverable
versions: + Python 2.7, - Python 2.6
resolution: invalid -> (no value)
stage: needs patch
2010-06-11 01:48:58mnotsetfiles: + testcase_8885.py

messages: + msg107519
2010-06-11 01:45:45mnotsetmessages: + msg107518
2010-06-10 12:02:41r.david.murraysetnosy: + r.david.murray
messages: + msg107457
2010-06-05 00:18:15mnotsetstatus: pending -> open

messages: + msg107114
2010-06-04 23:11:27terry.reedysetstatus: open -> pending

nosy: + terry.reedy
messages: + msg107109

resolution: invalid
2010-06-03 22:39:45mnotsetmessages: + msg106996
2010-06-03 11:14:09mnotcreate