HTMLParser : A auto-tolerant parsing mode #43346

kxroberto · 2006-05-11T17:19:36Z

BPO	1486713
Nosy	@freddrake, @terryjreedy, @orsenthil, @ezio-melotti, @merwok, @bitdancer
Files	HTMLParser_tolerant.patch HTMLParser_tolerant_py24.patch HTMLParser_tolerant_py26.patch test_htmlparser_tolerant.patch: test case

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-12-03.04:10:56.507>
created_at = <Date 2006-05-11.17:19:36.000>
labels = ['type-feature', 'library']
title = 'HTMLParser : A auto-tolerant parsing mode'
updated_at = <Date 2011-11-16.13:17:02.792>
user = 'https://bugs.python.org/kxroberto'

bugs.python.org fields:

activity = <Date 2011-11-16.13:17:02.792>
actor = 'ezio.melotti'
assignee = 'none'
closed = True
closed_date = <Date 2010-12-03.04:10:56.507>
closer = 'r.david.murray'
components = ['Library (Lib)']
creation = <Date 2006-05-11.17:19:36.000>
creator = 'kxroberto'
dependencies = []
files = ['7243', '7244', '18623', '18624']
hgrepos = []
issue_num = 1486713
keywords = ['patch']
message_count = 24.0
messages = ['50232', '50233', '50234', '50235', '113366', '114659', '114682', '114773', '114786', '114796', '115031', '115115', '115624', '121674', '123174', '123247', '147692', '147698', '147755', '147756', '147763', '147765', '147766', '147767']
nosy_count = 9.0
nosy_names = ['fdrake', 'terry.reedy', 'jjlee', 'orsenthil', 'kxroberto', 'ezio.melotti', 'Neil Muller', 'eric.araujo', 'r.david.murray']
pr_nums = []
priority = 'normal'
resolution = 'accepted'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue1486713'
versions = ['Python 3.2']

kxroberto · 2006-05-11T17:19:36Z

Changes:

Now allows missing spaces between attributes as its
often seen on the web like this :

That like broke the whole parsing before.

A fully auto-tolerant mode (HTMLParser.tolerant=1)
was added. It should hopefully NEVER break HTML parsing
on the level of HTMLParser, but recover and continue
the parsing smartly. The mode was tested extensively
with complex pages. The tolerant mode is guaranted to
finish all HTML stuff only during HTMLParser.close() /
goahead(end=True) - yet that was the same (stucking)
policy before.
Maybe steep: I have switched ON the tolerant mode by
default, as this is, what in 99.9% of cases one wants
to have.
(I've maybe 20 applications for HTMLParser - None like
the unrecoverable breaks with Exceptions)
During tolerant mode the virtual .warning(message,i,k)
is called instead of error - by default this just
counts .warning_count up. This framework should even
enable to write po HTML checkers
The patch was generated against py2.3 (still the
"good/base" Python for me) and also fixes a regexp-bug
(which already was fixed in py2.4.2). Yet the patch
works also against py2.4/2.5 - 2 locations where py24
trivially changed to %r/repr may grumble.

-robert

kxroberto · 2006-05-23T15:11:30Z

Logged In: YES
user_id=972995

Python 2.4 version of the patch added.

kxroberto · 2006-05-23T15:15:52Z

Logged In: YES
user_id=972995

(and works also for Python2.5)

jjlee · 2007-01-30T02:32:03Z

This badly needs unit tests.

terryjreedy · 2010-08-09T03:32:42Z

This needs to be checked for applicability to 3.x.
Do beautifulsoup and other programs cover this ground (tolerant parsing of junk html)?

BreamoreBoy · 2010-08-22T10:14:19Z

I think this should be closed as have other similar requests in the last few days.

bitdancer · 2010-08-22T16:37:51Z

I disagree (and might disagree with those other closings but I haven't noticed them I guess). BeautifulSoup does *not* cover this ground, it is broken in 3.x because of the lack of a tolerant HTML parser in the stdlib (it used to use sgmlib, which is now gone). BeautifulSoup would probably very much like to have this tolerant mode.

It probably shouldn't be the default, though, for backward compatibility reasons :(

kxroberto · 2010-08-24T09:38:09Z

for me a parser which cannot be feed with HTML from outside (which I cannot edit myself) has not much use at all.
attached my current patch (vs. py26) - many changes meanwhile.
and a test case.

I've put the default to strict mode, but ...

bitdancer · 2010-08-24T13:13:39Z

2.6 is now in security-fix-only mode. Since this is a new feature, it can only go into 3.2.

Can you provide a patch against py3k trunk?

I've only glanced at the patch briefly, but one thing that concerns me is 'warning file'. I suppose that either the logging module or perhaps the warnings module should be used instead. We should look at how other stdlib modules handle this kind of thing. Or perhaps warnings shouldn't be generated at all, since the default will be strict and therefore the programmer has consciously selected tolerant mode.

One stdlib model we could follow is the model of the email module: have a 'defects' attribute that collects the errors. email6, by the way, is going to have both 'tolerant' and 'strict' modes, and in that case the default is tolerant (and always has been) in respect for Postel's law, which is enshrined in the email RFCs. If the HTTP standards have a similar recommendation to accept "dirty" input when possible, we could make an argument for changing HTMLParser's default to tolerant.

terryjreedy · 2010-08-24T16:26:03Z

I agree that a tolerant mode would be good (and often requested). String encoding and decoding also have strict and forgiving modes, so this seems close to a policy.

Unit tests with example snippets that properly fail strict mode and pass the new tolerant mode are needed both to completely review and apply a patch.

kxroberto · 2010-08-26T21:44:56Z

I'm not working with Py3. don't how much that module is different in 3.
unless its going into a py2 version, I'll leave the FR so far to the py3 community

bitdancer · 2010-08-27T19:07:36Z

For anyone who does want to work on this (and I do, but it will be quite a while before I can) see also bpo-6191.

bitdancer · 2010-09-05T02:46:22Z

See also bpo-1058305, which may be a duplicate.

NeilMuller · 2010-11-20T16:28:24Z

bpo-975556 and bpo-1046092 look like they should also be superseded by this.

bitdancer · 2010-12-03T04:10:56Z

I have committed a version of this patch, without the warnings, using the keyword 'strict=True' as the default, and with a couple added heuristics from other similar issues, in r86952.

kxroberto, if you want to supply your full name, I'll add you to Misc/ACKS. Thanks for the original patch.

bitdancer · 2010-12-03T13:59:14Z

A note for the curious: I changed the keyword name from 'tolerant' to 'strict' because the stdlib has other examples of 'strict' as a keyword, but the word 'tolerant' appears nowhere in the documentation and certainly not as a keyword. So it seemed better to remain consistent with existing practice. This would be even better if the default value of 'strict' was False, but unfortunately we can't do that.

kxroberto · 2011-11-15T18:04:27Z

I looked at the new patch http://hg.python.org/lookup/r86952 for Py3 (regarding the extended tolerance and local backporting to Python2.7):

What I miss are the calls of a kind of self.warning(msg,i,k) function in non-strict/tolerant mode (where self.error is called in strict mode). Such function could be empty or could be a silent simple counter (like in the old patch) - and could be easily sub-classed for advanced use.
I often want at least the possibilty of a HTML error log - so the HTML author (sometimes its me myself) can be noticed to get it more strict on the long run ;-) ...

ezio-melotti · 2011-11-15T19:23:32Z

The HTMLParser is not suitable for validation, even the strict mode allows some non valid markup (and it might be removed soon).
Also I don't think it's easy to call a self.warnings() without trying the strict mode first. The tolerant parsing just allow more things, without making any distinction between valid and not.

kxroberto · 2011-11-16T08:18:34Z

Well in many browsers for example there is a internal warning and error log (window). Which yet does not (need to) claim to be a official W3C checker. It has positive effect on web stabilization.
For example just looking now I see the many HTML and CSS warnings and errors about the sourceforge site and this bug tracker in the Browsers log - not believing that the log covers the bugs 100% ;-)

The events of warnings are easily available here, and calling self.warning, as it was, costs quite nothing. I don't see a problem for non-users of this feature. And most code using HTMLParser also emits warnings on the next higher syntax level, so to not have a black box...

As I used a tolerant version of HTMLParser for about a decade, I can say the warnings are of the same value in many apps and use case, as to be able to have look into a Browsers syntax log.
The style of stretching a argument to black<->white is not reasonable here in the world of human edited HTML ;-)

ezio-melotti · 2011-11-16T08:28:56Z

The strict/tolerant mode mainly works by using either a strict or a tolerant regex. If the markup is invalid, the strict regex doesn't match and it gives an error. The tolerant regex will match both valid and invalid markup at the same time, without distinctions, and that's why there's no way to emit a warning for these cases. I think there are a couple of places where a warning could be emitted, but that would just cover a small percentage of the errors. Even if we find a way to emit a warning for everything allowed by the tolerant mode that fails on strict, it won't still cover all the possible errors, that's why I think tools like validators and conformance checkers (or even the warning/error logs) should be used instead.

kxroberto · 2011-11-16T10:16:51Z

The old patch warned already the majority of real cases - except the missing white space between attributes.

"The tolerant regex will match both":
locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attributes (with enclosed values). And there is the new tolerant comma between attributes, which however I have not seen so far anywhere (the old warning machanism and attrfind.match would have already raised it at "junk chars ..." event.
Both issues can be easily warned (also/already) at quite no cost by the slightly extended regex below (when the 2 new non-pseudo regex groups are check against <>None in check_for_whole_start_tag).
Or missing whitespace could be warned (multiple times) at attrfind time.

attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant could become the only attrfind.

--

locatestarttagend_tolerant = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  (?:(?:\s+|(\s*))                   # optional whitespace before attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
         (?:\s*(,))*                   # possibly followed by a comma
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)
attrfind_tolerant = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')

#s='<abc a="b,+"c="d"e=f>text'
#s='<abc a="b,+" c="d"e=f>text'
s='<abc a="b,+",c="d" e=f>text'

m = locatestarttagend_tolerant.search(s)
print m.group()
print m.groups()
#if m.group(1) is not None: self.warning('space missing ...
#if m.group(2) is not None: self.warning('comma between attr...

m = attrfind_tolerant.search(s, 5)
print m.group()
print m.groups()

ezio-melotti · 2011-11-16T10:50:06Z

Note that the regex and the way the parser considers the commas changed in 16ed15ff0d7c (it now considers them as the name of a value-less attribute), so adding a group for the comma is no longer doable.

In theory, the approach you suggest might work, but if we want some warning mechanism it should be generic enough to work with all kind of invalid markup. In addition this adds complexity to already complex regular expressions, so there should be a valid use case for this.
Also keep in mind that HTMLParser won't do any check about the validity of the elements' names or attributes' names/values, or even if they are nested/closed correctly, so even with a comprehensive set of warnings, you won't still be able to use HTMLParser to validate your pages.

kxroberto · 2011-11-16T12:16:25Z

16ed15ff0d7c was not in current stable py3.2 so I missed it..

When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods.
(still I guess locatestarttagend_tolerant matches a free standing comma extra after an attribute)

"should be generic enough to work with all kind of invalid markup": I think we would be rather complete then (->missing space issue)- at least regarding %age of real cases. And it could be improved with few touches over time if something missing. 100% is not the point unless it shall drive the official W3C checker. The call of self.warning, as in old patch, doesn't cost otherwise and I see no real increase of complexity/cpu-time.

"HTMLParser won't do any check about the validity of the elements' names or attributes' names/values": yes thats of course up to the next level handler (BTDT)- thus the possibilty of error handling is not killed. Its about what HTMLParser _hides_ irrecoverably.

"there should be a valid use case for this": Almost any app which parses HTML (self authored or remote) can have (should have?) a no-fuzz/collateral warn log option. (->no need to make a expensive W3C checker session). I mostly have this in use as said, as it was anyway there.

Well, as for me, I use anyway a private backport to Python2 of this. I try to avoid Python3 as far as possible. (No real plus, too much problems) So for me its just about joining Python4 in the future perhaps - which can do true CPython multithreading, stackless, psyco/static typing ... and print statement again without typing so many extra braces ;-)
I considered extra libs like the HTML tidy binding, but this is all too much fuzz for most cases. And HTMLParser has already quite everything, with the few calls inserted ..

ezio-melotti · 2011-11-16T13:17:02Z

16ed15ff0d7c was not in current stable py3.2 so I missed it..

It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it).

When the comma is now raised as attribute name, then the problem is
anyway moved to the higher level anyway - and is/can be handled easily
there by usual methods.

The next level could/should validate the name of the attribute and determine that ',' is not a valid attribute name, so in this case there's no warning to raise here (actually you could detect that it's not a-zA-Z (or whatever the specs say) and raise a more general warning even at this level, but no information is lost here about this).

100% is not the point unless it shall drive the official W3C checker.

I'm still not sure that having 70-80% is useful (unless we can achieve 100% on this level and leave the rest to an upper layer). If you think this is doable you could try to first identify what errors should be detected by this layer, see if they are all detectable and then propose a patch.

The call of self.warning, as in old patch, doesn't cost otherwise and
I see no real increase of complexity/cpu-time.

The extra complexity is mainly in the already complex regular expressions, and also in the list of 'if' that will have to check the content of the groups to report the warnings. These changes are indeed not too invasive, but they still make the code more complicated.

Almost any app which parses HTML (self authored or remote) can have
(should have?) a no-fuzz/collateral warn log option. (->no need to
make a expensive W3C checker session).

I think the original goal of HTMLParser was parsing mostly-valid HTML. People started reporting issues with less-valid HTML, and these issues got fixed to make it able to parse non-valid HTML. AFAIK it never followed strictly any HTML standard, and it just provided a best-effort way to get data out of an HTML page. So, I would consider doing validation or even being a building block for a conforming parser out of the scope of the module.

I mostly have this in use as said, as it was anyway there.

If 'this' refers to some kind of warning system, what do you do with these warnings? Do you fix them, avoid using the w3c validator (or any other conforming validator) and consider a mostly-valid page good enough? Or do you fix them, and then you also check with the w3c validator?

kxroberto mannequin added stdlib Python modules in the Lib dir labels May 11, 2006

devdanzin mannequin added type-feature A feature request or enhancement labels Mar 21, 2009

bitdancer closed this as completed Dec 3, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTMLParser : A auto-tolerant parsing mode #43346

HTMLParser : A auto-tolerant parsing mode #43346

kxroberto mannequin commented May 11, 2006

kxroberto mannequin commented May 11, 2006

kxroberto mannequin commented May 23, 2006

kxroberto mannequin commented May 23, 2006

jjlee mannequin commented Jan 30, 2007

terryjreedy commented Aug 9, 2010

BreamoreBoy mannequin commented Aug 22, 2010

bitdancer commented Aug 22, 2010

kxroberto mannequin commented Aug 24, 2010

bitdancer commented Aug 24, 2010

terryjreedy commented Aug 24, 2010

kxroberto mannequin commented Aug 26, 2010

bitdancer commented Aug 27, 2010

bitdancer commented Sep 5, 2010

NeilMuller mannequin commented Nov 20, 2010

bitdancer commented Dec 3, 2010

bitdancer commented Dec 3, 2010

kxroberto mannequin commented Nov 15, 2011

ezio-melotti commented Nov 15, 2011

kxroberto mannequin commented Nov 16, 2011

ezio-melotti commented Nov 16, 2011

kxroberto mannequin commented Nov 16, 2011

ezio-melotti commented Nov 16, 2011

kxroberto mannequin commented Nov 16, 2011

ezio-melotti commented Nov 16, 2011

HTMLParser : A auto-tolerant parsing mode #43346

HTMLParser : A auto-tolerant parsing mode #43346

Comments

kxroberto mannequin commented May 11, 2006

kxroberto mannequin commented May 11, 2006

kxroberto mannequin commented May 23, 2006

kxroberto mannequin commented May 23, 2006

jjlee mannequin commented Jan 30, 2007

terryjreedy commented Aug 9, 2010

BreamoreBoy mannequin commented Aug 22, 2010

bitdancer commented Aug 22, 2010

kxroberto mannequin commented Aug 24, 2010

bitdancer commented Aug 24, 2010

terryjreedy commented Aug 24, 2010

kxroberto mannequin commented Aug 26, 2010

bitdancer commented Aug 27, 2010

bitdancer commented Sep 5, 2010

NeilMuller mannequin commented Nov 20, 2010

bitdancer commented Dec 3, 2010

bitdancer commented Dec 3, 2010

kxroberto mannequin commented Nov 15, 2011

ezio-melotti commented Nov 15, 2011

kxroberto mannequin commented Nov 16, 2011

ezio-melotti commented Nov 16, 2011

kxroberto mannequin commented Nov 16, 2011

ezio-melotti commented Nov 16, 2011

kxroberto mannequin commented Nov 16, 2011

ezio-melotti commented Nov 16, 2011