classification
Title: HTMLParser : A auto-tolerant parsing mode
Type: enhancement Stage: resolved
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: Neil Muller, eric.araujo, ezio.melotti, fdrake, jjlee, kxroberto, orsenthil, r.david.murray, terry.reedy
Priority: normal Keywords: patch

Created on 2006-05-11 17:19 by kxroberto, last changed 2011-11-16 13:17 by ezio.melotti. This issue is now closed.

Files
File name Uploaded Description Edit
HTMLParser_tolerant.patch kxroberto, 2006-05-11 17:19
HTMLParser_tolerant_py24.patch kxroberto, 2006-05-23 15:11
HTMLParser_tolerant_py26.patch kxroberto, 2010-08-24 09:38
test_htmlparser_tolerant.patch kxroberto, 2010-08-24 10:04 test case
Messages (24)
msg50232 - (view) Author: kxroberto (kxroberto) Date: 2006-05-11 17:19
Changes:

* Now allows missing spaces between attributes as its
often seen on the web like this :

<script type="text/javascript"language="JavaScript1.1">

That like broke the whole parsing before.


* A fully auto-tolerant mode (HTMLParser.tolerant=1)
was added. It should hopefully NEVER break HTML parsing
on the level of HTMLParser, but recover and continue
the parsing smartly. The mode was tested extensively
with complex pages. The tolerant mode is guaranted to
finish all HTML stuff only during HTMLParser.close() /
goahead(end=True)  - yet that was the same (stucking)
policy before.
Maybe steep: I have  switched ON the tolerant mode by
default, as this is, what in 99.9% of cases one wants
to have.
(I've maybe 20 applications for HTMLParser - None like
the unrecoverable breaks with Exceptions)
During tolerant mode the virtual .warning(message,i,k)
is called instead of error - by default this just
counts .warning_count up. This framework should even
enable to write po HTML checkers

* The patch was generated against py2.3 (still the
"good/base" Python for me) and also fixes a regexp-bug
(which already was fixed in py2.4.2). Yet the patch
works also against py2.4/2.5 - 2 locations where py24
trivially changed to %r/repr may grumble.


-robert
msg50233 - (view) Author: kxroberto (kxroberto) Date: 2006-05-23 15:11
Logged In: YES 
user_id=972995

Python 2.4 version of the patch added.
msg50234 - (view) Author: kxroberto (kxroberto) Date: 2006-05-23 15:15
Logged In: YES 
user_id=972995

(and works also for Python2.5)
msg50235 - (view) Author: John J Lee (jjlee) Date: 2007-01-30 02:32
This badly needs unit tests.
msg113366 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-09 03:32
This needs to be checked for applicability to 3.x.
Do beautifulsoup and other programs cover this ground (tolerant parsing of junk html)?
msg114659 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-22 10:14
I think this should be closed as have other similar requests in the last few days.
msg114682 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-08-22 16:37
I disagree (and might disagree with those other closings but I haven't noticed them I guess).  BeautifulSoup does *not* cover this ground, it is broken in 3.x because of the lack of a tolerant HTML parser in the stdlib (it used to use sgmlib, which is now gone).  BeautifulSoup would probably very much like to have this tolerant mode.

It probably shouldn't be the default, though, for backward compatibility reasons :(
msg114773 - (view) Author: kxroberto (kxroberto) Date: 2010-08-24 09:38
for me a parser which cannot be feed with HTML from outside (which I cannot edit myself) has not much use at all.
attached my current patch (vs. py26) - many changes meanwhile.
and a test case.

I've put the default to strict mode, but ...
msg114786 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-08-24 13:13
2.6 is now in security-fix-only mode.  Since this is a new feature, it can only go into 3.2.

Can you provide a patch against py3k trunk?

I've only glanced at the patch briefly, but one thing that concerns me is 'warning file'.  I suppose that either the logging module or perhaps the warnings module should be used instead.  We should look at how other stdlib modules handle this kind of thing.  Or perhaps warnings shouldn't be generated at all, since the default will be strict and therefore the programmer has consciously selected tolerant mode.

One stdlib model we could follow is the model of the email module: have a 'defects' attribute that collects the errors.  email6, by the way, is going to have both 'tolerant' and 'strict' modes, and in that case the default is tolerant (and always has been) in respect for Postel's law, which is enshrined in the email RFCs.  If the HTTP standards have a similar recommendation to accept "dirty" input when possible, we could make an argument for changing HTMLParser's default to tolerant.
msg114796 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-24 16:26
I agree that a tolerant mode would be good (and often requested). String encoding and decoding also have strict and forgiving modes, so this seems close to a policy.

Unit tests with example snippets that properly fail strict mode and pass the new tolerant mode are needed both to completely review and apply a patch.
msg115031 - (view) Author: kxroberto (kxroberto) Date: 2010-08-26 21:44
I'm not working with Py3. don't how much that module is different in 3.
unless its going into a py2 version, I'll leave the FR so far to the py3 community
msg115115 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-08-27 19:07
For anyone who does want to work on this (and I do, but it will be quite a while before I can) see also issue 6191.
msg115624 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-09-05 02:46
See also issue 1058305, which may be a duplicate.
msg121674 - (view) Author: Neil Muller (Neil Muller) Date: 2010-11-20 16:28
#975556 and #1046092 look like they should also be superseded by this.
msg123174 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-03 04:10
I have committed a version of this patch, without the warnings, using the keyword 'strict=True' as the default, and with a couple added heuristics from other similar issues, in r86952.

kxroberto, if you want to supply your full name, I'll add you to Misc/ACKS.  Thanks for the original patch.
msg123247 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-12-03 13:59
A note for the curious: I changed the keyword name from 'tolerant' to 'strict' because the stdlib has other examples of 'strict' as a keyword, but the word 'tolerant' appears nowhere in the documentation and certainly not as a keyword.  So it seemed better to remain consistent with existing practice.  This would be even better if the default value of 'strict' was False, but unfortunately we can't do that.
msg147692 - (view) Author: kxroberto (kxroberto) Date: 2011-11-15 18:04
I looked at the new patch http://hg.python.org/lookup/r86952 for Py3 (regarding the extended tolerance and local backporting to Python2.7):

What I miss are the calls of a kind of self.warning(msg,i,k) function in non-strict/tolerant mode (where self.error is called in strict mode). Such function could be empty or could be a silent simple counter (like in the old patch) - and could be easily sub-classed for advanced use.
I often want at least the possibilty of a HTML error log - so the HTML author (sometimes its me myself) can be noticed to get it more strict on the long run ;-) ...
msg147698 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-15 19:23
The HTMLParser is not suitable for validation, even the strict mode allows some non valid markup (and it might be removed soon).
Also I don't think it's easy to call a self.warnings() without trying the strict mode first.  The tolerant parsing just allow more things, without making any distinction between valid and not.
msg147755 - (view) Author: kxroberto (kxroberto) Date: 2011-11-16 08:18
Well in many browsers for example there is a internal warning and error log (window). Which yet does not (need to) claim to be a official W3C checker. It has positive effect on web stabilization. 
For example just looking now I see the many HTML and CSS warnings and errors about the sourceforge site and this bug tracker in the Browsers log - not believing that the log covers the bugs 100% ;-)

The events of warnings are easily available here, and calling self.warning, as it was, costs quite nothing. I don't see a problem for non-users of this feature. And most code using HTMLParser also emits warnings on the next higher syntax level, so to not have a black box...
 
As I used a tolerant version of HTMLParser for about a decade, I can say the warnings are of the same value in many apps and use case, as to be able to have look into a Browsers syntax log. 
The style of stretching a argument to black<->white is not reasonable here in the world of human edited HTML ;-)
msg147756 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-16 08:28
The strict/tolerant mode mainly works by using either a strict or a tolerant regex.  If the markup is invalid, the strict regex doesn't match and it gives an error.  The tolerant regex will match both valid and invalid markup at the same time, without distinctions, and that's why there's no way to emit a warning for these cases.  I think there are a couple of places where a warning could be emitted, but that would just cover a small percentage of the errors.  Even if we find a way to emit a warning for everything allowed by the tolerant mode that fails on strict, it won't still cover all the possible errors, that's why I think tools like validators and conformance checkers (or even the warning/error logs) should be used instead.
msg147763 - (view) Author: kxroberto (kxroberto) Date: 2011-11-16 10:16
The old patch warned already the majority of real cases  - except the missing white space between attributes.

"The tolerant regex will match both": 
locatestarttagend_tolerant: The main and frequent issue on the web here is the missing white space between attributes (with enclosed values). And there is the new tolerant comma between attributes, which however I have not seen so far anywhere (the old warning machanism and attrfind.match would have already raised it at "junk chars ..." event.
Both issues can be easily warned (also/already) at quite no cost by the slightly extended regex below (when the 2 new non-pseudo regex groups are check against <>None in check_for_whole_start_tag). 
Or missing whitespace could be warned (multiple times) at attrfind time.

attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant  could become the only attrfind.


--

locatestarttagend_tolerant = re.compile(r"""
  <[a-zA-Z][-.a-zA-Z0-9:_]*          # tag name
  (?:(?:\s+|(\s*))                   # optional whitespace before attribute name
    (?:[a-zA-Z_][-.:a-zA-Z0-9_]*     # attribute name
      (?:\s*=\s*                     # value indicator
        (?:'[^']*'                   # LITA-enclosed value
          |\"[^\"]*\"                # LIT-enclosed value
          |[^'\">\s]+                # bare value
         )
         (?:\s*(,))*                   # possibly followed by a comma
       )?
     )
   )*
  \s*                                # trailing whitespace
""", re.VERBOSE)
attrfind_tolerant = re.compile(
    r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?')


#s='<abc a="b,+"c="d"e=f>text'
#s='<abc a="b,+" c="d"e=f>text'
s='<abc a="b,+",c="d" e=f>text'

m = locatestarttagend_tolerant.search(s)
print m.group()
print m.groups()
#if m.group(1) is not None: self.warning('space missing ...
#if m.group(2) is not None: self.warning('comma between attr...

m = attrfind_tolerant.search(s, 5)
print m.group()
print m.groups()
msg147765 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-16 10:50
Note that the regex and the way the parser considers the commas changed in 16ed15ff0d7c (it now considers them as the name of a value-less attribute), so adding a group for the comma is no longer doable.

In theory, the approach you suggest might work, but if we want some warning mechanism it should be generic enough to work with all kind of invalid markup.  In addition this adds complexity to already complex regular expressions, so there should be a valid use case for this.
Also keep in mind that HTMLParser won't do any check about the validity of the elements' names or attributes' names/values, or even if they are nested/closed correctly, so even with a comprehensive set of warnings, you won't still be able to use HTMLParser to validate your pages.
msg147766 - (view) Author: kxroberto (kxroberto) Date: 2011-11-16 12:16
16ed15ff0d7c was not in current stable py3.2 so I missed it..

When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods.
(still I guess locatestarttagend_tolerant matches a free standing comma extra after an attribute)

"should be generic enough to work with all kind of invalid markup": I think we would be rather complete then (->missing space issue)- at least regarding %age of real cases. And it could be improved with few touches over time if something missing. 100% is not the point unless it shall drive the official W3C checker. The call of self.warning, as in old patch, doesn't cost otherwise and I see no real increase of complexity/cpu-time.

"HTMLParser won't do any check about the validity of the elements' names or attributes' names/values": yes thats of course up to the next level handler (BTDT)- thus the possibilty of error handling is not killed. Its about what HTMLParser _hides_ irrecoverably.

"there should be a valid use case for this": Almost any app which parses HTML (self authored or remote) can have (should have?) a no-fuzz/collateral warn log option. (->no need to make a expensive W3C checker session). I mostly have this in use as said, as it was anyway there.

Well, as for me, I use anyway a private backport to Python2 of this. I try to avoid Python3 as far as possible. (No real plus, too much problems) So for me its just about joining Python4 in the future perhaps - which can do true CPython multithreading, stackless, psyco/static typing ... and print statement again without typing so many extra braces ;-)
I considered extra libs like the HTML tidy binding, but this is all too much fuzz for most cases. And HTMLParser has already quite everything, with the few calls inserted ..
msg147767 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-16 13:17
> 16ed15ff0d7c was not in current stable py3.2 so I missed it..

It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it).

> When the comma is now raised as attribute name, then the problem is 
> anyway moved to the higher level anyway - and is/can be handled easily 
> there by usual methods.

The next level could/should validate the name of the attribute and determine that ',' is not a valid attribute name, so in this case there's no warning to raise here (actually you could detect that it's not a-zA-Z (or whatever the specs say) and raise a more general warning even at this level, but no information is lost here about this).

> 100% is not the point unless it shall drive the official W3C checker.

I'm still not sure that having 70-80% is useful (unless we can achieve 100% on this level and leave the rest to an upper layer).  If you think this is doable you could try to first identify what errors should be detected by this layer, see if they are all detectable and then propose a patch.

> The call of self.warning, as in old patch, doesn't cost otherwise and
> I see no real increase of complexity/cpu-time.

The extra complexity is mainly in the already complex regular expressions, and also in the list of 'if' that will have to check the content of the groups to report the warnings.  These changes are indeed not too invasive, but they still make the code more complicated.

> Almost any app which parses HTML (self authored or remote) can have 
> (should have?) a no-fuzz/collateral warn log option. (->no need to 
> make a expensive W3C checker session).

I think the original goal of HTMLParser was parsing mostly-valid HTML.  People started reporting issues with less-valid HTML, and these issues got fixed to make it able to parse non-valid HTML.  AFAIK it never followed strictly any HTML standard, and it just provided a best-effort way to get data out of an HTML page.  So, I would consider doing validation or even being a building block for a conforming parser out of the scope of the module.

> I mostly have this in use as said, as it was anyway there.

If 'this' refers to some kind of warning system, what do you do with these warnings?   Do you fix them, avoid using the w3c validator (or any other conforming validator) and consider a mostly-valid page good enough?  Or do you fix them, and then you also check with the w3c validator?
History
Date User Action Args
2011-11-16 13:17:02ezio.melottisetmessages: + msg147767
2011-11-16 12:16:25kxrobertosetmessages: + msg147766
2011-11-16 10:50:06ezio.melottisetnosy: + eric.araujo
messages: + msg147765
2011-11-16 10:16:51kxrobertosetmessages: + msg147763
2011-11-16 08:28:56ezio.melottisetmessages: + msg147756
2011-11-16 08:18:34kxrobertosetmessages: + msg147755
2011-11-15 19:23:32ezio.melottisetnosy: + ezio.melotti
messages: + msg147698
2011-11-15 18:04:27kxrobertosetmessages: + msg147692
2010-12-03 13:59:14r.david.murraysetmessages: + msg123247
2010-12-03 04:17:17r.david.murraylinkissue1046092 superseder
2010-12-03 04:14:13r.david.murraylinkissue975556 superseder
2010-12-03 04:10:56r.david.murraysetstatus: open -> closed

nosy: - BreamoreBoy
messages: + msg123174

resolution: accepted
stage: test needed -> resolved
2010-12-03 03:00:05r.david.murraylinkissue1058305 superseder
2010-11-20 16:28:23Neil Mullersetnosy: + Neil Muller
messages: + msg121674
2010-09-05 02:46:22r.david.murraysetmessages: + msg115624
2010-08-27 19:07:35r.david.murraysetmessages: + msg115115
2010-08-26 21:44:56kxrobertosetmessages: + msg115031
2010-08-24 16:26:03terry.reedysetmessages: + msg114796
2010-08-24 13:13:39r.david.murraysetmessages: + msg114786
versions: - Python 2.6, Python 2.7
2010-08-24 10:04:28kxrobertosetfiles: + test_htmlparser_tolerant.patch
versions: + Python 2.6, Python 2.7
2010-08-24 09:38:11kxrobertosetfiles: + HTMLParser_tolerant_py26.patch

messages: + msg114773
2010-08-22 16:37:51r.david.murraysetnosy: + r.david.murray, orsenthil
messages: + msg114682
2010-08-22 10:14:19BreamoreBoysetnosy: + fdrake, BreamoreBoy
messages: + msg114659
2010-08-09 03:32:41terry.reedysetnosy: + terry.reedy

messages: + msg113366
versions: + Python 3.2, - Python 3.1, Python 2.7
2009-03-21 02:45:03ajaksu2setstage: test needed
type: enhancement
versions: + Python 3.1, Python 2.7, - Python 2.4
2006-05-11 17:19:36kxrobertocreate