New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTMLParser : A auto-tolerant parsing mode #43346
Comments
Changes:
<script type="text/javascript"language="JavaScript1.1"> That like broke the whole parsing before.
-robert |
Logged In: YES Python 2.4 version of the patch added. |
Logged In: YES (and works also for Python2.5) |
This badly needs unit tests. |
This needs to be checked for applicability to 3.x. |
I think this should be closed as have other similar requests in the last few days. |
I disagree (and might disagree with those other closings but I haven't noticed them I guess). BeautifulSoup does *not* cover this ground, it is broken in 3.x because of the lack of a tolerant HTML parser in the stdlib (it used to use sgmlib, which is now gone). BeautifulSoup would probably very much like to have this tolerant mode. It probably shouldn't be the default, though, for backward compatibility reasons :( |
for me a parser which cannot be feed with HTML from outside (which I cannot edit myself) has not much use at all. I've put the default to strict mode, but ... |
2.6 is now in security-fix-only mode. Since this is a new feature, it can only go into 3.2. Can you provide a patch against py3k trunk? I've only glanced at the patch briefly, but one thing that concerns me is 'warning file'. I suppose that either the logging module or perhaps the warnings module should be used instead. We should look at how other stdlib modules handle this kind of thing. Or perhaps warnings shouldn't be generated at all, since the default will be strict and therefore the programmer has consciously selected tolerant mode. One stdlib model we could follow is the model of the email module: have a 'defects' attribute that collects the errors. email6, by the way, is going to have both 'tolerant' and 'strict' modes, and in that case the default is tolerant (and always has been) in respect for Postel's law, which is enshrined in the email RFCs. If the HTTP standards have a similar recommendation to accept "dirty" input when possible, we could make an argument for changing HTMLParser's default to tolerant. |
I agree that a tolerant mode would be good (and often requested). String encoding and decoding also have strict and forgiving modes, so this seems close to a policy. Unit tests with example snippets that properly fail strict mode and pass the new tolerant mode are needed both to completely review and apply a patch. |
I'm not working with Py3. don't how much that module is different in 3. |
For anyone who does want to work on this (and I do, but it will be quite a while before I can) see also bpo-6191. |
See also bpo-1058305, which may be a duplicate. |
bpo-975556 and bpo-1046092 look like they should also be superseded by this. |
I have committed a version of this patch, without the warnings, using the keyword 'strict=True' as the default, and with a couple added heuristics from other similar issues, in r86952. kxroberto, if you want to supply your full name, I'll add you to Misc/ACKS. Thanks for the original patch. |
A note for the curious: I changed the keyword name from 'tolerant' to 'strict' because the stdlib has other examples of 'strict' as a keyword, but the word 'tolerant' appears nowhere in the documentation and certainly not as a keyword. So it seemed better to remain consistent with existing practice. This would be even better if the default value of 'strict' was False, but unfortunately we can't do that. |
I looked at the new patch http://hg.python.org/lookup/r86952 for Py3 (regarding the extended tolerance and local backporting to Python2.7): What I miss are the calls of a kind of self.warning(msg,i,k) function in non-strict/tolerant mode (where self.error is called in strict mode). Such function could be empty or could be a silent simple counter (like in the old patch) - and could be easily sub-classed for advanced use. |
The HTMLParser is not suitable for validation, even the strict mode allows some non valid markup (and it might be removed soon). |
Well in many browsers for example there is a internal warning and error log (window). Which yet does not (need to) claim to be a official W3C checker. It has positive effect on web stabilization. The events of warnings are easily available here, and calling self.warning, as it was, costs quite nothing. I don't see a problem for non-users of this feature. And most code using HTMLParser also emits warnings on the next higher syntax level, so to not have a black box... As I used a tolerant version of HTMLParser for about a decade, I can say the warnings are of the same value in many apps and use case, as to be able to have look into a Browsers syntax log. |
The strict/tolerant mode mainly works by using either a strict or a tolerant regex. If the markup is invalid, the strict regex doesn't match and it gives an error. The tolerant regex will match both valid and invalid markup at the same time, without distinctions, and that's why there's no way to emit a warning for these cases. I think there are a couple of places where a warning could be emitted, but that would just cover a small percentage of the errors. Even if we find a way to emit a warning for everything allowed by the tolerant mode that fails on strict, it won't still cover all the possible errors, that's why I think tools like validators and conformance checkers (or even the warning/error logs) should be used instead. |
The old patch warned already the majority of real cases - except the missing white space between attributes. "The tolerant regex will match both": attrfind_tolerant : I see no point in the old/"strict" attrfind. (and the difference is guessed 0.000% of real cases). attrfind_tolerant could become the only attrfind. -- locatestarttagend_tolerant = re.compile(r"""
<[a-zA-Z][-.a-zA-Z0-9:_]* # tag name
(?:(?:\s+|(\s*)) # optional whitespace before attribute name
(?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name
(?:\s*=\s* # value indicator
(?:'[^']*' # LITA-enclosed value
|\"[^\"]*\" # LIT-enclosed value
|[^'\">\s]+ # bare value
)
(?:\s*(,))* # possibly followed by a comma
)?
)
)*
\s* # trailing whitespace
""", re.VERBOSE)
attrfind_tolerant = re.compile(
r'\s*([a-zA-Z_][-.:a-zA-Z_0-9]*)(\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[^>\s]*))?') #s='<abc a="b,+"c="d"e=f>text' m = locatestarttagend_tolerant.search(s)
print m.group()
print m.groups()
#if m.group(1) is not None: self.warning('space missing ...
#if m.group(2) is not None: self.warning('comma between attr... m = attrfind_tolerant.search(s, 5)
print m.group()
print m.groups() |
Note that the regex and the way the parser considers the commas changed in 16ed15ff0d7c (it now considers them as the name of a value-less attribute), so adding a group for the comma is no longer doable. In theory, the approach you suggest might work, but if we want some warning mechanism it should be generic enough to work with all kind of invalid markup. In addition this adds complexity to already complex regular expressions, so there should be a valid use case for this. |
16ed15ff0d7c was not in current stable py3.2 so I missed it.. When the comma is now raised as attribute name, then the problem is anyway moved to the higher level anyway - and is/can be handled easily there by usual methods. "should be generic enough to work with all kind of invalid markup": I think we would be rather complete then (->missing space issue)- at least regarding %age of real cases. And it could be improved with few touches over time if something missing. 100% is not the point unless it shall drive the official W3C checker. The call of self.warning, as in old patch, doesn't cost otherwise and I see no real increase of complexity/cpu-time. "HTMLParser won't do any check about the validity of the elements' names or attributes' names/values": yes thats of course up to the next level handler (BTDT)- thus the possibilty of error handling is not killed. Its about what HTMLParser _hides_ irrecoverably. "there should be a valid use case for this": Almost any app which parses HTML (self authored or remote) can have (should have?) a no-fuzz/collateral warn log option. (->no need to make a expensive W3C checker session). I mostly have this in use as said, as it was anyway there. Well, as for me, I use anyway a private backport to Python2 of this. I try to avoid Python3 as far as possible. (No real plus, too much problems) So for me its just about joining Python4 in the future perhaps - which can do true CPython multithreading, stackless, psyco/static typing ... and print statement again without typing so many extra braces ;-) |
It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it).
The next level could/should validate the name of the attribute and determine that ',' is not a valid attribute name, so in this case there's no warning to raise here (actually you could detect that it's not a-zA-Z (or whatever the specs say) and raise a more general warning even at this level, but no information is lost here about this).
I'm still not sure that having 70-80% is useful (unless we can achieve 100% on this level and leave the rest to an upper layer). If you think this is doable you could try to first identify what errors should be detected by this layer, see if they are all detectable and then propose a patch.
The extra complexity is mainly in the already complex regular expressions, and also in the list of 'if' that will have to check the content of the groups to report the warnings. These changes are indeed not too invasive, but they still make the code more complicated.
I think the original goal of HTMLParser was parsing mostly-valid HTML. People started reporting issues with less-valid HTML, and these issues got fixed to make it able to parse non-valid HTML. AFAIK it never followed strictly any HTML standard, and it just provided a best-effort way to get data out of an HTML page. So, I would consider doing validation or even being a building block for a conforming parser out of the scope of the module.
If 'this' refers to some kind of warning system, what do you do with these warnings? Do you fix them, avoid using the w3c validator (or any other conforming validator) and consider a mostly-valid page good enough? Or do you fix them, and then you also check with the w3c validator? |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: