Message 147767 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Neil Muller, eric.araujo, ezio.melotti, fdrake, jjlee, kxroberto, orsenthil, r.david.murray, terry.reedy
Date	2011-11-16.13:17:02
SpamBayes Score	4.440892e-16
Marked as misclassified	No
Message-id	<1321449423.33.0.194209888399.issue1486713@psf.upfronthosting.co.za>
In-reply-to

Content
> 16ed15ff0d7c was not in current stable py3.2 so I missed it.. It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it). > When the comma is now raised as attribute name, then the problem is > anyway moved to the higher level anyway - and is/can be handled easily > there by usual methods. The next level could/should validate the name of the attribute and determine that ',' is not a valid attribute name, so in this case there's no warning to raise here (actually you could detect that it's not a-zA-Z (or whatever the specs say) and raise a more general warning even at this level, but no information is lost here about this). > 100% is not the point unless it shall drive the official W3C checker. I'm still not sure that having 70-80% is useful (unless we can achieve 100% on this level and leave the rest to an upper layer). If you think this is doable you could try to first identify what errors should be detected by this layer, see if they are all detectable and then propose a patch. > The call of self.warning, as in old patch, doesn't cost otherwise and > I see no real increase of complexity/cpu-time. The extra complexity is mainly in the already complex regular expressions, and also in the list of 'if' that will have to check the content of the groups to report the warnings. These changes are indeed not too invasive, but they still make the code more complicated. > Almost any app which parses HTML (self authored or remote) can have > (should have?) a no-fuzz/collateral warn log option. (->no need to > make a expensive W3C checker session). I think the original goal of HTMLParser was parsing mostly-valid HTML. People started reporting issues with less-valid HTML, and these issues got fixed to make it able to parse non-valid HTML. AFAIK it never followed strictly any HTML standard, and it just provided a best-effort way to get data out of an HTML page. So, I would consider doing validation or even being a building block for a conforming parser out of the scope of the module. > I mostly have this in use as said, as it was anyway there. If 'this' refers to some kind of warning system, what do you do with these warnings? Do you fix them, avoid using the w3c validator (or any other conforming validator) and consider a mostly-valid page good enough? Or do you fix them, and then you also check with the w3c validator?

> 16ed15ff0d7c was not in current stable py3.2 so I missed it..

It's also in 3.2 and 2.7 (but it's quite recent, so if you didn't pull recently you might have missed it).

> When the comma is now raised as attribute name, then the problem is 
> anyway moved to the higher level anyway - and is/can be handled easily 
> there by usual methods.

The next level could/should validate the name of the attribute and determine that ',' is not a valid attribute name, so in this case there's no warning to raise here (actually you could detect that it's not a-zA-Z (or whatever the specs say) and raise a more general warning even at this level, but no information is lost here about this).

> 100% is not the point unless it shall drive the official W3C checker.

I'm still not sure that having 70-80% is useful (unless we can achieve 100% on this level and leave the rest to an upper layer).  If you think this is doable you could try to first identify what errors should be detected by this layer, see if they are all detectable and then propose a patch.

> The call of self.warning, as in old patch, doesn't cost otherwise and
> I see no real increase of complexity/cpu-time.

The extra complexity is mainly in the already complex regular expressions, and also in the list of 'if' that will have to check the content of the groups to report the warnings.  These changes are indeed not too invasive, but they still make the code more complicated.

> Almost any app which parses HTML (self authored or remote) can have 
> (should have?) a no-fuzz/collateral warn log option. (->no need to 
> make a expensive W3C checker session).

I think the original goal of HTMLParser was parsing mostly-valid HTML.  People started reporting issues with less-valid HTML, and these issues got fixed to make it able to parse non-valid HTML.  AFAIK it never followed strictly any HTML standard, and it just provided a best-effort way to get data out of an HTML page.  So, I would consider doing validation or even being a building block for a conforming parser out of the scope of the module.

> I mostly have this in use as said, as it was anyway there.

If 'this' refers to some kind of warning system, what do you do with these warnings?   Do you fix them, avoid using the w3c validator (or any other conforming validator) and consider a mostly-valid page good enough?  Or do you fix them, and then you also check with the w3c validator?

History
Date	User	Action	Args
2011-11-16 13:17:03	ezio.melotti	set	recipients: + ezio.melotti, fdrake, terry.reedy, jjlee, orsenthil, kxroberto, Neil Muller, eric.araujo, r.david.murray
2011-11-16 13:17:03	ezio.melotti	set	messageid: <1321449423.33.0.194209888399.issue1486713@psf.upfronthosting.co.za>
2011-11-16 13:17:02	ezio.melotti	link	issue1486713 messages
2011-11-16 13:17:02	ezio.melotti	create