Message 141260 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Matt.Basta
Recipients	Hunanyan, Matt.Basta, cpalmer, eric.araujo, ezio.melotti, fantoozler, fdrake, friday, georg.brandl, gsf, momat, orsenthil, r.david.murray, yotam
Date	2011-07-27.18:37:44
SpamBayes Score	1.2350002e-06
Marked as misclassified	No
Message-id	<1311791865.72.0.675600341619.issue670664@psf.upfronthosting.co.za>
In-reply-to

Content
> Yes, but we don't claim to support HTML5 yet. There's also no claim in the docs or the source that HTMLParser specifically adheres to HTML4, either. Ideally, the parser should strive for parity with the functionality of major web browsers, as they are the de-facto standard for HTML parser behavior. All of the browsers on my machine, for instance, will even parse the following snippet with the behavior described in the HTML5 spec: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <script><span></span>This should not be visible.</script> Even in pre-HTML5 browsers, this is the way that HTML gets parsed. For the heck of it, I downloaded an old copy of Firefox 2.0 and ran the above snippet. The behavior is consistent. While I would otherwise agree that keeping to the HTML4 spec is the right thing to do, this is a quirk of the spec that is not only ignored by browsers (as can be seen in FX2) and changed in a future version of the spec, but is causing problems for a good number of developers. It could be argued that the patch is a far more elegant solution for Beautiful Soup developers than the workaround in msg88864.

> Yes, but we don't claim to support HTML5 yet.

There's also no claim in the docs or the source that HTMLParser specifically adheres to HTML4, either.

Ideally, the parser should strive for parity with the functionality of major web browsers, as they are the de-facto standard for HTML parser behavior. All of the browsers on my machine, for instance, will even parse the following snippet with the behavior described in the HTML5 spec:


<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
   "http://www.w3.org/TR/html4/strict.dtd">
<script><span></span>This should not be visible.</script>


Even in pre-HTML5 browsers, this is the way that HTML gets parsed. For the heck of it, I downloaded an old copy of Firefox 2.0 and ran the above snippet. The behavior is consistent.

While I would otherwise agree that keeping to the HTML4 spec is the right thing to do, this is a quirk of the spec that is not only ignored by browsers (as can be seen in FX2) and changed in a future version of the spec, but is causing problems for a good number of developers.

It could be argued that the patch is a far more elegant solution for Beautiful Soup developers than the workaround in msg88864.

History
Date	User	Action	Args
2011-07-27 18:37:45	Matt.Basta	set	recipients: + Matt.Basta, fdrake, georg.brandl, yotam, orsenthil, fantoozler, gsf, cpalmer, ezio.melotti, eric.araujo, r.david.murray, momat, Hunanyan, friday
2011-07-27 18:37:45	Matt.Basta	set	messageid: <1311791865.72.0.675600341619.issue670664@psf.upfronthosting.co.za>
2011-07-27 18:37:45	Matt.Basta	link	issue670664 messages
2011-07-27 18:37:44	Matt.Basta	create