Message 158143 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	r.david.murray
Recipients	Jim.Jewett, Michel.Leunen, ezio.melotti, georg.brandl, r.david.murray, serhiy.storchaka
Date	2012-04-12.15:26:45
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1334244405.96.0.271121992173.issue14538@psf.upfronthosting.co.za>
In-reply-to

Content
Yes, after considerable discussion those of working on this stuff decided that the goal should be that the parser be able to complete parsing, without error, anything the typical browsers can parse (which means, pretty much anything, though that says nothing about whether the result of the parse is useful in any way). In other words, we've been treating it as a bug when the parser throws an error, since one generally uses the library to parse web pages from the internet and having the parse fail leaves you SOL for doing anything useful with the bad pages one gets therefrom. (Note that if the parser was doing strict adherence to the older RFCs our decision would have been different...but it is not. It has always accepted some badly formed documents, and rejected others.) Also note that BeautifulSoup in Python2 used the sgml parser, which didn't throw errors, but that is gone in Python3. In Python3 BeautifulSoup uses the html parser...which is what started us down this road to begin with.

Yes, after considerable discussion those of working on this stuff decided that the goal should be that the parser be able to complete parsing, without error, anything the typical browsers can parse (which means, pretty much anything, though that says nothing about whether the result of the parse is useful in any way).  In other words, we've been treating it as a bug when the parser throws an error, since one generally uses the library to parse web pages from the internet and having the parse fail leaves you SOL for doing anything useful with the bad pages one gets therefrom.  (Note that if the parser was doing strict adherence to the older RFCs our decision would have been different...but it is not.  It has always accepted *some* badly formed documents, and rejected others.)

Also note that BeautifulSoup in Python2 used the sgml parser, which didn't throw errors, but that is gone in Python3.  In Python3 BeautifulSoup uses the html parser...which is what started us down this road to begin with.

History
Date	User	Action	Args
2012-04-12 15:26:46	r.david.murray	set	recipients: + r.david.murray, georg.brandl, ezio.melotti, Jim.Jewett, serhiy.storchaka, Michel.Leunen
2012-04-12 15:26:45	r.david.murray	set	messageid: <1334244405.96.0.271121992173.issue14538@psf.upfronthosting.co.za>
2012-04-12 15:26:45	r.david.murray	link	issue14538 messages
2012-04-12 15:26:45	r.david.murray	create