Message 88910 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	r.david.murray
Recipients	georg.brandl, momat, r.david.murray
Date	2009-06-04.21:50:24
SpamBayes Score	0.0016817165
Marked as misclassified	No
Message-id	<1244152227.38.0.361488389478.issue6191@psf.upfronthosting.co.za>
In-reply-to

Content
In doing web scraping I started using BeautifulSoup precisely because it was very lenient in what html it accepted (I haven't written such an ap for a while, so I'm not sure what BeautifulSoup currently does...I thought I heard it was now using HTMLParser...). There are a lot of messed up web pages out there. I don't have time right now to evaluate your particular cases, but my rule of thumb would be that if the major web browsers do something "reasonable" with these cases, then a python tool designed to read web pages should do so as well, where possible. ("Be liberal in what you accept, and strict in what you generate.") That said, I'm not sure what HTMLParser's design goals are, so this may not be an appropriate goal for the module.

In doing web scraping I started using BeautifulSoup precisely because it
was very lenient in what html it accepted (I haven't written such an ap
for a while, so I'm not sure what BeautifulSoup currently does...I
thought I heard it was now using HTMLParser...).

There are a lot of messed up web pages out there.

I don't have time right now to evaluate your particular cases, but my
rule of thumb would be that if the major web browsers do something
"reasonable" with these cases, then a python tool designed to read web
pages should do so as well, where possible.  ("Be liberal in what you
accept, and strict in what you generate.")

That said, I'm not sure what HTMLParser's design goals are, so this may
not be an appropriate goal for the module.

History
Date	User	Action	Args
2009-06-04 21:50:27	r.david.murray	set	recipients: + r.david.murray, georg.brandl, momat
2009-06-04 21:50:27	r.david.murray	set	messageid: <1244152227.38.0.361488389478.issue6191@psf.upfronthosting.co.za>
2009-06-04 21:50:26	r.david.murray	link	issue6191 messages
2009-06-04 21:50:24	r.david.murray	create