This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author kxroberto
Recipients
Date 2006-05-11.17:19:36
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to
Content
Changes:

* Now allows missing spaces between attributes as its
often seen on the web like this :

<script type="text/javascript"language="JavaScript1.1">

That like broke the whole parsing before.


* A fully auto-tolerant mode (HTMLParser.tolerant=1)
was added. It should hopefully NEVER break HTML parsing
on the level of HTMLParser, but recover and continue
the parsing smartly. The mode was tested extensively
with complex pages. The tolerant mode is guaranted to
finish all HTML stuff only during HTMLParser.close() /
goahead(end=True)  - yet that was the same (stucking)
policy before.
Maybe steep: I have  switched ON the tolerant mode by
default, as this is, what in 99.9% of cases one wants
to have.
(I've maybe 20 applications for HTMLParser - None like
the unrecoverable breaks with Exceptions)
During tolerant mode the virtual .warning(message,i,k)
is called instead of error - by default this just
counts .warning_count up. This framework should even
enable to write po HTML checkers

* The patch was generated against py2.3 (still the
"good/base" Python for me) and also fixes a regexp-bug
(which already was fixed in py2.4.2). Yet the patch
works also against py2.4/2.5 - 2 locations where py24
trivially changed to %r/repr may grumble.


-robert
History
Date User Action Args
2007-08-23 15:48:49adminlinkissue1486713 messages
2007-08-23 15:48:49admincreate