Message 50232 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	kxroberto
Recipients
Date	2006-05-11.17:19:36
SpamBayes Score
Marked as misclassified
Message-id
In-reply-to

Content
Changes: * Now allows missing spaces between attributes as its often seen on the web like this : <script type="text/javascript"language="JavaScript1.1"> That like broke the whole parsing before. * A fully auto-tolerant mode (HTMLParser.tolerant=1) was added. It should hopefully NEVER break HTML parsing on the level of HTMLParser, but recover and continue the parsing smartly. The mode was tested extensively with complex pages. The tolerant mode is guaranted to finish all HTML stuff only during HTMLParser.close() / goahead(end=True) - yet that was the same (stucking) policy before. Maybe steep: I have switched ON the tolerant mode by default, as this is, what in 99.9% of cases one wants to have. (I've maybe 20 applications for HTMLParser - None like the unrecoverable breaks with Exceptions) During tolerant mode the virtual .warning(message,i,k) is called instead of error - by default this just counts .warning_count up. This framework should even enable to write po HTML checkers * The patch was generated against py2.3 (still the "good/base" Python for me) and also fixes a regexp-bug (which already was fixed in py2.4.2). Yet the patch works also against py2.4/2.5 - 2 locations where py24 trivially changed to %r/repr may grumble. -robert

Changes:

* Now allows missing spaces between attributes as its
often seen on the web like this :

<script type="text/javascript"language="JavaScript1.1">

That like broke the whole parsing before.


* A fully auto-tolerant mode (HTMLParser.tolerant=1)
was added. It should hopefully NEVER break HTML parsing
on the level of HTMLParser, but recover and continue
the parsing smartly. The mode was tested extensively
with complex pages. The tolerant mode is guaranted to
finish all HTML stuff only during HTMLParser.close() /
goahead(end=True)  - yet that was the same (stucking)
policy before.
Maybe steep: I have  switched ON the tolerant mode by
default, as this is, what in 99.9% of cases one wants
to have.
(I've maybe 20 applications for HTMLParser - None like
the unrecoverable breaks with Exceptions)
During tolerant mode the virtual .warning(message,i,k)
is called instead of error - by default this just
counts .warning_count up. This framework should even
enable to write po HTML checkers

* The patch was generated against py2.3 (still the
"good/base" Python for me) and also fixes a regexp-bug
(which already was fixed in py2.4.2). Yet the patch
works also against py2.4/2.5 - 2 locations where py24
trivially changed to %r/repr may grumble.


-robert

History
Date	User	Action	Args
2007-08-23 15:48:49	admin	link	issue1486713 messages
2007-08-23 15:48:49	admin	create