Message 257472 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	jason_s
Recipients	jason_s
Date	2016-01-04.17:35:38
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1451928938.48.0.712719588041.issue26009@psf.upfronthosting.co.za>
In-reply-to

Content
The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others): - There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of </p> it is </P> or </ P > - The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments <!-- I AM BOGUS --> Suggested changes: - Add a get_endtag_text() method to return the exact endtag text - change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.

The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others):

- There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of </p> it is </P> or </   P >

- The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments <!-- I AM BOGUS -->

Suggested changes:

- Add a get_endtag_text() method to return the exact endtag text
- change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.

History
Date	User	Action	Args
2016-01-04 17:35:38	jason_s	set	recipients: + jason_s
2016-01-04 17:35:38	jason_s	set	messageid: <1451928938.48.0.712719588041.issue26009@psf.upfronthosting.co.za>
2016-01-04 17:35:38	jason_s	link	issue26009 messages
2016-01-04 17:35:38	jason_s	create