Title: HTMLParser lacking a few features to reconstruct input exactly
Author: Jason Sachs (jason_s) * Date: 2016-01-04 17:35
The HTMLParser class ( is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others):

- There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of </p> it is </P> or </   P >

- The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments <!-- I AM BOGUS -->

Suggested changes:

- Add a get_endtag_text() method to return the exact endtag text
- change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.
Author: Jason Sachs (jason_s) * Date: 2016-01-04 17:36
sample file attached containing VerbatimParser
Author: Jason Sachs (jason_s) * Date: 2016-01-04 17:44
sample file test1.html attached.

When running on it, the output is identical except for two things:

test1.html  contains <!DAMMIT HTML PUBLIC CRAP>
test1b.html contains <!--DAMMIT HTML PUBLIC CRAP-->

test1.html contains end tags that are capitalized e.g. </P> or have spaces </  goober   >
test1b.html contains end tags that are canonicalized to lowercase and without spaces e.g. </p> and </goober>
Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2016-01-08 17:46
What is your use case?
Also note that new features can only go on 3.6.
