Issue 26009: HTMLParser lacking a few features to reconstruct input exactly

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70197

classification

Title:	HTMLParser lacking a few features to reconstruct input exactly
Type:	enhancement	Stage:	test needed
Components:		Versions:	Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	ezio.melotti, jason_s
Priority:	normal	Keywords:

Created on 2016-01-04 17:35 by jason_s, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
test2.py	jason_s, 2016-01-04 17:36
test1.html	jason_s, 2016-01-04 17:44

Messages (4)
msg257472 - (view)	Author: Jason Sachs (jason_s) *	Date: 2016-01-04 17:35
The HTMLParser class (https://docs.python.org/2/library/htmlparser.html) is lacking a few features to reconstruct input exactly. For the most part it can do this, but I found two items where it falls short (there may be others): - There is a get_starttag_text() method but no get_endtag_text() method, which is necessary if the end tag is not in canonical form, e.g. instead of </p> it is </P> or </ P > - The effect of the parse_bogus_comment() internal method is to call handle_comment(), so content like <! I AM BOGUS > cannot be distinguished by subclasses of HTMLParser from actual comments <!-- I AM BOGUS --> Suggested changes: - Add a get_endtag_text() method to return the exact endtag text - change parse_bogus_comment to call self.handle_bogus_comment(), and define self.handle_bogus_comment() to call self.handle_comment(). This way it is backwards-compatible with existing behavior, but subclasses can redefine self.handle_bogus_comment() to do what they want.
msg257473 - (view)	Author: Jason Sachs (jason_s) *	Date: 2016-01-04 17:36
sample file attached containing VerbatimParser
msg257475 - (view)	Author: Jason Sachs (jason_s) *	Date: 2016-01-04 17:44
sample file test1.html attached. When running test2.py on it, the output is identical except for two things: test1.html contains <!DAMMIT HTML PUBLIC CRAP> test1b.html contains <!--DAMMIT HTML PUBLIC CRAP--> test1.html contains end tags that are capitalized e.g. </P> or have spaces </ goober > test1b.html contains end tags that are canonicalized to lowercase and without spaces e.g. </p> and </goober>
msg257770 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2016-01-08 17:46
What is your use case? Also note that new features can only go on 3.6.

History
Date	User	Action	Args
2022-04-11 14:58:25	admin	set	github: 70197
2016-01-08 18:26:03	terry.reedy	set	type: behavior -> enhancement stage: test needed
2016-01-08 17:46:11	ezio.melotti	set	nosy: + ezio.melotti messages: + msg257770 versions: + Python 3.6, - Python 2.7
2016-01-04 17:44:59	jason_s	set	files: + test1.html messages: + msg257475
2016-01-04 17:36:45	jason_s	set	files: + test2.py messages: + msg257473
2016-01-04 17:35:38	jason_s	create