Issue 6662: HTMLParser.HTMLParser doesn't handle malformed charrefs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50911

classification

Title:	HTMLParser.HTMLParser doesn't handle malformed charrefs
Type:	behavior	Stage:
Components:	Library (Lib)	Versions:	Python 2.4, Python 2.7, Python 2.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	dayveday, eric.araujo, ezio.melotti, fredrik.haard, vstinner
Priority:	high	Keywords:	patch

Created on 2009-08-07 01:25 by dayveday, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
Issue6662.patch	fredrik.haard, 2010-04-28 13:19

Messages (5)
msg91392 - (view)	Author: Dave Day (dayveday)	Date: 2009-08-07 01:25
When HTMLParser.HTMLParser encounters a malformed charref (for example &#bad;) it no longer parsers the following HTML correctly. For example: <p>&#bad;</p> Recognises the starttag "p" but considers the rest to be data. To reproduce: class MyParser(HTMLParser.HTMLParser): def handle_starttag(self, tag, attrs): print 'Start "%s"' % tag def handle_endtag(self,tag): print 'End "%s"' % tag def handle_charref(self, ref): print 'Charref "%s"' % ref def handle_data(self, data): print 'Data "%s"' % data parser = MyParser() parser.feed('<p>&#bad;</p>') parser.close() Expected output: Start "p" Data "&#bad;" End "p" Actual output: Start "p" Data "&#bad;</p>"
msg91950 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-08-25 11:37
Confirmed on Python3.1 too.
msg104367 - (view)	Author: Fredrik Håård (fredrik.haard)	Date: 2010-04-27 21:38
Is there a reason for HTMLParser to treat anything that does not match the regex '&#\d+;' as a charref?
msg104428 - (view)	Author: Fredrik Håård (fredrik.haard)	Date: 2010-04-28 13:19
Confirmed on trunk. Attached a (what I think is) minimal patch to fix, together with a tweak of existing unit test case to verify it.
msg106401 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-24 21:48
Commited: 2.7 (r81500, r81501), 2.6 (r81503), 3.2 (r81504), 3.1 (r81505).

History
Date	User	Action	Args
2022-04-11 14:56:51	admin	set	github: 50911
2010-05-24 21:48:50	vstinner	set	status: open -> closed nosy: + vstinner messages: + msg106401 resolution: fixed
2010-05-24 21:21:11	eric.araujo	set	nosy: + eric.araujo
2010-04-28 13:19:09	fredrik.haard	set	files: + Issue6662.patch keywords: + patch messages: + msg104428 versions: + Python 2.7
2010-04-27 21:38:24	fredrik.haard	set	messages: + msg104367
2010-04-27 21:31:39	fredrik.haard	set	nosy: + fredrik.haard
2009-08-25 11:37:36	ezio.melotti	set	priority: high nosy: + ezio.melotti messages: + msg91950
2009-08-07 01:25:08	dayveday	create