Issue 683938: HTMLParser attribute parsing bug

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/37954

classification

Title:	HTMLParser attribute parsing bug
Type:	enhancement	Stage:	test needed
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	fdrake	Nosy List:	BreamoreBoy, calvin, fdrake, r.david.murray, smroid, titus
Priority:	normal	Keywords:	easy

Created on 2003-02-10 14:57 by fdrake, last changed 2022-04-10 16:06 by admin. This issue is now closed.

Messages (6)
msg60305 - (view)	Author: Fred Drake (fdrake)	Date: 2003-02-10 14:57
HTMLParser (reportedly) fails to parse this construct: <a href="http://ss"title="pe">P</a> (Note that a required space between the two attributes of the "a" tag has been omitted). The W3C validator appearantly treats this differently, so there's no point in arguing the letter of the law. Assigned to me.
msg60306 - (view)	Author: Bastian Kleineidam (calvin)	Date: 2003-03-31 10:44
Logged In: YES user_id=9205 HTMLParser (and lots of other parsers I tried) has definitely limits when it comes to error recovering. I dont know if its good to put further development effort in HTMLParser as it will IMHO never reach the ability to cope with all the crappy HTML out there. If you really want to have a html parser in Python, I suggest you look at my htmlsax module packaged with linkchecker (linkchecker.sf.net) and webcleaner (webcleaner.sf.net), the parser is tested with lots of real world examples. The parser packaged with linkchecker has line counting, the one with webcleaner not. Cheers, Bastian
msg60307 - (view)	Author: Steven Rosenthal (smroid)	Date: 2003-05-14 05:12
Logged In: YES user_id=159908 Two troublesome input examples: <table border=0 width="100%"cellspacing=0 cellpadding=0> <option selected value=> Here's a fix I came up with in HTMLParser.py: replace the definition of locatestarttagend with: locatestarttagend = re.compile(r""" <[a-zA-Z][-.a-zA-Z0-9:_]* # tag name \s* # whitespace after tag name (?: (?:[a-zA-Z_][-.:a-zA-Z0-9_]* # attribute name (?:\s=\s # value indicator (?:'[^']' # LITA-enclosed value \|\"[^\"]\" # LIT-enclosed value \|[^'\">\s]+ # bare value )? )? ) \s* # whitespace between attrs )* \s* # trailing whitespace """, re.VERBOSE)
msg60308 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2004-03-11 04:53
Logged In: YES user_id=100308 I'm using python 2.3.3. I note that bug 699079, which addresses this same issue, was closed as "not a bug". As far as I can tell the current behavior of HTMLParser, unlike what was reported in that bug report, is to silently stop parsing. This is a problem, as it took me quite a while to track down why my application wasn't working, whereas if an exception had been generated I'd have figured it out right quick. If it's going to stop parsing when the error occurs, then I'd much rather it generate an exception. I can always trap the exception if I want to keep going. Since it apparently used to work that way, I'm hoping maybe a quick poke through CVS by someone knowledgeable with the code can restore the excption behavior, pending a more satisfactory resolution to the problem.
msg60309 - (view)	Author: Titus Brown (titus)	Date: 2004-12-19 00:34
Logged In: YES user_id=23486 In response to rdmurray's comment: in Python 2.4, at least, an exception is raised. Not sure why this bug is being kept open... but see bug 736428 and patch 755660 for related issues.
msg114217 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-18 13:15
Closed as fixed in r23322.

History
Date	User	Action	Args
2022-04-10 16:06:42	admin	set	github: 37954
2010-08-18 13:15:34	BreamoreBoy	set	status: open -> closed nosy: + BreamoreBoy messages: + msg114217 resolution: fixed
2009-04-22 18:49:33	ajaksu2	set	keywords: + easy versions: + Python 2.7
2009-02-12 03:25:02	ajaksu2	set	type: enhancement stage: test needed
2009-02-12 03:01:19	ajaksu2	link	issue755670 dependencies
2003-02-10 14:57:40	fdrake	create