Issue 1055864: HTMLParser not compliant to XHTML spec

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/41093

classification

Title:	HTMLParser not compliant to XHTML spec
Type:	enhancement	Stage:	test needed
Components:	Library (Lib)	Versions:	Python 2.7

process

Status:	closed	Resolution:	wont fix
Dependencies:	1051840	Superseder:
Assigned To:	fdrake	Nosy List:	BreamoreBoy, fdrake, loewis, neptune235
Priority:	normal	Keywords:	easy

Created on 2004-10-28 04:59 by neptune235, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
myHTMLParser.py	neptune235, 2004-10-28 08:33	HTMLParser with last_tag attribute.

Messages (6)
msg22920 - (view)	Author: Luke Bradley (neptune235)	Date: 2004-10-28 04:59
HTMLParser has a problem related to the fact that is doesn't seem to comply to the spec for XHTML. What I am refering to can be read about here: http://www.w3.org/TR/xhtml1/#h-4.8 In a nutshell, HTMLParser doesn't treat data inside 'script' or 'style' elements as #PCDATA, but rather behaves like an HTML 4 parser even for XHTML documents, parsing only end tags. As a result, entity references in javascript are not converted as they should be. XHTML authors writing to spec can expect entities in script sections of XHTML documents to be converted if the script is not explicitly escaped as a CDATA section. which brings up problem two, That sections explicitly escaped as CDATA are also parsed as HTML 4 'script' and 'style' sections...End tags are parsed... My understanding is that this is bad as well: http://www.w3.org/TR/2004/REC-xml-20040204/#dt-cdsection because CDend is the only thing that's supposed to be parsed in a CDATA section for all XML documents?
msg22921 - (view)	Author: Luke Bradley (neptune235)	Date: 2004-10-28 08:31
Logged In: YES user_id=178561 I also reported bug 1051840. I discovered this when I was looking for a universal way to handle all the wierd things people do with their script sections on HTML/XHTML pages on the net. I've ended up modifying HTMLParser.py so that the HTMLParser class has an extra attribute called last_match, which is the exact string of HTML that whatever handler event is being called for...So that putting: sys.stdout.write(self.last_match) or sys.stdout.write(self.get_last_match()) for every handler event (except handle_data, which can be directly outputted) will output the page exactly as was inputted. This allows me to handle all oddities in people's code at the level of my handler, without changing HTMLParser in any other way... Here's the code, attached. Not that you care, but on the off chance that you guys might want to think about doing something like this....:)
msg22922 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2004-10-28 19:41
Logged In: YES user_id=21627 Can you give an example demonstrating this problem, please? A Python script with a small embedded HTML file, and a PASS/FAIL condition would be best.
msg22923 - (view)	Author: Luke Bradley (neptune235)	Date: 2004-10-28 22:23
Logged In: YES user_id=178561 Sure. I'll attach it as a file: tidytest2.py btw: I'm no guru so tell me if I'm misinterpretting the w3c. I'm just trying to use HTMLParser in such a way that it won't mangle anybodies script sections, and I want to have all my bases covered.
msg114388 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-08-19 18:04
I think this should be closed as it's similar to #1051840, agreed?
msg114403 - (view)	Author: Fred Drake (fdrake)	Date: 2010-08-19 18:51
Indeed it is. Closing, won't fix. HTMLParser tries to deal with XHTML constructs only so much as HTML ends up with that stuff, not because it's trying to handle everything. (The claimed example appears not to have been attached, anyway.)

History
Date	User	Action	Args
2022-04-11 14:56:07	admin	set	github: 41093
2010-08-19 18:51:11	fdrake	set	status: open -> closed resolution: wont fix messages: + msg114403
2010-08-19 18:04:38	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg114388
2009-04-22 16:04:33	ajaksu2	set	keywords: + easy
2009-02-14 18:18:58	ajaksu2	set	dependencies: + HTMLParser doesn't treat endtags in <script> tags as CDATA type: enhancement stage: test needed versions: + Python 2.7
2004-10-28 04:59:36	neptune235	create