classification
Title: HTMLParser not compliant to XHTML spec
Type: enhancement Stage: test needed
Components: Library (Lib) Versions: Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: 1051840 Superseder:
Assigned To: fdrake Nosy List: BreamoreBoy, fdrake, loewis, neptune235
Priority: normal Keywords: easy

Created on 2004-10-28 04:59 by neptune235, last changed 2010-08-19 18:51 by fdrake. This issue is now closed.

Files
File name Uploaded Description Edit
myHTMLParser.py neptune235, 2004-10-28 08:33 HTMLParser with last_tag attribute.
Messages (6)
msg22920 - (view) Author: Luke Bradley (neptune235) Date: 2004-10-28 04:59
HTMLParser has a problem related to the fact that is
doesn't seem to comply to the spec for XHTML. What I am
refering to can be read about here:
http://www.w3.org/TR/xhtml1/#h-4.8
In a nutshell, HTMLParser doesn't treat data inside
'script' or 'style' elements as #PCDATA, but rather
behaves like an HTML 4 parser even for XHTML documents,
parsing only end tags. As a result, entity references
in javascript are not converted as they should be.
XHTML authors writing to spec can expect entities in
script sections of XHTML documents to be converted if
the script is not explicitly escaped as a CDATA
section. which brings up problem two, That sections
explicitly escaped as CDATA are also parsed as HTML 4
'script' and 'style' sections...End tags are parsed...
My understanding is that this is bad as well:
http://www.w3.org/TR/2004/REC-xml-20040204/#dt-cdsection
because CDend is the only thing that's supposed to be
parsed in a CDATA section for all XML documents?

msg22921 - (view) Author: Luke Bradley (neptune235) Date: 2004-10-28 08:31
Logged In: YES 
user_id=178561

I also reported bug 1051840. I discovered this when I was
looking for a universal way to handle all the wierd things
people do with their script sections on HTML/XHTML pages on
the net. I've ended up modifying HTMLParser.py so that the
HTMLParser class has an extra attribute called last_match,
which is the exact string of HTML that whatever handler
event  is being called for...So that putting:
sys.stdout.write(self.last_match) 
or
sys.stdout.write(self.get_last_match())
for every handler event (except handle_data, which can be
directly outputted) will output the page exactly as was
inputted. This allows me to handle all oddities in people's
code at the level of my handler, without changing HTMLParser
in any other way...
Here's the code, attached. Not that you care, but on the off
chance that you guys might want to think about doing
something like this....:)
msg22922 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2004-10-28 19:41
Logged In: YES 
user_id=21627

Can you give an example demonstrating this problem, please?
A Python script with a small embedded HTML file, and a
PASS/FAIL condition would be best.
msg22923 - (view) Author: Luke Bradley (neptune235) Date: 2004-10-28 22:23
Logged In: YES 
user_id=178561

Sure. I'll attach it as a file: tidytest2.py

btw: I'm no guru so tell me if I'm misinterpretting the w3c.
I'm just trying to use HTMLParser in such a way that it
won't mangle anybodies script sections, and I want to have
all my bases covered.
msg114388 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-08-19 18:04
I think this should be closed as it's similar to #1051840, agreed?
msg114403 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-08-19 18:51
Indeed it is.  Closing, won't fix.

HTMLParser tries to deal with XHTML constructs only so much as HTML ends up with that stuff, not because it's trying to handle everything.

(The claimed example appears not to have been attached, anyway.)
History
Date User Action Args
2010-08-19 18:51:11fdrakesetstatus: open -> closed
resolution: wont fix
messages: + msg114403
2010-08-19 18:04:38BreamoreBoysetnosy: + BreamoreBoy
messages: + msg114388
2009-04-22 16:04:33ajaksu2setkeywords: + easy
2009-02-14 18:18:58ajaksu2setdependencies: + HTMLParser doesn't treat endtags in <script> tags as CDATA
type: enhancement
stage: test needed
versions: + Python 2.7
2004-10-28 04:59:36neptune235create