Author ezio.melotti
Recipients Hunanyan, Matt.Basta, cpalmer, eric.araujo, ezio.melotti, fantoozler, fdrake, friday, georg.brandl, gsf, momat, orsenthil, r.david.murray, yotam
Date 2011-07-27.06:52:14
SpamBayes Score 1.23312e-12
Marked as misclassified No
Message-id <1311749535.43.0.630839370937.issue670664@psf.upfronthosting.co.za>
In-reply-to
Content
I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
I also did some tests with and without the patch:

>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo'  # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247


# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")  
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'

So my question is: is it normal that the data is passed to handle_data in chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).
History
Date User Action Args
2011-07-27 06:52:15ezio.melottisetrecipients: + ezio.melotti, fdrake, georg.brandl, yotam, orsenthil, fantoozler, gsf, cpalmer, eric.araujo, r.david.murray, momat, Hunanyan, friday, Matt.Basta
2011-07-27 06:52:15ezio.melottisetmessageid: <1311749535.43.0.630839370937.issue670664@psf.upfronthosting.co.za>
2011-07-27 06:52:14ezio.melottilinkissue670664 messages
2011-07-27 06:52:14ezio.melotticreate