Message141210
I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
I also did some tests with and without the patch:
>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
... def handle_data(self, data): print 'data: %r' % data
...
>>> myhp = MyHP()
# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar' # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
self.goahead(0)
File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
k = self.parse_endtag(i)
File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
self.error("bad end tag: %r" % (rawdata[i:j],))
File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247
# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar' # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'
So my question is: is it normal that the data is passed to handle_data in chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too). |
|
Date |
User |
Action |
Args |
2011-07-27 06:52:15 | ezio.melotti | set | recipients:
+ ezio.melotti, fdrake, georg.brandl, yotam, orsenthil, fantoozler, gsf, cpalmer, eric.araujo, r.david.murray, momat, Hunanyan, friday, Matt.Basta |
2011-07-27 06:52:15 | ezio.melotti | set | messageid: <1311749535.43.0.630839370937.issue670664@psf.upfronthosting.co.za> |
2011-07-27 06:52:14 | ezio.melotti | link | issue670664 messages |
2011-07-27 06:52:14 | ezio.melotti | create | |
|