Message 141210 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	Hunanyan, Matt.Basta, cpalmer, eric.araujo, ezio.melotti, fantoozler, fdrake, friday, georg.brandl, gsf, momat, orsenthil, r.david.murray, yotam
Date	2011-07-27.06:52:14
SpamBayes Score	1.2331247e-12
Marked as misclassified	No
Message-id	<1311749535.43.0.630839370937.issue670664@psf.upfronthosting.co.za>
In-reply-to

Content
I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there). I also did some tests with and without the patch: >>> from HTMLParser import HTMLParser as HP >>> class MyHP(HP): ... def handle_data(self, data): print 'data: %r' % data ... >>> myhp = MyHP() # without the patch: >>> myhp.feed('<script>foobar</script>') data: 'foobar' # this looks ok >>> myhp.feed('<script><p>foo</p></script>') data: '<p>foo' # where's the </p>? >>> myhp.feed('<script><p>foo</p><span>bar</span></script>') data: '<p>foo' # some tags missing, 2 chunks received data: 'bar' >>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>") data: '<p>foo' data: " '" Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed self.goahead(0) File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead k = self.parse_endtag(i) File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag self.error("bad end tag: %r" % (rawdata[i:j],)) File "/usr/lib/python2.7/HTMLParser.py", line 115, in error raise HTMLParseError(message, self.getpos()) HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247 # with the patch: >>> myhp.feed('<script>foobar</script>') data: 'foobar' # ok >>> myhp.feed('<script><p>foo</p></script>') data: '<p>foo' # all the content is there, but why 2 chunks? data: '</p>' >>> myhp.feed('<script><p>foo</p><span>bar</span></script>') data: '<p>foo' # same as previous data: '</p>' data: '<span>bar' data: '</span>' >>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>") data: '<p>foo' # same data: '</p>' data: " '" data: "</scr'+'ipt>" data: "' <span>bar" data: '</span>' So my question is: is it normal that the data is passed to handle_data in chunks? AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me. If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>). Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).

I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
I also did some tests with and without the patch:

>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo'  # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247


# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")  
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'

So my question is: is it normal that the data is passed to handle_data in chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).

History
Date	User	Action	Args
2011-07-27 06:52:15	ezio.melotti	set	recipients: + ezio.melotti, fdrake, georg.brandl, yotam, orsenthil, fantoozler, gsf, cpalmer, eric.araujo, r.david.murray, momat, Hunanyan, friday, Matt.Basta
2011-07-27 06:52:15	ezio.melotti	set	messageid: <1311749535.43.0.630839370937.issue670664@psf.upfronthosting.co.za>
2011-07-27 06:52:14	ezio.melotti	link	issue670664 messages
2011-07-27 06:52:14	ezio.melotti	create