HTMLParser.py - more robust SCRIPT tag parsing #37799

fantoozler · 2003-01-19T14:07:07Z

BPO	670664
Nosy	@birkenfeld, @yotam.medini, @orsenthil, @pitrou, @ezio-melotti, @merwok, @bitdancer
Files	patch-02.txt patch-test-cdata.txt: test suite update minify.py lt-in-script-example.tgz: Example of HTMLParser failure HTMLParser.diff: Patch with suggested fix endtag-space.html: Example of HTMLParser failure dollar-extra.html: Example of HTMLParser failure ltscr-endtag-dollarext.diff: Patch with suggested fix cdata_patch.diff cdata_patch.diff hp_fix.diff: Patch containing updated version of Alexander's patch and tests issue670664.diff

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2011-11-01.12:18:10.762>
created_at = <Date 2003-01-19.14:07:07.000>
labels = ['type-bug', 'library']
title = 'HTMLParser.py - more robust SCRIPT tag parsing'
updated_at = <Date 2011-11-01.12:18:10.761>
user = 'https://bugs.python.org/fantoozler'

bugs.python.org fields:

activity = <Date 2011-11-01.12:18:10.761>
actor = 'ezio.melotti'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2011-11-01.12:18:10.762>
closer = 'ezio.melotti'
components = ['Library (Lib)']
creation = <Date 2003-01-19.14:07:07.000>
creator = 'fantoozler'
dependencies = []
files = ['4937', '10952', '12183', '19072', '19073', '20231', '20232', '20233', '21045', '21046', '22767', '23553']
hgrepos = []
issue_num = 670664
keywords = ['patch', 'needs review']
message_count = 39.0
messages = ['42474', '42475', '42476', '42477', '42478', '70083', '76723', '83254', '88864', '117762', '117763', '120265', '125096', '125154', '130319', '130326', '130702', '141204', '141210', '141232', '141237', '141242', '141248', '141260', '141266', '141273', '141291', '141297', '141345', '141372', '141428', '141531', '141532', '141533', '146632', '146709', '146717', '146770', '146771']
nosy_count = 14.0
nosy_names = ['georg.brandl', 'yotam', 'orsenthil', 'pitrou', 'fantoozler', 'gsf', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'momat', 'Hunanyan', 'friday', 'python-dev', 'Matt.Basta']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue670664'
versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

fantoozler · 2003-01-19T14:07:07Z

http://www.ebay.com contains a script element of the form

<SCRIPT>
...
vbscript += "</SCR"+"IPT> \n";
...
</SCRIPT>

which is not enclosed in "" comments. The parser
choked on that line, indicating it was a mal-formed end tag.

The changes are:

interesting_cdata is now a dict mapping start tag to
an re matching the end tag, a "<--" or \Z

HTMLParser.set_cdata_mode takes an extra argument,
the start tag

fantoozler · 2003-01-25T03:58:04Z

Logged In: YES
user_id=690612

Found regression test, used it, found error, fixed it.

freddrake · 2003-01-28T22:24:10Z

Logged In: YES
user_id=3066

From python-dev:

John Paulson wrote:

[...]  A side-effect of this is that
any "\<!--" .. "--\>" within a script/style will
be parsed as a comment.  If that behavior is
incorrect, the regex can be modified.

Jerry Williams wrote:
Does this mean that the following won't work:

That could be a problem, since this is commonly used
to support browsers that don't understand <SCRIPT>.

See:
http://mail.python.org/pipermail/python-dev/2003-January/032482.html

fantoozler · 2003-01-28T22:35:00Z

Logged In: YES
user_id=690612

You will get a sequence of:
handle_starttag("script")
handle_comment("some-javascript-code")
handle_endtag("script")

whereas before the sequence was:
handle_starttag("script")
handle_data("")
handle_endtag("script")

freddrake · 2004-07-14T19:28:21Z

Logged In: YES
user_id=3066

Removed older version of the patch.

birkenfeld · 2008-07-20T11:32:05Z

Adding test suite patch from bpo-674449; closed that as a duplicate.

cpalmer · 2008-12-02T02:06:00Z

Here is an additional test case. I have a super simple HTML "minifier"
that burps when given this test file:

========

$ cat test.html 
'foo <sc'+'ript>'

========

The explosion is:

========

$ ./minify.py test.html 
Warning: malformed start tag
'foo Traceback (most recent call last):
  File "./minify.py", line 84, in <module>
    m.feed(f.read())
  File "/usr/local/lib/python2.5/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 148, in goahead
    k = self.parse_starttag(i)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
    endpos = self.check_for_whole_start_tag(i)
  File "/usr/local/lib/python2.5/HTMLParser.py", line 302, in
check_for_whole_start_tag
    raise AssertionError("we should not get here!")
AssertionError: we should not get here!

========

gsf · 2009-03-06T20:31:14Z

Now that BeautifulSoup uses HTMLParser, more people are seeing these
errors. See
http://groups.google.com/group/beautifulsoup/msg/d5a7540620538d14 and
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516824

momat · 2009-06-04T06:33:54Z

A simple workaround for the BeautifulSoup is the following wrapper. It
sanitize the javascript code before passing it to the parser by joining
the disjoint strings, so that "</scr"+"ipt>" becomes "</script>".

def bs(input):
	pattern = re.compile('\"\+\"')
	match = lambda x: ""
	massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
	massage.extend([(pattern, match)])
	return BeautifulSoup(input, markupMassage=massage)

yotammedini · 2010-09-30T21:50:04Z

The HTMLParser.py fails when inside
<script> ... </script>
it can fooled by JavaScript with less-than '<' conditional expressions.
In the attached example:

 $ tar tvzf lt-in-script-example.tgz | cut -c24-
     796 2010-09-30 16:52 h2t.py
   23678 2010-09-30 16:39 t.html

here's what happens:

 $ python h2t.py t.html /tmp/t.txt
 HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
 Traceback (most recent call last):
   File "h2t.py", line 31, in <module>
     text = html2text(f_html.read())
   File "h2t.py", line 23, in html2text
     te = TextExtractor(html)
   File "h2t.py", line 15, in __init__
     self.feed(html)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
     self.goahead(0)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
     k = self.parse_starttag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
     endpos = self.check_for_whole_start_tag(i)
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
     self.error("malformed start tag")
   File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
     raise HTMLParseError(message, self.getpos())
 HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332

I have a suggested patch
HTMLParser.diff
fixing this problem, soon to be attached.

-- yotam

yotammedini · 2010-09-30T21:52:24Z

The attached suggested patch fixes the problems shown in msg117762.

merwok · 2010-11-02T22:10:14Z

Would it be reasonable to add knowledge to html.parser to make it recognize script elements as CDATA and handle it correctly (that is let “<” pass)?

yotammedini · 2011-01-02T20:50:12Z

Suggested fix for the attached cases:
lt-in-script-example.tgz
endtag-space.html
dollar-extra.html

orsenthil · 2011-01-03T04:05:17Z

If you provide some tests augumenting the currently existing tests
test_htmlparser.py and also ensure that no existing test breaks, it
would be help better to review the patch. I do see some changes made
to the regex and parsing. So tests would definitely help.

friday · 2011-03-08T11:00:15Z

This is small patch for related bug bpo-9577 which actually is not related to this bug.

friday · 2011-03-08T11:28:24Z

And this patch fix the both bugs in more elegant way

ezio-melotti · 2011-03-12T22:47:00Z

Thanks for the patch, however it would be better if you could get a clone of the CPython repo and make a patch against it.
The patch should also include tests.

You can check http://docs.python.org/devguide/ for more information.

MattBasta · 2011-07-27T03:24:51Z

The number of problems produced by this bug can be greatly reduced by adding a relatively small check to the parser. Currently, <script> and <style> tags call set_cdata_mode(), which sets self.interesting to HTMLParser.interesting_cdata. This is bad because it searches for ANY closing tag, rather than a closing tag which matches the opening tag.

Alexander's fix solved about half the problem, but it didn't handle ending tags as text. I've fixed this and added some tests.

This is my first patch, so if there's a better way that I could be submitting this, input would be appreciated.

ezio-melotti · 2011-07-27T06:52:15Z

I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
I also did some tests with and without the patch:

>>> from HTMLParser import HTMLParser as HP
>>> class MyHP(HP):
...   def handle_data(self, data): print 'data: %r' % data
... 
>>> myhp = MyHP()

# without the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # this looks ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo'  # where's the </p>?
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # some tags missing, 2 chunks received
data: 'bar'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
data: '<p>foo'
data: " '"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
    self.goahead(0)
  File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
    k = self.parse_endtag(i)
  File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
    self.error("bad end tag: %r" % (rawdata[i:j],))
  File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
    raise HTMLParseError(message, self.getpos())
HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247


# with the patch:
>>> myhp.feed('<script>foobar</script>')
data: 'foobar'  # ok
>>> myhp.feed('<script><p>foo</p></script>')
data: '<p>foo' # all the content is there, but why 2 chunks?
data: '</p>'
>>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
data: '<p>foo' # same as previous
data: '</p>'
data: '<span>bar'
data: '</span>'
>>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")  
data: '<p>foo' # same
data: '</p>'
data: " '"
data: "</scr'+'ipt>"
data: "' <span>bar"
data: '</span>'

So my question is: is it normal that the data is passed to handle_data in chunks?
AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).

merwok · 2011-07-27T15:12:07Z

Ezio wrote:
  >>> myhp.feed('<script><p>foo</p></script>')
  data: '<p>foo'  # where's the </p>?

http://www.w3.org/TR/html4/types#type-cdata says:
Although the STYLE and SCRIPT elements use CDATA for their data
model, for these elements, CDATA must be handled differently by user
agents. Markup and entities must be treated as raw text and passed to
the application as is. The first occurrence of the character sequence
"</" (end-tag open delimiter) is treated as terminating the end of
the element's content. In valid documents, this would be the end tag
for the element.

So I think the example is invalid (should escape the <), and that HTMLParser is not buggy.

bitdancer · 2011-07-27T15:44:50Z

It's not buggy, but it is also not helpful. This kind of thing is what we introduced the 'strict' parameter for. And indeed I believe we've fixed some of these cases thereby. So any additional fixes should go into non-strict mode in Python3.

MattBasta · 2011-07-27T16:53:52Z

So I think the example is invalid (should escape the <), and that HTMLParser is not buggy.

On the other hand, the HTML5 spec clearly dictates otherwise:

http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions
The text in raw text and RCDATA elements must not contain any occurrences of the string "</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that case-insensitively match the tag name of the element followed by one of U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F SOLIDUS (/).

Additionally, no browsers (perhaps unless they are in quirks mode) currently obey the HTML4 variant of the rule. This is due largely in part to the need to include strings such as "</scr" + "ipt>" within a script tag itself. This behavior can be observed firsthand by loading this snippet in a browser:

bitdancer · 2011-07-27T17:28:50Z

Yes, but we don't claim to support HTML5 yet.

The best way to support HTML5 is probably a topic for python-dev.

MattBasta · 2011-07-27T18:37:45Z

Yes, but we don't claim to support HTML5 yet.

There's also no claim in the docs or the source that HTMLParser specifically adheres to HTML4, either.

Ideally, the parser should strive for parity with the functionality of major web browsers, as they are the de-facto standard for HTML parser behavior. All of the browsers on my machine, for instance, will even parse the following snippet with the behavior described in the HTML5 spec:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
"http://www.w3.org/TR/html4/strict.dtd"\>
<script><span></span>This should not be visible.</script>

Even in pre-HTML5 browsers, this is the way that HTML gets parsed. For the heck of it, I downloaded an old copy of Firefox 2.0 and ran the above snippet. The behavior is consistent.

While I would otherwise agree that keeping to the HTML4 spec is the right thing to do, this is a quirk of the spec that is not only ignored by browsers (as can be seen in FX2) and changed in a future version of the spec, but is causing problems for a good number of developers.

It could be argued that the patch is a far more elegant solution for Beautiful Soup developers than the workaround in msg88864.

bitdancer · 2011-07-27T19:13:09Z

I thought HTLM4 conformance was documented somewhere, but I could be wrong.

HTML5, from what I understand (I haven't read the spec), is explicitly or implicitly following "what browsers really do" exactly because nobody conformed to HTML4, so arguing that "a later spec changed the rules" isn't really relevant in this case :)

We made the change the way we did (strict option) out of backward compatibility concerns, so I still think this topic needs to be discussed on python-dev. I think the argument that python should handle what most browsers handle is a strong one (I myself would have been in favor of just making this stuff work, absent backward compatibility concerns). The question in my mind is what's the best way to get there from here?

ezio-melotti · 2011-07-27T20:03:23Z

IIRC we have been following what browsers do in other cases already.
There were also some discussions about supporting HTML5 (see e.g. bpo-7311 and bpo-11113) and the strict vs non-strict mode introduced in Python3.

Note that changing the way things are parsed is generally not backward-compatible, but you might argue that new behavior is useful enough to break some hackish code that was trying to workaround the limitations of HTMLParser.

merwok · 2011-07-28T13:43:24Z

HTML5 being a spec that builds on HTML 4.01 and real-world ways to deal with non-compliant input, I don’t object to fixes that follow the HTML5 spec. Regarding backward compatibility, we can break it if we decide that the behavior we’re changing was a bug. I think it’s far more useful to support BeautifoulSoup than to retain a non-useful behavior forever.

bitdancer · 2011-07-28T14:16:12Z

Unless someone else has picked it up, BeautifulSoup is a no longer an issue since its author has abandoned it. That doesn't change the fact that IMO it would be nice for our library to handle input generously.

pitrou · 2011-07-29T12:35:07Z

I also think this is a bug that should be fixed. Not being able to parse real-world HTML is a nuisance.

I agree with Ezio's review comments about the custom regex.

bitdancer · 2011-07-29T13:33:20Z

It sounds like the early consensus on python-dev is that html5 support is a good thing. I'm happy with that. I presume that means the 'strict' keyword in 3.x becomes strict-per-html5, and possibly useless :)

ezio-melotti · 2011-07-30T07:09:15Z

As I said somewhere else, the only use case I can think of where the 'strict' flag is useful is validation, but AFAIK even in "strict mode" it's possible to parse non-valid documents, so I agree it's pretty useless.

Moving to HTML5 and offering something able to parse real-world HTML seems the way to go IMHO.

friday · 2011-08-01T20:07:51Z

It sounds like the early consensus on python-dev is that html5 support is a good thing.

Yeah... But wait another 8 years untill these guys decides that there is enough tests and other cool stuff.

pitrou · 2011-08-01T20:20:05Z

Yeah... But wait another 8 years untill these guys decides that
there is enough tests and other cool stuff.

Which guys are you talking about?
Granted, this issue has been around for a loooong time... but now that we have a patch that seems ok (and has tests), we should be able to finally push this and include it in the next feature release, hopefully.

MattBasta · 2011-08-01T20:50:30Z

Seeing as everyone seems pretty satisfied with the 2.7 version, I'd be happy to put together a patch for 3 as well.

To confirm, though, this fix is NOT going behind the strict parameter, correct?

ezio-melotti · 2011-10-30T11:40:26Z

Attached a new patch with a few more tests and minor refactoring.

merwok · 2011-10-31T16:35:06Z

def set_cdata_mode(self):
+ def set_cdata_mode(self, elem):
Looks like an incompatible behavior change. Is it only an internal method that will never affect users’ code (even subclasses)?

ezio-melotti · 2011-10-31T17:13:21Z

I think it's internal. While it's not explicitly mentioned in the source, the method is not documented and I don't think people subclassed it. All that it does is changing the regex used to parse the data, and if someone needs to change this, it's probably easier to just change the regex.
OTOH this method is called just in one place, so in theory it's possible to set self.cdata_elem there before the set_cdata_mode() call, but that might make the code more fragile.

python-dev · 2011-11-01T12:14:36Z

New changeset 0a5eb57d5876 by Ezio Melotti in branch '2.7':
bpo-670664: Fix HTMLParser to correctly handle the content of <script>...</script> and <style>...</style>.
http://hg.python.org/cpython/rev/0a5eb57d5876

New changeset a6f2244b251f by Ezio Melotti in branch '3.2':
bpo-670664: Fix HTMLParser to correctly handle the content of <script>...</script> and <style>...</style>.
http://hg.python.org/cpython/rev/a6f2244b251f

New changeset b40752e227fa by Ezio Melotti in branch 'default':
bpo-670664: merge with 3.2.
http://hg.python.org/cpython/rev/b40752e227fa

ezio-melotti · 2011-11-01T12:18:11Z

Fixed, thanks to everyone who contributed to this over the years!

fantoozler mannequin assigned freddrake Jan 19, 2003

fantoozler mannequin added the stdlib Python modules in the Lib dir label Jan 19, 2003

fantoozler mannequin assigned freddrake Jan 19, 2003

fantoozler mannequin added the stdlib Python modules in the Lib dir label Jan 19, 2003

cpalmer mannequin added type-bug An unexpected behavior, bug, or error labels Dec 2, 2008

pitrou unassigned freddrake Jul 29, 2011

ezio-melotti assigned ezio-melotti Oct 29, 2011

ezio-melotti closed this as completed Nov 1, 2011

ezio-melotti transferred this issue from another repository Apr 9, 2022

HTMLParser.py - more robust SCRIPT tag parsing #37799

HTMLParser.py - more robust SCRIPT tag parsing #37799

Comments

fantoozler mannequin commented Jan 19, 2003

fantoozler mannequin commented Jan 19, 2003

fantoozler mannequin commented Jan 25, 2003

freddrake commented Jan 28, 2003

fantoozler mannequin commented Jan 28, 2003

freddrake commented Jul 14, 2004

birkenfeld commented Jul 20, 2008

cpalmer mannequin commented Dec 2, 2008

gsf mannequin commented Mar 6, 2009

momat mannequin commented Jun 4, 2009

yotammedini mannequin commented Sep 30, 2010

yotammedini mannequin commented Sep 30, 2010

merwok commented Nov 2, 2010

yotammedini mannequin commented Jan 2, 2011

orsenthil commented Jan 3, 2011

friday mannequin commented Mar 8, 2011

friday mannequin commented Mar 8, 2011

ezio-melotti commented Mar 12, 2011

MattBasta mannequin commented Jul 27, 2011

ezio-melotti commented Jul 27, 2011

merwok commented Jul 27, 2011

bitdancer commented Jul 27, 2011

MattBasta mannequin commented Jul 27, 2011

bitdancer commented Jul 27, 2011

MattBasta mannequin commented Jul 27, 2011

bitdancer commented Jul 27, 2011

ezio-melotti commented Jul 27, 2011

merwok commented Jul 28, 2011

bitdancer commented Jul 28, 2011

pitrou commented Jul 29, 2011

bitdancer commented Jul 29, 2011

ezio-melotti commented Jul 30, 2011

friday mannequin commented Aug 1, 2011

pitrou commented Aug 1, 2011

MattBasta mannequin commented Aug 1, 2011

ezio-melotti commented Oct 30, 2011

merwok commented Oct 31, 2011

ezio-melotti commented Oct 31, 2011

python-dev mannequin commented Nov 1, 2011

ezio-melotti commented Nov 1, 2011