Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser.py - more robust SCRIPT tag parsing #37799

Closed
fantoozler mannequin opened this issue Jan 19, 2003 · 39 comments
Closed

HTMLParser.py - more robust SCRIPT tag parsing #37799

fantoozler mannequin opened this issue Jan 19, 2003 · 39 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@fantoozler
Copy link
Mannequin

fantoozler mannequin commented Jan 19, 2003

BPO 670664
Nosy @birkenfeld, @yotam.medini, @orsenthil, @pitrou, @ezio-melotti, @merwok, @bitdancer
Files
  • patch-02.txt
  • patch-test-cdata.txt: test suite update
  • minify.py
  • lt-in-script-example.tgz: Example of HTMLParser failure
  • HTMLParser.diff: Patch with suggested fix
  • endtag-space.html: Example of HTMLParser failure
  • dollar-extra.html: Example of HTMLParser failure
  • ltscr-endtag-dollarext.diff: Patch with suggested fix
  • cdata_patch.diff
  • cdata_patch.diff
  • hp_fix.diff: Patch containing updated version of Alexander's patch and tests
  • issue670664.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2011-11-01.12:18:10.762>
    created_at = <Date 2003-01-19.14:07:07.000>
    labels = ['type-bug', 'library']
    title = 'HTMLParser.py - more robust SCRIPT tag parsing'
    updated_at = <Date 2011-11-01.12:18:10.761>
    user = 'https://bugs.python.org/fantoozler'

    bugs.python.org fields:

    activity = <Date 2011-11-01.12:18:10.761>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2011-11-01.12:18:10.762>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2003-01-19.14:07:07.000>
    creator = 'fantoozler'
    dependencies = []
    files = ['4937', '10952', '12183', '19072', '19073', '20231', '20232', '20233', '21045', '21046', '22767', '23553']
    hgrepos = []
    issue_num = 670664
    keywords = ['patch', 'needs review']
    message_count = 39.0
    messages = ['42474', '42475', '42476', '42477', '42478', '70083', '76723', '83254', '88864', '117762', '117763', '120265', '125096', '125154', '130319', '130326', '130702', '141204', '141210', '141232', '141237', '141242', '141248', '141260', '141266', '141273', '141291', '141297', '141345', '141372', '141428', '141531', '141532', '141533', '146632', '146709', '146717', '146770', '146771']
    nosy_count = 14.0
    nosy_names = ['georg.brandl', 'yotam', 'orsenthil', 'pitrou', 'fantoozler', 'gsf', 'ezio.melotti', 'eric.araujo', 'r.david.murray', 'momat', 'Hunanyan', 'friday', 'python-dev', 'Matt.Basta']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue670664'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @fantoozler
    Copy link
    Mannequin Author

    fantoozler mannequin commented Jan 19, 2003

    http://www.ebay.com contains a script element of the form

    <SCRIPT>
    ...
    vbscript += "</SCR"+"IPT> \n";
    ...
    </SCRIPT>

    which is not enclosed in "<!-- ... -->" comments. The parser
    choked on that line, indicating it was a mal-formed end tag.

    The changes are:

    interesting_cdata is now a dict mapping start tag to
    an re matching the end tag, a "<--" or \Z

    HTMLParser.set_cdata_mode takes an extra argument,
    the start tag

    @fantoozler fantoozler mannequin assigned freddrake Jan 19, 2003
    @fantoozler fantoozler mannequin added the stdlib Python modules in the Lib dir label Jan 19, 2003
    @fantoozler fantoozler mannequin assigned freddrake Jan 19, 2003
    @fantoozler fantoozler mannequin added the stdlib Python modules in the Lib dir label Jan 19, 2003
    @fantoozler
    Copy link
    Mannequin Author

    fantoozler mannequin commented Jan 25, 2003

    Logged In: YES
    user_id=690612

    Found regression test, used it, found error, fixed it.

    @freddrake
    Copy link
    Member

    Logged In: YES
    user_id=3066

    From python-dev:

    John Paulson wrote:

    [...]  A side-effect of this is that
    any "\<!--" .. "--\>" within a script/style will
    be parsed as a comment.  If that behavior is
    incorrect, the regex can be modified.
    

    Jerry Williams wrote:
    Does this mean that the following won't work:

    <SCRIPT language="JavaScript">
    <!-- //
    some-javascript-code
    // -->
    </SCRIPT>

    That could be a problem, since this is commonly used
    to support browsers that don't understand <SCRIPT>.

    See:
    http://mail.python.org/pipermail/python-dev/2003-January/032482.html

    @fantoozler
    Copy link
    Mannequin Author

    fantoozler mannequin commented Jan 28, 2003

    Logged In: YES
    user_id=690612

    You will get a sequence of:
    handle_starttag("script")
    handle_comment("some-javascript-code")
    handle_endtag("script")

    whereas before the sequence was:
    handle_starttag("script")
    handle_data("<!-- ... some-javascript-code ... //-->")
    handle_endtag("script")

    @freddrake
    Copy link
    Member

    Logged In: YES
    user_id=3066

    Removed older version of the patch.

    @birkenfeld
    Copy link
    Member

    Adding test suite patch from bpo-674449; closed that as a duplicate.

    @cpalmer
    Copy link
    Mannequin

    cpalmer mannequin commented Dec 2, 2008

    Here is an additional test case. I have a super simple HTML "minifier"
    that burps when given this test file:

    ========

    $ cat test.html 
    'foo <sc'+'ript>'

    ========

    The explosion is:

    ========

    $ ./minify.py test.html 
    Warning: malformed start tag
    'foo Traceback (most recent call last):
      File "./minify.py", line 84, in <module>
        m.feed(f.read())
      File "/usr/local/lib/python2.5/HTMLParser.py", line 108, in feed
        self.goahead(0)
      File "/usr/local/lib/python2.5/HTMLParser.py", line 148, in goahead
        k = self.parse_starttag(i)
      File "/usr/local/lib/python2.5/HTMLParser.py", line 226, in parse_starttag
        endpos = self.check_for_whole_start_tag(i)
      File "/usr/local/lib/python2.5/HTMLParser.py", line 302, in
    check_for_whole_start_tag
        raise AssertionError("we should not get here!")
    AssertionError: we should not get here!

    ========

    @cpalmer cpalmer mannequin added type-bug An unexpected behavior, bug, or error labels Dec 2, 2008
    @gsf
    Copy link
    Mannequin

    gsf mannequin commented Mar 6, 2009

    Now that BeautifulSoup uses HTMLParser, more people are seeing these
    errors. See
    http://groups.google.com/group/beautifulsoup/msg/d5a7540620538d14 and
    http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=516824

    @momat
    Copy link
    Mannequin

    momat mannequin commented Jun 4, 2009

    A simple workaround for the BeautifulSoup is the following wrapper. It
    sanitize the javascript code before passing it to the parser by joining
    the disjoint strings, so that "</scr"+"ipt>" becomes "</script>".

    def bs(input):
    	pattern = re.compile('\"\+\"')
    	match = lambda x: ""
    	massage = copy.copy(BeautifulSoup.MARKUP_MASSAGE)
    	massage.extend([(pattern, match)])
    	return BeautifulSoup(input, markupMassage=massage)

    @yotammedini
    Copy link
    Mannequin

    yotammedini mannequin commented Sep 30, 2010

    The HTMLParser.py fails when inside
    <script> ... </script>
    it can fooled by JavaScript with less-than '<' conditional expressions.
    In the attached example:

     $ tar tvzf lt-in-script-example.tgz | cut -c24-
         796 2010-09-30 16:52 h2t.py
       23678 2010-09-30 16:39 t.html

    here's what happens:

     $ python h2t.py t.html /tmp/t.txt
     HTMLParser: /home/yotam/src/wog/HTMLParser.bug/HTMLParser.py
     Traceback (most recent call last):
       File "h2t.py", line 31, in <module>
         text = html2text(f_html.read())
       File "h2t.py", line 23, in html2text
         te = TextExtractor(html)
       File "h2t.py", line 15, in __init__
         self.feed(html)
       File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 108, in feed
         self.goahead(0)
       File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 148, in goahead
         k = self.parse_starttag(i)
       File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 229, in parse_starttag
         endpos = self.check_for_whole_start_tag(i)
       File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 304, in check_for_whole_start_tag
         self.error("malformed start tag")
       File "/home/yotam/src/wog/HTMLParser.bug/HTMLParser.py", line 115, in error
         raise HTMLParseError(message, self.getpos())
     HTMLParser.HTMLParseError: malformed start tag, at line 396, column 332

    I have a suggested patch
    HTMLParser.diff
    fixing this problem, soon to be attached.

    -- yotam

    @yotammedini
    Copy link
    Mannequin

    yotammedini mannequin commented Sep 30, 2010

    The attached suggested patch fixes the problems shown in msg117762.

    @merwok
    Copy link
    Member

    merwok commented Nov 2, 2010

    Would it be reasonable to add knowledge to html.parser to make it recognize script elements as CDATA and handle it correctly (that is let “<” pass)?

    @yotammedini
    Copy link
    Mannequin

    yotammedini mannequin commented Jan 2, 2011

    Suggested fix for the attached cases:
    lt-in-script-example.tgz
    endtag-space.html
    dollar-extra.html

    @orsenthil
    Copy link
    Member

    If you provide some tests augumenting the currently existing tests
    test_htmlparser.py and also ensure that no existing test breaks, it
    would be help better to review the patch. I do see some changes made
    to the regex and parsing. So tests would definitely help.

    @friday
    Copy link
    Mannequin

    friday mannequin commented Mar 8, 2011

    This is small patch for related bug bpo-9577 which actually is not related to this bug.

    @friday
    Copy link
    Mannequin

    friday mannequin commented Mar 8, 2011

    And this patch fix the both bugs in more elegant way

    @ezio-melotti
    Copy link
    Member

    Thanks for the patch, however it would be better if you could get a clone of the CPython repo and make a patch against it.
    The patch should also include tests.

    You can check http://docs.python.org/devguide/ for more information.

    @MattBasta
    Copy link
    Mannequin

    MattBasta mannequin commented Jul 27, 2011

    The number of problems produced by this bug can be greatly reduced by adding a relatively small check to the parser. Currently, <script> and <style> tags call set_cdata_mode(), which sets self.interesting to HTMLParser.interesting_cdata. This is bad because it searches for ANY closing tag, rather than a closing tag which matches the opening tag.

    Alexander's fix solved about half the problem, but it didn't handle ending tags as text. I've fixed this and added some tests.

    This is my first patch, so if there's a better way that I could be submitting this, input would be appreciated.

    @ezio-melotti
    Copy link
    Member

    I left a review about your patch on rietveld, including a description of what I think it's going on there (the patch lacks some context and it's not easy to figure out how everything works there).
    I also did some tests with and without the patch:

    >>> from HTMLParser import HTMLParser as HP
    >>> class MyHP(HP):
    ...   def handle_data(self, data): print 'data: %r' % data
    ... 
    >>> myhp = MyHP()
    
    # without the patch:
    >>> myhp.feed('<script>foobar</script>')
    data: 'foobar'  # this looks ok
    >>> myhp.feed('<script><p>foo</p></script>')
    data: '<p>foo'  # where's the </p>?
    >>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
    data: '<p>foo' # some tags missing, 2 chunks received
    data: 'bar'
    >>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")
    data: '<p>foo'
    data: " '"
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/usr/lib/python2.7/HTMLParser.py", line 108, in feed
        self.goahead(0)
      File "/usr/lib/python2.7/HTMLParser.py", line 150, in goahead
        k = self.parse_endtag(i)
      File "/usr/lib/python2.7/HTMLParser.py", line 317, in parse_endtag
        self.error("bad end tag: %r" % (rawdata[i:j],))
      File "/usr/lib/python2.7/HTMLParser.py", line 115, in error
        raise HTMLParseError(message, self.getpos())
    HTMLParser.HTMLParseError: bad end tag: "</scr'+'ipt>", at line 1, column 247
    
    
    # with the patch:
    >>> myhp.feed('<script>foobar</script>')
    data: 'foobar'  # ok
    >>> myhp.feed('<script><p>foo</p></script>')
    data: '<p>foo' # all the content is there, but why 2 chunks?
    data: '</p>'
    >>> myhp.feed('<script><p>foo</p><span>bar</span></script>')
    data: '<p>foo' # same as previous
    data: '</p>'
    data: '<span>bar'
    data: '</span>'
    >>> myhp.feed("<script><p>foo</p> '</scr'+'ipt>' <span>bar</span></script>")  
    data: '<p>foo' # same
    data: '</p>'
    data: " '"
    data: "</scr'+'ipt>"
    data: "' <span>bar"
    data: '</span>'

    So my question is: is it normal that the data is passed to handle_data in chunks?
    AFAIU HTML parser should see CDATA as a single chunk of bytes they don't care about, so the fact that further parsing happens on the content of script/style seems wrong to me.
    If I'm reading the code correctly that's because the "interesting" regex is set to look for a closing tag ('</') -- maybe assuming that the CDATA section doesn't contain any other tag (usually true in case of <style>, often false for <script>).
    Changing the regex to explicitly look for the closing tag might be better (but still fail for e.g. <script> document.write('<script>alert("foo")</script>')</script> -- but some browsers will fail with this too).

    @merwok
    Copy link
    Member

    merwok commented Jul 27, 2011

    Ezio wrote:
      >>> myhp.feed('<script><p>foo</p></script>')
      data: '<p>foo'  # where's the </p>?

    http://www.w3.org/TR/html4/types#type-cdata says:
    Although the STYLE and SCRIPT elements use CDATA for their data
    model, for these elements, CDATA must be handled differently by user
    agents. Markup and entities must be treated as raw text and passed to
    the application as is. The first occurrence of the character sequence
    "</" (end-tag open delimiter) is treated as terminating the end of
    the element's content. In valid documents, this would be the end tag
    for the element.

    So I think the example is invalid (should escape the <), and that HTMLParser is not buggy.

    @bitdancer
    Copy link
    Member

    It's not buggy, but it is also not helpful. This kind of thing is what we introduced the 'strict' parameter for. And indeed I believe we've fixed some of these cases thereby. So any additional fixes should go into non-strict mode in Python3.

    @MattBasta
    Copy link
    Mannequin

    MattBasta mannequin commented Jul 27, 2011

    So I think the example is invalid (should escape the <), and that HTMLParser is not buggy.

    On the other hand, the HTML5 spec clearly dictates otherwise:

    http://www.w3.org/TR/html5/syntax.html#cdata-rcdata-restrictions
    The text in raw text and RCDATA elements must not contain any occurrences of the string "</" (U+003C LESS-THAN SIGN, U+002F SOLIDUS) followed by characters that case-insensitively match the tag name of the element followed by one of U+0009 CHARACTER TABULATION (tab), U+000A LINE FEED (LF), U+000C FORM FEED (FF), U+000D CARRIAGE RETURN (CR), U+0020 SPACE, U+003E GREATER-THAN SIGN (>), or U+002F SOLIDUS (/).

    Additionally, no browsers (perhaps unless they are in quirks mode) currently obey the HTML4 variant of the rule. This is due largely in part to the need to include strings such as "</scr" + "ipt>" within a script tag itself. This behavior can be observed firsthand by loading this snippet in a browser:

    <script><span></span>This should not be visible.</script>

    @bitdancer
    Copy link
    Member

    Yes, but we don't claim to support HTML5 yet.

    The best way to support HTML5 is probably a topic for python-dev.

    @MattBasta
    Copy link
    Mannequin

    MattBasta mannequin commented Jul 27, 2011

    Yes, but we don't claim to support HTML5 yet.

    There's also no claim in the docs or the source that HTMLParser specifically adheres to HTML4, either.

    Ideally, the parser should strive for parity with the functionality of major web browsers, as they are the de-facto standard for HTML parser behavior. All of the browsers on my machine, for instance, will even parse the following snippet with the behavior described in the HTML5 spec:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
    "http://www.w3.org/TR/html4/strict.dtd"\>
    <script><span></span>This should not be visible.</script>

    Even in pre-HTML5 browsers, this is the way that HTML gets parsed. For the heck of it, I downloaded an old copy of Firefox 2.0 and ran the above snippet. The behavior is consistent.

    While I would otherwise agree that keeping to the HTML4 spec is the right thing to do, this is a quirk of the spec that is not only ignored by browsers (as can be seen in FX2) and changed in a future version of the spec, but is causing problems for a good number of developers.

    It could be argued that the patch is a far more elegant solution for Beautiful Soup developers than the workaround in msg88864.

    @bitdancer
    Copy link
    Member

    I thought HTLM4 conformance was documented somewhere, but I could be wrong.

    HTML5, from what I understand (I haven't read the spec), is explicitly or implicitly following "what browsers really do" exactly because nobody conformed to HTML4, so arguing that "a later spec changed the rules" isn't really relevant in this case :)

    We made the change the way we did (strict option) out of backward compatibility concerns, so I still think this topic needs to be discussed on python-dev. I think the argument that python should handle what most browsers handle is a strong one (I myself would have been in favor of just making this stuff work, absent backward compatibility concerns). The question in my mind is what's the best way to get there from here?

    @ezio-melotti
    Copy link
    Member

    IIRC we have been following what browsers do in other cases already.
    There were also some discussions about supporting HTML5 (see e.g. bpo-7311 and bpo-11113) and the strict vs non-strict mode introduced in Python3.

    Note that changing the way things are parsed is generally not backward-compatible, but you might argue that new behavior is useful enough to break some hackish code that was trying to workaround the limitations of HTMLParser.

    @merwok
    Copy link
    Member

    merwok commented Jul 28, 2011

    HTML5 being a spec that builds on HTML 4.01 and real-world ways to deal with non-compliant input, I don’t object to fixes that follow the HTML5 spec. Regarding backward compatibility, we can break it if we decide that the behavior we’re changing was a bug. I think it’s far more useful to support BeautifoulSoup than to retain a non-useful behavior forever.

    @bitdancer
    Copy link
    Member

    Unless someone else has picked it up, BeautifulSoup is a no longer an issue since its author has abandoned it. That doesn't change the fact that IMO it would be nice for our library to handle input generously.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 29, 2011

    I also think this is a bug that should be fixed. Not being able to parse real-world HTML is a nuisance.

    I agree with Ezio's review comments about the custom regex.

    @bitdancer
    Copy link
    Member

    It sounds like the early consensus on python-dev is that html5 support is a good thing. I'm happy with that. I presume that means the 'strict' keyword in 3.x becomes strict-per-html5, and possibly useless :)

    @ezio-melotti
    Copy link
    Member

    As I said somewhere else, the only use case I can think of where the 'strict' flag is useful is validation, but AFAIK even in "strict mode" it's possible to parse non-valid documents, so I agree it's pretty useless.

    Moving to HTML5 and offering something able to parse real-world HTML seems the way to go IMHO.

    @friday
    Copy link
    Mannequin

    friday mannequin commented Aug 1, 2011

    It sounds like the early consensus on python-dev is that html5 support is a good thing.

    Yeah... But wait another 8 years untill these guys decides that there is enough tests and other cool stuff.

    @pitrou
    Copy link
    Member

    pitrou commented Aug 1, 2011

    Yeah... But wait another 8 years untill these guys decides that
    there is enough tests and other cool stuff.

    Which guys are you talking about?
    Granted, this issue has been around for a loooong time... but now that we have a patch that seems ok (and has tests), we should be able to finally push this and include it in the next feature release, hopefully.

    @MattBasta
    Copy link
    Mannequin

    MattBasta mannequin commented Aug 1, 2011

    Seeing as everyone seems pretty satisfied with the 2.7 version, I'd be happy to put together a patch for 3 as well.

    To confirm, though, this fix is NOT going behind the strict parameter, correct?

    @ezio-melotti
    Copy link
    Member

    Attached a new patch with a few more tests and minor refactoring.

    @merwok
    Copy link
    Member

    merwok commented Oct 31, 2011

    • def set_cdata_mode(self):
      + def set_cdata_mode(self, elem):
      Looks like an incompatible behavior change. Is it only an internal method that will never affect users’ code (even subclasses)?

    @ezio-melotti
    Copy link
    Member

    I think it's internal. While it's not explicitly mentioned in the source, the method is not documented and I don't think people subclassed it. All that it does is changing the regex used to parse the data, and if someone needs to change this, it's probably easier to just change the regex.
    OTOH this method is called just in one place, so in theory it's possible to set self.cdata_elem there before the set_cdata_mode() call, but that might make the code more fragile.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 1, 2011

    New changeset 0a5eb57d5876 by Ezio Melotti in branch '2.7':
    bpo-670664: Fix HTMLParser to correctly handle the content of <script>...</script> and <style>...</style>.
    http://hg.python.org/cpython/rev/0a5eb57d5876

    New changeset a6f2244b251f by Ezio Melotti in branch '3.2':
    bpo-670664: Fix HTMLParser to correctly handle the content of <script>...</script> and <style>...</style>.
    http://hg.python.org/cpython/rev/a6f2244b251f

    New changeset b40752e227fa by Ezio Melotti in branch 'default':
    bpo-670664: merge with 3.2.
    http://hg.python.org/cpython/rev/b40752e227fa

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks to everyone who contributed to this over the years!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants