Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser: comma in attribute values with/without space #85914

Closed
nowaskyjr mannequin opened this issue Sep 8, 2020 · 15 comments
Closed

HTMLParser: comma in attribute values with/without space #85914

nowaskyjr mannequin opened this issue Sep 8, 2020 · 15 comments
Assignees
Labels
3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@nowaskyjr
Copy link
Mannequin

nowaskyjr mannequin commented Sep 8, 2020

BPO 41748
Nosy @vstinner, @ezio-melotti, @karlcow, @miss-islington
PRs
  • bpo-41748: Handles unquoted attributes with commas #24072
  • [3.9] bpo-41748: Handles unquoted attributes with commas (GH-24072) #24415
  • [3.8] bpo-41748: Handles unquoted attributes with commas (GH-24072) #24416
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2021-02-01.21:27:04.840>
    created_at = <Date 2020-09-08.21:59:30.005>
    labels = ['3.8', 'type-bug', 'library']
    title = 'HTMLParser: comma in attribute values with/without space'
    updated_at = <Date 2021-02-01.21:27:04.839>
    user = 'https://bugs.python.org/nowaskyjr'

    bugs.python.org fields:

    activity = <Date 2021-02-01.21:27:04.839>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2021-02-01.21:27:04.840>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2020-09-08.21:59:30.005>
    creator = 'nowasky.jr'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 41748
    keywords = ['patch']
    message_count = 15.0
    messages = ['376607', '376635', '376638', '376639', '376641', '376642', '384256', '384258', '384259', '384763', '386104', '386109', '386110', '386111', '386115']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'ezio.melotti', 'karlcow', 'miss-islington', 'nowasky.jr']
    pr_nums = ['24072', '24415', '24416']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue41748'
    versions = ['Python 3.8']

    @nowaskyjr
    Copy link
    Mannequin Author

    nowaskyjr mannequin commented Sep 8, 2020

    HTML tags that have a attribute name starting with a comma character aren't parsed and break future calls to feed().

    The problem occurs when such attribute is the second one or later in the HTML tag. Doesn't seems to affect when it's the first attribute.

    #POC:

    from html.parser import HTMLParser
    
    class MyHTMLParser(HTMLParser):
        def handle_starttag(self, tag, attrs):
            print("Encountered a start tag:", tag)
    
    parser = MyHTMLParser()
    
    #This is ok
    parser.feed('<yyy id="poc" a,="">')
    
    #This breaks
    parser.feed('<zzz id="poc" ,a="">')
    
    #Future calls to feed() will not work
    parser.feed('<img id="poc" src=x>')

    @nowaskyjr nowaskyjr mannequin added type-crash A hard crash of the interpreter, possibly with a core dump 3.8 only security fixes stdlib Python modules in the Lib dir type-security A security issue and removed type-crash A hard crash of the interpreter, possibly with a core dump type-security A security issue labels Sep 8, 2020
    @vstinner
    Copy link
    Member

    vstinner commented Sep 9, 2020

    HTML 5.2 specification says
    https://www.w3.org/TR/html52/syntax.html#elements-attributes

    "Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, U+0022 QUOTATION MARK ("), U+0027 APOSTROPHE ('), U+003E GREATER-THAN SIGN (>), U+002F SOLIDUS (/), and U+003D EQUALS SIGN (=) characters, the control characters, and any characters that are not defined by Unicode."

    It is confirmed in the "12.2 Parsing HTML documents" section of the "HTML Living Standard":
    """
    12.2.5.33 Attribute name state

    Consume the next input character:

    U+0009 CHARACTER TABULATION (tab)
    U+000A LINE FEED (LF)
    U+000C FORM FEED (FF)
    U+0020 SPACE
    U+002F SOLIDUS (/)
    U+003E GREATER-THAN SIGN (>)
    EOF
    Reconsume in the after attribute name state.
    U+003D EQUALS SIGN (=)
    Switch to the before attribute value state.
    ASCII upper alpha
    Append the lowercase version of the current input character (add 0x0020 to the character's code point) to the current attribute's name.
    U+0000 NULL
    This is an unexpected-null-character parse error. Append a U+FFFD REPLACEMENT CHARACTER character to the current attribute's name.
    U+0022 QUOTATION MARK (")
    U+0027 APOSTROPHE (')
    U+003C LESS-THAN SIGN (<)
    This is an unexpected-character-in-attribute-name parse error. Treat it as per the "anything else" entry below.
    Anything else
    Append the current input character to the current attribute's name.
    """
    https://html.spec.whatwg.org/multipage/parsing.html#attribute-name-state

    I understand that "," *is* a legit character in a HTML attribute name. So "a," and ",a" *are* valid HTML attribute names. Do I understand correctly?

    @nowaskyjr
    Copy link
    Mannequin Author

    nowaskyjr mannequin commented Sep 9, 2020

    Yes, I understand that in the same way. Both are valid attr names.

    Maybe it's worth noting that Javascript has no problem handling this.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 9, 2020

    HTMLParser.check_for_whole_start_tag() uses locatestarttagend_tolerant regular expression to find the end of the start tag. This regex cuts the string at the first comma (","), but not if the comma is the first character of an attribute name

    • '<div id="test" , color="blue">' => '<div id="test" , color="blue"': OK!
    • '<div id="test" ,color="blue">' => '<div id="test" ,' => BUG

    The regex is quite complex:

    locatestarttagend_tolerant = re.compile(r"""
      <[a-zA-Z][^\t\n\r\f />\x00]*       # tag name
      (?:[\s/]*                          # optional whitespace before attribute name
        (?:(?<=['"\s/])[^\s/>][^\s/=>]*  # attribute name
          (?:\s*=+\s*                    # value indicator
            (?:'[^']*'                   # LITA-enclosed value
              |"[^"]*"                   # LIT-enclosed value
              |(?!['"])[^>\s]*           # bare value
             )
             (?:\s*,)*                   # possibly followed by a comma
           )?(?:\s|/(?!>))*
         )*
       )?
      \s*                                # trailing whitespace
    """, re.VERBOSE)
    endendtag = re.compile('>')

    The problem is that this part of the regex:

    #(?:\s*,)* # possibly followed by a comma

    The comma is not seen as part of the attribute name.

    @vstinner
    Copy link
    Member

    vstinner commented Sep 9, 2020

    Also, there is no warning about security in the html.parser documentation? Is this module mature and maintained enough to be considered as reliable? Or should we warn users about possible issues on corner cases, and point to BeautilfulSoup for a more mature HTML parser?

    https://docs.python.org/dev/library/html.parser.html

    @ezio-melotti
    Copy link
    Member

    The html.parser follows the HTML 5 specs as closely as possible. There are a few corner cases where it behaves slightly differently but it's only while dealing with invalid markup, and the differences should be trivial and generally not worth the extra complexity to deal with them.

    In this case, if I recall correctly, the way the comma is handled is just a left-over from the previous version of the parser, that predates the HTML 5 specs. In tags like <tag foo=bar,baz=asd> there was an ambiguous situation and parsing it <tag foo="bar" baz="asd"> was deemed a reasonable interpretation, so the comma was treated as an attribute separator (and there should be test cases for this).

    This likely caused the issue reported by the OP, and I think it should be fixed, even if technically it's a change in behavior and will break some of the tests.

    If I'm reading the specs[0] correctly:

    • <tag foo=bar,baz=asd> should be parsed as <tag foo="bar,baz=asd">, and
    • <tag foo=bar ,baz=asd> should be parsed as <tag foo="bar" ,baz="asd">, where ',baz' is the attribute name

    Also, there is no warning about security in the html.parser documentation?

    I'm not aware of any specific security issues, since html.parser just implements the parser described by the HTML 5 specs. If there are any security issues caused by divergences from the specs, they should be fixed. I'm not sure why a warning would be needed.

    Is this module mature and maintained enough to be considered as reliable?

    Even though it hasn't been updated to the latest version of the specs (5.2 at the time of writing), it has been updated to implement the parsing rules described by the HTML 5 specs. I don't know if the parsing rules changed between 5.0 and 5.2.

    Or should we warn users about possible issues on corner cases, and point to BeautilfulSoup for a more mature HTML parser?

    BeautifulSoup is built on top of html.parser (and can also use other parses, like lxml). BS uses the underlying parsers to parse the HTML, then builds the tree and provides, among other things, functions to search and edit it.
    When I upgraded html.parser to HTML 5 I worked with the author of BeautifulSoup (Leonard Richardson), to make sure that my changes were compatible with BS. We also discussed about some corner cases he found and other feature requests and issues he had with the old version of the parser. That said, a link to BS would be a nice addition, since it's a great library.

    [0] starting from https://www.w3.org/TR/html52/syntax.html#tokenizer-before-attribute-name-state

    @karlcow
    Copy link
    Mannequin

    karlcow mannequin commented Jan 3, 2021

    Ezio,

    TL,DR: Testing in browsers and adding two tests for this issue.
    Should I create a PR just for the tests?

    https://github.com/python/cpython/blame/63298930fb531ba2bb4f23bc3b915dbf1e17e9e1/Lib/test/test_htmlparser.py#L479-L485

    A: comma without spaces
    -----------------------

    Tests for browsers:
    data:text/html,<!doctype html><div class=bar,baz=asd>text</div>

    Serializations:

    • Firefox, Gecko (86.0a1 (2020-12-28) (64-bit))
    • Edge, Blink (Version 89.0.752.0 (Version officielle) Canary (64 bits))
    • Safari, WebKit (Release 117 (Safari 14.1, WebKit 16611.1.7.2))

    Same serialization in these 3 rendering engines
    <div class="bar,baz=asd">text</div>

    Adding:

        def test_comma_between_unquoted_attributes(self):
            # bpo 41748
            self._run_check('<div class=bar,baz=asd>',
                            [('starttag', 'div', [('class', 'bar,baz=asd')])])

    ❯ ./python.exe -m test -v test_htmlparser


    test_comma_between_unquoted_attributes (test.test_htmlparser.HTMLParserTestCase) ... ok

    Ran 47 tests in 0.168s

    OK

    == Tests result: SUCCESS ==

    1 test OK.

    Total duration: 369 ms
    Tests result: SUCCESS

    So this is working as expected for the first test.

    B: comma with spaces
    --------------------

    Tests for browsers:
    data:text/html,<!doctype html><div class=bar, baz=asd>text</div>

    Serializations:

    • Firefox, Gecko (86.0a1 (2020-12-28) (64-bit))
    • Edge, Blink (Version 89.0.752.0 (Version officielle) Canary (64 bits))
    • Safari, WebKit (Release 117 (Safari 14.1, WebKit 16611.1.7.2))

    Same serialization in these 3 rendering engines
    <div class="bar" ,baz="asd">text</div>

    Adding
    def test_comma_with_space_between_unquoted_attributes(self):
    # bpo-41748
    self._run_check('<div class=bar ,baz=asd>',
    [('starttag', 'div', [
    ('class', 'bar'),
    (',baz', 'asd')])])

    ❯ ./python.exe -m test -v test_htmlparser

    This is failing.

    ======================================================================
    FAIL: test_comma_with_space_between_unquoted_attributes (test.test_htmlparser.HTMLParserTestCase)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/Users/karl/code/cpython/Lib/test/test_htmlparser.py", line 493, in test_comma_with_space_between_unquoted_attributes
        self._run_check('<div class=bar ,baz=asd>',
      File "/Users/karl/code/cpython/Lib/test/test_htmlparser.py", line 95, in _run_check
        self.fail("received events did not match expected events" +
    AssertionError: received events did not match expected events
    Source:
    '<div class=bar ,baz=asd>'
    Expected:
    [('starttag', 'div', [('class', 'bar'), (',baz', 'asd')])]
    Received:
    [('data', '<div class=bar ,baz=asd>')]

    I started to look into the code of parser.py which I'm not familiar (yet) with.

    (?:[\s/]* # optional whitespace before attribute name
    (?:(?<=['"\s/])[^\s/>][^\s/=>]* # attribute name
    (?:\s*=+\s* # value indicator
    (?:'[^']*' # LITA-enclosed value
    |"[^"]*" # LIT-enclosed value
    |(?!['"])[^>\s]* # bare value
    )
    (?:\s*,)* # possibly followed by a comma
    )?(?:\s|/(?!>))*
    )*
    )?

    Do you have a suggestion to fix it?

    @karlcow karlcow mannequin changed the title HTMLParser: parsing error HTMLParser: comma in attribute values with/without space Jan 3, 2021
    @karlcow karlcow mannequin changed the title HTMLParser: parsing error HTMLParser: comma in attribute values with/without space Jan 3, 2021
    @karlcow
    Copy link
    Mannequin

    karlcow mannequin commented Jan 3, 2021

    Ah!

    This is fixing it

    diff --git a/Lib/html/parser.py b/Lib/html/parser.py
    index 6083077981..ffff790666 100644
    --- a/Lib/html/parser.py
    +++ b/Lib/html/parser.py
    @@ -44,7 +44,7 @@
           (?:\s*=+\s*                    # value indicator
             (?:'[^']*'                   # LITA-enclosed value
               |"[^"]*"                   # LIT-enclosed value
    -          |(?!['"])[^>\s]*           # bare value
    +          |(?!['"])[^>]*           # bare value
              )
              (?:\s*,)*                   # possibly followed by a comma
            )?(?:\s|/(?!>))*

    Ran 48 tests in 0.175s

    OK

    == Tests result: SUCCESS ==

    @ezio-melotti
    Copy link
    Member

    Writing tests that verify the expected behavior is a great first step. The expected output in the tests should match the behavior described by the HTML 5 specs (which should also correspond to the browsers' behavior), and should initially fail. You can start creating a PR with only the tests, clarifying that it's a work in progress, or wait until you have the fix too.

    The next step would be tweaking the regex and the code until both the new tests and all the other ones work (excluding the one with the commas you are fixing). You can then commit the fix in the same branch and push it -- GitHub will automatically update the PR.

    Do you have a suggestion to fix it?

    If you are familiar enough with regexes, you could try to figure out whether it matches the invalid attributes or not, and if not why (I took a quick look and I didn't see anything immediately wrong in the regexes).

    Since the output of the failing test is [('data', '<div class=bar ,baz=asd>')], it's likely that the parser doesn't know how to handle it and passes it to one of the handle_data() in the goahead() method. You can figure out which one is being called and see which are the if-conditions that are leading the interpreter down this path rather than the usual path where the attributes are parsed correctly.

    If you have other questions let me know :)

    @ezio-melotti ezio-melotti added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Jan 3, 2021
    @karlcow
    Copy link
    Mannequin

    karlcow mannequin commented Jan 10, 2021

    Status: The PR should be ready and completed
    #24072
    and eventually be merged at a point.
    Thanks to ezio.melotti for the wonderful guidance.

    @ezio-melotti
    Copy link
    Member

    New changeset 9eb11a1 by Karl Dubost in branch 'master':
    bpo-41748: Handles unquoted attributes with commas (bpo-24072)
    9eb11a1

    @vstinner
    Copy link
    Member

    vstinner commented Feb 1, 2021

    Oh, thank you Karl Dubost for the fix!

    @miss-islington
    Copy link
    Contributor

    New changeset 0869a71 by Miss Islington (bot) in branch '3.8':
    bpo-41748: Handles unquoted attributes with commas (GH-24072)
    0869a71

    @miss-islington
    Copy link
    Contributor

    New changeset 0874491 by Miss Islington (bot) in branch '3.9':
    bpo-41748: Handles unquoted attributes with commas (GH-24072)
    0874491

    @ezio-melotti
    Copy link
    Member

    Merged! Thanks for the report and the PR!

    @ezio-melotti ezio-melotti self-assigned this Feb 1, 2021
    @ezio-melotti ezio-melotti self-assigned this Feb 1, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants