Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParser silently stops parsing with malformed attributes #56838

Closed
teoryn mannequin opened this issue Jul 24, 2011 · 8 comments
Closed

HTMLParser silently stops parsing with malformed attributes #56838

teoryn mannequin opened this issue Jul 24, 2011 · 8 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@teoryn
Copy link
Mannequin

teoryn mannequin commented Jul 24, 2011

BPO 12629
Nosy @ezio-melotti, @merwok, @bitdancer
Files
  • test.py: Example of the broken behavior
  • issue12629.diff: Failing test
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2011-11-14.17:12:07.679>
    created_at = <Date 2011-07-24.18:35:07.279>
    labels = ['type-bug', 'library']
    title = 'HTMLParser silently stops parsing with malformed attributes'
    updated_at = <Date 2011-11-14.17:12:07.677>
    user = 'https://bugs.python.org/teoryn'

    bugs.python.org fields:

    activity = <Date 2011-11-14.17:12:07.677>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2011-11-14.17:12:07.679>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2011-07-24.18:35:07.279>
    creator = 'teoryn'
    dependencies = []
    files = ['22745', '23579']
    hgrepos = []
    issue_num = 12629
    keywords = ['patch']
    message_count = 8.0
    messages = ['141051', '141174', '146774', '146848', '146852', '147192', '147612', '147620']
    nosy_count = 5.0
    nosy_names = ['ezio.melotti', 'eric.araujo', 'r.david.murray', 'python-dev', 'teoryn']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue12629'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @teoryn
    Copy link
    Mannequin Author

    teoryn mannequin commented Jul 24, 2011

    Given the input '<x><y z=""o"" /></x>', HTMLParser only detects the opening x tag, and then stops parsing. Ideally this should behave like the case '<x><y z="""" /></x>' which raises an error and then can continue parsing the close x tag.

    @teoryn teoryn mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 24, 2011
    @teoryn
    Copy link
    Mannequin Author

    teoryn mannequin commented Jul 26, 2011

    A workaround is to call close() after feed(), which I supposed I should have done anyways. However, this does not resolve the issue that the two cases behave so differently.

    The code that causes the difference is lines 351-355 of parser.py, which also has a misleading comment stating it detects the / in a /> ending (which is actually done at 334).

    @ezio-melotti
    Copy link
    Member

    I think <x><y z=""o"" /></x> should be parser as <x><y z="" /></x>, and the o"" should be ignored.
    <x><y z="""" /></x> should be parser as <x><y z="" /></x>, and the last two "" should be ignored. This is what Firefox seems to do.

    Currently the parser doesn't seem to handle extraneous data in the start tag too well, because the locatestarttagend_tolerant regex looks for (more or less) well-formed attributes.
    Attached a patch for test_htmlparser with the two examples provided by Kevin.

    @merwok
    Copy link
    Member

    merwok commented Nov 2, 2011

    This is what Firefox seems to do.
    I think more confidence would be good. Doesn’t the HTML5 spec define that? Have you found their test suite? Do you have more than one browser known to be compliant (trick: not sure there is even one)?

    @ezio-melotti
    Copy link
    Member

    I haven't found anything in the HTML5 spec but I haven't looked closely.
    I'll do some more research when I'll start working on an actual patch.

    @ezio-melotti
    Copy link
    Member

    @ezio-melotti ezio-melotti self-assigned this Nov 14, 2011
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 14, 2011

    New changeset 3c3009f63700 by Ezio Melotti in branch '2.7':
    bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: improve attribute handling in HTMLParser.
    http://hg.python.org/cpython/rev/3c3009f63700

    New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2':
    bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: improve attribute handling in HTMLParser.
    http://hg.python.org/cpython/rev/16ed15ff0d7c

    New changeset 426f7a2b1826 by Ezio Melotti in branch 'default':
    bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: merge with 3.2.
    http://hg.python.org/cpython/rev/426f7a2b1826

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks for the report!
    Apparently the correct way to parse <y z=""o"" /> is:
    starttag y
    attribute z with value ""
    attribute o"" with no value
    So this is what HTMLParser does now.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants