Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling of broken comments in HTMLParser #58168

Closed
ezio-melotti opened this issue Feb 7, 2012 · 8 comments
Closed

Handling of broken comments in HTMLParser #58168

ezio-melotti opened this issue Feb 7, 2012 · 8 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@ezio-melotti
Copy link
Member

BPO 13960
Nosy @ezio-melotti, @merwok
Files
  • issue13960.diff: Patch against 3.2
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2012-02-13.14:14:59.669>
    created_at = <Date 2012-02-07.11:56:35.738>
    labels = ['type-bug', 'library']
    title = 'Handling of broken comments in HTMLParser'
    updated_at = <Date 2012-02-13.14:14:59.668>
    user = 'https://github.com/ezio-melotti'

    bugs.python.org fields:

    activity = <Date 2012-02-13.14:14:59.668>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2012-02-13.14:14:59.669>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2012-02-07.11:56:35.738>
    creator = 'ezio.melotti'
    dependencies = []
    files = ['24443']
    hgrepos = []
    issue_num = 13960
    keywords = ['patch']
    message_count = 8.0
    messages = ['152806', '152861', '152869', '153032', '153035', '153036', '153271', '153272']
    nosy_count = 3.0
    nosy_names = ['ezio.melotti', 'eric.araujo', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue13960'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @ezio-melotti
    Copy link
    Member Author

    html.parser fails to handle the following invalid comments:
    <! foo >
    <! bar -->
    <! -- baz -->
    The attached patch follows the HTML5 specs 0, and parses them as "bogus comments". Currently the patch fixes the problem only when strict=False, but it might be better to make this the default behavior and apply it to 2.7 too.

    @ezio-melotti ezio-melotti added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Feb 7, 2012
    @ezio-melotti ezio-melotti self-assigned this Feb 8, 2012
    @merwok
    Copy link
    Member

    merwok commented Feb 8, 2012

    LGTM. What did our last discussion about following HTML5 rules for Python 2.7 lead to? I don’t remember if we agreed that “3.3 is soon enough” or “let’s fix the bugs with HTML5 as reference”.

    @merwok
    Copy link
    Member

    merwok commented Feb 8, 2012

    After reading some emails again, I’m +1 on porting the fixes to 2.7.

    1. We agree that HTML5 is the reference specification.

    2. I don’t think there is sane code that would be broken if some previously unparsable page became parsable (an exception can be HTML parsers, but the obvious example BeautifulSoup does not use HTMLParser for example); HTMLParser is not a validating parser and never made any guarantee about the validity of handled pages.

    3. Most people should be happy to have more pages handled by HTMLParser.

    4. 2.7 is unique as long-term support, last 2.7 release.

    @ezio-melotti
    Copy link
    Member Author

    I'll fix this for 3.x non-strict and then see if it can be backported to 2.7 (there are still other fixes that should be backported to 2.7 before this can be applied).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 10, 2012

    New changeset 242b697449d8 by Ezio Melotti in branch '3.2':
    bpo-13960: HTMLParser is now able to handle broken comments when strict=False.
    http://hg.python.org/cpython/rev/242b697449d8

    New changeset 44366541dd86 by Ezio Melotti in branch 'default':
    bpo-13960: merge with 3.2.
    http://hg.python.org/cpython/rev/44366541dd86

    @ezio-melotti
    Copy link
    Member Author

    This is now fixed in 3.2/3.3, I'll wait for 2.7 before closing it.
    On a side note, the empty <!> comment doesn't seem to be valid in HTML5.
    HTMLParser just ignores it, and doesn't report it as an empty comment (so this should be fine).

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 13, 2012

    New changeset 333e3acf2008 by Ezio Melotti in branch '2.7':
    bpo-13960: HTMLParser is now able to handle broken comments.
    http://hg.python.org/cpython/rev/333e3acf2008

    @ezio-melotti
    Copy link
    Member Author

    I now backported this to 2.7, together with some improvements in the handling of declarations that I committed on 3.2 (4c4ff9fd19b6) and 3.3 (06a6fed0da56).
    Apparently <!> is not a valid comment in HTML5, but it is considered a bogus comment and should still emit a "comment" with no content. This is now fixed too.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants