Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve HTMLParser attribute processing regexps #38664

Closed
smroid mannequin opened this issue Jun 17, 2003 · 11 comments
Closed

improve HTMLParser attribute processing regexps #38664

smroid mannequin opened this issue Jun 17, 2003 · 11 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@smroid
Copy link
Mannequin

smroid mannequin commented Jun 17, 2003

BPO 755670
Nosy @ezio-melotti
Dependencies
  • bpo-683938: HTMLParser attribute parsing bug
  • Files
  • diff.txt
  • issue755670.diff: Failing test
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2011-11-14.17:07:52.150>
    created_at = <Date 2003-06-17.03:09:17.000>
    labels = ['type-bug', 'library']
    title = 'improve HTMLParser attribute processing regexps'
    updated_at = <Date 2011-11-14.17:07:52.148>
    user = 'https://bugs.python.org/smroid'

    bugs.python.org fields:

    activity = <Date 2011-11-14.17:07:52.148>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2011-11-14.17:07:52.150>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2003-06-17.03:09:17.000>
    creator = 'smroid'
    dependencies = ['683938']
    files = ['5384', '23582']
    hgrepos = []
    issue_num = 755670
    keywords = ['patch']
    message_count = 11.0
    messages = ['44032', '44033', '44034', '44035', '44036', '44037', '55984', '114245', '146786', '147611', '147619']
    nosy_count = 5.0
    nosy_names = ['titus', 'smroid', 'timtoo', 'ezio.melotti', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue755670'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @smroid
    Copy link
    Mannequin Author

    smroid mannequin commented Jun 17, 2003

    HTML examples seen in the wild that cause parse errors
    in HTMLParser include:

    <a width="100%"cellspacing=0>
    -- note lack of space between val and next attr name

    <a foo=>
    -- trailing attribute has no value after =

    <a href=javascript:popup('/popup/html.html')>
    -- javascript fragment with embedded quotes

    My patch contains improvements to the 'attrfind' and
    'locatestarttagend' regexps that allow these examples
    to parse.

    The existing test_htmlparser.py unit test continues to
    pass, except for the one test case where it considers
    <a foo=> to be an error.

    I commented out that case and added new test cases to
    cover the examples above.

    @smroid smroid mannequin added stdlib Python modules in the Lib dir labels Jun 17, 2003
    @smroid
    Copy link
    Mannequin Author

    smroid mannequin commented Jun 17, 2003

    Logged In: YES
    user_id=159908

    Base version for HTMLParser.py is 1.11.2.1; base version for
    test_htmlparser.py is 1.8.8.1

    @smroid
    Copy link
    Mannequin Author

    smroid mannequin commented Jun 18, 2003

    Logged In: YES
    user_id=159908

    This also fixes bugs 683938 and 699079.

    @titus
    Copy link
    Mannequin

    titus mannequin commented Dec 19, 2004

    Logged In: YES
    user_id=23486

    This patch allows developers to override the behavior of HTMLParser
    when parsing malformed HTML. Normally HTMLParser calls the function
    self.error(), which raises an exception. This patch adds appropriate
    return values for situations where self.error has been redefined in
    subclasses to *not* raise an exception.

    It does not change the default behavior of HTMLParser and so presents
    no backwards compatibility issues.

    The patch itself consists of an added comment and two added lines of
    code that call 'return' with appropriate values after a self.error call.
    Nothing wrong with 'em. I can't verify that the "junk characters" error
    call will leave the parser in a good state, though, if execution returns
    from error().

    The library documentation could be updated to reflect the ability to
    override
    error() behavior; I've written a short patch, available at

    http://issola.caltech.edu/~t/transfer/HTMLParser-doc-error.patch

    More problems exist with markupbase.py, upon which HTMLParser is
    based.
    markupbase calls error() as well, and has some stickier situations. See
    comments in bug 917188 as well.

    Comments in 683938 and 699079 suggest that raising an exception is the
    correct response to the parse errors. I recommend application of the
    patch anyway, because it (a) doesn't change any behavior by default
    and (b) may solve some problems for people.

    An alternative would be to distinguish between unrecoverable errors
    and recoverable errors by having two different functions, e.g. error()
    (for
    recoverable errors) and _fail() (for unrecoverable errors). By default
    error() would call _fail() and internal code could be changed to call
    _fail() where recovery is impossible. This might alter behavior in
    situations where subclasses override error() but then again that's not
    legitimate to do anyway, at least not at the moment -- error() isn't
    in the docs ;).

    If nothing done, at least close patch 755660 and bug 736428 with a
    comment saying that this behavior will not be addressed ;).

    @titus
    Copy link
    Mannequin

    titus mannequin commented Dec 19, 2004

    Logged In: YES
    user_id=23486

    whoops, attached to wrong patch! dangitall. sorry...

    @titus
    Copy link
    Mannequin

    titus mannequin commented Dec 19, 2004

    Logged In: YES
    user_id=23486

    I don't think HTMLParser should parse clearly invalid HTML without
    complaining. Perhaps such behavior should be overrideable (see patch
    755660 & comments therein), but this patch changes behavior to parse
    invalid HTML w/o complaint.

    Patch 699079, to "fix" similar behavior, was closed by adsr and bcannon
    because such behavior is correct. Patch 683938 is also similar but is
    being kept open for some reason.

    Recommend closing patch w/o applying.

    @timtoo
    Copy link
    Mannequin

    timtoo mannequin commented Sep 17, 2007

    I for one thank smroid for the patch. I also have hit *all* of these
    cases in the wild. This patch makes real-life a lot less frustrating.
    This patch is surely a lot more preferable than HTMLParser's tendency
    to just throw up its hands and quietly give up. And one might even
    make a case that in the horrific world of HTML "standards" case #2 and
    #3 can be considered actually valid (as much as it hurts to say so).

    @devdanzin devdanzin mannequin added type-feature A feature request or enhancement labels Feb 12, 2009
    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Aug 18, 2010

    There are messages both for and against the patch which contains a unit test. Can we have a statement from a knowledgeable HTML person as to whether the patch should be accepted or rejected.

    @ezio-melotti
    Copy link
    Member

    Attached patch includes the tests in diff.txt. On Python 3, with strict=False, the first test (adjacent attributes) passes, but the other two still fail.
    See also bpo-12629.

    @ezio-melotti ezio-melotti added type-bug An unexpected behavior, bug, or error and removed type-feature A feature request or enhancement labels Nov 1, 2011
    @ezio-melotti ezio-melotti self-assigned this Nov 14, 2011
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 14, 2011

    New changeset 3c3009f63700 by Ezio Melotti in branch '2.7':
    bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: improve attribute handling in HTMLParser.
    http://hg.python.org/cpython/rev/3c3009f63700

    New changeset 16ed15ff0d7c by Ezio Melotti in branch '3.2':
    bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: improve attribute handling in HTMLParser.
    http://hg.python.org/cpython/rev/16ed15ff0d7c

    New changeset 426f7a2b1826 by Ezio Melotti in branch 'default':
    bpo-1745761, bpo-755670, bpo-13357, bpo-12629, bpo-1200313: merge with 3.2.
    http://hg.python.org/cpython/rev/426f7a2b1826

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks for the report!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 9, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant