Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTMLParse handing of non-numeric charrefs broken #64487

Closed
iko mannequin opened this issue Jan 17, 2014 · 5 comments
Closed

HTMLParse handing of non-numeric charrefs broken #64487

iko mannequin opened this issue Jan 17, 2014 · 5 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@iko
Copy link
Mannequin

iko mannequin commented Jan 17, 2014

BPO 20288
Nosy @ezio-melotti, @bitdancer
Files
  • issue20288.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2014-02-14.05:31:06.723>
    created_at = <Date 2014-01-17.14:06:13.410>
    labels = ['type-bug', 'library']
    title = 'HTMLParse handing of non-numeric charrefs broken'
    updated_at = <Date 2014-02-14.05:31:06.721>
    user = 'https://bugs.python.org/iko'

    bugs.python.org fields:

    activity = <Date 2014-02-14.05:31:06.721>
    actor = 'ezio.melotti'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2014-02-14.05:31:06.723>
    closer = 'ezio.melotti'
    components = ['Library (Lib)']
    creation = <Date 2014-01-17.14:06:13.410>
    creator = 'iko'
    dependencies = []
    files = ['33845']
    hgrepos = []
    issue_num = 20288
    keywords = ['patch']
    message_count = 5.0
    messages = ['208336', '208350', '209911', '209914', '211202']
    nosy_count = 4.0
    nosy_names = ['iko', 'ezio.melotti', 'r.david.murray', 'python-dev']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue20288'
    versions = ['Python 2.7', 'Python 3.3', 'Python 3.4']

    @iko
    Copy link
    Mannequin Author

    iko mannequin commented Jan 17, 2014

    Python 2.7 HTMLParse.py lines 185-199 (similar lines still exist in Python 3.4)
    match = charref.match(rawdata, i)
    if match:
    ...
    else:
    if ";" in rawdata[i:]: #bail by consuming &#
    self.handle_data(rawdata[0:2])
    i = self.updatepos(i, 2)
    break

    if you feed a broken charref, that is non-numeric, it will pass whatever random string that happened to be at the start of rawdata to handle_data(). Eg:

    p = HTMLParser()
    p.handle_data = lambda x: sys.stdout.write(x)
    p.feed('<p>&#foo;</p>')

    will print '<p' which is clearly wrong. I think the intention of the code is to pass '&#', which seems saner.

    @iko iko mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 17, 2014
    @ezio-melotti ezio-melotti self-assigned this Jan 17, 2014
    @ezio-melotti
    Copy link
    Member

    Thanks for the report, this is indeed a bug.
    This behavior was covered by a test (see Lib/test/test_htmlparser.py:164), but _run_check feeds the chars one by one to the parser, and in that case it works correctly. While feeding the parser a whole chunk I was able to reproduce the bug. This should be fixed, and the behavior of _run_check should probably be changed too -- maybe it could test both the char-by-char and the regular feeding.

    @ezio-melotti
    Copy link
    Member

    Here's a patch against 2.7.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 1, 2014

    New changeset 0d50b5851f38 by Ezio Melotti in branch '2.7':
    bpo-20288: fix handling of invalid numeric charrefs in HTMLParser.
    http://hg.python.org/cpython/rev/0d50b5851f38

    New changeset 32097f193892 by Ezio Melotti in branch '3.3':
    bpo-20288: fix handling of invalid numeric charrefs in HTMLParser.
    http://hg.python.org/cpython/rev/32097f193892

    New changeset 92b3928bfde1 by Ezio Melotti in branch 'default':
    bpo-20288: merge with 3.3.
    http://hg.python.org/cpython/rev/92b3928bfde1

    @ezio-melotti
    Copy link
    Member

    This is now fixed, thanks for the report!

    This should be fixed, and the behavior of _run_check should probably be
    changed too -- maybe it could test both the char-by-char and the
    regular feeding.

    I created bpo-20623 to track this.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    1 participant