Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text #67333

Closed
xkjq mannequin opened this issue Jan 1, 2015 · 14 comments
Closed

html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text #67333

xkjq mannequin opened this issue Jan 1, 2015 · 14 comments
Assignees
Labels
docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@xkjq
Copy link
Mannequin

xkjq mannequin commented Jan 1, 2015

BPO 23144
Nosy @larryhastings, @rbtcollins, @ezio-melotti, @bitdancer, @vadmium, @serhiy-storchaka
Files
  • issue23144.diff: Patch against 3.5/default
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/ezio-melotti'
    closed_at = <Date 2015-09-06.18:56:30.456>
    created_at = <Date 2015-01-01.18:47:08.610>
    labels = ['type-bug', 'library', 'docs']
    title = "html.parser.HTMLParser: setting 'convert_charrefs = True' leads to dropped text"
    updated_at = <Date 2015-09-09.13:51:04.674>
    user = 'https://bugs.python.org/xkjq'

    bugs.python.org fields:

    activity = <Date 2015-09-09.13:51:04.674>
    actor = 'larry'
    assignee = 'ezio.melotti'
    closed = True
    closed_date = <Date 2015-09-06.18:56:30.456>
    closer = 'ezio.melotti'
    components = ['Documentation', 'Library (Lib)']
    creation = <Date 2015-01-01.18:47:08.610>
    creator = 'xkjq'
    dependencies = []
    files = ['38380']
    hgrepos = []
    issue_num = 23144
    keywords = ['patch']
    message_count = 14.0
    messages = ['233291', '233292', '233295', '237457', '237480', '237483', '237503', '237529', '247617', '249165', '249745', '250006', '250007', '250311']
    nosy_count = 9.0
    nosy_names = ['larry', 'rbcollins', 'ezio.melotti', 'r.david.murray', 'docs@python', 'python-dev', 'martin.panter', 'serhiy.storchaka', 'xkjq']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue23144'
    versions = ['Python 3.4', 'Python 3.5', 'Python 3.6']

    @xkjq
    Copy link
    Mannequin Author

    xkjq mannequin commented Jan 1, 2015

    If convert_charrefs is set to true the final data section is not return by feed(). It is held until the next tag is encountered.

    ---

    from html.parser import HTMLParser
    
    class MyHTMLParser(HTMLParser):
        def __init__(self):
            HTMLParser.__init__(self, convert_charrefs=True)
            self.fed = []
        def handle_starttag(self, tag, attrs):
            print("Encountered a start tag:", tag)
        def handle_endtag(self, tag):
            print("Encountered an end tag :", tag)
        def handle_data(self, data):
            print("Encountered some data  :", data)
    
    parser = MyHTMLParser()
    
    parser.feed("foo <a>link</a> bar")
    print("")
    parser.feed("spam <a>link</a> eggs")

    gives

    Encountered some data : foo
    Encountered a start tag: a
    Encountered some data : link
    Encountered an end tag : a

    Encountered some data : barspam
    Encountered a start tag: a
    Encountered some data : link
    Encountered an end tag : a

    With 'convert_charrefs = False' it works as expected.

    @xkjq xkjq mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 1, 2015
    @vadmium
    Copy link
    Member

    vadmium commented Jan 1, 2015

    You “forgot” to call close():

    >>> parser.close()
    Encountered some data  :  eggs

    Perhaps this is a documentation bug, since there is a lot of example code given, but none of the examples call close().

    @vadmium vadmium added the docs Documentation in the Doc dir label Jan 1, 2015
    @xkjq
    Copy link
    Mannequin Author

    xkjq mannequin commented Jan 1, 2015

    That would make sense.

    Might also be worth mentioning the difference in behaviour with convert_charrefs = True/False as that was what led me to think this was a bug.

    @ezio-melotti
    Copy link
    Member

    Here is a patch that fixes the problem.
    Even though calling .close() is the correct solution, I preferred to restore the previous behavior and call handle_data as soon as possible.
    There is a corner case in which a charref might be cut in half while feeding chunks to the parser -- in that case the parser will wait and it might still be necessary to call .close() if an incomplete charref is at the end of the string.
    Adding context manager support to HTMLParser might also help solving the problem, but that's a separate issue.
    (Also thanks to Serhiy for the feedback he provided me on IRC.)

    @vadmium
    Copy link
    Member

    vadmium commented Mar 7, 2015

    I still think it would be worthwhile adding close() calls to the examples in the documentation (Doc/library/html.parser.rst).

    BTW I haven’t tested this, and maybe it is not a concern, but even with this patch it looks like the parser will buffer unlimited data and output nothing until close() if each string it is fed ends with an ampersand (and otherwise contains only plain text, no tags etc).

    @ezio-melotti
    Copy link
    Member

    I still think it would be worthwhile adding close() calls to
    the examples in the documentation (Doc/library/html.parser.rst).

    If I add context manager support to HTMLParser I can update the examples to use it, but otherwise I don't think it's worth changing them now.

    BTW I haven’t tested this, and maybe it is not a concern, but even with
    this patch it looks like the parser will buffer unlimited data and
    output nothing until close() if each string it is fed ends with an
    ampersand (and otherwise contains only plain text, no tags etc).

    This is true, but I don't think it's a realistic case.
    For this to be a problem you would need:

    1. Someone feeding the parser with arbitrary chunks. Text files are usually fed to the parser whole, or line by line -- arbitrary chunks are uncommon.
    2. A file that contains lot of entities. In most documents charrefs are not very common, and so the chances that a chunk will split one in the middle is low. Chances that several consecutive charrefs are split in the middle is even lower.
    3. A file that is very big. Even if all the file is buffered until a call to close(), it shouldn't be a concern, since most files have relatively small size. It is true that this has a quadratic complexity, but I would expect the parsing to complete in a reasonable time for average sizes.

    @vadmium
    Copy link
    Member

    vadmium commented Mar 8, 2015

    A context manager here would seem a bit strange. Is there any precedent for using context managers with feed parsers? The two others that come to mind are ElementTree.XMLParser and email.parser.FeedParser. These two build an object while parsing, and close() returns that object, so a context manager would be unhelpful.

    If an exception is raised inside the context manager, should close() be called (like for file objects), or not?

    @ezio-melotti
    Copy link
    Member

    A context manager here would seem a bit strange.

    I still haven't thought this through, but I can't see any problem with it right now. This would be similar to:

      from contextlib import closing
      with closing(MyHTMLParser()) as parser:
          parser.feed(html)

    and this already seems to work fine, including with OP's case.

    If an exception is raised inside the context manager,
    should close() be called (like for file objects), or not?

    The parser is guaranteed to never raise parsing-related errors during parsing, so this shouldn't be an issue. I will open a new issue after fixing this so we can keep discussing there.

    @rbtcollins
    Copy link
    Member

    @ezio I think you should commit what you have so far. LGTM.

    @rbtcollins
    Copy link
    Member

    @ezio - you seem busy, so I'll commit this next week if its still pending.

    @ezio-melotti
    Copy link
    Member

    I'll try to take care of this during the weekend.
    Feel free to ping me if I don't.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Sep 6, 2015

    New changeset ef82131d0c93 by Ezio Melotti in branch '3.4':
    bpo-23144: Make sure that HTMLParser.feed() returns all the data, even when convert_charrefs is True.
    https://hg.python.org/cpython/rev/ef82131d0c93

    New changeset 1f6155ffcaf6 by Ezio Melotti in branch '3.5':
    bpo-23144: merge with 3.4.
    https://hg.python.org/cpython/rev/1f6155ffcaf6

    New changeset 48ae9d66c720 by Ezio Melotti in branch 'default':
    bpo-23144: merge with 3.5.
    https://hg.python.org/cpython/rev/48ae9d66c720

    @ezio-melotti
    Copy link
    Member

    Fixed, thanks for the report!

    @larryhastings
    Copy link
    Contributor

    The Misc/NEWS entry for this was added under Python 3.5.0rc3. But, since no pull request has been made for this change, this change hasn't been merged into 3.5.0. It will ship as part of Python 3.5.1. I've moved the Misc/NEWS entry accordingly.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    docs Documentation in the Doc dir stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants