Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in xml.etree.ElementTree.iterparse #79683

Closed
grokcode mannequin opened this issue Dec 14, 2018 · 10 comments
Closed

Memory leak in xml.etree.ElementTree.iterparse #79683

grokcode mannequin opened this issue Dec 14, 2018 · 10 comments
Assignees
Labels
3.7 (EOL) end of life 3.8 only security fixes performance Performance or resource usage topic-XML

Comments

@grokcode
Copy link
Mannequin

grokcode mannequin commented Dec 14, 2018

BPO 35502
Nosy @scoder, @vstinner, @serhiy-storchaka, @miss-islington, @grokcode
PRs
  • bpo-35502: Fix reference leaks in ElementTree.TreeBuilder. #11170
  • [WIP] bpo-35502: Fix memory leak in xml.etree iterparse() #11169
  • [3.7] bpo-35502: Fix reference leaks in ElementTree.TreeBuilder. (GH-11170) #11217
  • Files
  • run2.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2019-04-27.18:02:59.122>
    created_at = <Date 2018-12-14.19:59:00.832>
    labels = ['expert-XML', '3.7', '3.8', 'performance']
    title = 'Memory leak in xml.etree.ElementTree.iterparse'
    updated_at = <Date 2019-04-27.18:02:59.121>
    user = 'https://github.com/grokcode'

    bugs.python.org fields:

    activity = <Date 2019-04-27.18:02:59.121>
    actor = 'scoder'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2019-04-27.18:02:59.122>
    closer = 'scoder'
    components = ['XML']
    creation = <Date 2018-12-14.19:59:00.832>
    creator = 'jess.j'
    dependencies = []
    files = ['47999']
    hgrepos = []
    issue_num = 35502
    keywords = ['patch']
    message_count = 10.0
    messages = ['331861', '331863', '331864', '331875', '332051', '332052', '332078', '332082', '340991', '341005']
    nosy_count = 6.0
    nosy_names = ['scoder', 'vstinner', 'eli.bendersky', 'serhiy.storchaka', 'miss-islington', 'jess.j']
    pr_nums = ['11170', '11169', '11217']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'resource usage'
    url = 'https://bugs.python.org/issue35502'
    versions = ['Python 3.7', 'Python 3.8']

    @grokcode
    Copy link
    Mannequin Author

    grokcode mannequin commented Dec 14, 2018

    When given xml that that would raise a ParseError, but parsing is stopped before the ParseError is raised, xml.etree.ElementTree.iterparse leaks memory.

    Example:

    import gc
    from io import StringIO
    import xml.etree.ElementTree as etree
    
    import objgraph
    
    
    def parse_xml():
        xml = """
          <LEVEL1>
          </LEVEL1>
        </ROOT>
        """
        parser = etree.iterparse(StringIO(initial_value=xml))
        for _, elem in parser:
            if elem.tag == 'LEVEL1':
                break
    
    
    def run():
        parse_xml()
    
        gc.collect()
        uncollected_elems = objgraph.by_type('Element')
        print(uncollected_elems)
        objgraph.show_backrefs(uncollected_elems, max_depth=15)
    
    
    if __name__ == "__main__":
        run()

    Output:
    [<Element 'LEVEL1' at 0x10df712c8>]

    Also see this gist which has an image showing the objects that are retained in memory: https://gist.github.com/grokcode/f89d5c5f1831c6bc373be6494f843de3

    @grokcode grokcode mannequin added 3.7 (EOL) end of life topic-XML performance Performance or resource usage labels Dec 14, 2018
    @vstinner
    Copy link
    Member

    I wrote attached run.py which confirms a leak using tracemalloc:

    $ python3 run.py 
    1 calls: 15.3B / call (total: 15.3 kB)
    100 calls: 15.3B / call (total: 1527.7 kB)
    1000 calls: 15.3B / call (total: 15265.0 kB)

    @vstinner
    Copy link
    Member

    Oops, there was a typo, you should read kB:

    1 calls: 15.3 kB / call (total: 15.3 kB)
    100 calls: 15.3 kB / call (total: 1527.7 kB)
    1000 calls: 15.3 kB / call (total: 15265.0 kB)

    @serhiy-storchaka serhiy-storchaka self-assigned this Dec 14, 2018
    @serhiy-storchaka
    Copy link
    Member

    The problem was with detecting a reference cycle containing a TreeBuilder.

    @serhiy-storchaka serhiy-storchaka added the 3.8 only security fixes label Dec 14, 2018
    @vstinner
    Copy link
    Member

    Oops, my PR 11169 used the wrong issue number: bpo-35257 instead of bpo-35502. Anyway, I closed it, the change is too complex.

    --

    IMHO the root issue is the handling of the SyntaxError exception in XMLPullParser.feed(). I wrote a fix, but I don't have the bandwidth to write an unit test checking that the reference cycle is broken.

    commit 9f3354d36a89d7898bdb631e5119cc37e9a74840 (fix_etree_leak)
    Author: Victor Stinner <vstinner@redhat.com>
    Date: Fri Dec 14 22:03:16 2018 +0100

    bpo-35257: Fix memory leak in XMLPullParser.feed()
    
    Fix memory leak in XMLPullParser.feed() of xml.etree: on syntax
    error, clear the traceback to break a reference cycle.
    
    diff --git a/Lib/xml/etree/ElementTree.py b/Lib/xml/etree/ElementTree.py
    index c1cf483cf5..f17c52541b 100644
    --- a/Lib/xml/etree/ElementTree.py
    +++ b/Lib/xml/etree/ElementTree.py
    @@ -1266,6 +1266,8 @@ class XMLPullParser:
                 try:
                     self._parser.feed(data)
                 except SyntaxError as exc:
    +                # bpo-35502: Break reference cycle
    +                #exc.__traceback__ = None
                     self._events_queue.append(exc)
     
         def _close_and_return_root(self):

    I don't see any behavior difference in XMLPullParser.read_events() which raise again the exception:

            events = self._events_queue
            while events:
                event = events.popleft()
                if isinstance(event, Exception):
                    raise event
                else:
                    yield event

    --

    PR 11170 is also a nice enhancement (fix treebuilder_gc_traverse()), but maybe we should also prevent creating reference cycles in the first place?

    @serhiy-storchaka
    Copy link
    Member

    It is not easy to avoid reference cycles if use a generator function. And generator function is much faster than an implementation as a class with the __next__ method. We need to access the iterator object from the code of the generator function, and this creates a cycle.

    @serhiy-storchaka
    Copy link
    Member

    New changeset d2a75c6 by Serhiy Storchaka in branch 'master':
    bpo-35502: Fix reference leaks in ElementTree.TreeBuilder. (GH-11170)
    d2a75c6

    @miss-islington
    Copy link
    Contributor

    New changeset 60c919b by Miss Islington (bot) in branch '3.7':
    bpo-35502: Fix reference leaks in ElementTree.TreeBuilder. (GH-11170)
    60c919b

    @scoder
    Copy link
    Contributor

    scoder commented Apr 27, 2019

    This ticket looks like it's done for 3.7/8. Can it be closed?
    I guess 3.6 isn't relevant anymore, right?

    @vstinner
    Copy link
    Member

    The 3.6 branch no longer accept bugfixes.

    @scoder scoder closed this as completed Apr 27, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes performance Performance or resource usage topic-XML
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants