classification
Title: Memory leak in xml.etree.ElementTree.iterparse
Type: resource usage Stage: resolved
Components: XML Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: eli.bendersky, jess.j, miss-islington, scoder, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2018-12-14 19:59 by jess.j, last changed 2019-04-27 18:02 by scoder. This issue is now closed.

Files
File name Uploaded Description Edit
run2.py vstinner, 2018-12-14 20:11
Pull Requests
URL Status Linked Edit
PR 11170 merged serhiy.storchaka, 2018-12-14 21:59
PR 11169 closed vstinner, 2018-12-14 22:10
PR 11217 merged miss-islington, 2018-12-18 20:29
Messages (10)
msg331861 - (view) Author: Jess Johnson (jess.j) Date: 2018-12-14 19:59
When given xml that that would raise a ParseError, but parsing is stopped before the ParseError is raised, xml.etree.ElementTree.iterparse leaks memory.

Example:


import gc
from io import StringIO
import xml.etree.ElementTree as etree

import objgraph


def parse_xml():
    xml = """
      <LEVEL1>
      </LEVEL1>
    </ROOT>
    """
    parser = etree.iterparse(StringIO(initial_value=xml))
    for _, elem in parser:
        if elem.tag == 'LEVEL1':
            break


def run():
    parse_xml()

    gc.collect()
    uncollected_elems = objgraph.by_type('Element')
    print(uncollected_elems)
    objgraph.show_backrefs(uncollected_elems, max_depth=15)


if __name__ == "__main__":
    run()


Output:
[<Element 'LEVEL1' at 0x10df712c8>]

Also see this gist which has an image showing the objects that are retained in memory: https://gist.github.com/grokcode/f89d5c5f1831c6bc373be6494f843de3
msg331863 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 20:11
I wrote attached run.py which confirms a leak using tracemalloc:

$ python3 run.py 
1 calls: 15.3B / call (total: 15.3 kB)
100 calls: 15.3B / call (total: 1527.7 kB)
1000 calls: 15.3B / call (total: 15265.0 kB)
msg331864 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-14 20:11
Oops, there was a typo, you should read kB:

1 calls: 15.3 kB / call (total: 15.3 kB)
100 calls: 15.3 kB / call (total: 1527.7 kB)
1000 calls: 15.3 kB / call (total: 15265.0 kB)
msg331875 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-14 22:06
The problem was with detecting a reference cycle containing a TreeBuilder.
msg332051 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-12-18 11:20
Oops, my PR 11169 used the wrong issue number: bpo-35257 instead of bpo-35502. Anyway, I closed it, the change is too complex.

--

IMHO the root issue is the handling of the SyntaxError exception in XMLPullParser.feed(). I wrote a fix, but I don't have the bandwidth to write an unit test checking that the reference cycle is broken.

commit 9f3354d36a89d7898bdb631e5119cc37e9a74840 (fix_etree_leak)
Author: Victor Stinner <vstinner@redhat.com>
Date:   Fri Dec 14 22:03:16 2018 +0100

    bpo-35257: Fix memory leak in XMLPullParser.feed()
    
    Fix memory leak in XMLPullParser.feed() of xml.etree: on syntax
    error, clear the traceback to break a reference cycle.

diff --git a/Lib/xml/etree/ElementTree.py b/Lib/xml/etree/ElementTree.py
index c1cf483cf5..f17c52541b 100644
--- a/Lib/xml/etree/ElementTree.py
+++ b/Lib/xml/etree/ElementTree.py
@@ -1266,6 +1266,8 @@ class XMLPullParser:
             try:
                 self._parser.feed(data)
             except SyntaxError as exc:
+                # bpo-35502: Break reference cycle
+                #exc.__traceback__ = None
                 self._events_queue.append(exc)
 
     def _close_and_return_root(self):


I don't see any behavior difference in XMLPullParser.read_events() which raise again the exception:

        events = self._events_queue
        while events:
            event = events.popleft()
            if isinstance(event, Exception):
                raise event
            else:
                yield event

--

PR 11170 is also a nice enhancement (fix treebuilder_gc_traverse()), but maybe we should also prevent creating reference cycles in the first place?
msg332052 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-18 11:53
It is not easy to avoid reference cycles if use a generator function. And generator function is much faster than an implementation as a class with the __next__ method. We need to access the iterator object from the code of the generator function, and this creates a cycle.
msg332078 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-12-18 20:29
New changeset d2a75c67830d7c9f59e4e9b60f36974234c829ef by Serhiy Storchaka in branch 'master':
bpo-35502: Fix reference leaks in ElementTree.TreeBuilder. (GH-11170)
https://github.com/python/cpython/commit/d2a75c67830d7c9f59e4e9b60f36974234c829ef
msg332082 - (view) Author: miss-islington (miss-islington) Date: 2018-12-18 21:40
New changeset 60c919b58bd3cf8730947a00ddc6a527d6922ff1 by Miss Islington (bot) in branch '3.7':
bpo-35502: Fix reference leaks in ElementTree.TreeBuilder. (GH-11170)
https://github.com/python/cpython/commit/60c919b58bd3cf8730947a00ddc6a527d6922ff1
msg340991 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 15:39
This ticket looks like it's done for 3.7/8. Can it be closed?
I guess 3.6 isn't relevant anymore, right?
msg341005 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019-04-27 18:00
The 3.6 branch no longer accept bugfixes.
History
Date User Action Args
2019-04-27 18:02:59scodersetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: - Python 3.6
2019-04-27 18:00:05vstinnersetmessages: + msg341005
2019-04-27 15:39:05scodersetmessages: + msg340991
2018-12-18 21:40:27miss-islingtonsetnosy: + miss-islington
messages: + msg332082
2018-12-18 20:29:31miss-islingtonsetpull_requests: + pull_request10453
2018-12-18 20:29:16serhiy.storchakasetmessages: + msg332078
2018-12-18 11:53:29serhiy.storchakasetmessages: + msg332052
2018-12-18 11:20:42vstinnersetmessages: + msg332051
2018-12-15 00:15:44jess.jsetversions: + Python 3.6
2018-12-14 22:10:01vstinnersetpull_requests: + pull_request10408
2018-12-14 22:06:17serhiy.storchakasetnosy: + scoder, eli.bendersky

messages: + msg331875
versions: + Python 3.8
2018-12-14 21:59:01serhiy.storchakasetkeywords: + patch
stage: patch review
pull_requests: + pull_request10407
2018-12-14 21:28:39serhiy.storchakasetassignee: serhiy.storchaka

nosy: + serhiy.storchaka
2018-12-14 20:11:56vstinnersetfiles: + run2.py

messages: + msg331864
2018-12-14 20:11:24vstinnersetfiles: - run.py
2018-12-14 20:11:04vstinnersetfiles: + run.py
nosy: + vstinner
messages: + msg331863

2018-12-14 19:59:00jess.jcreate