Title: xml.etree.ElementTree in Python 3.6 is incompatible with defusedxml
Type: behavior Stage: resolved
Components: XML Versions: Python 3.6
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: adamwill, ammar2, christian.heimes, serhiy.storchaka
Priority: normal Keywords:

Created on 2016-12-22 17:48 by adamwill, last changed 2017-01-07 18:33 by serhiy.storchaka. This issue is now closed.

Messages (10)
msg283854 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 17:48
The changes made to xml.etree.ElementTree in this commit:

break defusedxml , Christian Heimes' library of modified parsers that's intended to be safe for parsing untrusted input. As of now, it's not possible to have defusedxml working properly with Python 3.6; its ElementTree parsers cannot work properly.

Of course, defusedxml is an external library that does 'inappropriate' things (like fiddling around with internals of the xml library). So usually this should be considered just a problem for defusedxml to deal with somehow, and indeed I've reported it there: . That report has more details on the precise problem.

I thought it was worthwhile reporting to Python itself as well, however, for a specific reason. The Python docs for the xml library explicitly cover and endorse the use of defusedxml:

"defusedxml is a pure Python package with modified subclasses of all stdlib XML parsers that prevent any potentially malicious operation. Use of this package is recommended for any server code that parses untrusted XML data." -

so as things stand, the Python 3.6 docs will explicitly recommend people use a module which does not work with Python 3.6. Is this considered a serious problem?

It also looks to me (though I'm hardly an expert) as if it might be quite difficult and ugly to fix this on the defusedxml side, and the 'nicest' fix might actually be to tweak Python's xml module back a bit more to how it was in < 3.6 (but without losing the optimization from the commit in question) so it's easier for defusedxml to get at the internals it needs...but I could well be wrong about that.

msg283855 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-22 18:16
Could you please provide an example of code that doesn't work with Python 3.6?
msg283856 - (view) Author: Ammar Askar (ammar2) * (Python committer) Date: 2016-12-22 18:25
From what I can tell from the failed travis run in defused, the problem is here:

It's caused by it using the removed _IterParseIterator file (or rather a field that represents it that's set to None)

The full traceback can be found here:
msg283857 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-22 19:01
I think this can be fixed from defusedxml side:

    def iterparse(source, events=None, parser=None, forbid_dtd=False,
                  forbid_entities=True, forbid_external=True):
        if not parser:
            parser = DefusedXMLParser(target=_TreeBuilder(),
        return xml.etree.ElementTree.iterparse(source, events, parser)
msg283858 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 19:28
Ammar: yep, that's correct. There's code in defused's - _ get_py3_cls() - which passes different values to _generate_etree_functions based on the Python 3 version.

For Python 3.2+, defused 0.4.1 expects to use the _IterParseIterator class from xml ElementTree , but that got removed in 3.6, so if you just use defused 0.4.1 with Python 3.6, it asserts as soon as you try to import defusedxml.ElementTree at all:

>>> import defusedxml.ElementTree
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/defusedxml-0.4.1/defusedxml/", line 62, in <module>
    _XMLParser, _iterparse, _IterParseIterator, ParseError = _get_py3_cls()
  File "/tmp/defusedxml-0.4.1/defusedxml/", line 56, in _get_py3_cls
    _IterParseIterator = pure_pymod._IterParseIterator
AttributeError: module 'xml.etree.ElementTree' has no attribute '_IterParseIterator'

Christian made a change to make _get_py3_cls() pass None to _generate_etree_functions() so you can at least import defusedxml.ElementTree, but he didn't change _generate_etree_functions() at all so it just doesn't have a code path that handles this at all; for Python 3.2+ it's expecting to get a real iterator, not None, and it just breaks completely trying to use None as an iterator:

<mock-chroot> sh-4.3# echo "<xml></xml>" > test.xml
<mock-chroot> sh-4.3# python3
>>> import defusedxml.ElementTree
>>> parser = defusedxml.ElementTree.iterparse('test.xml')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/tmp/defusedxml-0.4.1/defusedxml/", line 141, in iterparse
    return _IterParseIterator(source, events, parser, close_source)
TypeError: 'NoneType' object is not callable

Serhiy, thanks for the suggestion! We'll try that out.
msg283860 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 20:03
serhiy: so, the funny thing is this: your fix is ultimately a reversion. Though we have to dig way back into the bowels of defusedxml to see this. Specifically, to this commit!

that's the point at which Christian introduced a divergence between Python 2 and Python 3 here, and essentially the same divergence remains between the `elif PY3:` and `else: # Python 2.7` blocks now. The Python 2.7 block in current defusedxml is in fact the same as your block, because `_iterparse` is just the parent `iterparse` function, as discovered by `_get_py3_cls()`.

So before applying your change, I kinda want to understand why Christian introduced this divergence in the first place. The commit message claims it's because Python 3.3 hid some pure python, but I don't quite understand that: looks like iterparse() was still perfectly available and usable for this purpose in 3.3, just as it was in 3.2 and still appears to be in 3.6.
msg283861 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 20:13
Aha, so thanks to my colleague Patrick Uiterwijk, we see the problem. Since Python 3.3, Python doesn't actually use that pure-Python iterparse() function if it can instead replace it with a C version:

"# Overwrite 'ElementTree.parse' and 'iterparse' to use the C XMLParser"

so the reason defusedxml wants to use _IterParseIterator on Python 3 is because if it just uses xml.etree.ElementTree.iterparse() it's getting the 'accelerated' C implementation, not the pure-Python implementation it wants.
msg283862 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 20:24
Digging some more, it looks like *only* Python 3.3 went so far out of its way to hide the pure-Python iterparse() - the code was changed again in 3.4 and it doesn't do that any more. So I think a way forward here is to make the code that uses _IterParseIterator specific to Python 3.3, and use the Python 2.7 code (i.e. just use the iterparse() function) for 3.2 and 3.4+.
msg283863 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 20:42 is my cut at a fix for this, gonna test it out now.
msg283864 - (view) Author: Adam Williamson (adamwill) Date: 2016-12-22 20:53 should fix this, I hope.
Date User Action Args
2017-01-07 18:33:47serhiy.storchakasetstatus: open -> closed
resolution: not a bug
stage: resolved
2016-12-22 20:53:29adamwillsetmessages: + msg283864
2016-12-22 20:42:34adamwillsetmessages: + msg283863
2016-12-22 20:24:57adamwillsetmessages: + msg283862
2016-12-22 20:13:35adamwillsetmessages: + msg283861
2016-12-22 20:03:44adamwillsetmessages: + msg283860
2016-12-22 19:28:22adamwillsetmessages: + msg283858
2016-12-22 19:01:15serhiy.storchakasetmessages: + msg283857
2016-12-22 18:25:03ammar2setnosy: + ammar2
messages: + msg283856
2016-12-22 18:16:46serhiy.storchakasetmessages: + msg283855
2016-12-22 18:12:44serhiy.storchakasetnosy: + christian.heimes, serhiy.storchaka
2016-12-22 17:48:46adamwillcreate