classification
Title: xml.etree.ElementInclude does not include nested xincludes
Type: enhancement Stage: patch review
Components: Library (Lib), XML Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: James.Bailey, eli.bendersky, math_foo, r.david.murray, rhettinger, ruffsl, scoder, urule99
Priority: normal Keywords: patch

Created on 2014-03-14 20:40 by James.Bailey, last changed 2018-03-30 16:41 by scoder.

Files
File name Uploaded Description Edit
issue20928.patch math_foo, 2014-04-16 20:14 Adds and tests recursive xinclude checks review
issue20928_fixed.patch urule99, 2014-06-22 18:08 review
issue20928_fixed2.patch urule99, 2014-06-26 05:09 review
Pull Requests
URL Status Linked Edit
PR 5723 open scoder, 2018-02-17 18:37
Messages (19)
msg213589 - (view) Author: James Bailey (James.Bailey) Date: 2014-03-14 20:40
After xml.etree.ElementInclude.include inserts an Xinclude'd href it does not walk the just-inserted subtree to see if it contains any Xincludes itself.

I think the behaviour should be modified to walk the included subtree and perform any Xincludes contained.
msg213648 - (view) Author: Stefan Behnel (scoder) * Date: 2014-03-15 09:47
Agreed that this should be done. The XInclude spec suggests that the processing is transitive. Also, lxml does it that way, in both the libxml2 based implementation and the ElementInclude compatibility module.

https://github.com/lxml/lxml/blob/master/src/lxml/ElementInclude.py

(note that this was adapted for lxml, so it's not a drop-in replacement for ET anymore)

Changing target version to Py3.5 as this is essentially a new (and somewhat backwards incompatible) feature.
msg216578 - (view) Author: Caelyn McAulay (math_foo) Date: 2014-04-16 20:14
xml.etree.ElementInclude.include now checks the inserted subtree to see if contains any Xincludes itself. And adds them.

Also added a unit test to check for the new functionality (which will fail without the change to ElementInclude.py).
msg220686 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-06-16 01:11
Thanks Caelyn.  This patch also needs a doc patch and a whatsnew entry in order to be complete.  It's not obvious to me where the relevant documentation is, though, so perhaps we instead have missing documentation that should be addressed in a separate issue.  The whatsnew entry would still be needed, though.

However, I think there is something important missing here in the patch itself.  It is specified (section 4.2.7 Inclusion Loops) that it is a fatal error for an already included document to be specified in a recursively processed xinclude directive, and indeed failing to detect such a loop has DOS implications.  So, the patch needs to address that before it can be committed.
msg220689 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2014-06-16 02:29
FWIW, we often do whatsnew entries later in the development cycle.  I think this can go forward without whatsnew.

The doc entry could be as simple as a ..versionchanged 2.5 include() now supports recursive Xincludes or somesuch.
msg220717 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-06-16 13:06
Since there is no guarantee anyone is going to tackle the job of reviewing all the changes for whatsnew entries, I prefer that we get in the habit of creating the entries as we go along.

As for the versionchanged...if you can find where include is documented, that would work :)

The recursion still needs to be addressed (it is addressed in the lxml version Stefan linked to).
msg221290 - (view) Author: Blake Hartstein (urule99) Date: 2014-06-22 18:08
This patch incorporates nested includes and a testcase that shows how it handles infinite include by throwing an exception.
msg221555 - (view) Author: Stefan Behnel (scoder) * Date: 2014-06-25 16:24
Your code adds a lot of new code. Why is that necessary? Can't the new feature be integrated into the existing code?
msg221564 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2014-06-25 17:51
Yeah, include should now be a very simple wrapper around a call to _include, include shouldn't have the original include code (which is moved to _include) in it.
msg221591 - (view) Author: Blake Hartstein (urule99) Date: 2014-06-26 05:09
That would make sense. Please see if the updated patch file works.
msg312092 - (view) Author: (ruffsl) * Date: 2018-02-13 01:34
I see it's been a few years, but I would still like to see xml.etree.ElementInclude support nested xincludes, as lxml does. Additionally, I'd like to have the default_loader attempt to open the href with respect to the original xml file's dirname, as again with lxml, as opposed to just assuming its relative to the working directory for the session.

Let me know if there are any changes you'd like to see in the latest patch here, and I could incorporate those into a new PR?
msg312286 - (view) Author: Stefan Behnel (scoder) * Date: 2018-02-17 18:38
The latest "fixed2" patch looks good to me, but the author didn't sign a contributors agreement. However, I did, and I already wrote the same thing for lxml, so I put together an initial PR.
msg312325 - (view) Author: (ruffsl) * Date: 2018-02-18 19:47
Hello @scoder,

Thanks for looking into this.

In addition to what I mentioned with lxml's use of the parent xml file's respect dirname when attempting to open the include element's href, do you think it would be possible to (optionally or by default) have the include processor provide the necessary xml:base attributes for compliance with the specification [1][2][3].
I know lxml partially does this, but not for recursive includes beyond a depth of one. Without this, it makes it impossible to track down the subsequent source for deep include paths, i.e. when you'd like to trace from what sequence of includes contributed the element in question.

From looking at the "fixed2", although the added lines 137-143 in ElementInclude.py appear to prevent infinite recursive includes, but I think it also prevents valid merging of include trajectories when expanding the tree of include paths. E.g. what happens if I state the inclusion of the same source multiple times elsewhere in the same file, or elsewhere in the include tree that is no way a sub-child of the current parent file.
Isn't `already_included` unnecessarily global in scope, when it just need to be relative to the trajectory in the sequence of current branches; only checking a node is not simultaneously a parent and child along a single include trajectory?


[1] https://stackoverflow.com/a/22791471/2577586
[2] https://www.w3.org/TR/xmlbase/
[3] http://xerces.apache.org/xerces2-j/faq-xinclude.html#faq-3
msg312326 - (view) Author: James Bailey (James.Bailey) Date: 2018-02-18 20:52
Agreed with ruffsl's concerns about the overly aggressive detection of infinite recursion.

I also wonder if the hrefs should be normalized or canonized for the check? The check may miss infinite recursions if the hrefs happen to be written in non-matching but equivalent forms. Ex: relative versus absolute paths.
msg312333 - (view) Author: Stefan Behnel (scoder) * Date: 2018-02-19 00:13
> The check may miss infinite recursions if the hrefs happen to be written in non-matching but equivalent forms. Ex: relative versus absolute paths.

I thought about that, too, but it's not a real problem. There are only a few different ways to spell the same file path, and once they are through, the recursion would still be detected and never become infinite.

Admittedly, the current implementation might lower the overhead for attacks a little, but then, if an attacker can control the input anyway, then there is not really much to win by including the same file multiple times rather than including different files.

Maybe we should add a "max_depth" parameter to limit the maximum recursion depth, defaulting to e.g. 5, that users would have to pass in order to say "I know what I'm doing".

I agree with the comment about the overly restrictive global set, though. Included file paths should be collected only along an inclusion path and not across independent subtrees.
msg312612 - (view) Author: (ruffsl) * Date: 2018-02-23 02:10
> Included file paths should be collected only along an inclusion path and not across independent subtrees.

Yes, well put.

> Maybe we should add a "max_depth" parameter to limit the maximum recursion depth, defaulting to e.g. 5, that users would have to pass in order to say "I know what I'm doing".

Could that be set to false by the user, just in case we don't know beforehand how deep the rabbit hole goes, but we're feeling overly committed to see it through?

Not to detract from the ticket, but I'd just like to share to a question related to this topic about the expected behavior of Xinclude [1]. You could also see it as a use case example for the recursive import feature we are currently deciding, of which would help avoid one more non system library to workaround [2].

[1] https://stackoverflow.com/q/48857647/2577586
[2] https://github.com/ComArmor/comarmor/blob/856dd339b090c28e86206d4d6af0fac050618e74/comarmor/__init__.py#L116
msg312651 - (view) Author: Stefan Behnel (scoder) * Date: 2018-02-23 17:43
I've added a 'max_depth' argument to limit the maximum recursive include depth to 6 by default (which is a small, arbitrarily chosen value). Users can set it to None to disable the limit.

From my side, I don't see any more features that I would add.
@ruffsl: feel free to open a new ticket if you feel like anything is missing that is not covered by the scope of this one.
msg314693 - (view) Author: Stefan Behnel (scoder) * Date: 2018-03-30 16:36
There seems to be no documentation currently for ElementInclude. Would be nice if someone who's interested in this feature could take the time to write something up. I created a documentation ticket as issue #33187.
msg314694 - (view) Author: Stefan Behnel (scoder) * Date: 2018-03-30 16:41
PR is complete now, ready for merging.
History
Date User Action Args
2018-03-30 16:41:23scodersetmessages: + msg314694
2018-03-30 16:36:35scodersetmessages: + msg314693
2018-02-23 17:43:39scodersetmessages: + msg312651
2018-02-23 02:10:53ruffslsetmessages: + msg312612
2018-02-19 00:13:58scodersetmessages: + msg312333
2018-02-18 20:52:21James.Baileysetmessages: + msg312326
2018-02-18 19:47:50ruffslsetmessages: + msg312325
2018-02-17 18:38:27scodersettype: behavior -> enhancement
messages: + msg312286
versions: + Python 3.8, - Python 3.5
2018-02-17 18:37:47scodersetstage: patch review
pull_requests: + pull_request5508
2018-02-13 01:34:55ruffslsetnosy: + ruffsl
messages: + msg312092
2014-06-26 05:09:55urule99setfiles: + issue20928_fixed2.patch

messages: + msg221591
2014-06-25 17:51:01r.david.murraysetmessages: + msg221564
2014-06-25 16:24:27scodersetmessages: + msg221555
2014-06-22 18:08:59urule99setfiles: + issue20928_fixed.patch
nosy: + urule99
messages: + msg221290

2014-06-16 13:06:40r.david.murraysetmessages: + msg220717
2014-06-16 02:29:10rhettingersetnosy: + rhettinger
messages: + msg220689
2014-06-16 01:11:24r.david.murraysetnosy: + r.david.murray
messages: + msg220686
2014-04-16 20:14:01math_foosetfiles: + issue20928.patch

nosy: + math_foo
messages: + msg216578

keywords: + patch
2014-03-15 09:47:05scodersetmessages: + msg213648
versions: + Python 3.5, - Python 2.7
2014-03-14 22:28:48ned.deilysetnosy: + scoder, eli.bendersky
2014-03-14 20:40:06James.Baileycreate