classification
Title: xml.etree.ElementTree accepts control chars.
Type: enhancement Stage: resolved
Components: Library (Lib), XML Versions: Python 3.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: ElementTree and minidom don't prevent creation of not well-formed XML
View: 5166
Assigned To: Nosy List: christian.heimes, eli.bendersky, maker, mmokrejs, r.david.murray, scoder, serhiy.storchaka
Priority: normal Keywords:

Created on 2013-08-27 09:50 by maker, last changed 2013-09-02 21:19 by eli.bendersky. This issue is now closed.

Files
File name Uploaded Description Edit
test.py maker, 2013-08-27 09:50
inject.py maker, 2013-08-27 13:25
Messages (30)
msg196271 - (view) Author: Michele Orrù (maker) * Date: 2013-08-27 09:50
Got from irc; 
<JerryKwan> python bug in xml.etree.ElementTree, from version 2.7 to 3.2 http://www.reddit.com/r/Python/comments/1l6cta/python_bug_in_xmletreeelementtree/

I think we should keep consistency with lxml and forbid control chars in advance. 
<https://en.wikipedia.org/wiki/Valid_characters_in_XML#XML_1.1>

Attaching the test file proving the issue, slightly modified to work on 2.7 and 3.x
msg196277 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-27 12:16
This is a bit tricky in ET because it generally allows you to stick anything into the Element properties (and that's a feature). So catching this at tree building time (as lxml.etree does) isn't really possible.

However, at least catching it in the serialiser should be possible and would help. Regexes for well-formed tag names and text could do the job.
msg196278 - (view) Author: Michele Orrù (maker) * Date: 2013-08-27 12:20
you mind if I try by myself to provide patch and unittest in the next few days?
msg196280 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-27 12:29
Go for it. That's usually the fastest way to get things done.
msg196284 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-27 12:56
Unless it is a security issue, this seems like the kind of fix that shouldn't be applied to maintenance releases.
msg196286 - (view) Author: Michele Orrù (maker) * Date: 2013-08-27 13:25
I suppose it is, David, if in 2 minutes flat I can change your terminal name.
msg196288 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2013-08-27 13:43
In that case, the fix needs to be applied to 3.2 and 2.6 as well.  Or at least considered for application.  It could be that this will break working (though dangerous) programs.  I'll leave it to folks more knowledgeable in this particular area than I to decide where the severity/backward compatibility line lies in this case.
msg196294 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-08-27 14:20
See also issue7727. Almost any other XML generation code (xml.sax.sautils.XMLGenerator, xml.dom.minidom.Element.writexml(), etc, but not plistlib.PlistWriter) has the same problem.

The problem with filtering control characters is that it will significantly slowdown XML generation.
msg196296 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-27 14:26
Michele, could you elaborate how you would exploit this issue as a security risk?

I mean, I can easily create a (non-)XML-document with control characters manually, and the parser would reject it.

What part of the create-to-serialise process exactly is a problem here?
msg196298 - (view) Author: Michele Orrù (maker) * Date: 2013-08-27 15:04
> Michele, could you elaborate how you would exploit this issue as a security risk?
Sure. What I meant in my message is: assume you have a script that simply stores each message it receives (from stdin, from a tcp stream, whatever) inside an xml tree like 
'<text>{message1}</text><text>{message2}<text>',
and prints the tree on SIGINT.

What I would expect is the xml document not to allow control chars, as "restricted and discouraged", and consistent with lxml. 
What instead happens is that the control chars are not handled, and thus anybody can send control chars in my terminal. Changing the terminal title is a trivial example of those. 

For sure an echo server may have the same issue, but the premises are different, because I expect to print just a byte stream.
Mentioning this fact in the documentation may be a possible solution, but I believe more that keeping consistency with lxml is the right way.

> I mean, I can easily create a (non-)XML-document with control characters manually, and the parser would reject it.
False? The parser is *not* rejecting control chars.
 
> What part of the create-to-serialise process exactly is a problem here?
ElementTree.tostring().
msg196313 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-27 19:48
> The parser is *not* rejecting control chars.

The parser *is* rejecting control characters. It's an XML parser. See the example in the link you posted.


> assume you have a script that simply stores each message it receives (from stdin, from a tcp stream, whatever) inside an xml tree like 
> '<text>{message1}</text><text>{message2}<text>',
> and prints the tree on SIGINT.

That's not an XML specific issue. You are printing a byte string here, so repr() would be the right thing to use (and is actually being used automatically in Py3), instead of plain printing. The fact that you are wrapping the content in XML doesn't matter.


>> What part of the create-to-serialise process exactly is a problem here?
> ElementTree.tostring().

What I meant was: at what step of the process from creating an XML tree in memory to serialisation is it a problem that the tree contains control characters? Because once the data is serialised, it will just be rejected on input by any XML parser, and handling bytes data is a thing on its own (e.g. you could serialise to UTF16 and the result would contain null bytes - too bad).

It may just be a bad example that you chose here, but I really can't see this being a security problem. You are mishandling arbitrary untrusted binary data, that's all. Control characters are most likely not the only problem that you should guard against.

Unless there is a more dangerous way to exploit this that is actually due to XML being used, I'd suggest changing the type from "security" back to "behaviour".
msg196315 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-27 19:55
Or maybe even to "enhancement". The behaviour that it writes out what you give it isn't exactly wrong, it's just inconvenient that you have to take care yourself that you pass it well-formed XML content.
msg196328 - (view) Author: Martin Mokrejs (mmokrejs) Date: 2013-08-27 23:19
Incidentally I read today http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html mentioning ^A being used. Maybe that would stop working?
msg196360 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 09:21
> The parser *is* rejecting control characters. It's an XML parser. See the example in the link you posted.
Ehrm, my apologies.

> That's not an XML specific issue. You are printing a byte string here, so repr() would be the right thing to use (and is actually being used automatically in 
> Py3), instead of plain printing. The fact that you are wrapping the content in XML doesn't matter.
[citation needed] 
After a quick scan in the documentation I did not see anything mentioning this. Instead, I see many cases in which escape chars and binary-to-text encodings are mentioned.

> What I meant was: at what step of the process from creating an XML tree in memory to serialisation is it a problem that the tree contains control characters? 
> Because once the data is serialised, it will just be rejected on input by any XML parser, and handling bytes data is a thing on its own (e.g. you could serialise 
> to UTF16 and the result would contain null bytes - too bad).
m, I think the problem lies in the expectation of having fromstring(tostring(tree)) = tree

> Unless there is a more dangerous way to exploit this that is actually due to XML being used, I'd suggest changing the type from "security" back to "behaviour".
> Or maybe even to "enhancement". The behaviour that it writes out what you give it isn't exactly wrong, it's just inconvenient that you have to take care yourself 
> that you pass it well-formed XML content.
I think the point here is clarifying whether xml expect text or just a byte string. In case that's a stream of byte, I agree with you, is more a "behaviour" problem.
msg196361 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-28 09:32
> I think the point here is clarifying whether xml expect text or just a byte string. In case that's a stream of byte, I agree with you, is more a "behaviour" problem.

XML is *defined* as a stream of bytes.

Regarding the API side in ElementTree, Py2 accepts byte strings and Py3 requires Unicode strings. Py2 will not change in that regard, and I can't see this being a serious enough issue to change the ET-API there, so IMHO we can ignore Py2.x completely for this issue. (changing ticket accordingly)

However, Py3 will happily write out control characters if they appear in the Unicode text string, so the issue is the same there. A fix for Py3 would be to add an input validation step, preferably at serialisation time.
msg196362 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 09:37
> Incidentally I read today http://blastedbio.blogspot.co.uk/2012/05/blast-tabular-missing-descriptions.html mentioning ^A being used. 
> Maybe that would stop working?
I don't see any problem in any xml output. Indeed:
"You can't put a nasty non-printing ASCII character in XML, so they were switched to something else. That is my working theory for this anyway."
msg196364 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 09:45
> XML is *defined* as a stream of bytes.
Can you *paste* the *source* proving what you are arguing, please?  

> Regarding the API side in ElementTree, Py2 accepts byte strings and Py3 requires Unicode strings.
"accepts"? python3 works with ElementTree(bytes(unicode))
msg196365 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-28 09:48
>> XML is *defined* as a stream of bytes.
> Can you *paste* the *source* proving what you are arguing, please?

http://www.w3.org/TR/REC-xml/


> python3 works with ElementTree(bytes(unicode))

What does this sentence mean?
msg196366 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 10:01
>>> XML is *defined* as a stream of bytes.
>> Can you *paste* the *source* proving what you are arguing, please?
> http://www.w3.org/TR/REC-xml/
"""
The first two suggestions are directly derived from the rules given for identifiers in Standard Annex #31 (UAX #31) of the Unicode Standard, version 5.0 [Unicode], and exclude all control characters, enclosing nonspacing marks, non-decimal numbers, private-use characters, punctuation characters (with the noted exceptions), symbol characters, unassigned codepoints, and white space characters. The other suggestions are mostly derived from Appendix B in previous editions of this specification.
"""
msg196367 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 10:05
"""
Document authors are encouraged to avoid "compatibility characters", as defined in section 2.3 of [Unicode]. The characters defined in the following ranges are also discouraged. They are either control characters or permanently undefined Unicode characters:
"""
msg196368 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 10:08
Does not seem to me just a "byte string" where you can put binary data. 
Hence, I expect the xml tree to escape/reject those. 
Hence, Is not an enhancement, but a bug. Unless we just want to document this.

(not going to change the metadata, otherwise we'll end up changing on each message).
msg196369 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-28 10:26
We are talking about two different things here.

I said that (serialised) XML is defined as a sequence of bytes. Read the spec on that.

What you are talking about is the Infoset, or the parsed/generated in-memory XML tree. That's obviously not bytes, it's defined based on Unicode. Parsing and serialising does the mapping here.

The "attack" that you presented is based on serialised XML, thus on a sequence of bytes. What I am saying is that this "attack" can be done by any kind of binary data, so it's not XML specific, thus not a problem with ElementTree.

The fact that ElementTree allows you to generate non well-formed 'XML' containing control characters when you tell it to do so is unfortunate, but it's neither a security risk (you already had the non well-formed content in your hands *before* you passed it into ElementTree), nor clearly a bug, because the user specifically requested the serialisation of an in-memory tree that contained these control characters.

But, once again, it would be nice if ElementTree rejected this input in one way or another, and that's a feature request.
msg196373 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 10:54
>I said that (serialised) XML is defined as a sequence of bytes. 
> Read the spec on that.
And I'm saying that's inexact. I have expectations that control chars are escaped in the serialized xml, because the spec I'm reading says so, and because the documentation does not warn me about this.
msg196379 - (view) Author: Michele Orrù (maker) * Date: 2013-08-28 12:56
Just pointed by a friend - <http://www.w3.org/TR/REC-xml/#sec-cdata-sect>
I suppose this is insanely used to put binary blobs inside xml until "only the CDEnd string is recognized as markup".
That's what I needed. 

Amen.
msg196441 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-29 03:25
Is that you actual use case? That you *want* to store binary data in XML, instead of getting it properly rejected as non well-formed content?

Then I suggest going the canonical route of passing it through base64 first, or any of the other binary-to-characters encoding.
msg196488 - (view) Author: Michele Orrù (maker) * Date: 2013-08-29 20:31
> Is that you actual use case? That you *want* to store binary data in XML, instead of getting it properly rejected as non well-formed content?
No, Stefan.

What I was saying in my last message was  just "you're right, the user shall always use repr() when printing an xml tree" (msg196313) because "xml does *not* guarantee to have only printable chars by itself" (msg196368, msg196379).

As an advice I hope you do not take as insult, saying
"in section {section} the spec says {argument}" 
is much more constructive than 
"read the spec on that", "{extremely_obvious_link}",
at least to people not familiar with the spec and asking for the source of your arguments (msg196360). Can shorten threads, too.
msg196510 - (view) Author: Stefan Behnel (scoder) * Date: 2013-08-30 04:27
> As an advice I hope you do not take as insult, saying
> "in section {section} the spec says {argument}" 
> is much more constructive than 
> "read the spec on that", "{extremely_obvious_link}",
> at least to people not familiar with the spec and asking for the source > of your arguments (msg196360). Can shorten threads, too.

No harm done. The reason why I just posted the spec URL is that it's actually the entire spec that backs the argument. XML is (essentially) specified as a mapping from a sequence of bytes to a hierarchical structure (and back again). That's why there is an XML declaration header that names the encoding, for example. It wouldn't be needed if XML was defined as Unicode data.
msg196756 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-09-01 23:50
Can this be transformed into a new issue that succinctly summarizes what the new requested feature is, and why it's useful?
msg196797 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-09-02 17:54
Isn't this a duplicate of issue5166?
msg196808 - (view) Author: Eli Bendersky (eli.bendersky) * (Python committer) Date: 2013-09-02 21:19
As Serhiy points out, this is a duplicate of #5166
History
Date User Action Args
2013-09-02 21:19:38eli.benderskysetstatus: open -> closed
resolution: duplicate
stage: needs patch -> resolved
2013-09-02 21:19:25eli.benderskysetsuperseder: ElementTree and minidom don't prevent creation of not well-formed XML
messages: + msg196808
2013-09-02 17:54:40serhiy.storchakasetmessages: + msg196797
2013-09-01 23:50:25eli.benderskysetmessages: + msg196756
2013-08-30 04:27:16scodersetmessages: + msg196510
2013-08-29 20:31:03makersetmessages: + msg196488
2013-08-29 03:25:01scodersetmessages: + msg196441
2013-08-28 21:24:42pitrousetnosy: + christian.heimes
2013-08-28 12:56:08makersetmessages: + msg196379
2013-08-28 10:54:18makersetmessages: + msg196373
2013-08-28 10:26:03scodersetmessages: + msg196369
2013-08-28 10:08:25makersetmessages: + msg196368
2013-08-28 10:05:46makersetmessages: + msg196367
2013-08-28 10:01:36makersetmessages: + msg196366
2013-08-28 09:48:10scodersetmessages: + msg196365
2013-08-28 09:45:06makersetmessages: + msg196364
2013-08-28 09:37:27makersetmessages: + msg196362
2013-08-28 09:32:39scodersettype: security -> enhancement
messages: + msg196361
versions: - Python 2.6, Python 2.7, Python 3.2, Python 3.3
2013-08-28 09:21:57makersetmessages: + msg196360
2013-08-27 23:19:59mmokrejssetnosy: + mmokrejs
messages: + msg196328
2013-08-27 19:55:39scodersetmessages: + msg196315
2013-08-27 19:48:19scodersetmessages: + msg196313
2013-08-27 15:04:22makersetmessages: + msg196298
2013-08-27 14:26:31scodersetmessages: + msg196296
2013-08-27 14:20:04serhiy.storchakasetmessages: + msg196294
2013-08-27 13:43:09r.david.murraysettype: behavior -> security
stage: needs patch
messages: + msg196288
versions: + Python 2.6, Python 3.2
2013-08-27 13:25:23makersetfiles: + inject.py

messages: + msg196286
2013-08-27 12:56:12r.david.murraysetnosy: + r.david.murray
messages: + msg196284
2013-08-27 12:29:41scodersetmessages: + msg196280
2013-08-27 12:20:14makersetmessages: + msg196278
2013-08-27 12:16:01scodersetmessages: + msg196277
2013-08-27 12:01:55serhiy.storchakasetnosy: + scoder, eli.bendersky, serhiy.storchaka

versions: + Python 2.7, Python 3.3, Python 3.4
2013-08-27 09:52:02makersettype: behavior
components: + Library (Lib), XML
2013-08-27 09:50:51makercreate