classification
Title: ElementTree Comment text isn't escaped
Type: behavior Stage: resolved
Components: XML Versions: Python 3.8
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, eli.bendersky, johnburnett, scoder
Priority: normal Keywords: patch

Created on 2018-04-18 00:33 by johnburnett, last changed 2019-05-12 07:38 by scoder. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 13229 closed Jeffrey.Kintscher, 2019-05-09 22:34
Messages (6)
msg315428 - (view) Author: John Burnett (johnburnett) Date: 2018-04-18 00:33
The _serialize_xml function in ElementTree.py doesn't escape Comment.text values when writing output.  This means the following code:

    import sys
    import xml.etree.ElementTree
    elem = xml.etree.ElementTree.Comment()
    elem.text = 'hi --> bye'
    tree = xml.etree.ElementTree.ElementTree(elem)
    tree.write(sys.stdout)

...will output the following invalid xml:

    <!--hi --> bye-->

In Python 3.7, changing the _serialize_xml function on line 903/904 from this:

    if tag is Comment:
        write("<!--%s-->" % text)

...to this:

    if tag is Comment:
        write("<!--%s-->" % _escape_cdata(text))

...writes something more expected:

    <!--hi --&gt; bye-->
msg340990 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 15:33
Yes, comment text should be escaped internally like all other text, not by the user. The same applies to processing instructions.

This suggests that it's probably also untested currently. Could you provide a PR for that changes both and adds tests?
msg342145 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-11 01:44
Which characters would you escape in a processing instruction? The XML spec says not to escape '<' and '&', but doesn't say what should be escaped. It seems to me that escaping '?' when followed by '>' should be sufficient since the character sequence "?>" is the closing delimiter.
msg342168 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-05-11 06:45
Hmm, sorry, I was wrong here. I looked it up and also checked the behaviour of other libraries: the data content of PIs is application specific and must not be escaped, at all. It's not XML character data.

Sorry for the confusion and the extra work on your side.
msg342243 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-12 07:05
I backed-out the Processing Instruction changes.
msg342244 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-05-12 07:38
I'm really sorry again, but I only consulted the XML spec on this now (and also the way libxml2 does it), and I found that XML comment text actually does not get escaped. It's not character data, and, in fact, "--" is not even allowed at all inside of comments. (Funny enough, the HTML serialiser does escaping for both comments and PIs, but, well, that's HTML, I guess…)

https://www.w3.org/TR/REC-xml/#sec-comments

Sorry, Jeffrey, I should have looked that up in the spec much earlier, before you invested so much time into this.

There are two disallowed cases: "--" in the text content, and "-" at the end of the text (which would lead to an "--->"). Now, the thing is, such validation is currently unprecedented in ElementTree, so I don't know if we should start raising exceptions from the serialiser for this case, and if yes, which. Since comments are rare, it won't hurt performance to do that, but once we get started on this, users would probably also want their text and attribute content and their tag and attribute names to be validated, and that would hurt then.

So, I will have to reject the PR and this ticket.
History
Date User Action Args
2019-05-12 07:38:02scodersetstatus: open -> closed
resolution: not a bug
messages: + msg342244

stage: patch review -> resolved
2019-05-12 07:05:26Jeffrey.Kintschersetmessages: + msg342243
2019-05-11 06:45:56scodersetmessages: + msg342168
2019-05-11 01:44:51Jeffrey.Kintschersetmessages: + msg342145
2019-05-09 22:34:30Jeffrey.Kintschersetkeywords: + patch
stage: needs patch -> patch review
pull_requests: + pull_request13137
2019-05-09 22:19:05Jeffrey.Kintschersetnosy: + Jeffrey.Kintscher
2019-04-27 15:33:07scodersetstage: needs patch
messages: + msg340990
versions: + Python 3.8, - Python 2.7, Python 3.6, Python 3.7
2018-04-21 03:27:48terry.reedysetversions: - Python 3.4, Python 3.5
2018-04-18 00:33:12johnburnettcreate