This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: ElementTree and minidom don't prevent creation of not well-formed XML
Type: enhancement Stage: resolved
Components: Library (Lib), XML Versions: Python 3.8, Python 3.7, Python 3.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Ben Spiller, benspiller, effbot, eli.bendersky, flox, jwilk, martin.panter, nvetoshkin, ods, santoso.wijaya, scoder, strangefeatures
Priority: normal Keywords:

Created on 2009-02-06 11:13 by ods, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (11)
msg81259 - (view) Author: Denis S. Otkidach (ods) * Date: 2009-02-06 11:13
ElementTree and minidom allow creation of not well-formed XML, that
can't be parsed:

>>> from xml.etree import ElementTree
>>> element = ElementTree.Element('element')
>>> element.text = u'\0'
>>> xml = ElementTree.tostring(element, encoding='utf-8')
>>> ElementTree.fromstring(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 9

>>> from xml.dom import minidom
>>> doc = minidom.getDOMImplementation().createDocument(None, None, None)
>>> element = doc.createElement('element')
>>> element.appendChild(doc.createTextNode(u'\0'))
<DOM Text node "">
>>> doc.appendChild(element)
<DOM Element: element at 0xb7ca688c>
>>> xml = doc.toxml(encoding='utf-8')
>>> minidom.parseString(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, colum

I believe they should raise some exception when there are characters 
not allowed in XML (http://www.w3.org/TR/REC-xml/#NT-Char) are used in
attribute values, text nodes and CDATA sections.
msg89685 - (view) Author: Fredrik Lundh (effbot) * (Python committer) Date: 2009-06-24 21:53
For ET, that's very much on purpose.  Validating data provided by every 
single application would kill performance for all of them, even if only a 
small minority would ever try to serialize data that cannot be represented 
in XML.
msg89699 - (view) Author: Denis S. Otkidach (ods) * Date: 2009-06-25 07:33
Every blog engine I've even seen so far pass through comments from
untrusted users to RSS/Atom feeds without proper validation causing
broken XML in feeds. Sure, this is a bug in web applications, but DOM
manipulation packages should prevent from creation broken XML to help
detecting errors earlier.
msg95684 - (view) Author: Andy (strangefeatures) Date: 2009-11-24 16:09
I'm also of the opinion that this would be a valuable feature to have. I
think it's a reasonable expectation that an XML library produces valid
XML. It's particularly strange that ET would output XML that it can't
itself read. Surely the job of making the input valid falls on the XML
creator - that's the point of using libraries in the first place, to
abstract away from details like not being able to use characters in the
0-32 range, in the same way that ampersands etc are auto-escaped.
Granted, it's not as clear-cut here since the low-range ASCII characters
are likely to be less frequent and the strategy to handle them is less
clear. I think the sanest behaviour would be to raise an exception by
default, although a user-configurable option to replace or omit the
characters would also make sense. If impacting performance is a concern,
maybe it would make sense to be off by default, but I would have thought
that the single regex that could perform the check would have relatively
minimal impact - and it seems to be an acceptable overhead on the
parsing side, so why not on generation?
msg95689 - (view) Author: Denis S. Otkidach (ods) * Date: 2009-11-24 17:26
Here is a regexp I use to clean up text (note, that I don't touch 
"compatibility characters" that are also not recommended in XML; some 
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | 
#          [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and 
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
    _char_tail = u'%s-%s' % (unichr(0x10000),
                             unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
                ur'[^\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD%s]' % 
_char_tail,
                re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
    return _nontext_sub(replacement, text)
msg101158 - (view) Author: Vetoshkin Nikita (nvetoshkin) Date: 2010-03-16 08:10
What about this example?
>>> from xml.dom import minidom
>>> doc = minidom.Document()
>>> el = doc.createElement("Test")
>>> el.setAttribute("with space", "False")
>>> doc.appendChild(el)
<DOM Element: Test at 0xba1440>
>>>
>>> #nahhh
... minidom.parseString(doc.toxml())
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 33

>>>

Is it worth making another bug report?
msg111603 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-26 12:01
In msg89685 it's stated that this behaviour is deliberate for ET.  Could somebody please comment on the minidom aspects.
msg258343 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-16 00:44
Issue 12129 is open about this sort of problem with xml.dom (which would also apply to minidom I think).

If someone wants to suggest a clarification for the Element Tree documentation, that might work. But I tend to agree about not bogging down the implementation.
msg324922 - (view) Author: Ben Spiller (benspiller) * Date: 2018-09-10 12:36
Hi it's been a few years now since this was reported and it's still a problem, any chance of a fix for this? The API gives the impression that if you pass python strings to the XML API then the library will generate valid XML. It takes care of the charset/encoding and entity escaping aspects of XML generation so would be logical for it to in some way take care of control characters too - especially as silently generating unparseable XML is a somewhat dangerous failure mode. 

I think there's a strong case for some built-in functionality to replace/ignore the control characters (perhaps as a configurable option, in case of performance worries) rather than just throwing an exception, since it's very common to have an arbitrary string generated by some other program or user input that needs to be written into an XML file (and a lot less common to be 100% sure in all cases what characters your string might contain). For those common use cases, the current situation where every python developer needs to implement their own workaround to sanitize strings isn't ideal, especially as it's not trivial to get it right and likely a lot of the community who end up 'rolling their own' are getting in wrong in some way. 

[On the other hand if you guys decide this really isn't going to be fixed, then at the very least I'd suggest that the API documentation should prominently state that it is up to the users of these libraries to implement their own sanitization of control characters, since I'm sure none of us want people using python to end up with buggy applications]
msg328040 - (view) Author: Ben Spiller (benspiller) * Date: 2018-10-19 11:28
To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
	return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates XML:

myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8'))

It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. 

I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc.
msg340981 - (view) Author: Stefan Behnel (scoder) * (Python committer) Date: 2019-04-27 11:39
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.

Note that simply replacing invalid characters by the replacement character is not a good solution, at least not in the general case, since it silently corrupts data. It's probably a better solution for users to make their code scream out loudly when it has to deal with data that it cannot serialise in the end, and to do that early on input (where its easy to debug) rather than late on serialisation where it might be difficult to understand how the data became what it is. Trying to serialise a null-character seems only a symptom of a more important problem somewhere else in the processing pipeline.

In the end, users who *really* care about correct output should run some kind of schema validation over it *after* serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)

So, if someone finds a way to augment the text escaping procedure with a bit of character validation without making it slower (especially for the extremely common very short strings), then I think we can reconsider this as an enhancement. Until then, and seeing that no-one has come up with a patch in the last 10 years, I'll close this as "won't fix".
History
Date User Action Args
2022-04-11 14:56:45adminsetgithub: 49416
2019-04-27 11:39:42scodersetstatus: open -> closed

dependencies: - Document Object Model API - validation
versions: + Python 3.8, - Python 3.4, Python 3.5
nosy: + scoder

messages: + msg340981
resolution: wont fix
stage: resolved
2018-11-07 17:21:29Ben Spillersetnosy: + Ben Spiller
2018-10-19 11:28:13benspillersetmessages: + msg328040
2018-09-10 12:36:41benspillersetnosy: + benspiller

messages: + msg324922
versions: + Python 3.5, Python 3.6, Python 3.7
2016-01-16 00:44:53martin.pantersetdependencies: + Document Object Model API - validation
messages: + msg258343
2015-03-12 19:48:58ned.deilylinkissue23650 superseder
2014-12-13 01:58:30martin.pantersetnosy: + martin.panter
2014-02-03 17:01:35BreamoreBoysetnosy: - BreamoreBoy
2013-09-02 21:19:53eli.benderskysetnosy: + eli.bendersky
2013-09-02 21:19:25eli.benderskylinkissue18850 superseder
2012-07-21 13:43:06floxsetassignee: effbot ->
components: + XML
versions: + Python 3.4, - Python 2.7, Python 3.2
2011-04-08 18:10:16santoso.wijayasetnosy: + santoso.wijaya
2010-07-26 12:01:14BreamoreBoysetnosy: + BreamoreBoy
messages: + msg111603
2010-03-16 08:10:25nvetoshkinsetnosy: + nvetoshkin
messages: + msg101158
2010-02-16 14:44:14jwilksetnosy: + jwilk
2010-02-16 14:02:11floxsetpriority: normal
nosy: + flox

type: behavior -> enhancement
versions: + Python 2.7, Python 3.2, - Python 2.6, Python 2.5, Python 3.0
2010-02-16 13:50:26floxlinkissue7599 superseder
2009-11-24 17:26:33odssetmessages: + msg95689
2009-11-24 16:09:13strangefeaturessetnosy: + strangefeatures
messages: + msg95684
2009-06-25 07:33:33odssetmessages: + msg89699
2009-06-24 21:53:38effbotsetmessages: + msg89685
2009-02-06 21:51:10georg.brandlsetassignee: effbot
nosy: + effbot
2009-02-06 11:13:43odscreate