ElementTree and minidom don't prevent creation of not well-formed XML #49416

ods · 2009-02-06T11:13:44Z

BPO	5166
Nosy	@ods, @scoder, @jwilk, @florentx, @vadmium, @ben-spiller, @ben-spiller

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2019-04-27.11:39:42.662>
created_at = <Date 2009-02-06.11:13:43.638>
labels = ['expert-XML', '3.8', 'type-feature', 'library', '3.7']
title = "ElementTree and minidom don't prevent creation of not well-formed XML"
updated_at = <Date 2019-04-27.11:39:42.643>
user = 'https://github.com/ods'

bugs.python.org fields:

activity = <Date 2019-04-27.11:39:42.643>
actor = 'scoder'
assignee = 'none'
closed = True
closed_date = <Date 2019-04-27.11:39:42.662>
closer = 'scoder'
components = ['Library (Lib)', 'XML']
creation = <Date 2009-02-06.11:13:43.638>
creator = 'ods'
dependencies = []
files = []
hgrepos = []
issue_num = 5166
keywords = []
message_count = 11.0
messages = ['81259', '89685', '89699', '95684', '95689', '101158', '111603', '258343', '324922', '328040', '340981']
nosy_count = 12.0
nosy_names = ['effbot', 'ods', 'scoder', 'strangefeatures', 'jwilk', 'eli.bendersky', 'flox', 'nvetoshkin', 'santoso.wijaya', 'martin.panter', 'benspiller', 'Ben Spiller']
pr_nums = []
priority = 'normal'
resolution = 'wont fix'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue5166'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8']

ods · 2009-02-06T11:13:43Z

ElementTree and minidom allow creation of not well-formed XML, that
can't be parsed:

>>> from xml.etree import ElementTree
>>> element = ElementTree.Element('element')
>>> element.text = u'\0'
>>> xml = ElementTree.tostring(element, encoding='utf-8')
>>> ElementTree.fromstring(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 9

>>> from xml.dom import minidom
>>> doc = minidom.getDOMImplementation().createDocument(None, None, None)
>>> element = doc.createElement('element')
>>> element.appendChild(doc.createTextNode(u'\0'))
<DOM Text node "">
>>> doc.appendChild(element)
<DOM Element: element at 0xb7ca688c>
>>> xml = doc.toxml(encoding='utf-8')
>>> minidom.parseString(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, colum

I believe they should raise some exception when there are characters
not allowed in XML (http://www.w3.org/TR/REC-xml/#NT-Char) are used in
attribute values, text nodes and CDATA sections.

effbot · 2009-06-24T21:53:38Z

For ET, that's very much on purpose. Validating data provided by every
single application would kill performance for all of them, even if only a
small minority would ever try to serialize data that cannot be represented
in XML.

ods · 2009-06-25T07:33:32Z

Every blog engine I've even seen so far pass through comments from
untrusted users to RSS/Atom feeds without proper validation causing
broken XML in feeds. Sure, this is a bug in web applications, but DOM
manipulation packages should prevent from creation broken XML to help
detecting errors earlier.

strangefeatures · 2009-11-24T16:09:12Z

I'm also of the opinion that this would be a valuable feature to have. I
think it's a reasonable expectation that an XML library produces valid
XML. It's particularly strange that ET would output XML that it can't
itself read. Surely the job of making the input valid falls on the XML
creator - that's the point of using libraries in the first place, to
abstract away from details like not being able to use characters in the
0-32 range, in the same way that ampersands etc are auto-escaped.
Granted, it's not as clear-cut here since the low-range ASCII characters
are likely to be less frequent and the strategy to handle them is less
clear. I think the sanest behaviour would be to raise an exception by
default, although a user-configurable option to replace or omit the
characters would also make sense. If impacting performance is a concern,
maybe it would make sense to be off by default, but I would have thought
that the single regex that could perform the check would have relatively
minimal impact - and it seems to be an acceptable overhead on the
parsing side, so why not on generation?

ods · 2009-11-24T17:26:33Z

Here is a regexp I use to clean up text (note, that I don't touch
"compatibility characters" that are also not recommended in XML; some
other developers remove them too):

# http://www.w3.org/TR/REC-xml/#NT-Char
# Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
# [#x10000- #x10FFFF]
# (any Unicode character, excluding the surrogate blocks, FFFE, and
FFFF)
_char_tail = ''
if sys.maxunicode > 0x10000:
_char_tail = u'%s-%s' % (unichr(0x10000),
unichr(min(sys.maxunicode, 0x10FFFF)))
_nontext_sub = re.compile(
ur'[^\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD%s]' %
_char_tail,
re.U).sub
def replace_nontext(text, replacement=u'\uFFFD'):
return _nontext_sub(replacement, text)

nvetoshkin · 2010-03-16T08:10:25Z

What about this example?
>>> from xml.dom import minidom
>>> doc = minidom.Document()
>>> el = doc.createElement("Test")
>>> el.setAttribute("with space", "False")
>>> doc.appendChild(el)
<DOM Element: Test at 0xba1440>
>>>
>>> #nahhh
... minidom.parseString(doc.toxml())
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString
    return expatbuilder.parseString(string)
  File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString
    return builder.parseString(string)
  File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 33

>>

Is it worth making another bug report?

BreamoreBoy · 2010-07-26T12:01:14Z

In msg89685 it's stated that this behaviour is deliberate for ET. Could somebody please comment on the minidom aspects.

vadmium · 2016-01-16T00:44:54Z

bpo-12129 is open about this sort of problem with xml.dom (which would also apply to minidom I think).

If someone wants to suggest a clarification for the Element Tree documentation, that might work. But I tend to agree about not bogging down the implementation.

ben-spiller · 2018-09-10T12:36:41Z

Hi it's been a few years now since this was reported and it's still a problem, any chance of a fix for this? The API gives the impression that if you pass python strings to the XML API then the library will generate valid XML. It takes care of the charset/encoding and entity escaping aspects of XML generation so would be logical for it to in some way take care of control characters too - especially as silently generating unparseable XML is a somewhat dangerous failure mode.

I think there's a strong case for some built-in functionality to replace/ignore the control characters (perhaps as a configurable option, in case of performance worries) rather than just throwing an exception, since it's very common to have an arbitrary string generated by some other program or user input that needs to be written into an XML file (and a lot less common to be 100% sure in all cases what characters your string might contain). For those common use cases, the current situation where every python developer needs to implement their own workaround to sanitize strings isn't ideal, especially as it's not trivial to get it right and likely a lot of the community who end up 'rolling their own' are getting in wrong in some way.

[On the other hand if you guys decide this really isn't going to be fixed, then at the very least I'd suggest that the API documentation should prominently state that it is up to the users of these libraries to implement their own sanitization of control characters, since I'm sure none of us want people using python to end up with buggy applications]

ben-spiller · 2018-10-19T11:28:13Z

To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
	return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates XML:

myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8'))

It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml.

I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc.

scoder · 2019-04-27T11:39:43Z

This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.

Note that simply replacing invalid characters by the replacement character is not a good solution, at least not in the general case, since it silently corrupts data. It's probably a better solution for users to make their code scream out loudly when it has to deal with data that it cannot serialise in the end, and to do that early on input (where its easy to debug) rather than late on serialisation where it might be difficult to understand how the data became what it is. Trying to serialise a null-character seems only a symptom of a more important problem somewhere else in the processing pipeline.

In the end, users who *really* care about correct output should run some kind of schema validation over it *after* serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)

So, if someone finds a way to augment the text escaping procedure with a bit of character validation without making it slower (especially for the extremely common very short strings), then I think we can reconsider this as an enhancement. Until then, and seeing that no-one has come up with a patch in the last 10 years, I'll close this as "won't fix".

We need to remove illegal XML characters because ElementTree doesn't. The characters are replaced by the replacement character ("�"). This will mean CircleCI will be able to parse test output XML files that contain ANSI control codes for whatever reason. Link: python/cpython#49416

ods mannequin added type-bug An unexpected behavior, bug, or error stdlib Python modules in the Lib dir labels Feb 6, 2009

birkenfeld assigned effbot Feb 6, 2009

florentx mannequin added type-feature A feature request or enhancement and removed type-bug An unexpected behavior, bug, or error labels Feb 16, 2010

florentx mannequin added the topic-XML label Jul 21, 2012

florentx mannequin unassigned effbot Jul 21, 2012

ben-spiller mannequin added the 3.7 (EOL) end of life label Sep 10, 2018

scoder added the 3.8 only security fixes label Apr 27, 2019

scoder closed this as completed Apr 27, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

malloryavvir mentioned this issue Aug 7, 2023

Sanitize strings before passing into ElementTree Avvir/pyne#28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElementTree and minidom don't prevent creation of not well-formed XML #49416

ElementTree and minidom don't prevent creation of not well-formed XML #49416

ods mannequin commented Feb 6, 2009

ods mannequin commented Feb 6, 2009

effbot mannequin commented Jun 24, 2009

ods mannequin commented Jun 25, 2009

strangefeatures mannequin commented Nov 24, 2009

ods mannequin commented Nov 24, 2009

nvetoshkin mannequin commented Mar 16, 2010

BreamoreBoy mannequin commented Jul 26, 2010

vadmium commented Jan 16, 2016

ben-spiller mannequin commented Sep 10, 2018

ben-spiller mannequin commented Oct 19, 2018

scoder commented Apr 27, 2019

ElementTree and minidom don't prevent creation of not well-formed XML #49416

ElementTree and minidom don't prevent creation of not well-formed XML #49416

Comments

ods mannequin commented Feb 6, 2009

ods mannequin commented Feb 6, 2009

effbot mannequin commented Jun 24, 2009

ods mannequin commented Jun 25, 2009

strangefeatures mannequin commented Nov 24, 2009

ods mannequin commented Nov 24, 2009

nvetoshkin mannequin commented Mar 16, 2010

BreamoreBoy mannequin commented Jul 26, 2010

vadmium commented Jan 16, 2016

ben-spiller mannequin commented Sep 10, 2018

ben-spiller mannequin commented Oct 19, 2018

scoder commented Apr 27, 2019