New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ElementTree and minidom don't prevent creation of not well-formed XML #49416
Comments
ElementTree and minidom allow creation of not well-formed XML, that >>> from xml.etree import ElementTree
>>> element = ElementTree.Element('element')
>>> element.text = u'\0'
>>> xml = ElementTree.tostring(element, encoding='utf-8')
>>> ElementTree.fromstring(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1,
column 9
>>> from xml.dom import minidom
>>> doc = minidom.getDOMImplementation().createDocument(None, None, None)
>>> element = doc.createElement('element')
>>> element.appendChild(doc.createTextNode(u'\0'))
<DOM Text node "">
>>> doc.appendChild(element)
<DOM Element: element at 0xb7ca688c>
>>> xml = doc.toxml(encoding='utf-8')
>>> minidom.parseString(xml)
[...]
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, colum I believe they should raise some exception when there are characters |
For ET, that's very much on purpose. Validating data provided by every |
Every blog engine I've even seen so far pass through comments from |
I'm also of the opinion that this would be a valuable feature to have. I |
Here is a regexp I use to clean up text (note, that I don't touch # http://www.w3.org/TR/REC-xml/#NT-Char |
What about this example?
>>> from xml.dom import minidom
>>> doc = minidom.Document()
>>> el = doc.createElement("Test")
>>> el.setAttribute("with space", "False")
>>> doc.appendChild(el)
<DOM Element: Test at 0xba1440>
>>>
>>> #nahhh
... minidom.parseString(doc.toxml())
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "C:\Python26\lib\xml\dom\minidom.py", line 1928, in parseString
return expatbuilder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 940, in parseString
return builder.parseString(string)
File "C:\Python26\lib\xml\dom\expatbuilder.py", line 223, in parseString
parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 1, column 33
Is it worth making another bug report? |
In msg89685 it's stated that this behaviour is deliberate for ET. Could somebody please comment on the minidom aspects. |
bpo-12129 is open about this sort of problem with xml.dom (which would also apply to minidom I think). If someone wants to suggest a clarification for the Element Tree documentation, that might work. But I tend to agree about not bogging down the implementation. |
Hi it's been a few years now since this was reported and it's still a problem, any chance of a fix for this? The API gives the impression that if you pass python strings to the XML API then the library will generate valid XML. It takes care of the charset/encoding and entity escaping aspects of XML generation so would be logical for it to in some way take care of control characters too - especially as silently generating unparseable XML is a somewhat dangerous failure mode. I think there's a strong case for some built-in functionality to replace/ignore the control characters (perhaps as a configurable option, in case of performance worries) rather than just throwing an exception, since it's very common to have an arbitrary string generated by some other program or user input that needs to be written into an XML file (and a lot less common to be 100% sure in all cases what characters your string might contain). For those common use cases, the current situation where every python developer needs to implement their own workaround to sanitize strings isn't ideal, especially as it's not trivial to get it right and likely a lot of the community who end up 'rolling their own' are getting in wrong in some way. [On the other hand if you guys decide this really isn't going to be fixed, then at the very least I'd suggest that the API documentation should prominently state that it is up to the users of these libraries to implement their own sanitization of control characters, since I'm sure none of us want people using python to end up with buggy applications] |
To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this: def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString) and then copy+paste the following pattern into every bit of code that generates XML: myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8')) It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc. |
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it. I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur. Note that simply replacing invalid characters by the replacement character is not a good solution, at least not in the general case, since it silently corrupts data. It's probably a better solution for users to make their code scream out loudly when it has to deal with data that it cannot serialise in the end, and to do that early on input (where its easy to debug) rather than late on serialisation where it might be difficult to understand how the data became what it is. Trying to serialise a null-character seems only a symptom of a more important problem somewhere else in the processing pipeline. In the end, users who *really* care about correct output should run some kind of schema validation over it *after* serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :) So, if someone finds a way to augment the text escaping procedure with a bit of character validation without making it slower (especially for the extremely common very short strings), then I think we can reconsider this as an enhancement. Until then, and seeing that no-one has come up with a patch in the last 10 years, I'll close this as "won't fix". |
We need to remove illegal XML characters because ElementTree doesn't. The characters are replaced by the replacement character ("�"). This will mean CircleCI will be able to parse test output XML files that contain ANSI control codes for whatever reason. Link: python/cpython#49416
We need to remove illegal XML characters because ElementTree doesn't. The characters are replaced by the replacement character ("�"). This will mean CircleCI will be able to parse test output XML files that contain ANSI control codes for whatever reason. Link: python/cpython#49416
We need to remove illegal XML characters because ElementTree doesn't. The characters are replaced by the replacement character ("�"). This will mean CircleCI will be able to parse test output XML files that contain ANSI control codes for whatever reason. Link: python/cpython#49416
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: