Author benspiller
Recipients benspiller, effbot, eli.bendersky, flox, jwilk, martin.panter, nvetoshkin, ods, santoso.wijaya, strangefeatures
Date 2018-10-19.11:28:13
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
To help anyone else struggling with this bug, based on the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
	return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates XML:


It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. 

I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc.
Date User Action Args
2018-10-19 11:28:13benspillersetrecipients: + benspiller, effbot, ods, strangefeatures, jwilk, eli.bendersky, flox, nvetoshkin, santoso.wijaya, martin.panter
2018-10-19 11:28:13benspillersetmessageid: <>
2018-10-19 11:28:13benspillerlinkissue5166 messages
2018-10-19 11:28:13benspillercreate