Message 328040 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	benspiller
Recipients	benspiller, effbot, eli.bendersky, flox, jwilk, martin.panter, nvetoshkin, ods, santoso.wijaya, strangefeatures
Date	2018-10-19.11:28:13
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1539948493.31.0.788709270274.issue5166@psf.upfronthosting.co.za>
In-reply-to

Content
To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this: def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'): return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString) and then copy+paste the following pattern into every bit of code that generates XML: myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8')) It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc.

To help anyone else struggling with this bug, based on https://lsimons.wordpress.com/2011/03/17/stripping-illegal-characters-out-of-xml-in-python/ the best workaround I've currently found is to define this:

def escape_xml_illegal_chars(unicodeString, replaceWith=u'?'):
	return re.sub(u'[\x00-\x08\x0b\x0c\x0e-\x1F\uD800-\uDFFF\uFFFE\uFFFF]', replaceWith, unicodeString)

and then copy+paste the following pattern into every bit of code that generates XML:

myfile.write(escape_xml_illegal_chars(document.toxml(encoding='utf-8').decode('utf-8')).encode('utf-8'))

It's obviously pretty grim (and unsafe) to expect every python developer to copy+paste this kind of thing into their own project to avoid buggy XML generation, so would be better to have the escape_xml_illegal_chars function in the python standard library (maybe alongside xml.sax.utils.escape - which notably does _not_ escape all the unicode characters that aren't valid XML), and built-in support for this as part of document.toxml. 

I guess we'd want it to be user-configurable for any users who are prepared to tolerate the possibility unparseable XML documents will be generated in return for improved performance for the common case where these characters are not present, not not having the capability at all just means most python applications that do XML generate with special-casing this have a bug. I suggest we definitely need some clear warnings about this in the doc.

History
Date	User	Action	Args
2018-10-19 11:28:13	benspiller	set	recipients: + benspiller, effbot, ods, strangefeatures, jwilk, eli.bendersky, flox, nvetoshkin, santoso.wijaya, martin.panter
2018-10-19 11:28:13	benspiller	set	messageid: <1539948493.31.0.788709270274.issue5166@psf.upfronthosting.co.za>
2018-10-19 11:28:13	benspiller	link	issue5166 messages
2018-10-19 11:28:13	benspiller	create