Message 340981 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	scoder
Recipients	Ben Spiller, benspiller, effbot, eli.bendersky, flox, jwilk, martin.panter, nvetoshkin, ods, santoso.wijaya, scoder, strangefeatures
Date	2019-04-27.11:39:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1556365182.67.0.0753404885671.issue5166@roundup.psfhosted.org>
In-reply-to

Content
This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it. I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur. Note that simply replacing invalid characters by the replacement character is not a good solution, at least not in the general case, since it silently corrupts data. It's probably a better solution for users to make their code scream out loudly when it has to deal with data that it cannot serialise in the end, and to do that early on input (where its easy to debug) rather than late on serialisation where it might be difficult to understand how the data became what it is. Trying to serialise a null-character seems only a symptom of a more important problem somewhere else in the processing pipeline. In the end, users who really care about correct output should run some kind of schema validation over it after serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :) So, if someone finds a way to augment the text escaping procedure with a bit of character validation without making it slower (especially for the extremely common very short strings), then I think we can reconsider this as an enhancement. Until then, and seeing that no-one has come up with a patch in the last 10 years, I'll close this as "won't fix".

This is a tricky decision. lxml, for example, validates user input, but that's because it has to process it anyway and does it along the way directly on input (and very efficiently in C code). ET, on the other hand, is rather lenient about what it allows users to do and doesn't apply much processing to user input. It even allows invalid trees during processing and only expects the tree to be serialisable when requested to serialise it.

I think that's a fair behaviour, because most user input will be ok and shouldn't need to suffer the performance penalty of validating all input. Null-characters are a very rare thing to find in text, for example, and I think it's reasonable to let users handle the few cases by themselves where they can occur.

Note that simply replacing invalid characters by the replacement character is not a good solution, at least not in the general case, since it silently corrupts data. It's probably a better solution for users to make their code scream out loudly when it has to deal with data that it cannot serialise in the end, and to do that early on input (where its easy to debug) rather than late on serialisation where it might be difficult to understand how the data became what it is. Trying to serialise a null-character seems only a symptom of a more important problem somewhere else in the processing pipeline.

In the end, users who *really* care about correct output should run some kind of schema validation over it *after* serialisation, as that would detect not only data issues but also structural and logical issues (such as a missing or empty attribute), specifically for their target data format. In some cases, it might even detect random data corruption due to old non-ECC RAM in the server machine. :)

So, if someone finds a way to augment the text escaping procedure with a bit of character validation without making it slower (especially for the extremely common very short strings), then I think we can reconsider this as an enhancement. Until then, and seeing that no-one has come up with a patch in the last 10 years, I'll close this as "won't fix".

History
Date	User	Action	Args
2019-04-27 11:39:42	scoder	set	recipients: + scoder, effbot, ods, strangefeatures, jwilk, eli.bendersky, flox, nvetoshkin, santoso.wijaya, martin.panter, benspiller, Ben Spiller
2019-04-27 11:39:42	scoder	set	messageid: <1556365182.67.0.0753404885671.issue5166@roundup.psfhosted.org>
2019-04-27 11:39:42	scoder	link	issue5166 messages
2019-04-27 11:39:42	scoder	create