Message 393207 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gregory.p.smith
Recipients	Mike.Lissner, apollo13, gregory.p.smith, lukasz.langa, mgorny, miss-islington, ned.deily, odd_bloke, orsenthil, sethmlarson, xtreak
Date	2021-05-07.19:15:24
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1620414924.41.0.0860851395031.issue43882@roundup.psfhosted.org>
In-reply-to

Content
There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing. It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise) Based on my testing, raising an exception is more intrusive to existing tests (which we can only ever hope is representative of code) than stripping. At least as exposed by running the changes through many tens of thousands of unittest suites at work. ie: if we raise an exception, pandas.read_json() starts failing because that winds up using urlsplit in hopes of extracting the scheme and comparing that to known values as their method of deciding if something should be treated as a URL to data rather than data. Pandas would need to be fixed. That urlsplit() API use pattern is repeated in various other pieces of code: urlsplit is not expected to raise an exception. The caller then has a conditional or two testing some parts of the urlsplit result to make a guess as to if something should be considered a URL or not. Doing code inspection, pandas included, this code pretty much always then goes on to pass the original url value off to some other library, be it urllib, or requests, or ...). Consequences of that code inspection finding? With our existing character stripping change, new data is then allowed to pass through these urlsplit uses and be considered a URL. Which leads to some code sending the url with embedded \r\n\t chars on to other APIs - a concern expressed a couple of times above. Even though urlsplit isn't itself a validation API, it gets used as an early step in peoples custom identification and validation attempts. So any change we make to it at all in any way breaks someones expectations, even if they probably shouldn't have had those expectations and aren't doing wise validation. Treat this analysis as a sign that we should provide an explicit url validator because almost everyone is doing it some form of wrong. (issue43883) I did wonder if Mike's suggestion of removing the characters during processing, but leaving them in the final result in https://bugs.python.org/issue43882#msg393033 is feasible as remediation for this? My gut feeling is that it isn't. It doesn't solve the problem of preventing the bad data from going where it shouldn't. Even if we happen to parse that example differently, the unwanted characters are still retained in other places they don't belong. Fundamantelly: We require people to make a different series of API call and choices in the end user code to explicitly not use unvalidated inputs. Our stdlib API surface can't satisfy that today and use of unvalidated data in wrong places is a broad software security antipattern theme.

There is no less intrusive fix as far as I can see. I believe we're down to either stick with what we've done, or do nothing. It doesn't have to be the same choice in all release branches, being more conservative with changes the older the stable branch is okay. (ie: removing this from 3.6 and 3.7 seems fine even if more recent ones do otherwise)

Based on my testing, raising an exception is more intrusive to existing tests (which we can only ever hope is representative of code) than stripping. At least as exposed by running the changes through many tens of thousands of unittest suites at work.

ie: if we raise an exception, pandas.read_json() starts failing because that winds up using urlsplit in hopes of extracting the scheme and comparing that to known values as their method of deciding if something should be treated as a URL to data rather than data. Pandas would need to be fixed.

That urlsplit() API use pattern is repeated in various other pieces of code: urlsplit is not expected to raise an exception. The caller then has a conditional or two testing some parts of the urlsplit result to make a guess as to if something should be considered a URL or not. Doing code inspection, pandas included, this code pretty much always then goes on to pass the original url value off to some other library, be it urllib, or requests, or ...).

Consequences of that code inspection finding? With our existing character stripping change, new data is then allowed to pass through these urlsplit uses and be considered a URL. Which leads to some code sending the url with embedded \r\n\t chars on to other APIs - a concern expressed a couple of times above.

Even though urlsplit isn't itself a validation API, it gets used as an early step in peoples custom identification and validation attempts. So *any* change we make to it at all in any way breaks someones expectations, even if they probably shouldn't have had those expectations and aren't doing wise validation.

Treat this analysis as a sign that we should provide an explicit url validator because almost everyone is doing it some form of wrong. (issue43883)

I did wonder if Mike's suggestion of removing the characters during processing, but leaving them in the final result in https://bugs.python.org/issue43882#msg393033 is feasible as remediation for this? My gut feeling is that it isn't. It doesn't solve the problem of preventing the bad data from going where it shouldn't. Even if we happen to parse that example differently, the unwanted characters are still retained in other places they don't belong. Fundamantelly: We require people to make a different series of API call and choices in the end user code to **explicitly not use unvalidated inputs**. Our stdlib API surface can't satisfy that today and use of unvalidated data in wrong places is a broad software security antipattern theme.

History
Date	User	Action	Args
2021-05-07 19:15:24	gregory.p.smith	set	recipients: + gregory.p.smith, orsenthil, ned.deily, odd_bloke, lukasz.langa, mgorny, apollo13, Mike.Lissner, miss-islington, xtreak, sethmlarson
2021-05-07 19:15:24	gregory.p.smith	set	messageid: <1620414924.41.0.0860851395031.issue43882@roundup.psfhosted.org>
2021-05-07 19:15:24	gregory.p.smith	link	issue43882 messages
2021-05-07 19:15:24	gregory.p.smith	create