This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Mike.Lissner
Recipients Mike.Lissner, apollo13, gregory.p.smith, lukasz.langa, mgorny, miss-islington, orsenthil, sethmlarson, xtreak
Date 2021-05-05.18:35:05
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1620239705.8.0.964913791823.issue43882@roundup.psfhosted.org>
In-reply-to
Content
> Instead of the patches as you see them, we could've raised an exception.

In my mind the definition of a valid URL is what browsers recognize. They're moving towards the WHATWG definition, and so too must we. 

If we make python raise an exception when a URL has a newline in the scheme (e..g: "htt\np"), we'd be raising exceptions for *valid* URLs as browsers define them. That doesn't seem right at all to me. I'd be frustrated to have to catch such an exception, and I'd wonder how to pass through valid exceptions without urlparse raising something.


> Making the output 'sanitized' means that invalid input is converted into valid output.  This goes against the principle of least surprise.

Well, not quite, right? The URLs this fixes *are* valid according to browsers. Browsers say these tabs and newlines are OK. 

----

I agree though that there's an issue with the approach of stripping input in a way that affects output. That doesn't seem right. 

I think the solution I'd favor (and I imagine what's coming in 43883) is to do this properly so that newlines are preserved in the output, but so that the scheme is also placed properly in the scheme attribute. 

So instead of this (from the initial report):

> In [9]: from urllib.parse import urlsplit
> In [10]: urlsplit("java\nscript:alert('bad')")
> Out[10]: SplitResult(scheme='', netloc='', path="java\nscript:alert('bad')", query='', fragment='')

We get something like this:

> In [10]: urlsplit("java\nscript:alert('bad')")
> Out[10]: SplitResult(scheme='java\nscript', netloc='', path="alert('bad')", query='', fragment='')

In other words, keep the funky characters and parse properly.
History
Date User Action Args
2021-05-05 18:35:05Mike.Lissnersetrecipients: + Mike.Lissner, gregory.p.smith, orsenthil, lukasz.langa, mgorny, apollo13, miss-islington, xtreak, sethmlarson
2021-05-05 18:35:05Mike.Lissnersetmessageid: <1620239705.8.0.964913791823.issue43882@roundup.psfhosted.org>
2021-05-05 18:35:05Mike.Lissnerlinkissue43882 messages
2021-05-05 18:35:05Mike.Lissnercreate