Author gregory.p.smith
Recipients Mike.Lissner, gregory.p.smith, miss-islington, orsenthil, xtreak
Date 2021-05-01.17:26:20
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1619889980.34.0.454198883162.issue43882@roundup.psfhosted.org>
In-reply-to
Content
I think there's still a flaw in the fixes implemented in 3.10 and 3.9 so far.  We're closer, but probably not quite good enough yet.

why?  We aren't stripping the newlines+tab early enough.

I think we need to do the stripping *right after* the _coerce_args(url, ...) call at the start of the function.

Otherwise we
  (1) are storing url variants with the bad characters in _parse_cache [a mere slowdown in the worst case as it'd just overflow the cache sooner]
  (2) are splitting the scheme off the URL prior to stripping.  in 3.9+ there is a check for valid scheme characters, which will defer to the default scheme when found.  The WHATWG basic url parsing has these characters stripped before any parts are split off though, so 'ht\rtps' - for example - would wind up as 'https' rather than our behavior so far of deferring to the default scheme.

I noticed this when reviewing the pending 3.8 PR as it made it more obvious due to the structure of the code and would've allowed characters through into query and fragment in some cases.  https://github.com/python/cpython/pull/25726#pullrequestreview-649803605
History
Date User Action Args
2021-05-01 17:26:20gregory.p.smithsetrecipients: + gregory.p.smith, orsenthil, Mike.Lissner, miss-islington, xtreak
2021-05-01 17:26:20gregory.p.smithsetmessageid: <1619889980.34.0.454198883162.issue43882@roundup.psfhosted.org>
2021-05-01 17:26:20gregory.p.smithlinkissue43882 messages
2021-05-01 17:26:20gregory.p.smithcreate