Message 392611 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gregory.p.smith
Recipients	Mike.Lissner, gregory.p.smith, miss-islington, orsenthil, xtreak
Date	2021-05-01.17:26:20
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1619889980.34.0.454198883162.issue43882@roundup.psfhosted.org>
In-reply-to

Content
I think there's still a flaw in the fixes implemented in 3.10 and 3.9 so far. We're closer, but probably not quite good enough yet. why? We aren't stripping the newlines+tab early enough. I think we need to do the stripping right after the _coerce_args(url, ...) call at the start of the function. Otherwise we (1) are storing url variants with the bad characters in _parse_cache [a mere slowdown in the worst case as it'd just overflow the cache sooner] (2) are splitting the scheme off the URL prior to stripping. in 3.9+ there is a check for valid scheme characters, which will defer to the default scheme when found. The WHATWG basic url parsing has these characters stripped before any parts are split off though, so 'ht\rtps' - for example - would wind up as 'https' rather than our behavior so far of deferring to the default scheme. I noticed this when reviewing the pending 3.8 PR as it made it more obvious due to the structure of the code and would've allowed characters through into query and fragment in some cases. https://github.com/python/cpython/pull/25726#pullrequestreview-649803605

I think there's still a flaw in the fixes implemented in 3.10 and 3.9 so far.  We're closer, but probably not quite good enough yet.

why?  We aren't stripping the newlines+tab early enough.

I think we need to do the stripping *right after* the _coerce_args(url, ...) call at the start of the function.

Otherwise we
  (1) are storing url variants with the bad characters in _parse_cache [a mere slowdown in the worst case as it'd just overflow the cache sooner]
  (2) are splitting the scheme off the URL prior to stripping.  in 3.9+ there is a check for valid scheme characters, which will defer to the default scheme when found.  The WHATWG basic url parsing has these characters stripped before any parts are split off though, so 'ht\rtps' - for example - would wind up as 'https' rather than our behavior so far of deferring to the default scheme.

I noticed this when reviewing the pending 3.8 PR as it made it more obvious due to the structure of the code and would've allowed characters through into query and fragment in some cases.  https://github.com/python/cpython/pull/25726#pullrequestreview-649803605

History
Date	User	Action	Args
2021-05-01 17:26:20	gregory.p.smith	set	recipients: + gregory.p.smith, orsenthil, Mike.Lissner, miss-islington, xtreak
2021-05-01 17:26:20	gregory.p.smith	set	messageid: <1619889980.34.0.454198883162.issue43882@roundup.psfhosted.org>
2021-05-01 17:26:20	gregory.p.smith	link	issue43882 messages
2021-05-01 17:26:20	gregory.p.smith	create