This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author odd_bloke
Recipients Mike.Lissner, apollo13, gregory.p.smith, lukasz.langa, mgorny, miss-islington, odd_bloke, orsenthil, sethmlarson, xtreak
Date 2021-05-06.19:42:24
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <>
Hey folks,

Thanks for all the work on this: I really appreciate the efforts to keep Python as secure as possible!

This change _is_ causing us problems in the cloud-init codebase, which thankfully have been caught by our testing in Ubuntu's development release.  This is in a fairly deep part of the codebase, so apologies in advance for the detailed description.

TL;DR: cloud-init constructs mirror URLs and then sanitises them by replacing invalid characters with hyphens.  With the fix for this bug, urlsplit silently removes (some of) those characters before we can replace them, modifying the output of our sanitisation code, and therefore meaning cloud-init will, albeit in fairly specific corner cases, configure different mirrors if run with a Python including this fix vs. one that precedes it.

cloud-init constructs mirror URLs based on applying cloud metadata to user-configured (or default) templates.  As we're responsible for constructing these URLs, we also sanitise them before configuring the package manager to use them: specifically, we urlsplit to get the hostname, IDNA-encode (to handle non-ASCII input), replace any invalid URL characters with a "-", and then strip "-" off each part of the hostname (to handle leading/trailing invalid characters), then recombine the URL.  The most common case for this is a cloud which specifies values for the variables used in the template with an underscore: causes Apache mirrors with the default "HTTPProtocolOptions Strict" configuration to reject all requests to them (as that's an invalid hostname).  In contrast, *is* accepted, so is preferable.  (This is important because * exists so that local cloud admins can DNS "hijack" subdomains of it to point at internal servers: even though the Ubuntu mirrors don't reject underscored domains (any longer), this is a landmine waiting for any admins running their own mirrors.)  For more background, see the bug where we figured this all out:

So, more concretely: if we consider a post-templated URL of http://my\topenstack\tregion.mirror.internal/ubuntu, cloud-init changes from rewriting that to my-openstack-region.mirror.internal (on < 3.9.5) to myopenstackregion.mirror.internal (on 3.9.5+): if, in this notional deployment, an apt mirror is running at (exactly) my-openstack-region.mirror.internal, then new instance deployments will start failing: they won't be able to install packages.  This is the sort of breakage that we aim to avoid in cloud-init (because you just _know_ that everyone who deployed this cloud left NotionalCorp years ago, so fixing the configuration to remove these obviously-incorrect tabs is not necessarily trivial).

Given the current state of the fix here, it's not clear to me how we could (cleanly) achieve our desired behaviour.  We could perform replacement of these characters before invoking `urlsplit` but that would then substitute these characters outside of only the hostname: this is also a change in behaviour.  We could substitute those characters with magic strings, perform the split, and then replace them in the non-hostname parts with the original character and in the hostname with hyphens: we've obviously left "cleanly" behind at this point.  Another option would be to monkeypatch _UNSAFE_URL_BYTES_TO_REMOVE to an empty list: again, not a solution I'd want to have to support across Python versions!

One solution that presents itself to me: add a `strip_insecure_characters: bool = True` parameter.  We, in cloud-init, would pass this in as `False`, knowing that we're going to handle those ourselves.  Of course, this does leave the door open for API users to keep the current insecure behaviour: if library code (either public or project-internal) were to default to `False`, then the situation is no better than today.

For our use case, at least, I think a more restricted solution would work: `url_replacement_char: str = ""`.  We'd call `urlsplit(..., url_replacement_char="-")` and the rest of our code would work as it does today: from its POV, there were never these invalid chars in the first place.

Thanks once again for the work (and apologies for the wall of text)!

Date User Action Args
2021-05-06 19:42:25odd_blokesetrecipients: + odd_bloke, gregory.p.smith, orsenthil, lukasz.langa, mgorny, apollo13, Mike.Lissner, miss-islington, xtreak, sethmlarson
2021-05-06 19:42:25odd_blokesetmessageid: <>
2021-05-06 19:42:25odd_blokelinkissue43882 messages
2021-05-06 19:42:24odd_blokecreate