Title: urllib.parse: Allow more flexibility in schemes and URL resolution behavior
Messages (11)
msg410259 - (view) Author: Lincoln Auster (lincolnauster) * Date: 2022-01-10 21:50
It looks like this was discussed in 2013-2015 here:

Basically, with all the URL schemes that exist in the world (and the possibility of a custom scheme), the current strategy of enumerating what do what in a hard-coded variable is a bit ... weird. Among the proposed solutions in 18828, some were:

+ Have a global registry of what schemes do what (criticized for being overkill, and I can't say I disagree)
+ Get rid of the scheme lists altogether, and assume every scheme supports everything (isn't backwards compatible; might break with intended behavior, too).
+ Switch the use_relative whitelist to a blacklist: (maybe fine in practice, maybe not; either way it doesn't really fix the underlying issue)
+ Work around it with global state (modify the uses_* lists; this is what I'm doing in my code, and I can't say I like it much).

An alternative implemented I've implemented in my fork ( is to have an Enum with all the weird scheme-based behaviors that may occur (urllib.parse.SchemeClass in my fork) and allow passing a set of those Enums to functions relying on scheme-specific behavior, and adding all the elements of that set to what's been determined by the scheme. (See the test case for a concrete example; this explanation is not great).

Some things I like about this:
+ Backwards compatibility.
+ It makes the functions using it as a general strategy a bit more pure.
+ It makes client code deal with edge cases.

Some things that could be changed:
+ There's no way to remove behaviors you *don't* want.
+ It makes client code deal with edge cases.

As a side thought: if the above could be adopted, the uses_* lists could be enforced as immutable, which, while breaking compatibility, could make client code a bit cleaner.
msg413066 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-02-11 12:56
I remember a discussion about this years ago.
urllib is a module that pre-dates the idea of universal parsing for URIs, where the delimiters (like ://) are enough to determine the parts of a URI and give them meaning (host, port, user, path, etc).
Backward compat for urllib is always a concern; someone said at the time that it could be good to have a new module for modern, generic parsing, but that hasn’t happened.  Maybe a new parse function, or new parameter to the existing one, could be easier to add.
msg413084 - (view) Author: Lincoln Auster (lincolnauster) * Date: 2022-02-11 16:24
> Maybe a new parse function, or new parameter to the existing one,
> could be easier to add.

If I'm understanding you right, that's what this (and the PR) is - an
extra optional parameter to urllib.parse to supplement the existing
(legacy?) hard-coded list.
msg413123 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-02-12 11:47
In my idea it would not be a list of things that you have to pass piecemeal to request specific behaviour, but another function or a new param (like `parse(string, universal=True)`) that implements universal parsing.

We could even handle things like #22852 in that mode (although ironically, correct behaviour for that requires having a registry of schemes).
msg413139 - (view) Author: Lincoln Auster (lincolnauster) * Date: 2022-02-12 18:11
> In my idea it would not be a list of things that you have to pass
> piecemeal to request specific behaviour, but another function or a new
> param (like `parse(string, universal=True)`) that implements universal
> parsing.

If I'm correct in my understanding of a universal parse function (a
function with all the SchemeClasses enabled unilaterally), some
parse_universal function would be a pretty trivial thing to add with the
API I've already got here (though it wouldn't address 22852 without some
extra work afaict). I do think keeping the 'piecemeal' options exposed
has some utility, though, especially since the uses_* lists already
treat them on such a granular level.

Do we think a parse_universal function would be helpful to add on top of
this, or just repetitive?
msg413314 - (view) Author: karl (karlcow) * Date: 2022-02-16 04:01
Just to note that there is a maintained list of officially accepted schemes at IANA.

In addition there is a list of unofficial schemes on wikipedia
msg416369 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2022-03-30 14:54
Éric Araujo wrote on PR30520:
> No, we should not redefine the behavior of urlparse.
> I was always talking about adding another function. Yes it can be a one-liner,
> but my point is that I don’t see the usefulness of having the separate flags to
> pick and choose parts of standard parsing.

I suspect the usefulness comes from error checking -- if a scheme doesn't support parameters, then having what looks like parameters converted would not be helpful.

Further, while a new function is definitely safer, how many parse options do we need?  Anyone else remember `os.popen()`, `os.popen2`, `os.popen3`, and, finally, `os.popen4()`?

Assuming we just enhance the existing function, would it be more palatable if there was a `SchemeFlag.ALL`, so universal parsing was just `urlparse(uri_string, flags=SchemeFlag.ALL)`?  To be really user-friendly, we could have:

    class SchemeFlag(Flag):
        RELATIVE = auto()
        NETLOC = auto()
        PARAMS = auto()
        def __repr__(self):
            return f"{self.module}.{self._name_}"
        __str__ = __repr__

Then the above call becomes:

    urlparse(uri_string, flags=UNIVERSAL)
msg416462 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-03-31 22:10
I would like to know what Senthil is thinking before the PR with options à la carte are merged!
msg416463 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2022-03-31 22:31
Sounds good.
msg416464 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2022-03-31 22:41
I will review this in a day. 
I had been following the conversation, but couldn't look deeper into the code.
Thank you for engaging and contributions.
msg416633 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2022-04-03 17:36
Hi all, I was looking at it. Introducing an enum at the last parameter is going to add cost of understanding the behavior to this function. I am doing further reading on the previous discussions and PR(s) now.
