This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urllib.parse: Allow more flexibility in schemes and URL resolution behavior
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.11
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.araujo, ethan.furman, karlcow, lincolnauster, lukasz.langa, orsenthil
Priority: normal Keywords: patch

Created on 2022-01-10 21:50 by lincolnauster, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 30520 open lincolnauster, 2022-01-10 21:55
Messages (11)
msg410259 - (view) Author: Lincoln Auster (lincolnauster) * Date: 2022-01-10 21:50
It looks like this was discussed in 2013-2015 here: https://bugs.python.org/issue18828

Basically, with all the URL schemes that exist in the world (and the possibility of a custom scheme), the current strategy of enumerating what do what in a hard-coded variable is a bit ... weird. Among the proposed solutions in 18828, some were:

+ Have a global registry of what schemes do what (criticized for being overkill, and I can't say I disagree)
+ Get rid of the scheme lists altogether, and assume every scheme supports everything (isn't backwards compatible; might break with intended behavior, too).
+ Switch the use_relative whitelist to a blacklist: (maybe fine in practice, maybe not; either way it doesn't really fix the underlying issue)
+ Work around it with global state (modify the uses_* lists; this is what I'm doing in my code, and I can't say I like it much).

An alternative implemented I've implemented in my fork (https://github.com/lincolnauster/cpython/tree/urllib-custom-schemes) is to have an Enum with all the weird scheme-based behaviors that may occur (urllib.parse.SchemeClass in my fork) and allow passing a set of those Enums to functions relying on scheme-specific behavior, and adding all the elements of that set to what's been determined by the scheme. (See the test case for a concrete example; this explanation is not great).

Some things I like about this:
+ Backwards compatibility.
+ It makes the functions using it as a general strategy a bit more pure.
+ It makes client code deal with edge cases.

Some things that could be changed:
+ There's no way to remove behaviors you *don't* want.
+ It makes client code deal with edge cases.

As a side thought: if the above could be adopted, the uses_* lists could be enforced as immutable, which, while breaking compatibility, could make client code a bit cleaner.
msg413066 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-02-11 12:56
I remember a discussion about this years ago.
urllib is a module that pre-dates the idea of universal parsing for URIs, where the delimiters (like ://) are enough to determine the parts of a URI and give them meaning (host, port, user, path, etc).
Backward compat for urllib is always a concern; someone said at the time that it could be good to have a new module for modern, generic parsing, but that hasn’t happened.  Maybe a new parse function, or new parameter to the existing one, could be easier to add.
msg413084 - (view) Author: Lincoln Auster (lincolnauster) * Date: 2022-02-11 16:24
> Maybe a new parse function, or new parameter to the existing one,
> could be easier to add.

If I'm understanding you right, that's what this (and the PR) is - an
extra optional parameter to urllib.parse to supplement the existing
(legacy?) hard-coded list.
msg413123 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-02-12 11:47
In my idea it would not be a list of things that you have to pass piecemeal to request specific behaviour, but another function or a new param (like `parse(string, universal=True)`) that implements universal parsing.

We could even handle things like #22852 in that mode (although ironically, correct behaviour for that requires having a registry of schemes).
msg413139 - (view) Author: Lincoln Auster (lincolnauster) * Date: 2022-02-12 18:11
> In my idea it would not be a list of things that you have to pass
> piecemeal to request specific behaviour, but another function or a new
> param (like `parse(string, universal=True)`) that implements universal
> parsing.

If I'm correct in my understanding of a universal parse function (a
function with all the SchemeClasses enabled unilaterally), some
parse_universal function would be a pretty trivial thing to add with the
API I've already got here (though it wouldn't address 22852 without some
extra work afaict). I do think keeping the 'piecemeal' options exposed
has some utility, though, especially since the uses_* lists already
treat them on such a granular level.

Do we think a parse_universal function would be helpful to add on top of
this, or just repetitive?
msg413314 - (view) Author: karl (karlcow) * Date: 2022-02-16 04:01
Just to note that there is a maintained list of officially accepted schemes at IANA.
https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

In addition there is a list of unofficial schemes on wikipedia 
https://en.wikipedia.org/wiki/List_of_URI_schemes#Unofficial_but_common_URI_schemes
msg416369 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2022-03-30 14:54
Éric Araujo wrote on PR30520:
----------------------------
> No, we should not redefine the behavior of urlparse.
> 
> I was always talking about adding another function. Yes it can be a one-liner,
> but my point is that I don’t see the usefulness of having the separate flags to
> pick and choose parts of standard parsing.

I suspect the usefulness comes from error checking -- if a scheme doesn't support parameters, then having what looks like parameters converted would not be helpful.

Further, while a new function is definitely safer, how many parse options do we need?  Anyone else remember `os.popen()`, `os.popen2`, `os.popen3`, and, finally, `os.popen4()`?

Assuming we just enhance the existing function, would it be more palatable if there was a `SchemeFlag.ALL`, so universal parsing was just `urlparse(uri_string, flags=SchemeFlag.ALL)`?  To be really user-friendly, we could have:

    class SchemeFlag(Flag):
        RELATIVE = auto()
        NETLOC = auto()
        PARAMS = auto()
        UNIVERSAL = RELATIVE | NETLOC | PARAMS
        #
        def __repr__(self):
            return f"{self.module}.{self._name_}"
        __str__ = __repr__
    RELATIVE, NETLOC, PARAMS, UNIVERSAL = SchemeFlag

Then the above call becomes:

    urlparse(uri_string, flags=UNIVERSAL)
msg416462 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2022-03-31 22:10
I would like to know what Senthil is thinking before the PR with options à la carte are merged!
msg416463 - (view) Author: Ethan Furman (ethan.furman) * (Python committer) Date: 2022-03-31 22:31
Sounds good.
msg416464 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2022-03-31 22:41
I will review this in a day. 
I had been following the conversation, but couldn't look deeper into the code.
Thank you for engaging and contributions.
msg416633 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2022-04-03 17:36
Hi all, I was looking at it. Introducing an enum at the last parameter is going to add cost of understanding the behavior to this function. I am doing further reading on the previous discussions and PR(s) now.
History
Date User Action Args
2022-04-11 14:59:54adminsetgithub: 90495
2022-04-04 03:47:32ned.deilysetassignee: docs@python ->

nosy: - barry, paul.moore, ronaldoussoren, vstinner, larry, tim.golden, ned.deily, ezio.melotti, mrabarnett, r.david.murray, docs@python, zach.ware, koobs, steve.dower, lys.nikolaou, pablogsal
components: - Build, Demos and Tools, Documentation, Extension Modules, Interpreter Core, macOS, Regular Expressions, Tests, Unicode, Windows, XML, 2to3 (2.x to 3.x conversion tool), ctypes, Cross-Build, email, Argument Clinic, FreeBSD, SSL, C API, Parser
versions: - Python 3.7
2022-04-04 03:46:29ned.deilysethgrepos: - hgrepo414
2022-04-04 03:40:24ned.deilysetfiles: - mitre_f188eec1268fd49bdc7375fc5b77ded657c150875fede1a4d797f818d2514e88_120.csv
2022-04-04 03:28:46qwerazzfffssetfiles: + mitre_f188eec1268fd49bdc7375fc5b77ded657c150875fede1a4d797f818d2514e88_120.csv

nosy: + larry, paul.moore, tim.golden, koobs, r.david.murray, zach.ware, steve.dower, ned.deily, barry, pablogsal, ezio.melotti, ronaldoussoren, lys.nikolaou, docs@python, vstinner, mrabarnett
versions: + Python 3.7
hgrepos: + hgrepo414
assignee: docs@python
components: + Build, Demos and Tools, Documentation, Extension Modules, Interpreter Core, macOS, Regular Expressions, Tests, Unicode, Windows, XML, 2to3 (2.x to 3.x conversion tool), ctypes, Cross-Build, email, Argument Clinic, FreeBSD, SSL, C API, Parser
2022-04-03 17:36:07orsenthilsetmessages: + msg416633
2022-03-31 22:41:06orsenthilsetmessages: + msg416464
2022-03-31 22:31:52ethan.furmansetmessages: + msg416463
2022-03-31 22:10:19eric.araujosetmessages: + msg416462
2022-03-30 14:54:08ethan.furmansetmessages: + msg416369
2022-03-29 16:07:16ethan.furmansetnosy: + ethan.furman
2022-02-16 04:01:03karlcowsetnosy: + karlcow
messages: + msg413314
2022-02-14 22:11:19brett.cannonsetnosy: - brett.cannon
2022-02-12 18:11:25lincolnaustersetmessages: + msg413139
2022-02-12 11:47:38eric.araujosetmessages: + msg413123
2022-02-11 16:24:18lincolnaustersetmessages: + msg413084
2022-02-11 12:56:26eric.araujosetnosy: + eric.araujo, brett.cannon, orsenthil, lukasz.langa

messages: + msg413066
versions: + Python 3.11
2022-01-10 21:55:16lincolnaustersetkeywords: + patch
stage: patch review
pull_requests: + pull_request28721
2022-01-10 21:50:56lincolnaustercreate