Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

urllib.parse: Allow more flexibility in schemes and URL resolution behavior #90495

Closed
lincolnauster mannequin opened this issue Jan 10, 2022 · 14 comments
Closed

urllib.parse: Allow more flexibility in schemes and URL resolution behavior #90495

lincolnauster mannequin opened this issue Jan 10, 2022 · 14 comments
Labels
3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@lincolnauster
Copy link
Mannequin

lincolnauster mannequin commented Jan 10, 2022

BPO 46337
Nosy @orsenthil, @merwok, @karlcow, @ethanfurman, @ambv, @lincolnauster
PRs
  • bpo-46337: Urllib.parse scheme-specific behavior without reliance on URL scheme #30520
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2022-01-10.21:50:56.334>
    labels = ['type-bug', 'library', '3.11']
    title = 'urllib.parse: Allow more flexibility in schemes and URL resolution behavior'
    updated_at = <Date 2022-04-04.03:47:32.421>
    user = 'https://github.com/lincolnauster'

    bugs.python.org fields:

    activity = <Date 2022-04-04.03:47:32.421>
    actor = 'ned.deily'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2022-01-10.21:50:56.334>
    creator = 'lincolnauster'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 46337
    keywords = ['patch']
    message_count = 11.0
    messages = ['410259', '413066', '413084', '413123', '413139', '413314', '416369', '416462', '416463', '416464', '416633']
    nosy_count = 6.0
    nosy_names = ['orsenthil', 'eric.araujo', 'karlcow', 'ethan.furman', 'lukasz.langa', 'lincolnauster']
    pr_nums = ['30520']
    priority = 'normal'
    resolution = None
    stage = 'patch review'
    status = 'open'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue46337'
    versions = ['Python 3.11']

    @lincolnauster
    Copy link
    Mannequin Author

    lincolnauster mannequin commented Jan 10, 2022

    It looks like this was discussed in 2013-2015 here: https://bugs.python.org/issue18828

    Basically, with all the URL schemes that exist in the world (and the possibility of a custom scheme), the current strategy of enumerating what do what in a hard-coded variable is a bit ... weird. Among the proposed solutions in 18828, some were:

    + Have a global registry of what schemes do what (criticized for being overkill, and I can't say I disagree)
    + Get rid of the scheme lists altogether, and assume every scheme supports everything (isn't backwards compatible; might break with intended behavior, too).
    + Switch the use_relative whitelist to a blacklist: (maybe fine in practice, maybe not; either way it doesn't really fix the underlying issue)
    + Work around it with global state (modify the uses_* lists; this is what I'm doing in my code, and I can't say I like it much).

    An alternative implemented I've implemented in my fork (https://github.com/lincolnauster/cpython/tree/urllib-custom-schemes) is to have an Enum with all the weird scheme-based behaviors that may occur (urllib.parse.SchemeClass in my fork) and allow passing a set of those Enums to functions relying on scheme-specific behavior, and adding all the elements of that set to what's been determined by the scheme. (See the test case for a concrete example; this explanation is not great).

    Some things I like about this:
    + Backwards compatibility.
    + It makes the functions using it as a general strategy a bit more pure.
    + It makes client code deal with edge cases.

    Some things that could be changed:
    + There's no way to remove behaviors you *don't* want.
    + It makes client code deal with edge cases.

    As a side thought: if the above could be adopted, the uses_* lists could be enforced as immutable, which, while breaking compatibility, could make client code a bit cleaner.

    @lincolnauster lincolnauster mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jan 10, 2022
    @merwok
    Copy link
    Member

    merwok commented Feb 11, 2022

    I remember a discussion about this years ago.
    urllib is a module that pre-dates the idea of universal parsing for URIs, where the delimiters (like ://) are enough to determine the parts of a URI and give them meaning (host, port, user, path, etc).
    Backward compat for urllib is always a concern; someone said at the time that it could be good to have a new module for modern, generic parsing, but that hasn’t happened. Maybe a new parse function, or new parameter to the existing one, could be easier to add.

    @merwok merwok added the 3.11 only security fixes label Feb 11, 2022
    @lincolnauster
    Copy link
    Mannequin Author

    lincolnauster mannequin commented Feb 11, 2022

    Maybe a new parse function, or new parameter to the existing one,
    could be easier to add.

    If I'm understanding you right, that's what this (and the PR) is - an
    extra optional parameter to urllib.parse to supplement the existing
    (legacy?) hard-coded list.

    @merwok
    Copy link
    Member

    merwok commented Feb 12, 2022

    In my idea it would not be a list of things that you have to pass piecemeal to request specific behaviour, but another function or a new param (like parse(string, universal=True)) that implements universal parsing.

    We could even handle things like bpo-22852 in that mode (although ironically, correct behaviour for that requires having a registry of schemes).

    @lincolnauster
    Copy link
    Mannequin Author

    lincolnauster mannequin commented Feb 12, 2022

    In my idea it would not be a list of things that you have to pass
    piecemeal to request specific behaviour, but another function or a new
    param (like parse(string, universal=True)) that implements universal
    parsing.

    If I'm correct in my understanding of a universal parse function (a
    function with all the SchemeClasses enabled unilaterally), some
    parse_universal function would be a pretty trivial thing to add with the
    API I've already got here (though it wouldn't address 22852 without some
    extra work afaict). I do think keeping the 'piecemeal' options exposed
    has some utility, though, especially since the uses_* lists already
    treat them on such a granular level.

    Do we think a parse_universal function would be helpful to add on top of
    this, or just repetitive?

    @karlcow
    Copy link
    Mannequin

    karlcow mannequin commented Feb 16, 2022

    Just to note that there is a maintained list of officially accepted schemes at IANA.
    https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

    In addition there is a list of unofficial schemes on wikipedia
    https://en.wikipedia.org/wiki/List_of_URI_schemes#Unofficial_but_common_URI_schemes

    @ethanfurman
    Copy link
    Member

    Éric Araujo wrote on PR30520:
    ----------------------------

    No, we should not redefine the behavior of urlparse.

    I was always talking about adding another function. Yes it can be a one-liner,
    but my point is that I don’t see the usefulness of having the separate flags to
    pick and choose parts of standard parsing.

    I suspect the usefulness comes from error checking -- if a scheme doesn't support parameters, then having what looks like parameters converted would not be helpful.

    Further, while a new function is definitely safer, how many parse options do we need? Anyone else remember os.popen(), os.popen2, os.popen3, and, finally, os.popen4()?

    Assuming we just enhance the existing function, would it be more palatable if there was a SchemeFlag.ALL, so universal parsing was just urlparse(uri_string, flags=SchemeFlag.ALL)? To be really user-friendly, we could have:

        class SchemeFlag(Flag):
            RELATIVE = auto()
            NETLOC = auto()
            PARAMS = auto()
            UNIVERSAL = RELATIVE | NETLOC | PARAMS
            #
            def __repr__(self):
                return f"{self.module}.{self._name_}"
            __str__ = __repr__
        RELATIVE, NETLOC, PARAMS, UNIVERSAL = SchemeFlag

    Then the above call becomes:

        urlparse(uri_string, flags=UNIVERSAL)

    @merwok
    Copy link
    Member

    merwok commented Mar 31, 2022

    I would like to know what Senthil is thinking before the PR with options à la carte are merged!

    @ethanfurman
    Copy link
    Member

    Sounds good.

    @orsenthil
    Copy link
    Member

    I will review this in a day.
    I had been following the conversation, but couldn't look deeper into the code.
    Thank you for engaging and contributions.

    @orsenthil
    Copy link
    Member

    Hi all, I was looking at it. Introducing an enum at the last parameter is going to add cost of understanding the behavior to this function. I am doing further reading on the previous discussions and PR(s) now.

    @qwerazzfffs qwerazzfffs mannequin added build The build process and cross-build docs Documentation in the Doc dir extension-modules C modules in the Modules dir interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-mac topic-regex tests Tests in the Lib/test dir topic-unicode OS-windows topic-XML topic-2to3 topic-ctypes topic-email topic-argument-clinic topic-SSL labels Apr 4, 2022
    @ned-deily ned-deily removed topic-argument-clinic topic-SSL topic-C-API interpreter-core (Objects, Python, Grammar, and Parser dirs) 3.7 (EOL) end of life labels Apr 4, 2022
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @oldaccountdeadname
    Copy link

    @orsenthil any udpates? I'd like to continue with this PR as soon as is convenient :)

    @orsenthil
    Copy link
    Member

    Hi @lincolnauster , I was -1 and was thinking much on introducing a flag with the enum in the parse module.

    urlparse(url, scheme='', allow_fragments=True, flags=SchemeFlag(0)):
    

    This API signature is going to confuse people and will be huge blocker for further adoption and change (even if the default arguments are specified). I was thinking how best to mitigate that.

    1. Did we ever consider not going to flag as as parameter?
    2. Any other default for the flag rather than an enum?

    I am going to paste this comment on the PR too, and we could continue the discussion there. If design design required, we can bring it to a wider forum too.

    @orsenthil
    Copy link
    Member

    This is closed with wont fix (at least for the implementation suggested in this PR - #30520) - A simpler approach if suggested by a new issue can be considered.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants