classification
Title: urlparse doesn't validate the scheme
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, devkral, orsenthil, steven.daprano
Priority: normal Keywords:

Created on 2018-12-02 15:27 by devkral, last changed 2018-12-03 22:51 by steven.daprano.

Pull Requests
URL Status Linked Edit
PR 10862 closed python-dev, 2018-12-03 10:50
Messages (8)
msg330884 - (view) Author: (devkral) Date: 2018-12-02 15:27
the scheme argument of urlsplit/urlparse is completely broken.
here two examples:

urlunsplit(urlsplit("httpbin.org", scheme="https://"))
'https://:httpbin.org'

urlunsplit(urlsplit("httpbin.org", scheme="https"))
'https:///httpbin.org'

Fix: change urlsplit logic like this:
...
url, scheme, _coerce_result = _coerce_args(url, scheme)
scheme = scheme.rstrip("://") # this removes ://
...
i = url.find('://') # harden against arbitrary :
if i > 0:
    ...
elif scheme:
    netloc, url = _splitnetloc(url, 0)  # if scheme is specified, netloc is implied

sry too lazy to create a patch from this. Most probably are all python versions affected but I checked only 2.7 and 3.7 .
msg330886 - (view) Author: SilentGhost (SilentGhost) * (Python triager) Date: 2018-12-02 15:40
While it might seem broken, it behaves according to documentation, it seems to me.
msg330888 - (view) Author: (devkral) Date: 2018-12-02 16:00
first a correction; there is one protocol where it make sense:
file
I would change:
...
elif scheme:
...
to
...
elif scheme and scheme != "file":
...


Second: no it is not a correct behaviour. Urlunsplit is supposed to create a valid url from a tuple created by urlsplit. :/// is not a valid url (except for the file protocol).

Third: rstrip could be simplified to rstrip(":/")
msg330904 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-03 00:26
I don't think this is broken, but I do think it could be documented better. You have to read the documentation for `urlparse` to see this:

[Quote]
Following the syntax specifications in RFC 1808, urlparse recognizes a netloc only if it is properly introduced by ‘//’. Otherwise the input is presumed to be a relative URL and thus to start with a path component.

https://docs.python.org/3/library/urllib.parse.html#urllib.parse.urlparse

so the function is correct. You're making two errors:

- providing a relative URL "httpbin.org" and expecting it to be treated as an absolute URL;

- specifying scheme+delimiter instead of the scheme alone.

So I don't think this is a bug. urlsplit rightly accepts any arbitrary string as a scheme (it used to have a whitelist of permitted schemes, and that was a problem), and we can't assume that :/// is ONLY valid for file protocols.

Unless you come up with a convincing argument for why this is a bug, I'm going to change it to a documentation issue. The docs could do with some improvement to make it more clear, although the examples are good.
msg330927 - (view) Author: (devkral) Date: 2018-12-03 10:24
Ok, then the problem is in unsplit.
Because: :/// is not a valid url.

Next try: in urlunsplit:


if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):
    if url and url[:1] != '/' and scheme in ("file", ""): # my change
        url = '/' + url
msg330938 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-03 11:39
You haven't given a convincing reason that there is a problem that needs fixing, or if there is, that your patch is the right way to fix it.

You have already pointed out that there is at least one scheme where :/// is part of a valid URL: "file:///". Where there is one, there could be others, if not today, then in the future:

    spam:///eggs.cheese

There are well over 200 registered schemes:

https://www.iana.org/assignments/uri-schemes/uri-schemes.xhtml

I don't think it will be practical to put in a whitelist or a blacklist of schemes that are, or aren't, permitted to include :/// after the scheme. It is the caller's responsibility to use the correct scheme, without adding extra characters to the end.

I asked you to justify why this should be considered a bug in the library, rather than a bug in your code. I'm not an expert on URLs, but the functions look correct to me. If you can't justify why this is a bug in the library that needs fixing, rather than user-error, we should close this or change it to a request for better documentation.
msg330976 - (view) Author: (devkral) Date: 2018-12-03 19:48
ah, I get the idea behind urlunsplit. But still:
for e.g. http, http:/// is invalid. These urls are invalid for e.g. requests.
May I ask:
urllib.parse is for web protocol urls? If yes, then there should be an option which defaults to my version.
msg330993 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-03 22:51
I'm changing the name to better describe the problem, and suggest a better solution.

The urlparse.urlsplit and .urlunsplit functions currently don't validate the scheme argument, if given. According to the RFC:

   Scheme names consist of a sequence of characters. The lower case
   letters "a"--"z", digits, and the characters plus ("+"), period
   ("."), and hyphen ("-") are allowed. For resiliency, programs
   interpreting URLs should treat upper case letters as equivalent to
   lower case in scheme names (e.g., allow "HTTP" as well as "http").

https://www.ietf.org/rfc/rfc1738.txt

If the scheme is specified, I suggest it should be normalised to lowercase and validated, something like this:

    # untested
    if scheme:
        # scheme_chars already defined in module
        badchars = set(scheme) - set(scheme_chars)
        if badchars:
            raise ValueError('"%c" is invalid in URL schemes' % badchars.pop())
        scheme = scheme.lower()


This will help avoid errors such as passing 'http://' as the scheme.
History
Date User Action Args
2018-12-03 22:51:57steven.dapranosetversions: + Python 3.8, - Python 2.7, Python 3.7
title: urlsplit scheme argument broken -> urlparse doesn't validate the scheme
messages: + msg330993

keywords: - patch
stage: patch review ->
2018-12-03 19:48:46devkralsetmessages: + msg330976
2018-12-03 11:39:41steven.dapranosetmessages: + msg330938
2018-12-03 10:50:08python-devsetkeywords: + patch
stage: patch review
pull_requests: + pull_request10097
2018-12-03 10:24:15devkralsetmessages: + msg330927
2018-12-03 00:26:25steven.dapranosetnosy: + steven.daprano
messages: + msg330904
2018-12-02 16:00:09devkralsetmessages: + msg330888
2018-12-02 15:40:47SilentGhostsetnosy: + SilentGhost, orsenthil
type: behavior
messages: + msg330886
2018-12-02 15:27:37devkralcreate