classification
Title: urllib.parse doesn't round-trip file URI's with multiple leading slashes
Type: behavior Stage: patch review
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: chris.jerdonek, epicfaace, martin.panter, piotr.dobrogost, v2m, xtreak
Priority: normal Keywords: patch

Created on 2018-07-30 04:39 by chris.jerdonek, last changed 2019-08-27 01:16 by epicfaace.

Pull Requests
URL Status Linked Edit
PR 15297 open epicfaace, 2019-08-15 03:29
Messages (9)
msg322652 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2018-07-30 04:39
urllib.parse doesn't seem to round-trip file URI's containing multiple leading slashes.  For example, this--

    import urllib.parse

    def round_trip(url):
        parsed = urllib.parse.urlsplit(url)
        new_url = urllib.parse.urlunsplit(parsed)
        print(f'{url} [{parsed}]\n{new_url}')
        print('ROUNDTRIP: {}\n'.format(url == new_url))

    for i in range(4):
        round_trip('file://{}root/a'.format(i * '/'))

results in--

    file://root/a [SplitResult(scheme='file', netloc='root', path='/a', query='', fragment='')]
    file://root/a
    ROUNDTRIP: True

    file:///root/a [SplitResult(scheme='file', netloc='', path='/root/a', query='', fragment='')]
    file:///root/a
    ROUNDTRIP: True

    file:////root/a [SplitResult(scheme='file', netloc='', path='//root/a', query='', fragment='')]
    file://root/a
    ROUNDTRIP: False

    file://///root/a [SplitResult(scheme='file', netloc='', path='///root/a', query='', fragment='')]
    file:///root/a
    ROUNDTRIP: False

URI's of the form file:////<host>/<share>/<path> occur, for example, when one wants to git-clone a UNC path on Windows:
https://stackoverflow.com/a/2520121/262819

Here is where CPython defines urlunsplit():
https://github.com/python/cpython/blob/4e11c461ed39085b8495a35c9367b46d8a0d306d/Lib/urllib/parse.py#L465-L482
(The '//' special-casing seems to occur in this line here:
https://github.com/python/cpython/blob/4e11c461ed39085b8495a35c9367b46d8a0d306d/Lib/urllib/parse.py#L473 )
 
And here is where the round-tripping is tested:
https://github.com/python/cpython/blob/4e11c461ed39085b8495a35c9367b46d8a0d306d/Lib/test/test_urlparse.py#L156
(Three initial leading slashes is tested, but not the problem case of four or more.)
msg322671 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-07-30 12:47
This is an issue with Python 2 too which I hope can be fixed too. The original logic in the code was committed around 16 years back : https://github.com/python/cpython/commit/bbc0568a5c7d3849a22c78d545823a4b952c0933 and tests are also around 10 years old too.

➜  cpython git:(2bea771609) ✗ ./python.exe
Python 2.7.15+ (remotes/upstream/2.7:2bea771609, Jul 30 2018, 18:07:51)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
➜  cpython git:(2bea771609) ✗ ./python.exe bpo34276.py
file://root/a [SplitResult(scheme='file', netloc='root', path='/a', query='', fragment='')]
file://root/a
ROUNDTRIP: True

file:///root/a [SplitResult(scheme='file', netloc='', path='/root/a', query='', fragment='')]
file:///root/a
ROUNDTRIP: True

file:////root/a [SplitResult(scheme='file', netloc='', path='//root/a', query='', fragment='')]
file://root/a
ROUNDTRIP: False

file://///root/a [SplitResult(scheme='file', netloc='', path='///root/a', query='', fragment='')]
file:///root/a
ROUNDTRIP: False


Thanks
msg322674 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python committer) Date: 2018-07-30 13:07
I just checked back the behavior on Perl's https://github.com/libwww-perl/URI/ . It seems to handle that along with other additional cases. Maybe some of the tests can be adopted from there for better coverage too (https://github.com/libwww-perl/URI/blob/master/t/split.t)

$ cat bpo34276.pl
use URI::Split qw(uri_split uri_join);

sub print_url{
    my $uri = shift;
    print "original uri ", $uri, "\n";
    ($scheme, $auth, $path, $query, $frag) = uri_split($uri);
    $uri = uri_join($scheme, $auth, $path, $query, $frag);
    print "returned uri ", $uri, "\n";
}

print_url("file://root/a");
print_url("file:///root/a");
print_url("file:////root/a");
print_url("file://///root/a");

$ perl bpo34276.pl
original uri file://root/a
returned uri file://root/a
original uri file:///root/a
returned uri file:///root/a
original uri file:////root/a
returned uri file:////root/a
original uri file://///root/a
returned uri file://///root/a


Thanks
msg322675 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-07-30 13:10
This may be a very old regression (from 2002) caused by Issue 591713 and Mercurial rev. 554f975073a0. The original check for the double slash, added in 0d6bd391acd8, “escapes” a path beginning with a double slash by prefixing it with two more slashes (empty “netloc”). This should round-trip Chris’s problem URLs.

I think the logic in “urlsplit” should always add the extra double slash for the netloc, regardless of path, at least if a scheme is present and it is registered in “uses_netloc”. This should fix Chris’s instance of the bug, since “file:” is registered. There is already a patch in Issue 1722348 which should do this (although it includes other changes as well).

The double slash should also be escaped if no scheme is present. (The empty scheme string is already in “uses_netloc”.) This might satisfy Issue 23505.

IMO it would be better to do the escaping by default, for all schemes unknown to “urllib”, and to blacklist specific schemes like “mailto:” instead. But that would be out of scope for a bug fix.
msg322716 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2018-07-31 03:38
Thanks for all the extra info. A couple more comments:

1. I came across this issue when diagnosing the following pip issue ("pip install git+file://" not working for Windows UNC paths):
https://github.com/pypa/pip/issues/3783

2. URLs of the form "file:////root" (with four or more leading slashes) are perhaps not valid URI's technically. See Section 3. "Syntax Components" of RFC 3986, where it says, "When authority [i.e. netloc] is not present, the path cannot begin with two slash characters ('//')":
https://tools.ietf.org/html/rfc3986#section-3

However, I don't think that means Python shouldn't try to roundtrip it successfully. Also, git-clone is apparently okay with URLs of this form, and does the right thing with them.
msg322722 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-07-31 05:50
I think your URLs are valid by RFC 3986. "When authority is not present" refers to URLs without the double-slash prefix, like the "urn:example:animal:ferret:nose". The RFC treats empty authority and no authority as different cases. If authority is present, the format for hier-part has to be

"//" authority path-abempty

Authority may be an empty string:

authority = [userinfo "@"] host [":" port]
host = IP-literal / IPv4address / reg-name
reg-name = *(unreserved / pct-encoded / sub-delims)  ; May be empty

Path-abempty may begin with two slashes if the first two segments are empty strings:

path-abempty = *("/" segment)
segment = *pchar  ; May be empty
msg322737 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2018-07-31 07:11
> The RFC treats empty authority and no authority as different cases.

I'm not well-versed on this. But I guess this means urllib.parse doesn't support this distinction. For example:

  >>> urllib.parse.urlsplit('file:/foo')
  SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')
  >>> urllib.parse.urlsplit('file:///foo')
  SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')
  >>> urllib.parse.urlsplit('file:/foo') == \
      urllib.parse.urlsplit('file:///foo')
  True

Both have authority / netloc equal to the empty string, even though in the first example the authority isn't present per your comment.
msg322756 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-07-31 10:58
Yes urllib doesn’t distinguish a missing authority/netloc from an empty string. The same for the ?query and #fragment parts. There is Issue 22852 open about that.
msg324718 - (view) Author: Vladimir Matveev (v2m) * Date: 2018-09-07 04:57
file URI scheme is covered by RFC8089, specifically https://tools.ietf.org/html/rfc8089#appendix-E.3.2.
History
Date User Action Args
2019-08-27 01:16:36epicfaacesetversions: + Python 3.9
2019-08-15 03:29:05epicfaacesetkeywords: + patch
stage: patch review
pull_requests: + pull_request15020
2019-08-15 03:28:56epicfaacesetnosy: + epicfaace
2018-09-07 04:57:35v2msetnosy: + v2m
messages: + msg324718
2018-07-31 10:58:50martin.pantersetmessages: + msg322756
2018-07-31 07:48:58piotr.dobrogostsetnosy: + piotr.dobrogost
2018-07-31 07:11:19chris.jerdoneksetmessages: + msg322737
2018-07-31 05:50:33martin.pantersetmessages: + msg322722
2018-07-31 03:38:53chris.jerdoneksetmessages: + msg322716
2018-07-30 13:10:24martin.pantersetnosy: + martin.panter
messages: + msg322675
2018-07-30 13:07:48xtreaksetmessages: + msg322674
2018-07-30 12:47:09xtreaksetmessages: + msg322671
2018-07-30 06:22:58xtreaksetnosy: + xtreak
2018-07-30 04:39:03chris.jerdonekcreate