urllib.parse doesn't round-trip file URI's with multiple leading slashes #78457

cjerdonek · 2018-07-30T04:39:03Z

BPO	34276
Nosy	@cjerdonek, @vadmium, @tirkarthi, @vladima, @epicfaace
PRs	bpo-34276: round-trip file URI's with multiple leading slashes #15297

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2018-07-30.04:39:03.225>
labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
title = "urllib.parse doesn't round-trip file URI's with multiple leading slashes"
updated_at = <Date 2019-08-27.01:16:36.907>
user = 'https://github.com/cjerdonek'

bugs.python.org fields:

activity = <Date 2019-08-27.01:16:36.907>
actor = 'epicfaace'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation = <Date 2018-07-30.04:39:03.225>
creator = 'chris.jerdonek'
dependencies = []
files = []
hgrepos = []
issue_num = 34276
keywords = ['patch']
message_count = 9.0
messages = ['322652', '322671', '322674', '322675', '322716', '322722', '322737', '322756', '324718']
nosy_count = 6.0
nosy_names = ['chris.jerdonek', 'martin.panter', 'piotr.dobrogost', 'xtreak', 'v2m', 'epicfaace']
pr_nums = ['15297']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue34276'
versions = ['Python 3.6', 'Python 3.7', 'Python 3.8', 'Python 3.9']

Linked PRs

gh-67693: Fix urlunparse() and urlunsplit() for URIs with path starting with multiple slashes and no authority #113563

cjerdonek · 2018-07-30T04:39:03Z

urllib.parse doesn't seem to round-trip file URI's containing multiple leading slashes. For example, this--

    import urllib.parse

    def round_trip(url):
        parsed = urllib.parse.urlsplit(url)
        new_url = urllib.parse.urlunsplit(parsed)
        print(f'{url} [{parsed}]\n{new_url}')
        print('ROUNDTRIP: {}\n'.format(url == new_url))

    for i in range(4):
        round_trip('file://{}root/a'.format(i * '/'))

results in--

file://root/a [SplitResult(scheme='file', netloc='root', path='/a', query='', fragment='')]
file://root/a
ROUNDTRIP: True

file:///root/a [SplitResult(scheme='file', netloc='', path='/root/a', query='', fragment='')]
file:///root/a
ROUNDTRIP: True

file:////root/a [SplitResult(scheme='file', netloc='', path='//root/a', query='', fragment='')]
file://root/a
ROUNDTRIP: False

file://///root/a [SplitResult(scheme='file', netloc='', path='///root/a', query='', fragment='')]
file:///root/a
ROUNDTRIP: False

URI's of the form file:////<host>/<share>/<path> occur, for example, when one wants to git-clone a UNC path on Windows:
https://stackoverflow.com/a/2520121/262819

Here is where CPython defines urlunsplit():

cpython/Lib/urllib/parse.py

Lines 465 to 482 in 4e11c46

    
           def urlunsplit(components): 
        
               """Combine the elements of a tuple as returned by urlsplit() into a 
        
               complete URL as a string. The data argument can be any five-item iterable. 
        
               This may result in a slightly different, but equivalent URL, if the URL that 
        
               was parsed originally had unnecessary delimiters (for example, a ? with an 
        
               empty query; the RFC states that these are equivalent).""" 
        
               scheme, netloc, url, query, fragment, _coerce_result = ( 
        
                                                     _coerce_args(*components)) 
        
               if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'): 
        
                   if url and url[:1] != '/': url = '/' + url 
        
                   url = '//' + (netloc or '') + url 
        
               if scheme: 
        
                   url = scheme + ':' + url 
        
               if query: 
        
                   url = url + '?' + query 
        
               if fragment: 
        
                   url = url + '#' + fragment 
        
               return _coerce_result(url)

(The '//' special-casing seems to occur in this line here:

cpython/Lib/urllib/parse.py

Line 473 in 4e11c46

if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):

)

And here is where the round-tripping is tested:

cpython/Lib/test/test_urlparse.py

Line 156 in 4e11c46

('file:///tmp/junk.txt',

(Three initial leading slashes is tested, but not the problem case of four or more.)

tirkarthi · 2018-07-30T12:47:09Z

This is an issue with Python 2 too which I hope can be fixed too. The original logic in the code was committed around 16 years back : bbc0568 and tests are also around 10 years old too.

➜ cpython git:(2bea771) ✗ ./python.exe
Python 2.7.15+ (remotes/upstream/2.7:2bea771609, Jul 30 2018, 18:07:51)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

>>
➜ cpython git:(2bea771) ✗ ./python.exe bpo-34276.py
file://root/a [SplitResult(scheme='file', netloc='root', path='/a', query='', fragment='')]
file://root/a
ROUNDTRIP: True

file:///root/a [SplitResult(scheme='file', netloc='', path='/root/a', query='', fragment='')]
file:///root/a
ROUNDTRIP: True

file:////root/a [SplitResult(scheme='file', netloc='', path='//root/a', query='', fragment='')]
file://root/a
ROUNDTRIP: False

file://///root/a [SplitResult(scheme='file', netloc='', path='///root/a', query='', fragment='')]
file:///root/a
ROUNDTRIP: False

Thanks

tirkarthi · 2018-07-30T13:07:48Z

I just checked back the behavior on Perl's https://github.com/libwww-perl/URI/ . It seems to handle that along with other additional cases. Maybe some of the tests can be adopted from there for better coverage too (https://github.com/libwww-perl/URI/blob/master/t/split.t)

$ cat bpo34276.pl
use URI::Split qw(uri_split uri_join);

sub print_url{
my $uri = shift;
print "original uri ", $uri, "\n";
($scheme, $auth, $path, $query, $frag) = uri_split($uri);
$uri = uri_join($scheme, $auth, $path, $query, $frag);
print "returned uri ", $uri, "\n";
}

print_url("file://root/a");
print_url("file:///root/a");
print_url("file:////root/a");
print_url("file://///root/a");

$ perl bpo34276.pl
original uri file://root/a
returned uri file://root/a
original uri file:///root/a
returned uri file:///root/a
original uri file:////root/a
returned uri file:////root/a
original uri file://///root/a
returned uri file://///root/a

Thanks

vadmium · 2018-07-30T13:10:25Z

This may be a very old regression (from 2002) caused by bpo-591713 and Mercurial rev. 554f975073a0. The original check for the double slash, added in 0d6bd391acd8, “escapes” a path beginning with a double slash by prefixing it with two more slashes (empty “netloc”). This should round-trip Chris’s problem URLs.

I think the logic in “urlsplit” should always add the extra double slash for the netloc, regardless of path, at least if a scheme is present and it is registered in “uses_netloc”. This should fix Chris’s instance of the bug, since “file:” is registered. There is already a patch in bpo-1722348 which should do this (although it includes other changes as well).

The double slash should also be escaped if no scheme is present. (The empty scheme string is already in “uses_netloc”.) This might satisfy bpo-23505.

IMO it would be better to do the escaping by default, for all schemes unknown to “urllib”, and to blacklist specific schemes like “mailto:” instead. But that would be out of scope for a bug fix.

cjerdonek · 2018-07-31T03:38:53Z

Thanks for all the extra info. A couple more comments:

I came across this issue when diagnosing the following pip issue ("pip install git+file://" not working for Windows UNC paths):
pip install git+file:// doesn't work with Windows UNC paths pypa/pip#3783
URLs of the form "file:////root" (with four or more leading slashes) are perhaps not valid URI's technically. See Section 3. "Syntax Components" of RFC 3986, where it says, "When authority [i.e. netloc] is not present, the path cannot begin with two slash characters ('//')":
https://tools.ietf.org/html/rfc3986#section-3

However, I don't think that means Python shouldn't try to roundtrip it successfully. Also, git-clone is apparently okay with URLs of this form, and does the right thing with them.

vadmium · 2018-07-31T05:50:34Z

I think your URLs are valid by RFC 3986. "When authority is not present" refers to URLs without the double-slash prefix, like the "urn:example:animal:ferret:nose". The RFC treats empty authority and no authority as different cases. If authority is present, the format for hier-part has to be

"//" authority path-abempty

Authority may be an empty string:

authority = [userinfo "@"] host [":" port]
host = IP-literal / IPv4address / reg-name
reg-name = *(unreserved / pct-encoded / sub-delims)  ; May be empty

Path-abempty may begin with two slashes if the first two segments are empty strings:

path-abempty = *("/" segment)
segment = *pchar ; May be empty

cjerdonek · 2018-07-31T07:11:20Z

The RFC treats empty authority and no authority as different cases.

I'm not well-versed on this. But I guess this means urllib.parse doesn't support this distinction. For example:

  >>> urllib.parse.urlsplit('file:/foo')
  SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')
  >>> urllib.parse.urlsplit('file:///foo')
  SplitResult(scheme='file', netloc='', path='/foo', query='', fragment='')
  >>> urllib.parse.urlsplit('file:/foo') == \
      urllib.parse.urlsplit('file:///foo')
  True

Both have authority / netloc equal to the empty string, even though in the first example the authority isn't present per your comment.

vadmium · 2018-07-31T10:58:51Z

Yes urllib doesn’t distinguish a missing authority/netloc from an empty string. The same for the ?query and #fragment parts. There is bpo-22852 open about that.

vladima · 2018-09-07T04:57:35Z

file URI scheme is covered by RFC8089, specifically https://tools.ietf.org/html/rfc8089#appendix-E.3.2.

serhiy-storchaka · 2023-12-29T09:18:57Z

Initially the condition was added in f3963b1, specially to handle file: URLs starting with 4 slashes.

f3963b1#diff-c272b33cfc076b56637b73ae6fecd13ca0b999883a0a48b16a5c564b24bdb4c3R115

       if netloc or (scheme in uses_netloc and url[:2] == '//'):

But it was wrong for file: URLs starting with 3 slashes, so the condition was negated in 7dfb6e2 (bpo-591713/gh-36991).

7dfb6e2#diff-c272b33cfc076b56637b73ae6fecd13ca0b999883a0a48b16a5c564b24bdb4c3R131

       if netloc or (scheme in uses_netloc and url[:2] != '//'):

Since the first change was without tests, its initial purpose was lost.

I believe that the right condition should be

    if netloc or (scheme and scheme in uses_netloc) or url[:2] == '//':

Even if netloc is empty and scheme is empty or not requires a netloc, // should be added if the path starts with //, otherwise the first component of the part will be confused with a netlock when parse.

vadmium · 2024-05-12T05:20:40Z

After seeing that the “mailto:” RFC 6068 says slashes (/) have to be encoded in e.g. mailto:%2F%2Flocal@domain, I agree with adding the extra double-slash (//) regardless of scheme. (Previously I believed mailto://local@domain to be valid for an address of //local@domain, but I take that back.)

cjerdonek added 3.7 (EOL) end of life 3.8 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 30, 2018

epicfaace mannequin added the 3.9 only security fixes label Aug 27, 2019

ezio-melotti transferred this issue from another repository Apr 10, 2022

bedevere-app bot mentioned this issue Dec 29, 2023

gh-67693: Fix urlunparse() and urlunsplit() for URIs with path starting with multiple slashes and no authority #113563

Merged

serhiy-storchaka mentioned this issue Dec 29, 2023

[CVE-2015-2104] Urlparse insufficient validation leads to open redirect #67693

Open

vadmium mentioned this issue Apr 28, 2024

gh-87389: avoid treating path as URI with netloc #93894

Open

serhiy-storchaka closed this as completed May 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urllib.parse doesn't round-trip file URI's with multiple leading slashes #78457

urllib.parse doesn't round-trip file URI's with multiple leading slashes #78457

cjerdonek commented Jul 30, 2018 •

edited by bedevere-app bot

cjerdonek commented Jul 30, 2018

tirkarthi commented Jul 30, 2018

tirkarthi commented Jul 30, 2018

vadmium commented Jul 30, 2018

cjerdonek commented Jul 31, 2018

vadmium commented Jul 31, 2018

cjerdonek commented Jul 31, 2018

vadmium commented Jul 31, 2018

vladima mannequin commented Sep 7, 2018

serhiy-storchaka commented Dec 29, 2023 •

edited

vadmium commented May 12, 2024

urllib.parse doesn't round-trip file URI's with multiple leading slashes #78457

urllib.parse doesn't round-trip file URI's with multiple leading slashes #78457

Comments

cjerdonek commented Jul 30, 2018 • edited by bedevere-app bot

Linked PRs

cjerdonek commented Jul 30, 2018

tirkarthi commented Jul 30, 2018

tirkarthi commented Jul 30, 2018

vadmium commented Jul 30, 2018

cjerdonek commented Jul 31, 2018

vadmium commented Jul 31, 2018

cjerdonek commented Jul 31, 2018

vadmium commented Jul 31, 2018

vladima mannequin commented Sep 7, 2018

serhiy-storchaka commented Dec 29, 2023 • edited

vadmium commented May 12, 2024

cjerdonek commented Jul 30, 2018 •

edited by bedevere-app bot

serhiy-storchaka commented Dec 29, 2023 •

edited