This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urljoin since 3.5 incorrectly filters out double slashes
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: David Bell, Ido Michael, iritkatriel, op368, orsenthil, taleinat
Priority: normal Keywords:

Created on 2020-05-11 11:53 by David Bell, last changed 2022-04-11 14:59 by admin.

Messages (9)
msg368623 - (view) Author: David Bell (David Bell) Date: 2020-05-11 11:53
In Python 3.5 the urljoin function was rewritten to be RFC 3986 compliant and fix long standing issues. In the initial rewrite duplicate slashes were added by accident, and so code was added to prevent that. The discussion is here: https://bugs.python.org/issue22118

The code within urljoin is this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

This seems sensible, you would not want double slashes in a URL, right? The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild. The code above was written to remove them because urljoin accidentally introduced them, the problem is that it also removes intentional double slashes:

>>> urljoin("http://www.example.com/", "this//double/path")
'http://www.example.com/this/double/path'

Where as the expected result should be:

'http://www.example.com/this//double/path'

I suggest that the fix for this is to remove the aforementioned filter code, e.g. remove this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

...and remove this code too:

if base_parts[-1] != '':
    # the last item is not a directory, so will not be taken into account
    # in resolving the relative path
    del base_parts[-1]

and instead simply add:

del base_parts[-1]

...because the last part of the split base URL should always be deleted. If the last element of the list (the base URL split) is an empty string, then the URL ended with a slash, and so we should remove the last element otherwise a double slash will occur when we combine it with the second parameter to urljoin. If the last element is not an empty string then the last part of the URL was not a directory, and should be removed. Thus the last element should always be removed. 

By following this logic the "remove all double slashes" filter is not necessary, because the cause of the accidental addition of double slashes is removed.
msg370821 - (view) Author: Ido Michael (Ido Michael) * Date: 2020-06-06 13:49
I can take this
msg371092 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-06-09 11:46
>  The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild

And what is the de-facto behavior of browsers, when double URL is used?
As far as I know, the browsers remove them, combine them as a single URL. 
(Tested on Chrome with some URLs to confirm on June 2020)

```

Where as the expected result should be:

'http://www.example.com/this//double/path'
```

This statement should be supported by

a) An RFC
b) Or behavior or some popular clients.
c) Or libcurl exhibiting this behavior will gain a huge support too.


Please share if you have any references and we should confirm ourselves to those.
msg371098 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2020-06-09 12:20
According to section 3.3 of RFC 3986[1], and also RFC 2396[2] which preceded it, a path is made up of zero or more consecutive "/" + section  pieces, and a section may be empty. In other words, paths may include consecutive slashes.

Several StackOverflow answers to questions on this subject (e.g. this[3], and this[4]) also agree that consecutive slashes are valid, based on the aforementioned RFCs and other references.

.. [1]: https://tools.ietf.org/html/rfc3986#section-3.3
.. [2]: https://tools.ietf.org/html/rfc2396#section-3.3
.. [3]: https://stackoverflow.com/questions/20523318/is-a-url-with-in-the-path-section-valid
.. [4]: https://stackoverflow.com/questions/10161177/url-with-multiple-forward-slashes-does-it-break-anything
msg371101 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-06-09 12:51
Hi Tal, 

Thank you. To be specific, it is about parsing behavior, specifically urljoin behavior,

There is no question about the validity of the double-slash in the URL. 

If we find a source (like libcurl) which exhibits parsing behavior different than CPython, or sections in docs which give examples which our current test suite fails, those serve as good sources for expected results.
msg371135 - (view) Author: Tal Einat (taleinat) * (Python committer) Date: 2020-06-09 18:55
Putting aside backwards compatibility, I would argue the opposite: Since consecutive slashes are valid, I suggest we would need to see that collapsing them into a single slash is the status quo, otherwise we should avoid such collapsing.

Here's some evidence that we should keep consecutive slashes:

1.
The curl cli does not appear to collapse consecutive slashes in URLs before sending them:

$ nc -l localhost 8000 &
[1] 39380
$ curl --dump-header - --proxy localhost:8000 --silent --max-time 1 http://www.example.com/this//double/path
GET http://www.example.com/this//double/path HTTP/1.1
Host: www.example.com
User-Agent: curl/7.64.1
Accept: */*
Proxy-Connection: Keep-Alive

[1]  + 39380 done       nc -l localhost 8000


2.
With NodeJS v10.18.0 and v12.16.1:

> const urllib = require('url');
> new url.URL("this//double/path", "http://www.example.com/").href
'http://www.example.com/this//double/path'


For me this is evidence enough that urljoin should *not* be collapsing consecutive slashes.
msg371685 - (view) Author: Open Close (op368) * Date: 2020-06-16 17:15
To Senthil Kumaran:

https://bugs.python.org/issue40594#msg371092
> As far as I know, the browsers remove them, combine them as a single URL. 
> (Tested on Chrome with some URLs to confirm on June 2020)

How did you test them?
msg371691 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2020-06-16 19:29
> How did you test them?

https://about.google//products/
https://www.google.com///search?q=something

Were redirecting, and collapsing multiple slashes to single.

Now, when I try this on:

> https://en.wikipedia.org/w//index.php?search=foobar+something

was not.

Sorry. It is false to say that it was a default client behavior of chrome.
msg407503 - (view) Author: Irit Katriel (iritkatriel) * (Python committer) Date: 2021-12-01 23:04
See also 25403, 37235.
History
Date User Action Args
2022-04-11 14:59:30adminsetgithub: 84774
2021-12-01 23:04:01iritkatrielsetnosy: + iritkatriel
messages: + msg407503
2020-06-16 19:29:24orsenthilsetmessages: + msg371691
2020-06-16 17:15:44op368setnosy: + op368
messages: + msg371685
2020-06-09 18:55:11taleinatsetmessages: + msg371135
2020-06-09 12:51:29orsenthilsetmessages: + msg371101
2020-06-09 12:20:47taleinatsetmessages: + msg371098
2020-06-09 11:46:35orsenthilsetmessages: + msg371092
2020-06-07 04:14:46ned.deilysetnosy: + orsenthil
2020-06-06 13:49:09Ido Michaelsetnosy: + Ido Michael, taleinat
messages: + msg370821
2020-05-11 11:53:50David Bellcreate