Message 368623 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	David Bell
Recipients	David Bell
Date	2020-05-11.11:53:49
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1589198030.38.0.614579410038.issue40594@roundup.psfhosted.org>
In-reply-to

Content
In Python 3.5 the urljoin function was rewritten to be RFC 3986 compliant and fix long standing issues. In the initial rewrite duplicate slashes were added by accident, and so code was added to prevent that. The discussion is here: https://bugs.python.org/issue22118 The code within urljoin is this: # filter out elements that would cause redundant slashes on re-joining # the resolved_path segments[1:-1] = filter(None, segments[1:-1]) This seems sensible, you would not want double slashes in a URL, right? The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild. The code above was written to remove them because urljoin accidentally introduced them, the problem is that it also removes intentional double slashes: >>> urljoin("http://www.example.com/", "this//double/path") 'http://www.example.com/this/double/path' Where as the expected result should be: 'http://www.example.com/this//double/path' I suggest that the fix for this is to remove the aforementioned filter code, e.g. remove this: # filter out elements that would cause redundant slashes on re-joining # the resolved_path segments[1:-1] = filter(None, segments[1:-1]) ...and remove this code too: if base_parts[-1] != '': # the last item is not a directory, so will not be taken into account # in resolving the relative path del base_parts[-1] and instead simply add: del base_parts[-1] ...because the last part of the split base URL should always be deleted. If the last element of the list (the base URL split) is an empty string, then the URL ended with a slash, and so we should remove the last element otherwise a double slash will occur when we combine it with the second parameter to urljoin. If the last element is not an empty string then the last part of the URL was not a directory, and should be removed. Thus the last element should always be removed. By following this logic the "remove all double slashes" filter is not necessary, because the cause of the accidental addition of double slashes is removed.

In Python 3.5 the urljoin function was rewritten to be RFC 3986 compliant and fix long standing issues. In the initial rewrite duplicate slashes were added by accident, and so code was added to prevent that. The discussion is here: https://bugs.python.org/issue22118

The code within urljoin is this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

This seems sensible, you would not want double slashes in a URL, right? The problem is: double slashes are perfectly legal in a URI/URL, and for reasons I don't understand, they are in use in the wild. The code above was written to remove them because urljoin accidentally introduced them, the problem is that it also removes intentional double slashes:

>>> urljoin("http://www.example.com/", "this//double/path")
'http://www.example.com/this/double/path'

Where as the expected result should be:

'http://www.example.com/this//double/path'

I suggest that the fix for this is to remove the aforementioned filter code, e.g. remove this:

# filter out elements that would cause redundant slashes on re-joining
# the resolved_path
segments[1:-1] = filter(None, segments[1:-1])

...and remove this code too:

if base_parts[-1] != '':
    # the last item is not a directory, so will not be taken into account
    # in resolving the relative path
    del base_parts[-1]

and instead simply add:

del base_parts[-1]

...because the last part of the split base URL should always be deleted. If the last element of the list (the base URL split) is an empty string, then the URL ended with a slash, and so we should remove the last element otherwise a double slash will occur when we combine it with the second parameter to urljoin. If the last element is not an empty string then the last part of the URL was not a directory, and should be removed. Thus the last element should always be removed. 

By following this logic the "remove all double slashes" filter is not necessary, because the cause of the accidental addition of double slashes is removed.

History
Date	User	Action	Args
2020-05-11 11:53:50	David Bell	set	recipients: + David Bell
2020-05-11 11:53:50	David Bell	set	messageid: <1589198030.38.0.614579410038.issue40594@roundup.psfhosted.org>
2020-05-11 11:53:50	David Bell	link	issue40594 messages
2020-05-11 11:53:49	David Bell	create