This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: urlparse.urlsplit mishandles novel schemes
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: eric.araujo, ezio.melotti, fdrake, mbloore, orsenthil, r.david.murray, tseaver
Priority: normal Keywords:

Created on 2010-02-10 23:24 by mbloore, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
fix7904.txt mbloore, 2010-02-11 18:20 svn diff of fix and unit test against 2.7 repository.
fix7904-2.txt mbloore, 2010-02-17 21:09 svn diff of fix and unit test against 2.7 repository.
Messages (14)
msg99181 - (view) Author: mARK (mbloore) Date: 2010-02-10 23:24
urlparse.urlsplit('s3://example/files/photos/161565.jpg')
returns
('s3', '', '//example/files/photos/161565.jpg', '', '')
instead of
('s3', 'example', '/files/photos/161565.jpg', '', '')

according to rfc 3986 's3' is a valid scheme name, so the '://' indicates a URL with netloc and path parts.
msg99183 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-02-10 23:28
Thanks for the report, could you provide a patch with unit tests?
msg99196 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-02-11 03:48
Does s3 stand for the amazon s3 services? urlparse does not have it under its list of known schemes yet. Does s3 have any specifications as such or is aligned towards any of the known schemes (like http or ftp)?

s3 is valid scheme name according to rfc 3986, but urlparse module does not seem to recognize it. If we say, s3 to be much similar to http, then it can be added to list of known schemes. Does Amazon say anything about it?
msg99198 - (view) Author: mARK (mbloore) Date: 2010-02-11 04:53
it's not actually necessary to have a list of known schemes.  any url that has a double slash after the colon is expected to follow that with an authority section (what urlparse calls "netloc"), optionally followed by a path, which starts with a slash.

there are various defined schemes with their own syntax within the URL framework, but one is free to invent new ones with the general form
scheme://netloc/path
msg99229 - (view) Author: mARK (mbloore) Date: 2010-02-11 18:20
i have attached an svn diff of my (very simple!) fix and added unit test for python 2.7.
msg99256 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-02-12 02:58
Hello Mark, 
Thanks for the patch.

However there are reasons why the check is:

"if scheme in uses_netloc and url[:2] == '//':"
It cannot be replaced by just url[:2] == '//' as in your patch.

Different protocols have different parsing requirements. (for e.g. some wish to consider (or act as if), after the scheme, the rest is their path)

The better way is to add 's3' to uses_netloc list and it should be fine too. I shall add it and include your tests. Thanks.
msg99265 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-02-12 13:41
I think Mark is correct.  RFC 3986 says:

When authority is present, the path must either be empty or begin with a slash ("/") character.  When authority is not present, the path cannot begin with two slash characters ("//").

I think it would make sense to have urlparse fall back to doing a generic RFC 3986 parse when it does not recognize the scheme.
msg99290 - (view) Author: mARK (mbloore) Date: 2010-02-12 21:12
The case which prompted this issue was a purely private set of URLs, sent to me by a client but never sent to Amazon or anywhere else outside our systems (though I'm sure many others have invented this particular scheme for their own use).  It would have been convenient if urlparse had handled it properly.  That is true for any scheme one may invent at need.

On second thought it does make sense to enforce the use of :// for the schemes in uses_netloc, but still not to ignore its meaning for other schemes.  It also makes sense to add s3 to uses_netloc despite the fact that it is not (afaik) registered, since it is an obvious invention.

I'll make another patch, but I don't have time to do it just now.
msg99480 - (view) Author: mARK (mbloore) Date: 2010-02-17 21:09
Doing a fallback test for // would look like
if scheme in uses_netloc and url[:2] == '//' or url[:2] == '//':

but this is equivalent to 
if url[:2] == '//':

i.e., an authority appears if and only if there is a // after the scheme.

This still allows a uses_netloc scheme to appear without //.

I have attached a patch against Python 2.7, svn revision 78212.  It adds s3 to netloc.
msg99560 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-02-19 07:47
Fixed in the r78234 and merged back to other branches.
I fell back to RFC's definition of scheme, as anything before the ://.
I did not see the need to add s3 specifically as a valid scheme type, because s3 itself is not registered a schemetype.
So, the fix should work for s3 and other undefined schemes as per RFC.

Thanks for the patch.
msg104261 - (view) Author: Tres Seaver (tseaver) * Date: 2010-04-26 17:38
The fix for this bug breaks any code which worked with non-standard
schemes in 2.6.4 (by working around the issue).  This kind of backward
incompatibility should be called out prominently in NEWS.txt (assuming
that such a fix is considered appropriate in a third-dot release).
msg105078 - (view) Author: Éric Araujo (eric.araujo) * (Python committer) Date: 2010-05-05 19:14
I remember seeing a discussion on python-dev archives about that months or years ago. Someone pointed to Guido that the new RFC removed the need for uses_netloc thanks to the generic syntax. Isn’t there already a bug about that?
msg123300 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-12-03 22:33
Though msg104261 suggests this change be documented in NEWS.txt, it doesn't appear to have made it.

Sure enough, we just found application code that this broke.
msg123327 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-12-04 10:02
On Fri, Dec 03, 2010 at 10:33:50PM +0000, Fred L. Drake, Jr. wrote:
> Though msg104261 suggests this change be documented in NEWS.txt, it
> doesn't appear to have made it.

Better late than never. I just added the NEWS in r87014 (py3k)
,r87015(release31-maint) ,r87016(release27-maint).
History
Date User Action Args
2022-04-11 14:56:57adminsetgithub: 52152
2010-12-04 10:02:42orsenthilsetmessages: + msg123327
2010-12-03 22:33:47fdrakesetnosy: + fdrake
messages: + msg123300
2010-05-05 19:14:32eric.araujosetnosy: + eric.araujo
messages: + msg105078
2010-04-26 17:38:54tseaversetnosy: + tseaver
messages: + msg104261
2010-02-19 07:47:30orsenthilsetstatus: open -> closed
resolution: fixed
messages: + msg99560
2010-02-17 21:09:37mblooresetfiles: + fix7904-2.txt

messages: + msg99480
2010-02-12 21:13:58mblooresetnosy: orsenthil, ezio.melotti, mbloore, r.david.murray
components: + Library (Lib), - Extension Modules
versions: + Python 3.1, Python 3.2
2010-02-12 21:12:06mblooresetnosy: orsenthil, ezio.melotti, mbloore, r.david.murray
messages: + msg99290
components: + Extension Modules, - Library (Lib)
versions: - Python 3.1, Python 3.2
2010-02-12 13:41:48r.david.murraysetnosy: + r.david.murray

messages: + msg99265
versions: + Python 3.1, Python 3.2
2010-02-12 02:58:48orsenthilsetnosy: orsenthil, ezio.melotti, mbloore
messages: + msg99256
components: + Library (Lib), - Extension Modules
2010-02-11 18:20:37mblooresetfiles: + fix7904.txt

messages: + msg99229
title: urllib.urlparse mishandles novel schemes -> urlparse.urlsplit mishandles novel schemes
2010-02-11 04:53:11mblooresetmessages: + msg99198
2010-02-11 03:48:18orsenthilsetassignee: orsenthil

messages: + msg99196
nosy: + orsenthil
2010-02-10 23:28:06ezio.melottisetpriority: normal
versions: + Python 2.6, Python 2.7, - Python 2.5
nosy: + ezio.melotti

messages: + msg99183

stage: test needed
2010-02-10 23:24:49mbloorecreate