Issue 8818: urlsplit and urlparse add extra slash when using scheme

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53064

classification

Title:	urlsplit and urlparse add extra slash when using scheme
Type:	behavior	Stage:	resolved
Components:	Documentation	Versions:	Python 3.1, Python 3.2, Python 2.7, Python 2.6

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	adamnelson, docs@python, fdrake, orsenthil, r.david.murray
Priority:	normal	Keywords:

Created on 2010-05-25 14:39 by adamnelson, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (12)
msg106438 - (view)	Author: AdamN (adamnelson)	Date: 2010-05-25 14:39
urlsplit and urlparse place the host into the path when using a default scheme: (Pdb) urlsplit('regionalhelpwanted.com/browseads/?sn=2',scheme='http') SplitResult(scheme='http', netloc='', path='regionalhelpwanted.com/browseads/', query='sn=2', fragment='') When using default_scheme as referenced in the documentation, it simply doesn't work: (Pdb) urlsplit('regionalhelpwanted.com/browseads/?sn=2',default_scheme='http') *** TypeError: urlsplit() got an unexpected keyword argument 'default_scheme'
msg106443 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-05-25 15:37
The keyword in the code is 'scheme'. I've updated the docs accordingly in r81521 and r81522.
msg106448 - (view)	Author: AdamN (adamnelson)	Date: 2010-05-25 16:53
Great, thanks. However urlsplit and urlparse still take what one would expect to be recognized as the netloc and assigns it to the 'path' key. If that is by design perhaps we should at least warn people?
msg106452 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-05-25 17:33
I've added Senthil as nosy to double check me, but my understanding is that the scheme is just the part up to the colon. If you want to have a netloc in the URL, you have to precede it with a '//'. Otherwise there's no netloc.
msg106453 - (view)	Author: AdamN (adamnelson)	Date: 2010-05-25 17:41
Ok, you're right: >>> urlsplit('cnn.com') SplitResult(scheme='', netloc='', path='cnn.com', query='', fragment='') >>> urlsplit('//cnn.com') SplitResult(scheme='', netloc='cnn.com', path='', query='', fragment='') >>> Although I see that nowhere in the documentation. It seems to me that in the scenario most people are dealing with, where they are getting 'cnn.com' or 'http://cnn.com' but don't know which ahead of time, this will be useless. I don't see who would ever have '//cnn.com' without constructing that string specifically for urlsplit. I would propose that '/whatever' becomes the path because it starts with slash, otherwise, it becomes the netloc and everything after the first slash becomes the path.
msg106455 - (view)	Author: Fred Drake (fdrake)	Date: 2010-05-25 17:53
On Tue, May 25, 2010 at 1:41 PM, AdamN <report@bugs.python.org> wrote: > Although I see that nowhere in the documentation. It needn't be in the urlparse documentation; the RFCs on URL syntax apply here. None of what's going on with the urlparse module is Python specific, as far as the URL interpretation is concerned. > It seems to me that in the scenario most people are dealing with, where > they are getting 'cnn.com' or 'http://cnn.com' but don't know which ahead > of time, this will be useless. I don't see who would ever have '//cnn.com' > without constructing that string specifically for urlsplit. 'cnn.com' isn't a URL, and there's no need for urlparse to handle it direectly. That just complicates things. Doing something above and beyond what the RFCs specify means you need to really think about the heuristics you're applying. If there's a useful set of heuristics that folks can agree on, that's a good case for a new module distributed on PyPI. -Fred
msg106456 - (view)	Author: AdamN (adamnelson)	Date: 2010-05-25 18:04
I appreciate what you're saying but nobody, I guarantee nobody, is using the '//cnn.com' semantics. Anyway, in RFC 3986 in the Syntax Components section, you'll see that the '://' is not part of scheme or netloc. I could imagine urlsplit() failing if the url was not prepended by '//' or 'scheme://', but why would being prepended with nothing cause urlsplit() to presume it's a path? Can we at least document this?
msg106458 - (view)	Author: Fred Drake (fdrake)	Date: 2010-05-25 18:16
The module is documented as supporting "Relative Uniform Resource Locators", in which a value with a non-rooted path is supported using simply "non/rooted/path". See the third paragraph in the Python 2.6 documentation, starting "The module has been designed".
msg106461 - (view)	Author: AdamN (adamnelson)	Date: 2010-05-25 18:26
I think I misspoke before. What I'm referring to is when somebody uses the 'scheme' parameter: urlsplit('cnn.com',scheme='http') Is there no way that we can document that this won't work the way that people think it will? Is it really reasonable for a high-level language to expect people to have read a 100 page RFC in order to know that regular expressions will have to be used for this type of situation?
msg106463 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2010-05-25 18:41
How would you expect urlsplit to differentiate between a relative path and a path with a netloc? I would think that most people would expect the semantics the module provides without reading any additional documentation. I certainly did, to the point where when reading your example I didn't even notice that there was any problem report other than the misnaming of the scheme keyword :) You could suggest a clarification to the docs if you like.
msg106465 - (view)	Author: AdamN (adamnelson)	Date: 2010-05-25 19:03
I would say right under: urlparse.urlparse(urlstring[, default_scheme[, allow_fragments]])¶ Put: urlstring is a pseudo-url. If the string has a scheme, it will be interpreted as a scheme, followed by a path, querystring and fragment. If it is prepended with a double-slash '//', it will be interpreted as a netloc followed by a path, querystring and fragment. Otherwise, it will be interpreted as a path followed by a querystring and fragment. I'm still confused about when anybody would use a relative path with a default scheme and no netloc but I'll leave that decision to you guys. Thanks, Adam
msg106468 - (view)	Author: Fred Drake (fdrake)	Date: 2010-05-25 19:09
On Tue, May 25, 2010 at 3:03 PM, AdamN <report@bugs.python.org> wrote: > I'm still confused about when anybody would use a relative path with a default scheme and no netloc but I'll leave that decision to you guys. The strings are not pseudo-URLs, they're relative references, as documented. This is used all the time in HREF and SRC attributes in web pages, which is exactly the use case for urlparse.urljoin().

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53064
2010-10-17 09:35:04	georg.brandl	set	status: open -> closed
2010-05-25 19:09:40	fdrake	set	messages: + msg106468
2010-05-25 19:03:39	adamnelson	set	messages: + msg106465
2010-05-25 18:41:58	r.david.murray	set	messages: + msg106463
2010-05-25 18:26:15	adamnelson	set	messages: + msg106461
2010-05-25 18:16:05	fdrake	set	messages: + msg106458
2010-05-25 18:04:39	adamnelson	set	messages: + msg106456
2010-05-25 17:53:24	fdrake	set	nosy: + fdrake messages: + msg106455
2010-05-25 17:41:51	adamnelson	set	messages: + msg106453
2010-05-25 17:33:00	r.david.murray	set	assignee: docs@python -> messages: + msg106452 nosy: + orsenthil
2010-05-25 16:53:55	adamnelson	set	status: closed -> open messages: + msg106448
2010-05-25 15:37:21	r.david.murray	set	status: open -> closed assignee: docs@python components: + Documentation, - Library (Lib) versions: + Python 3.1, Python 2.7, Python 3.2 nosy: + docs@python, r.david.murray messages: + msg106443 resolution: fixed stage: resolved
2010-05-25 14:40:29	adamnelson	set	type: behavior
2010-05-25 14:39:57	adamnelson	create