classification
Title: urlsplit and urlparse add extra slash when using scheme
Type: behavior Stage: resolved
Components: Documentation Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: adamnelson, docs@python, fdrake, orsenthil, r.david.murray
Priority: normal Keywords:

Created on 2010-05-25 14:39 by adamnelson, last changed 2010-10-17 09:35 by georg.brandl. This issue is now closed.

Messages (12)
msg106438 - (view) Author: AdamN (adamnelson) Date: 2010-05-25 14:39
urlsplit and urlparse place the host into the path when using a default scheme:

(Pdb) urlsplit('regionalhelpwanted.com/browseads/?sn=2',scheme='http')
SplitResult(scheme='http', netloc='', path='regionalhelpwanted.com/browseads/', query='sn=2', fragment='')

When using default_scheme as referenced in the documentation, it simply doesn't work:

(Pdb) urlsplit('regionalhelpwanted.com/browseads/?sn=2',default_scheme='http')
*** TypeError: urlsplit() got an unexpected keyword argument 'default_scheme'
msg106443 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-25 15:37
The keyword in the code is 'scheme'.  I've updated the docs accordingly in r81521 and r81522.
msg106448 - (view) Author: AdamN (adamnelson) Date: 2010-05-25 16:53
Great, thanks.

However urlsplit and urlparse still take what one would expect to be recognized as the netloc and assigns it to the 'path' key.  If that is by design perhaps we should at least warn people?
msg106452 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-25 17:33
I've added Senthil as nosy to double check me, but my understanding is that the scheme is just the part up to the colon.  If you want to have a netloc in the URL, you have to precede it with a '//'.  Otherwise there's no netloc.
msg106453 - (view) Author: AdamN (adamnelson) Date: 2010-05-25 17:41
Ok, you're right:

>>> urlsplit('cnn.com')
SplitResult(scheme='', netloc='', path='cnn.com', query='', fragment='')
>>> urlsplit('//cnn.com')
SplitResult(scheme='', netloc='cnn.com', path='', query='', fragment='')
>>> 

Although I see that nowhere in the documentation.  It seems to me that in the scenario most people are dealing with, where they are getting 'cnn.com' or 'http://cnn.com' but don't know which ahead of time, this will be useless.  I don't see who would ever have '//cnn.com' without constructing that string specifically for urlsplit.

I would propose that '/whatever' becomes the path because it starts with slash, otherwise, it becomes the netloc and everything after the first slash becomes the path.
msg106455 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-05-25 17:53
On Tue, May 25, 2010 at 1:41 PM, AdamN <report@bugs.python.org> wrote:
> Although I see that nowhere in the documentation.

It needn't be in the urlparse documentation; the RFCs on URL syntax
apply here.  None of what's going on with the urlparse module is
Python specific, as far as the URL interpretation is concerned.

> It seems to me that in the scenario most people are dealing with, where
> they are getting 'cnn.com' or 'http://cnn.com' but don't know which ahead
> of time, this will be useless.  I don't see who would ever have '//cnn.com'
> without constructing that string specifically for urlsplit.

'cnn.com' isn't a URL, and there's no need for urlparse to handle it
direectly.  That just complicates things.

Doing something above and beyond what the RFCs specify means you need
to really think about the heuristics you're applying.  If there's a
useful set of heuristics that folks can agree on, that's a good case
for a new module distributed on PyPI.

  -Fred
msg106456 - (view) Author: AdamN (adamnelson) Date: 2010-05-25 18:04
I appreciate what you're saying but nobody, I guarantee nobody, is using the '//cnn.com' semantics.

Anyway, in RFC 3986 in the Syntax Components section, you'll see that the '://' is not part of scheme or netloc.  I could imagine urlsplit() failing if the url was not prepended by '//' or 'scheme://', but why would being prepended with nothing cause urlsplit() to presume it's a path?

Can we at least document this?
msg106458 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-05-25 18:16
The module is documented as supporting "Relative Uniform Resource
Locators", in which a value with a non-rooted path is supported using
simply "non/rooted/path".

See the third paragraph in the Python 2.6 documentation, starting "The
module has been designed".
msg106461 - (view) Author: AdamN (adamnelson) Date: 2010-05-25 18:26
I think I misspoke before.  What I'm referring to is when somebody uses the 'scheme' parameter:

urlsplit('cnn.com',scheme='http')

Is there no way that we can document that this won't work the way that people think it will?  Is it really reasonable for a high-level language to expect people to have read a 100 page RFC in order to know that regular expressions will have to be used for this type of situation?
msg106463 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-25 18:41
How would you expect urlsplit to differentiate between a relative path and a path with a netloc?  I would think that most people would expect the semantics the module provides without reading any additional documentation.  I certainly did, to the point where when reading your example I didn't even notice that there was any problem report other than the misnaming of the scheme keyword :)

You could suggest a clarification to the docs if you like.
msg106465 - (view) Author: AdamN (adamnelson) Date: 2010-05-25 19:03
I would say right under:

urlparse.urlparse(urlstring[, default_scheme[, allow_fragments]])¶

Put:

urlstring is a pseudo-url.  If the string has a scheme, it will be interpreted as a scheme, followed by a path, querystring and fragment.  If it is prepended with a double-slash '//', it will be interpreted as a netloc followed by a path, querystring and fragment.  Otherwise, it will be interpreted as a path followed by a querystring and fragment.

I'm still confused about when anybody would use a relative path with a default scheme and no netloc but I'll leave that decision to you guys.  

Thanks,
Adam
msg106468 - (view) Author: Fred Drake (fdrake) (Python committer) Date: 2010-05-25 19:09
On Tue, May 25, 2010 at 3:03 PM, AdamN <report@bugs.python.org> wrote:
> I'm still confused about when anybody would use a relative path with a default scheme and no netloc but I'll leave that decision to you guys.

The strings are not pseudo-URLs, they're relative references, as documented.

This is used all the time in HREF and SRC attributes in web pages,
which is exactly the use case for urlparse.urljoin().
History
Date User Action Args
2010-10-17 09:35:04georg.brandlsetstatus: open -> closed
2010-05-25 19:09:40fdrakesetmessages: + msg106468
2010-05-25 19:03:39adamnelsonsetmessages: + msg106465
2010-05-25 18:41:58r.david.murraysetmessages: + msg106463
2010-05-25 18:26:15adamnelsonsetmessages: + msg106461
2010-05-25 18:16:05fdrakesetmessages: + msg106458
2010-05-25 18:04:39adamnelsonsetmessages: + msg106456
2010-05-25 17:53:24fdrakesetnosy: + fdrake
messages: + msg106455
2010-05-25 17:41:51adamnelsonsetmessages: + msg106453
2010-05-25 17:33:00r.david.murraysetassignee: docs@python ->

messages: + msg106452
nosy: + orsenthil
2010-05-25 16:53:55adamnelsonsetstatus: closed -> open

messages: + msg106448
2010-05-25 15:37:21r.david.murraysetstatus: open -> closed

assignee: docs@python
components: + Documentation, - Library (Lib)
versions: + Python 3.1, Python 2.7, Python 3.2
nosy: + docs@python, r.david.murray

messages: + msg106443
resolution: fixed
stage: resolved
2010-05-25 14:40:29adamnelsonsettype: behavior
2010-05-25 14:39:57adamnelsoncreate