classification
Title: urlparse goes wrong with IP:port without scheme
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.1, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: orsenthil Nosy List: BreamoreBoy, dstanek, elachuni, facundobatista, georg.brandl, georg.brandl, jjlee, kruegi, orsenthil, sjones
Priority: normal Keywords: patch

Created on 2003-06-13 15:15 by kruegi, last changed 2010-08-04 04:56 by orsenthil. This issue is now closed.

Files
File name Uploaded Description Edit
issue754016-py26.patch orsenthil, 2008-06-27 18:20
754016.patch dstanek, 2010-08-03 23:58
Messages (16)
msg16377 - (view) Author: Thomas Kr├╝ger (kruegi) Date: 2003-06-13 15:15
urlparse doesnt work if IP and port are given without 
scheme:

>>> urlparse.urlparse('1.2.3.4:80','http')
('1.2.3.4', '', '80', '', '', '')

should be:

>>> urlparse.urlparse('1.2.3.4:80','http')
('http', '1.2.3.4', '80', '', '', '')
msg16378 - (view) Author: Shannon Jones (sjones) Date: 2003-06-14 02:26
Logged In: YES 
user_id=589306

urlparse.urlparse takes a url of the format:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

And returns a 6-tuple of the format:
    (scheme, netloc, path, params, query, fragment).

An example from the library refrence takes:
msg16379 - (view) Author: Shannon Jones (sjones) Date: 2003-06-14 02:39
Logged In: YES 
user_id=589306

Sorry, previous comment got cut off...

urlparse.urlparse takes a url of the format:
    <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

And returns a 6-tuple of the format:
    (scheme, netloc, path, params, query, fragment).

An example from the library refrence takes:
    urlparse('http://www.cwi.nl:80/%7Eguido/Python.html')

And produces:
    ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '',
'', '')

--------------------------------

Note that there isn't a field for the port number in the
6-tuple. Instead, it is included in the netloc. Urlparse
should handle your example as:

>>> urlparse.urlparse('1.2.3.4:80','http') 
('http', '1.2.3.4:80', '', '', '', '')

Instead, it gives the incorrect output as you indicated.
msg16380 - (view) Author: Shannon Jones (sjones) Date: 2003-06-14 04:18
Logged In: YES 
user_id=589306

Ok, I researched this a bit, and the situation isn't as
simple as it first appears. The RFC that urlparse tries to
follow is at http://www.faqs.org/rfcs/rfc1808.html and
notice specifically sections 2.1 and 2.2.

It seems to me that the source code follows rfc1808
religiously, and in that sense it does the correct thing.
According to the RFC, the netloc should begin with a '//',
and since your example didn't include one then it technical
was an invalid URL. Here is another example where it seems
Python fails to do the right thing:

>>> urlparse.urlparse('python.org')
('', '', 'python.org', '', '', '')
>>> urlparse.urlparse('python.org', 'http')
('http', '', 'python.org', '', '', '')

Note that it is putting 'python.org' as the path and not the
netloc. So the problem isn't limited to just when you use a
scheme parameter and/or a port number. Now if we put '//' at
the beginning, we get:

>>> urlparse.urlparse('//python.org')
('', 'python.org', '', '', '', '')
>>> urlparse.urlparse('//python.org', 'http')
('http', 'python.org', '', '', '', '')

So here it does the correct thing.

There are two problems though. First, it is common for
browsers and other software to just take a URL without a
scheme and '://' and assume it is http for the user. While
the URL is technically not correct, it is still common
usage. Also, urlparse does take a scheme parameter. It seems
as though this parameter should be used with a scheme-less
URL to give it a default one like web browsers do.

So somebody needs to make a decision. Should urlparse follow
the RFC religiously and require '//' in front of netlocs? If
so, I think the documentation should give an example showing
this and showing how to use the 'scheme' parameter. Or
should urlparse use the more commonly used form of a URL
where '//' is omitted when the scheme is omitted? If so,
urlparse.py will need to be changed. Or maybe another
fuction should be added to cover whichever behaviour
urlparse doesn't cover.

In any case, you can temporarily solve your problem by
making sure that URL's without a scheme have '//' at the
front. So your example becomes:

>>> urlparse.urlparse('//1.2.3.4:80', 'http')
('http', '1.2.3.4:80', '', '', '', '')

msg16381 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2004-12-26 21:55
Logged In: YES 
user_id=752496

The problem is still present in Py2.3.4.

IMO, it should support dirs without the "http://" or raise
an error if it's not present (never fail silently!).
msg16382 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2005-08-25 13:22
Logged In: YES 
user_id=1188172

Will look into it. Should be easy to fix.
msg67890 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-06-10 02:54
Attaching the patch to fix this issue. I deliberated upon this for a
while and came up with the approach to:

1) fix the port issue, wherein urlparse should technically recognize the 
':' separator for port from ':' after scheme.

2) And Doc fix wherein, it is advised that in the absence of a scheme,
use the net_loc as //net_loc (following RCF 1808).

If we go for any other fix, like internally pre-pending // when user has
not specified the scheme (like in many pratical purpose), then we stand
at chance of breaking a number of tests ( cases where url is 'g'(path
only),';x' (path with params) and cases where relative url is g:h)

Let me know your thoughts on this.

>>> urlparse('1.2.3.4:80')
ParseResult(scheme='', netloc='', path='1.2.3.4:80', params='',
query='', fragment='')
>>> urlparse('http://www.python.org:80/~guido/foo?query#fun')
ParseResult(scheme='http', netloc='www.python.org:80',
path='/~guido/foo', params='', query='query', fragment='fun')
>>>
msg68526 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-06-21 18:03
Senthil, your patch is wrong, see:

>>> import urlparse
>>> urlparse.urlparse('1.2.3.4:80','http')
ParseResult(scheme='http', netloc='', path='1.2.3.4:80', params='',
query='', fragment='')

The netloc should be "1.2.3.4:80", note the composition of an URL:

  <scheme>://<netloc>/<path>;<params>?<query>#<fragment>

Please fix it and test it applying the patch to the test I'm submitting
here...
msg68533 - (view) Author: Anthony Lenton (elachuni) * Date: 2008-06-21 18:45
I agree with facundobatista that the patch is bad, but for a different
reason: it now breaks with:

>>> import urlparse
>>> urlparse.urlparse ('http:')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anthony/svn/python26/Lib/urlparse.py", line 108, in urlparse
tuple = urlsplit(url, scheme, allow_fragments)
  File "/home/anthony/svn/python26/Lib/urlparse.py", line 148, in urlsplit
if i > 0 and not url[i+1].isdigit():
IndexError: string index out of range

I'm afraid that it it's not evident that the expected behavior isn't
evident.

Take for example:

>>> import urlparse
>>> urlparse.urlparse('some.com', 'http')
ParseResult(scheme='http', netloc='', path='some.com', params='',
query='', fragment='')

Is the url referring to the some.com domain or to a windows executable file?

If you're using urlparse to parse only absolute urls then probably you
want the first component to be considered a net_loc, but not if you're
thinking of accepting also relative urls.

It would probably be better to be explicit and raise an exception if the
url is invalid, so that the user can prepend a '//' and resubmit if
needed.  Also we'd probably stop seeing bugreports about this issue :)
msg68535 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-06-21 19:10
I agree with Anthony here, because if you let people write without the
"//" at the beginning, you'll never know if they're entering a net
location or a relative path.

So, the better behaviour to be as explicit as possible should be:

>>> urlparse.urlparse('1.2.3.4:80','http')
Traceback!!! ValueError(<nice message here>)

>>> urlparse.urlparse('//1.2.3.4:80','http')
('http', '1.2.3.4:80', '', '', '', '')


So, to close this issue, we should fix the code to behave like indicated
in the first case.

What do you think?
msg68841 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2008-06-27 18:20
I am attaching the modified patch, which addresses the port issue
properly and handles 'http:', 'https:' only URLS. Also included the
tests for it.

Facundo, I gave sufficient thought on raising an Exception for URLS not
staring with '//', and I am -1 for it.

As urlparse module is used for handling both absolute URLs as well as
relative URLS, this suggestion IMHO, would break the urlparse handling
of all relative urls. For e.g, cases which are mentioned in the RFC 1808
(Section 5.1 Normal Examples).

The way to inform the users to use '//net_loc' when they want net_loc,
would be Docs/Help message (included in the patch) and otherwise
urlparse following RFC1808, will treat it as the path.

This case may seem absurd when 'www.python.org' is treated as path but
perfect for parsing relative urls like just 'a'. More over this makes
sense when we have relative urls with parameters and query, for
e.g.'g:h','?x'

Another way to handle this would be split urlparse into two methods:
urlparse.absparse()
urlparse.relparse() 
and let the user decide what he wants to do.
I am passing a message to Web-SIG to discuss this further.

Irrespective of this, if the patch looks okay for "handling the port
issue" for 2.6 with the "Doc/Help message", then we should close this
bug and take the discussion further in Web-SIG. (I shall provide the
patch for 3.0 as well)

Comments Please.
msg69095 - (view) Author: Facundo Batista (facundobatista) * (Python committer) Date: 2008-07-02 14:40
I think this last patch is ok, but the third case that was raised in the
web-sig should be addressed:

"""
There's even a 3rd case: HTTP's Request-URI. For example, '//path' must
be treated as an abs_path consisting of two path_segments ['', 'path'],
not a net_loc, since the Request_URI must be one of ("*" | absoluteURI |
abs_path | authority).
"""

Please, address this new detail, and I'd commit this. Thanks!
msg85877 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2009-04-11 21:35
Facundo, I re-looked at this issue (after a long time; sorry for that)
and I think that the patch is good to be applied as it is for this issue.

The Web-SIG discussion, which you pointed to in the comment
(http://mail.python.org/pipermail/web-sig/2008-June/003458.html)
suggests a more "general case" of thinking and parsing of the urls.

The suggestion is specifically for "//path" kind of urls, which is will
be parsed into authority (as per RFC2396) or into netloc as in the patch.

My suggestion is to fix this issue with patch and if the corner case
like "//path" comes up, then I shall look the modifications required as
there is No RFC Specifications and Tests defining those cases.

Your comments?
msg110334 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-14 22:15
The patch will need to be reworked for the 2.7, 3.1 and 3.2 branches.
msg112725 - (view) Author: David Stanek (dstanek) Date: 2010-08-03 23:58
I've reworked the patch so that it applied against the py3k branch. It's been attached to this issue and is also available here: http://codereview.appspot.com/1910044.
msg112758 - (view) Author: Senthil Kumaran (orsenthil) * (Python committer) Date: 2010-08-04 04:56
Fixed in revision 83700 (release27-maint). r83701(py3k) and r83702(release31-maint).

David, thanks for reworking on the patch. Couple of comments
- I made change to the original patch where I checked 'https:' and 'http:' kind of url with url.endswith(':')  instead of len(url) == i+1 // I had a hard time to figure out why I had this way in the first place.

- In py3k, the urlparse.urlparse is changed to urllib.parse.urlparse. So that changes were required in the tests.
History
Date User Action Args
2010-08-04 04:56:30orsenthilsetstatus: open -> closed
messages: + msg112758

assignee: facundobatista -> orsenthil
resolution: accepted -> fixed
stage: patch review -> resolved
2010-08-03 23:58:04dstaneksetfiles: + 754016.patch
nosy: + dstanek
messages: + msg112725

2010-07-14 22:15:59BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110334
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2009-04-11 21:35:45orsenthilsetresolution: accepted
messages: + msg85877
2009-02-13 01:37:56ajaksu2setnosy: + jjlee
stage: patch review
type: behavior
versions: + Python 2.6, - Python 2.3
2008-07-02 14:40:56facundobatistasetmessages: + msg69095
2008-06-27 18:21:20orsenthilsetfiles: - issue754016.patch
2008-06-27 18:20:58orsenthilsetfiles: + issue754016-py26.patch
messages: + msg68841
2008-06-21 19:10:08facundobatistasetmessages: + msg68535
2008-06-21 19:02:54facundobatistasetfiles: - test_urlparse.diff
2008-06-21 18:45:44elachunisetnosy: + elachuni
messages: + msg68533
2008-06-21 18:03:19facundobatistasetfiles: + test_urlparse.diff
messages: + msg68526
2008-06-20 01:03:37facundobatistasetassignee: georg.brandl -> facundobatista
2008-06-10 02:54:28orsenthilsetfiles: + issue754016.patch
nosy: + orsenthil
messages: + msg67890
keywords: + patch
2003-06-13 15:15:34kruegicreate