Created on 2003-06-13 15:15 by kruegi, last changed 2010-08-04 04:56 by orsenthil. This issue is now closed.
|issue754016-py26.patch||orsenthil, 2008-06-27 18:20|
|754016.patch||dstanek, 2010-08-03 23:58|
|msg16377 - (view)||Author: Thomas Krüger (kruegi)||Date: 2003-06-13 15:15|
urlparse doesnt work if IP and port are given without scheme: >>> urlparse.urlparse('126.96.36.199:80','http') ('188.8.131.52', '', '80', '', '', '') should be: >>> urlparse.urlparse('184.108.40.206:80','http') ('http', '220.127.116.11', '80', '', '', '')
|msg16378 - (view)||Author: Shannon Jones (sjones)||Date: 2003-06-14 02:26|
Logged In: YES user_id=589306 urlparse.urlparse takes a url of the format: <scheme>://<netloc>/<path>;<params>?<query>#<fragment> And returns a 6-tuple of the format: (scheme, netloc, path, params, query, fragment). An example from the library refrence takes:
|msg16379 - (view)||Author: Shannon Jones (sjones)||Date: 2003-06-14 02:39|
Logged In: YES user_id=589306 Sorry, previous comment got cut off... urlparse.urlparse takes a url of the format: <scheme>://<netloc>/<path>;<params>?<query>#<fragment> And returns a 6-tuple of the format: (scheme, netloc, path, params, query, fragment). An example from the library refrence takes: urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') And produces: ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '', '', '') -------------------------------- Note that there isn't a field for the port number in the 6-tuple. Instead, it is included in the netloc. Urlparse should handle your example as: >>> urlparse.urlparse('18.104.22.168:80','http') ('http', '22.214.171.124:80', '', '', '', '') Instead, it gives the incorrect output as you indicated.
|msg16380 - (view)||Author: Shannon Jones (sjones)||Date: 2003-06-14 04:18|
Logged In: YES user_id=589306 Ok, I researched this a bit, and the situation isn't as simple as it first appears. The RFC that urlparse tries to follow is at http://www.faqs.org/rfcs/rfc1808.html and notice specifically sections 2.1 and 2.2. It seems to me that the source code follows rfc1808 religiously, and in that sense it does the correct thing. According to the RFC, the netloc should begin with a '//', and since your example didn't include one then it technical was an invalid URL. Here is another example where it seems Python fails to do the right thing: >>> urlparse.urlparse('python.org') ('', '', 'python.org', '', '', '') >>> urlparse.urlparse('python.org', 'http') ('http', '', 'python.org', '', '', '') Note that it is putting 'python.org' as the path and not the netloc. So the problem isn't limited to just when you use a scheme parameter and/or a port number. Now if we put '//' at the beginning, we get: >>> urlparse.urlparse('//python.org') ('', 'python.org', '', '', '', '') >>> urlparse.urlparse('//python.org', 'http') ('http', 'python.org', '', '', '', '') So here it does the correct thing. There are two problems though. First, it is common for browsers and other software to just take a URL without a scheme and '://' and assume it is http for the user. While the URL is technically not correct, it is still common usage. Also, urlparse does take a scheme parameter. It seems as though this parameter should be used with a scheme-less URL to give it a default one like web browsers do. So somebody needs to make a decision. Should urlparse follow the RFC religiously and require '//' in front of netlocs? If so, I think the documentation should give an example showing this and showing how to use the 'scheme' parameter. Or should urlparse use the more commonly used form of a URL where '//' is omitted when the scheme is omitted? If so, urlparse.py will need to be changed. Or maybe another fuction should be added to cover whichever behaviour urlparse doesn't cover. In any case, you can temporarily solve your problem by making sure that URL's without a scheme have '//' at the front. So your example becomes: >>> urlparse.urlparse('//126.96.36.199:80', 'http') ('http', '188.8.131.52:80', '', '', '', '')
|msg16381 - (view)||Author: Facundo Batista (facundobatista) *||Date: 2004-12-26 21:55|
Logged In: YES user_id=752496 The problem is still present in Py2.3.4. IMO, it should support dirs without the "http://" or raise an error if it's not present (never fail silently!).
|msg16382 - (view)||Author: Georg Brandl (georg.brandl) *||Date: 2005-08-25 13:22|
Logged In: YES user_id=1188172 Will look into it. Should be easy to fix.
|msg67890 - (view)||Author: Senthil Kumaran (orsenthil) *||Date: 2008-06-10 02:54|
Attaching the patch to fix this issue. I deliberated upon this for a while and came up with the approach to: 1) fix the port issue, wherein urlparse should technically recognize the ':' separator for port from ':' after scheme. 2) And Doc fix wherein, it is advised that in the absence of a scheme, use the net_loc as //net_loc (following RCF 1808). If we go for any other fix, like internally pre-pending // when user has not specified the scheme (like in many pratical purpose), then we stand at chance of breaking a number of tests ( cases where url is 'g'(path only),';x' (path with params) and cases where relative url is g:h) Let me know your thoughts on this. >>> urlparse('184.108.40.206:80') ParseResult(scheme='', netloc='', path='220.127.116.11:80', params='', query='', fragment='') >>> urlparse('http://www.python.org:80/~guido/foo?query#fun') ParseResult(scheme='http', netloc='www.python.org:80', path='/~guido/foo', params='', query='query', fragment='fun') >>>
|msg68526 - (view)||Author: Facundo Batista (facundobatista) *||Date: 2008-06-21 18:03|
Senthil, your patch is wrong, see: >>> import urlparse >>> urlparse.urlparse('18.104.22.168:80','http') ParseResult(scheme='http', netloc='', path='22.214.171.124:80', params='', query='', fragment='') The netloc should be "126.96.36.199:80", note the composition of an URL: <scheme>://<netloc>/<path>;<params>?<query>#<fragment> Please fix it and test it applying the patch to the test I'm submitting here...
|msg68533 - (view)||Author: Anthony Lenton (elachuni) *||Date: 2008-06-21 18:45|
I agree with facundobatista that the patch is bad, but for a different reason: it now breaks with: >>> import urlparse >>> urlparse.urlparse ('http:') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/anthony/svn/python26/Lib/urlparse.py", line 108, in urlparse tuple = urlsplit(url, scheme, allow_fragments) File "/home/anthony/svn/python26/Lib/urlparse.py", line 148, in urlsplit if i > 0 and not url[i+1].isdigit(): IndexError: string index out of range I'm afraid that it it's not evident that the expected behavior isn't evident. Take for example: >>> import urlparse >>> urlparse.urlparse('some.com', 'http') ParseResult(scheme='http', netloc='', path='some.com', params='', query='', fragment='') Is the url referring to the some.com domain or to a windows executable file? If you're using urlparse to parse only absolute urls then probably you want the first component to be considered a net_loc, but not if you're thinking of accepting also relative urls. It would probably be better to be explicit and raise an exception if the url is invalid, so that the user can prepend a '//' and resubmit if needed. Also we'd probably stop seeing bugreports about this issue :)
|msg68535 - (view)||Author: Facundo Batista (facundobatista) *||Date: 2008-06-21 19:10|
I agree with Anthony here, because if you let people write without the "//" at the beginning, you'll never know if they're entering a net location or a relative path. So, the better behaviour to be as explicit as possible should be: >>> urlparse.urlparse('188.8.131.52:80','http') Traceback!!! ValueError(<nice message here>) >>> urlparse.urlparse('//184.108.40.206:80','http') ('http', '220.127.116.11:80', '', '', '', '') So, to close this issue, we should fix the code to behave like indicated in the first case. What do you think?
|msg68841 - (view)||Author: Senthil Kumaran (orsenthil) *||Date: 2008-06-27 18:20|
I am attaching the modified patch, which addresses the port issue properly and handles 'http:', 'https:' only URLS. Also included the tests for it. Facundo, I gave sufficient thought on raising an Exception for URLS not staring with '//', and I am -1 for it. As urlparse module is used for handling both absolute URLs as well as relative URLS, this suggestion IMHO, would break the urlparse handling of all relative urls. For e.g, cases which are mentioned in the RFC 1808 (Section 5.1 Normal Examples). The way to inform the users to use '//net_loc' when they want net_loc, would be Docs/Help message (included in the patch) and otherwise urlparse following RFC1808, will treat it as the path. This case may seem absurd when 'www.python.org' is treated as path but perfect for parsing relative urls like just 'a'. More over this makes sense when we have relative urls with parameters and query, for e.g.'g:h','?x' Another way to handle this would be split urlparse into two methods: urlparse.absparse() urlparse.relparse() and let the user decide what he wants to do. I am passing a message to Web-SIG to discuss this further. Irrespective of this, if the patch looks okay for "handling the port issue" for 2.6 with the "Doc/Help message", then we should close this bug and take the discussion further in Web-SIG. (I shall provide the patch for 3.0 as well) Comments Please.
|msg69095 - (view)||Author: Facundo Batista (facundobatista) *||Date: 2008-07-02 14:40|
I think this last patch is ok, but the third case that was raised in the web-sig should be addressed: """ There's even a 3rd case: HTTP's Request-URI. For example, '//path' must be treated as an abs_path consisting of two path_segments ['', 'path'], not a net_loc, since the Request_URI must be one of ("*" | absoluteURI | abs_path | authority). """ Please, address this new detail, and I'd commit this. Thanks!
|msg85877 - (view)||Author: Senthil Kumaran (orsenthil) *||Date: 2009-04-11 21:35|
Facundo, I re-looked at this issue (after a long time; sorry for that) and I think that the patch is good to be applied as it is for this issue. The Web-SIG discussion, which you pointed to in the comment (http://mail.python.org/pipermail/web-sig/2008-June/003458.html) suggests a more "general case" of thinking and parsing of the urls. The suggestion is specifically for "//path" kind of urls, which is will be parsed into authority (as per RFC2396) or into netloc as in the patch. My suggestion is to fix this issue with patch and if the corner case like "//path" comes up, then I shall look the modifications required as there is No RFC Specifications and Tests defining those cases. Your comments?
|msg110334 - (view)||Author: Mark Lawrence (BreamoreBoy) *||Date: 2010-07-14 22:15|
The patch will need to be reworked for the 2.7, 3.1 and 3.2 branches.
|msg112725 - (view)||Author: David Stanek (dstanek)||Date: 2010-08-03 23:58|
I've reworked the patch so that it applied against the py3k branch. It's been attached to this issue and is also available here: http://codereview.appspot.com/1910044.
|msg112758 - (view)||Author: Senthil Kumaran (orsenthil) *||Date: 2010-08-04 04:56|
Fixed in revision 83700 (release27-maint). r83701(py3k) and r83702(release31-maint). David, thanks for reworking on the patch. Couple of comments - I made change to the original patch where I checked 'https:' and 'http:' kind of url with url.endswith(':') instead of len(url) == i+1 // I had a hard time to figure out why I had this way in the first place. - In py3k, the urlparse.urlparse is changed to urllib.parse.urlparse. So that changes were required in the tests.
|2010-08-04 04:56:30||orsenthil||set||status: open -> closed|
messages: + msg112758
assignee: facundobatista -> orsenthil
resolution: accepted -> fixed
stage: patch review -> resolved
nosy: + dstanek
messages: + msg112725
messages: + msg110334
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
|2009-04-11 21:35:45||orsenthil||set||resolution: accepted|
messages: + msg85877
stage: patch review
versions: + Python 2.6, - Python 2.3
|2008-07-02 14:40:56||facundobatista||set||messages: + msg69095|
|2008-06-27 18:21:20||orsenthil||set||files: - issue754016.patch|
messages: + msg68841
|2008-06-21 19:10:08||facundobatista||set||messages: + msg68535|
|2008-06-21 19:02:54||facundobatista||set||files: - test_urlparse.diff|
messages: + msg68533
messages: + msg68526
|2008-06-20 01:03:37||facundobatista||set||assignee: georg.brandl -> facundobatista|
nosy: + orsenthil
messages: + msg67890
keywords: + patch