Issue 754016: urlparse goes wrong with IP:port without scheme

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/38644

classification

Title:	urlparse goes wrong with IP:port without scheme
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.1, Python 3.2, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	orsenthil	Nosy List:	dstanek, elachuni, facundobatista, georg.brandl, georg.brandl, jjlee, kruegi, miss-islington, orsenthil, sjones
Priority:	normal	Keywords:	patch

Created on 2003-06-13 15:15 by kruegi, last changed 2022-04-10 16:09 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue754016-py26.patch	orsenthil, 2008-06-27 18:20
754016.patch	dstanek, 2010-08-03 23:58

Pull Requests
URL	Status	Linked	Edit
PR 661	merged	Tim.Graham, 2017-03-13 17:39
PR 16837	merged	miss-islington, 2019-10-18 13:07
PR 16839	merged	orsenthil, 2019-10-18 13:51

Messages (19)
msg16377 - (view)	Author: Thomas Krüger (kruegi)	Date: 2003-06-13 15:15
urlparse doesnt work if IP and port are given without scheme: >>> urlparse.urlparse('1.2.3.4:80','http') ('1.2.3.4', '', '80', '', '', '') should be: >>> urlparse.urlparse('1.2.3.4:80','http') ('http', '1.2.3.4', '80', '', '', '')
msg16378 - (view)	Author: Shannon Jones (sjones)	Date: 2003-06-14 02:26
Logged In: YES user_id=589306 urlparse.urlparse takes a url of the format: <scheme>://<netloc>/<path>;<params>?<query>#<fragment> And returns a 6-tuple of the format: (scheme, netloc, path, params, query, fragment). An example from the library refrence takes:
msg16379 - (view)	Author: Shannon Jones (sjones)	Date: 2003-06-14 02:39
Logged In: YES user_id=589306 Sorry, previous comment got cut off... urlparse.urlparse takes a url of the format: <scheme>://<netloc>/<path>;<params>?<query>#<fragment> And returns a 6-tuple of the format: (scheme, netloc, path, params, query, fragment). An example from the library refrence takes: urlparse('http://www.cwi.nl:80/%7Eguido/Python.html') And produces: ('http', 'www.cwi.nl:80', '/%7Eguido/Python.html', '', '', '') -------------------------------- Note that there isn't a field for the port number in the 6-tuple. Instead, it is included in the netloc. Urlparse should handle your example as: >>> urlparse.urlparse('1.2.3.4:80','http') ('http', '1.2.3.4:80', '', '', '', '') Instead, it gives the incorrect output as you indicated.
msg16380 - (view)	Author: Shannon Jones (sjones)	Date: 2003-06-14 04:18
Logged In: YES user_id=589306 Ok, I researched this a bit, and the situation isn't as simple as it first appears. The RFC that urlparse tries to follow is at http://www.faqs.org/rfcs/rfc1808.html and notice specifically sections 2.1 and 2.2. It seems to me that the source code follows rfc1808 religiously, and in that sense it does the correct thing. According to the RFC, the netloc should begin with a '//', and since your example didn't include one then it technical was an invalid URL. Here is another example where it seems Python fails to do the right thing: >>> urlparse.urlparse('python.org') ('', '', 'python.org', '', '', '') >>> urlparse.urlparse('python.org', 'http') ('http', '', 'python.org', '', '', '') Note that it is putting 'python.org' as the path and not the netloc. So the problem isn't limited to just when you use a scheme parameter and/or a port number. Now if we put '//' at the beginning, we get: >>> urlparse.urlparse('//python.org') ('', 'python.org', '', '', '', '') >>> urlparse.urlparse('//python.org', 'http') ('http', 'python.org', '', '', '', '') So here it does the correct thing. There are two problems though. First, it is common for browsers and other software to just take a URL without a scheme and '://' and assume it is http for the user. While the URL is technically not correct, it is still common usage. Also, urlparse does take a scheme parameter. It seems as though this parameter should be used with a scheme-less URL to give it a default one like web browsers do. So somebody needs to make a decision. Should urlparse follow the RFC religiously and require '//' in front of netlocs? If so, I think the documentation should give an example showing this and showing how to use the 'scheme' parameter. Or should urlparse use the more commonly used form of a URL where '//' is omitted when the scheme is omitted? If so, urlparse.py will need to be changed. Or maybe another fuction should be added to cover whichever behaviour urlparse doesn't cover. In any case, you can temporarily solve your problem by making sure that URL's without a scheme have '//' at the front. So your example becomes: >>> urlparse.urlparse('//1.2.3.4:80', 'http') ('http', '1.2.3.4:80', '', '', '', '')
msg16381 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2004-12-26 21:55
Logged In: YES user_id=752496 The problem is still present in Py2.3.4. IMO, it should support dirs without the "http://" or raise an error if it's not present (never fail silently!).
msg16382 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2005-08-25 13:22
Logged In: YES user_id=1188172 Will look into it. Should be easy to fix.
msg67890 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-06-10 02:54
Attaching the patch to fix this issue. I deliberated upon this for a while and came up with the approach to: 1) fix the port issue, wherein urlparse should technically recognize the ':' separator for port from ':' after scheme. 2) And Doc fix wherein, it is advised that in the absence of a scheme, use the net_loc as //net_loc (following RCF 1808). If we go for any other fix, like internally pre-pending // when user has not specified the scheme (like in many pratical purpose), then we stand at chance of breaking a number of tests ( cases where url is 'g'(path only),';x' (path with params) and cases where relative url is g:h) Let me know your thoughts on this. >>> urlparse('1.2.3.4:80') ParseResult(scheme='', netloc='', path='1.2.3.4:80', params='', query='', fragment='') >>> urlparse('http://www.python.org:80/~guido/foo?query#fun') ParseResult(scheme='http', netloc='www.python.org:80', path='/~guido/foo', params='', query='query', fragment='fun') >>>
msg68526 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-06-21 18:03
Senthil, your patch is wrong, see: >>> import urlparse >>> urlparse.urlparse('1.2.3.4:80','http') ParseResult(scheme='http', netloc='', path='1.2.3.4:80', params='', query='', fragment='') The netloc should be "1.2.3.4:80", note the composition of an URL: <scheme>://<netloc>/<path>;<params>?<query>#<fragment> Please fix it and test it applying the patch to the test I'm submitting here...
msg68533 - (view)	Author: Anthony Lenton (elachuni) *	Date: 2008-06-21 18:45
I agree with facundobatista that the patch is bad, but for a different reason: it now breaks with: >>> import urlparse >>> urlparse.urlparse ('http:') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/anthony/svn/python26/Lib/urlparse.py", line 108, in urlparse tuple = urlsplit(url, scheme, allow_fragments) File "/home/anthony/svn/python26/Lib/urlparse.py", line 148, in urlsplit if i > 0 and not url[i+1].isdigit(): IndexError: string index out of range I'm afraid that it it's not evident that the expected behavior isn't evident. Take for example: >>> import urlparse >>> urlparse.urlparse('some.com', 'http') ParseResult(scheme='http', netloc='', path='some.com', params='', query='', fragment='') Is the url referring to the some.com domain or to a windows executable file? If you're using urlparse to parse only absolute urls then probably you want the first component to be considered a net_loc, but not if you're thinking of accepting also relative urls. It would probably be better to be explicit and raise an exception if the url is invalid, so that the user can prepend a '//' and resubmit if needed. Also we'd probably stop seeing bugreports about this issue :)
msg68535 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-06-21 19:10
I agree with Anthony here, because if you let people write without the "//" at the beginning, you'll never know if they're entering a net location or a relative path. So, the better behaviour to be as explicit as possible should be: >>> urlparse.urlparse('1.2.3.4:80','http') Traceback!!! ValueError(<nice message here>) >>> urlparse.urlparse('//1.2.3.4:80','http') ('http', '1.2.3.4:80', '', '', '', '') So, to close this issue, we should fix the code to behave like indicated in the first case. What do you think?
msg68841 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2008-06-27 18:20
I am attaching the modified patch, which addresses the port issue properly and handles 'http:', 'https:' only URLS. Also included the tests for it. Facundo, I gave sufficient thought on raising an Exception for URLS not staring with '//', and I am -1 for it. As urlparse module is used for handling both absolute URLs as well as relative URLS, this suggestion IMHO, would break the urlparse handling of all relative urls. For e.g, cases which are mentioned in the RFC 1808 (Section 5.1 Normal Examples). The way to inform the users to use '//net_loc' when they want net_loc, would be Docs/Help message (included in the patch) and otherwise urlparse following RFC1808, will treat it as the path. This case may seem absurd when 'www.python.org' is treated as path but perfect for parsing relative urls like just 'a'. More over this makes sense when we have relative urls with parameters and query, for e.g.'g:h','?x' Another way to handle this would be split urlparse into two methods: urlparse.absparse() urlparse.relparse() and let the user decide what he wants to do. I am passing a message to Web-SIG to discuss this further. Irrespective of this, if the patch looks okay for "handling the port issue" for 2.6 with the "Doc/Help message", then we should close this bug and take the discussion further in Web-SIG. (I shall provide the patch for 3.0 as well) Comments Please.
msg69095 - (view)	Author: Facundo Batista (facundobatista) *	Date: 2008-07-02 14:40
I think this last patch is ok, but the third case that was raised in the web-sig should be addressed: """ There's even a 3rd case: HTTP's Request-URI. For example, '//path' must be treated as an abs_path consisting of two path_segments ['', 'path'], not a net_loc, since the Request_URI must be one of ("*" \| absoluteURI \| abs_path \| authority). """ Please, address this new detail, and I'd commit this. Thanks!
msg85877 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2009-04-11 21:35
Facundo, I re-looked at this issue (after a long time; sorry for that) and I think that the patch is good to be applied as it is for this issue. The Web-SIG discussion, which you pointed to in the comment (http://mail.python.org/pipermail/web-sig/2008-June/003458.html) suggests a more "general case" of thinking and parsing of the urls. The suggestion is specifically for "//path" kind of urls, which is will be parsed into authority (as per RFC2396) or into netloc as in the patch. My suggestion is to fix this issue with patch and if the corner case like "//path" comes up, then I shall look the modifications required as there is No RFC Specifications and Tests defining those cases. Your comments?
msg110334 - (view)	Author: Mark Lawrence (BreamoreBoy) *	Date: 2010-07-14 22:15
The patch will need to be reworked for the 2.7, 3.1 and 3.2 branches.
msg112725 - (view)	Author: David Stanek (dstanek)	Date: 2010-08-03 23:58
I've reworked the patch so that it applied against the py3k branch. It's been attached to this issue and is also available here: http://codereview.appspot.com/1910044.
msg112758 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2010-08-04 04:56
Fixed in revision 83700 (release27-maint). r83701(py3k) and r83702(release31-maint). David, thanks for reworking on the patch. Couple of comments - I made change to the original patch where I checked 'https:' and 'http:' kind of url with url.endswith(':') instead of len(url) == i+1 // I had a hard time to figure out why I had this way in the first place. - In py3k, the urlparse.urlparse is changed to urllib.parse.urlparse. So that changes were required in the tests.
msg354892 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2019-10-18 13:07
New changeset 5a88d50ff013a64fbdb25b877c87644a9034c969 by Senthil Kumaran (Tim Graham) in branch 'master': bpo-27657: Fix urlparse() with numeric paths (#661) https://github.com/python/cpython/commit/5a88d50ff013a64fbdb25b877c87644a9034c969
msg354897 - (view)	Author: miss-islington (miss-islington)	Date: 2019-10-18 13:24
New changeset 82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f by Miss Islington (bot) in branch '3.7': bpo-27657: Fix urlparse() with numeric paths (GH-661) https://github.com/python/cpython/commit/82b5f6b16e051f8a2ac6e87ba86b082fa1c4a77f
msg354906 - (view)	Author: Senthil Kumaran (orsenthil) *	Date: 2019-10-18 15:23
New changeset 0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384 by Senthil Kumaran in branch '3.8': [3.8] bpo-27657: Fix urlparse() with numeric paths (GH-661) (#16839) https://github.com/python/cpython/commit/0f3187c1ce3b3ace60f6c1691dfa3d4e744f0384

History
Date	User	Action	Args
2022-04-10 16:09:11	admin	set	github: 38644
2019-10-18 15:23:21	orsenthil	set	messages: + msg354906
2019-10-18 13:51:49	orsenthil	set	pull_requests: + pull_request16389
2019-10-18 13:24:32	miss-islington	set	nosy: + miss-islington messages: + msg354897
2019-10-18 13:07:37	miss-islington	set	pull_requests: + pull_request16383
2019-10-18 13:07:37	orsenthil	set	messages: + msg354892
2017-03-13 19:39:47	BreamoreBoy	set	nosy: - BreamoreBoy
2017-03-13 17:39:32	Tim.Graham	set	pull_requests: + pull_request544
2010-08-04 04:56:30	orsenthil	set	status: open -> closed messages: + msg112758 assignee: facundobatista -> orsenthil resolution: accepted -> fixed stage: patch review -> resolved
2010-08-03 23:58:04	dstanek	set	files: + 754016.patch nosy: + dstanek messages: + msg112725
2010-07-14 22:15:59	BreamoreBoy	set	nosy: + BreamoreBoy messages: + msg110334 versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2009-04-11 21:35:45	orsenthil	set	resolution: accepted messages: + msg85877
2009-02-13 01:37:56	ajaksu2	set	nosy: + jjlee stage: patch review type: behavior versions: + Python 2.6, - Python 2.3
2008-07-02 14:40:56	facundobatista	set	messages: + msg69095
2008-06-27 18:21:20	orsenthil	set	files: - issue754016.patch
2008-06-27 18:20:58	orsenthil	set	files: + issue754016-py26.patch messages: + msg68841
2008-06-21 19:10:08	facundobatista	set	messages: + msg68535
2008-06-21 19:02:54	facundobatista	set	files: - test_urlparse.diff
2008-06-21 18:45:44	elachuni	set	nosy: + elachuni messages: + msg68533
2008-06-21 18:03:19	facundobatista	set	files: + test_urlparse.diff messages: + msg68526
2008-06-20 01:03:37	facundobatista	set	assignee: georg.brandl -> facundobatista
2008-06-10 02:54:28	orsenthil	set	files: + issue754016.patch nosy: + orsenthil messages: + msg67890 keywords: + patch
2003-06-13 15:15:34	kruegi	create