Issue 20271: urllib.parse.urlparse() accepts wrong URLs

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/64470

classification

Title:	urllib.parse.urlparse() accepts wrong URLs
Type:	behavior	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9

process

Status:	open	Resolution:
Dependencies:	20270	Superseder:
Assigned To:		Nosy List:	demian.brecht, gvanrossum, martin.panter, orsenthil, serhiy.storchaka, xtreak, yselivanov
Priority:	normal	Keywords:	patch

Created on 2014-01-15 12:32 by serhiy.storchaka, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
urlparse.patch	vstinner, 2014-03-07 16:19		review
urlparse_2.patch	serhiy.storchaka, 2015-03-03 18:40		review

Messages (14)
msg208156 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-01-15 12:32
>>> import urllib.parse >>> p = urllib.parse.urlparse('http://[::1]spam:80') >>> p ParseResult(scheme='http', netloc='[::1]spam:80', path='', params='', query='', fragment='') >>> p.hostname '::1' >>> p.port 80 'http://[::1]spam:80' is invalid URL, but urllib.parse.urlparse() accepts it and just ignore the spam part.
msg208675 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-01-21 17:14
What should be correct behavior? Raise an exception, return '[::1]spam' as hostname, or left all as is?
msg209852 - (view)	Author: Yury Selivanov (yselivanov) *	Date: 2014-02-01 00:23
I think we should start raising an exception in 3.5 (backwards incompatible change to back-port it)
msg212888 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-03-07 16:19
Here is a patch for Python 3.5 which breaks backward compatibility: urlparse functions now raise a ValueError if the IPv6 address, port or host is invalid. Examples of invalid URLs: - "HTTP://WWW.PYTHON.ORG:65536/doc/#frag": 65536 is invalid - "http://www.example.net:foo" - "http://::1/" - "http://[127.0.0.1]/" - "http://[host]/" According to unit tests, Python 3.4 is more tolerant: it only raises an error when the port number is read (obj.port) from an URL with an invalid port. There error is not raised when the whole URL is parsed. Is it ok to break backward compatibility?
msg212889 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-03-07 16:19
Oh, by the way my patch fixes also #18191.
msg212890 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-03-07 16:21
My patch "urlparse.patch" may be modified to fix also #18191 in Python 2.7, 3.3 and 3.4: splitport() should handle IPv6 ("[::1]:80") and auth ("user:passowrd@host") but not raises an exception.
msg236990 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-01 22:09
splitport() shouldn't handle auth, it should be called after auth is dropped with splituser(). The patch makes URL parsing slower, especially URLs with IPv6 addresses. $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://python.org:80'); clear_cache()" Unpatched: 10000 loops, best of 3: 70.3 usec per loop Patched: 10000 loops, best of 3: 105 usec per loop $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://[2001:4802:7901:0:e60a:1375:0:5]:80'); clear_cache()" Unpatched: 10000 loops, best of 3: 71.1 usec per loop Patched: 1000 loops, best of 3: 239 usec per loop
msg237001 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-02 01:10
I suspect the proposed patch here would make some of the special error handling for the “port” property redundant, and would compete with my patch for Issue 20059, which raises ValueError in more cases when “port” is accessed, rather than returning None. There should also be test cases (and maybe documentation or fixes?) to illustrate the behaviour of parsing: * http://[fe80::a%25en1] (RFC 6874-style scoped IPv6 address with zone identifier) * http://[v1.fe80::a+en1] (RFC 3986 IPvFuture format, using <https://tools.ietf.org/html/draft-sweet-uri-zoneid-01> style address)
msg237157 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-03-03 18:40
With optimizations in issue23563 and weaken IPv6 check the implementation of urlsplit() can be faster. $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://python.org:80'); clear_cache()" 10000 loops, best of 3: 86.3 usec per loop $ ./python -m timeit -s "from urllib.parse import urlparse, clear_cache" -- "urlparse('http://[2001:4802:7901:0:e60a:1375:0:5]:80'); clear_cache()" 10000 loops, best of 3: 88.6 usec per loop
msg237369 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-06 18:29
I think some further consideration around this change is worthwhile: Currently, urllib.parse.split* methods seem to do inconsistent validation around the data they're splitting. For example: (None for an invalid port) >>> parse.splitnport('example.com:foo') ('example.com', None) Whereas other split* methods do not: (Auth part should be URL-encoded) >>> parse.splituser('u@ser:p@ssword@example.com:80') ('u@ser:p@ssword', 'example.com:80') And others are just plain incorrect: (:bad should be the port as defined by the ABNF 'authority = [ userinfo "@" ] host [ ":" port ]') >>> parse.splitport('example.com:bad') ('example.com:bad', None) However, none of these cases (currently) raise exceptions. Looking at urllib.parse, two large motivations behind it are splitting and parsing. In my mind, splitting should solely be responsible for splitting input based on RFC-defined delimiters. Parsing on the other hand, should be responsible for both splitting as necessary as well as input validation. It may also make sense to add simple validation functions to the module to comply with the "batteries included" philosophy, but that's a topic for another issue. My concern with the proposed patch is that it adds further inconsistency to split* methods: Before patch: >>> parse.urlsplit('http://[::1]spam:80') SplitResult(scheme='http', netloc='[::1]spam:80', path='', query='', fragment='') After patch: >>> parse.urlsplit('http://[::1]spam:80') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 350, in urlsplit netloc, url = _splitnetloc(url, 2) File "/Volumes/src/p/cpython/Lib/urllib/parse.py", line 324, in _splitnetloc raise ValueError('Invalid IPv6 URL') (While the above examples still yield the same results and don't raise exceptions) I do think that the validation should be done and I agree that an exception should be raised, but it should be done at the urlparse level, not on split (in this case, due to the change to _splitnecloc).
msg237371 - (view)	Author: Demian Brecht (demian.brecht) *	Date: 2015-03-06 18:59
> splitport() shouldn't handle auth, it should be called after auth is dropped with splituser(). Why shouldn't splitport() handle auth? Are you suggesting that in the eyes of splitport() that "user:password@host:port" should be invalid? >>> parse.splitport('user:password@host:80') ('user:password@host', '80') Seems to be reasonable behaviour to me. Why add the artificial constraint of calling splituser() first?
msg238982 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-03-23 05:26
To me, the only difference between urlsplit() and urlparse() is that urlsplit() does not treat parameters following a semicolon specially. The documentation tends to agree with this view. So whatever exceptions are raised by urlparse() should also be raised by urlsplit() for consistency. Also, I doubt it is worth worrying about the finer details of splitport() or whatever, because they are meant to be deprecated and not used externally, except for backwards compatibility.
msg338974 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-03-27 16:16
See also issue36338 for a possible security issue for host of value "benign.com[attacker.com]" (spam[::1] format) where attacker.com is parsed as the host name assuming presence of [ and ] to be a IPV6 address without validation of the value attacker.com inside [] to be a valid IPV6 address. As a datapoint input "http://[::1]spam" raises exception in Java, golang and Ruby. Browser's JS console returns invalid URL. I too would like exception being raised but not sure at which level. Ruby seems to use a regex : https://github.com/ruby/ruby/blob/trunk/lib/uri/rfc3986_parser.rb#L6 Java parseurl : http://hg.openjdk.java.net/jdk/jdk/file/c4c225b49c5f/src/java.base/share/classes/java/net/URLStreamHandler.java#l124 golang : https://github.com/golang/go/blob/50bd1c4d4eb4fac8ddeb5f063c099daccfb71b26/src/net/url/url.go#L587 See also https://url.spec.whatwg.org/#host-parsing If input starts with U+005B ([), then: If input does not end with U+005D (]), validation error, return failure. Return the result of IPv6 parsing input with its leading U+005B ([) and trailing U+005D (]) removed.
msg355315 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-10-24 09:42
With Victor's PR 16780, "http://[::1]spam:80" would raise a ValueError. ValueError: Invalid network location: '[::1]spam:80'

History
Date	User	Action	Args
2022-04-11 14:57:57	admin	set	github: 64470
2022-01-17 12:57:05	vstinner	set	nosy: - vstinner
2022-01-16 19:18:28	iritkatriel	set	versions: + Python 3.9, Python 3.10, Python 3.11, - Python 3.5, Python 3.6, Python 3.7, Python 3.8
2019-10-24 09:42:12	xtreak	set	messages: + msg355315
2019-03-27 16:16:05	xtreak	set	nosy: + xtreak messages: + msg338974 versions: + Python 3.6, Python 3.7, Python 3.8
2015-03-23 05:26:01	martin.panter	set	messages: + msg238982
2015-03-06 18:59:51	demian.brecht	set	stage: needs patch -> patch review
2015-03-06 18:59:25	demian.brecht	set	messages: + msg237371
2015-03-06 18:29:59	demian.brecht	set	nosy: + demian.brecht messages: + msg237369
2015-03-03 18:40:15	serhiy.storchaka	set	files: + urlparse_2.patch messages: + msg237157
2015-03-02 01:10:53	martin.panter	set	nosy: + martin.panter messages: + msg237001
2015-03-01 22:09:09	serhiy.storchaka	set	messages: + msg236990
2014-03-07 16:21:36	vstinner	set	messages: + msg212890
2014-03-07 16:19:40	vstinner	set	messages: + msg212889
2014-03-07 16:19:24	vstinner	set	files: + urlparse.patch nosy: + gvanrossum, vstinner messages: + msg212888 keywords: + patch
2014-02-01 00:23:38	yselivanov	set	nosy: + yselivanov messages: + msg209852 versions: + Python 3.5, - Python 2.7, Python 3.3, Python 3.4
2014-01-21 17:14:26	serhiy.storchaka	set	messages: + msg208675
2014-01-15 12:37:09	serhiy.storchaka	link	issue18191 dependencies
2014-01-15 12:32:52	serhiy.storchaka	set	dependencies: + urllib.parse doesn't work with empty port
2014-01-15 12:32:09	serhiy.storchaka	create