Issue 36338: urlparse of urllib returns wrong hostname

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/80519

classification

Title:	urlparse of urllib returns wrong hostname
Type:	security	Stage:	patch review
Components:	Library (Lib)	Versions:	Python 3.11, Python 3.10, Python 3.9, Python 3.8, Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	christian.heimes, iritkatriel, jpic, martin.panter, matrixise, orsenthil, ronaldoussoren, sanebow, xtreak
Priority:	high	Keywords:	patch

Created on 2019-03-18 08:06 by sanebow, last changed 2022-04-11 14:59 by admin.

Files
File name	Uploaded	Description	Edit
test_bug_36338.py	matrixise, 2019-03-18 08:26	Unittest for this issue.

Pull Requests
URL	Status	Linked	Edit
PR 14896	closed	jpic, 2019-07-21 22:35
PR 16780	closed	vstinner, 2019-10-14 14:01

Messages (14)
msg338171 - (view)	Author: Xianbo Wang (sanebow)	Date: 2019-03-18 08:06
The urlparse function in Python urllib returns the wrong hostname when parsing URL crafted by the malicious user. This may be caused by incorrect handling of IPv6 addresses. The bug could lead to open redirect in web applications which rely on urlparse to extract and validate the domain of redirection URL. The test case is as follows: >>> from urllib.parse import urlparse >>> urlparse(urlparse('http://benign.com\[attacker.com]').hostname >>> 'attacker.com' The correct behavior should be raising an invalid URL exception.
msg338172 - (view)	Author: Stéphane Wirtel (matrixise) *	Date: 2019-03-18 08:17
I can confirm with 3.7.2 on fedora 29
msg338173 - (view)	Author: Stéphane Wirtel (matrixise) *	Date: 2019-03-18 08:26
Here is a unittest where you can test this issue and the result on Python 3.8.0a2 and 3.7.2 >>> 3.8.0a2 ./python /tmp/test_bug_36338.py /tmp/test_bug_36338.py:8: SyntaxWarning: invalid escape sequence \[ url = 'http://demo.com\[attacker.com]' 3.8.0a2+ (heads/master:23581c018f, Mar 18 2019, 09:17:05) [GCC 8.3.1 20190223 (Red Hat 8.3.1-2)] test_bad_url (__main__.TestUrlparse) ... FAIL ====================================================================== FAIL: test_bad_url (__main__.TestUrlparse) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/test_bug_36338.py", line 13, in test_bad_url self.assertEqual(hostname, expected_hostname) AssertionError: 'attacker.com' != 'demo.com' - attacker.com + demo.com ---------------------------------------------------------------------- Ran 1 test in 0.001s FAILED (failures=1) >>> 3.7.2 python /tmp/test_bug_36338.py 3.7.2 (default, Jan 16 2019, 19:49:22) [GCC 8.2.1 20181215 (Red Hat 8.2.1-6)] test_bad_url (__main__.TestUrlparse) ... FAIL ====================================================================== FAIL: test_bad_url (__main__.TestUrlparse) ---------------------------------------------------------------------- Traceback (most recent call last): File "/tmp/test_bug_36338.py", line 13, in test_bad_url self.assertEqual(hostname, expected_hostname) AssertionError: 'attacker.com' != 'demo.com' - attacker.com + demo.com ---------------------------------------------------------------------- Ran 1 test in 0.000s FAILED (failures=1)
msg338599 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-03-22 11:27
See also issue20271 that discusses the other format http://[::1]spam where ::1 is returned as hostname. urlparse tries to parse the hostname as IPV6 address when there is [ and parses till ] at [0] thus "benign.com\[attacker.com]" is treated as a URL where attacker.com is assumed as the IPV6 hostname. I am not sure of the correct behavior. FWIW at least Java and golang return "benign.com[attacker.com]" and Ruby raises an exception that this is a bad URL. Java > (.getHost (java.net.URL. "http://benign.com\\[attacker.com]")) "benign.com\\[attacker.com]" golang: https://play.golang.org/p/q8pTo9ySLby [0] https://github.com/python/cpython/blob/c5c6cdada3d41148bdeeacfe7528327b481c5d18/Lib/urllib/parse.py#L199
msg338960 - (view)	Author: Ronald Oussoren (ronaldoussoren) *	Date: 2019-03-27 13:33
Given a quick scan of RFC 3986[1] I'd say that the behaviour of Ruby seems to be the most correct. That said, I'd also check what the major browsers do in this case (FWIW both FF and Safari use 'benign.com' as the hostname in this case). [1] https://tools.ietf.org/html/rfc3986#page-17
msg338972 - (view)	Author: Karthikeyan Singaravelan (xtreak) *	Date: 2019-03-27 16:00
I found this page to be uesful : https://url.spec.whatwg.org/#host-parsing and following the steps it seems that this should raise an error since at the 7th step it denotes that asciiDomain shouldn't contain forbidden host code point including "[]" . As another data point using 'new URL("http://benign.com[attacker.com]")' in browser's Javascript console also raises exception that this is a bad URL. Even if attacker.com is assumed to be the correct host by Python it's not validated to be an IPV6 address where it should fail. Ruby seems to use a regex : https://github.com/ruby/ruby/blob/trunk/lib/uri/rfc3986_parser.rb#L6 Java parseurl : http://hg.openjdk.java.net/jdk/jdk/file/c4c225b49c5f/src/java.base/share/classes/java/net/URLStreamHandler.java#l124 golang : https://github.com/golang/go/blob/50bd1c4d4eb4fac8ddeb5f063c099daccfb71b26/src/net/url/url.go#L587
msg349153 - (view)	Author: Xianbo Wang (sanebow)	Date: 2019-08-07 07:47
Python2 urlparse.urlparse and urllib2.urlparse.urlparse have a similar IPv6 hostname parsing bug. >>> urlparse.urlparse('http://nevil.com[]').hostname >>> 'evil.com[' This is less practical to exploit since the parsed domain contains a '[' in the end. Do I need to create a separate issue for this Python2 bug? I think the way PR 14896 fix the python3 bug can also be applied to this. Also, do we need a CVE ID for the python3 bug? As it may lead to some security issues in some Python apps, e.g., open-redirect. I have found such a case in a private bug bounty program.
msg351511 - (view)	Author: Christian Heimes (christian.heimes) *	Date: 2019-09-09 16:04
The guidelines https://url.spec.whatwg.org/#host-parsing make a lot of sense to me. Python should refuse hostnames with "[" unless * the hostname starts with "[" * the hostname ends with "]" * the string between [] is a valid IPv6 address (full or shortened, without or with correctly quoted scope id) Python should refuse any hostname with forbidden chars "\x00\n\r #%/:?@", too.
msg351589 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-09-10 07:58
To be clear, the \ in 'http://benign.com\[attacker.com]' is not needed to reproduce the bug: vstinner@apu$ python3 Python 3.7.4 (default, Jul 9 2019, 16:32:37) >>> from urllib.parse import urlparse; print(urlparse('http://demo.com[attacker.com]').hostname) attacker.com
msg351590 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-09-10 08:07
Python 3.5 and newer are impacted, but Python 2.7 behaves differently: vstinner@apu$ python2 Python 2.7.16 (default, Apr 30 2019, 15:54:43) >>> from urlparse import urlparse >>> urlparse('http://demo.com[attacker.com]').hostname 'emo.com[attacker.com'
msg354634 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-14 14:07
I proposed PR 16780 which makes the urllib.parse module way more stricter: * the IPv6 address is validated by ipaddress.IPv6Address() parser * invalid characters are rejected in the IPv6 zone: "%", "[" and "]" * the port number is now validated when parsing the URL: must be an integer in the [0; 65535] range Sadly, validating using ipaddress.IPv6Address() cannot be easily ported to Python 2 which doesn't have this module.
msg354746 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-15 17:08
I modified my PR 16780 to also fix bpo-33342: "urllib IPv6 parsing fails with special characters in passwords".
msg355322 - (view)	Author: STINNER Victor (vstinner) *	Date: 2019-10-24 10:38
OMG parsing an URL is a can of worms... There are so many open issues related to URL parsing! * bpo-18191: urllib.parse.splitport("::1") * bpo-20271: urllib.parse.urlparse('http://[::1]spam:80') * bpo-28841: urlparse.urlparse() parses invalid URI without generating an error (examples provided) * bpo-33342: urlsplit("//user:[@host") * bpo-34360: 'http://[::1]]' * bpo-35377: urlparse doesn't validate the scheme * bpo-35748: 'http://www.google.com\@xxx.com' * bpo-36338 (this issue): urlparse('http://demo.com[attacker.com]') * bpo-37678: urlparse('http://user:pass#?[word@example.com:80/path') Related: * bpo-3647: urlparse - relative url parsing and joins to be RFC3986 compliance * bpo-16909: urlparse: add userinfo attribute * bpo-18140: issue with 'http://auser:secr#et@192.168.0.1:8080/a/b/c.html' * bpo-22234: urllib.parse.urlparse accepts any falsy value as an url * bpo-22852: "urllib.parse wrongly strips empty #fragment, ?query, //netloc" * bpo-23328: issue with "http://someuser:a/b@10.11.12.13:1234" * bpo-23448: "urllib2 needs to remove scope from IPv6 address when creating Host header" * bpo-23505: [CVE-2015-2104] Urlparse insufficient validation leads to open redirect There are 124 open issues with "urllib" in their title and 12 open issues with "urlparse" in their title.
msg407558 - (view)	Author: Irit Katriel (iritkatriel) *	Date: 2021-12-02 23:31
It produces a deprecation warning on 3.11, but still does the same. >>> urlparse('http://benign.com\[attacker.com]').hostname <stdin>:1: DeprecationWarning: invalid escape sequence '\[' 'attacker.com'

History
Date	User	Action	Args
2022-04-11 14:59:12	admin	set	github: 80519
2021-12-03 00:01:27	vstinner	set	nosy: - vstinner
2021-12-02 23:31:38	iritkatriel	set	nosy: + iritkatriel messages: + msg407558 versions: + Python 3.10, Python 3.11, - Python 3.5, Python 3.6
2019-10-24 10:38:17	vstinner	set	messages: + msg355322
2019-10-15 17:08:57	vstinner	set	messages: + msg354746
2019-10-14 14:07:13	vstinner	set	messages: + msg354634
2019-10-14 14:01:50	vstinner	set	pull_requests: + pull_request16342
2019-09-10 08:07:39	vstinner	set	messages: + msg351590 versions: + Python 3.5, Python 3.6
2019-09-10 07:58:02	vstinner	set	nosy: + vstinner messages: + msg351589
2019-09-09 16:04:30	christian.heimes	set	priority: normal -> high versions: + Python 3.9
2019-09-09 16:04:14	christian.heimes	set	nosy: + christian.heimes messages: + msg351511
2019-08-07 07:47:54	sanebow	set	messages: + msg349153
2019-07-21 23:13:37	jpic	set	nosy: + jpic
2019-07-21 22:35:59	jpic	set	pull_requests: + pull_request14677
2019-05-15 12:01:19	methane	set	pull_requests: - pull_request13146
2019-05-10 18:42:37	pierreglaser	set	pull_requests: + pull_request13146
2019-03-27 16:00:26	xtreak	set	messages: + msg338972
2019-03-27 13:57:01	xtreak	set	pull_requests: - pull_request12526
2019-03-27 13:33:00	ronaldoussoren	set	nosy: + ronaldoussoren messages: + msg338960
2019-03-27 10:17:02	pierreglaser	set	pull_requests: + pull_request12526
2019-03-27 10:15:52	xtreak	set	pull_requests: - pull_request12525
2019-03-27 10:13:09	pierreglaser	set	stage: patch review pull_requests: + pull_request12525
2019-03-22 11:27:43	xtreak	set	nosy: + xtreak messages: + msg338599 stage: patch review -> (no value)
2019-03-21 15:11:58	xtreak	set	pull_requests: - pull_request12435
2019-03-21 15:09:27	pierreglaser	set	keywords: + patch stage: patch review pull_requests: + pull_request12435
2019-03-18 08:54:34	matrixise	set	versions: + Python 3.8
2019-03-18 08:34:57	xtreak	set	nosy: + martin.panter
2019-03-18 08:26:29	matrixise	set	files: + test_bug_36338.py messages: + msg338173
2019-03-18 08:17:10	matrixise	set	nosy: + matrixise, orsenthil messages: + msg338172
2019-03-18 08:06:11	sanebow	create