New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
urlparse library detecting wrong hostname leads to open redirect vulnerability #79929
Comments
Summary: Steps to reproduce the issue: Following code will help you in reproducing the issue:
Output: The hostname from above URL which is actually rendered by browser is : 'https://www.google.com'. In following browsers tested: (hostname detected as: https://www.google.com)
|
FWIW I understand the backslash should be percent-encoded in URLs, otherwise the URL is not valid. This reminds me of a few other bugs:
I think people some times come up with these invalid URLs because they are trying to make a URL that includes a password with unusual characters (e.g. for the “http_proxy” environment variable). So raising an exception or otherwise changing the parsing behaviour could break those cases. |
Hi, I know that \ (backslash) should be encoded to url encoding (%5c) but if the same url (without urlencoded form) typed into URL bar of browser we are getting hostname to 'https://www.google.com' |
You cannot compare a low level library like Python's urllib module with a user interface like a modern browser. Browsers do a lot of extra work to make sense of user input. For example Firefox and Chrome mangle your example URL and replace \ with /. Firefox even shows a warning when the URL contains user and password: --- Is “python.org” the site you want to visit? |
I just tested other implementations in Ruby and Go and they too return host as "evil.com" for "http://www.google.com@evil.com" along with the user info component. $ ruby -e 'require "uri"; puts URI("http://www.google.com@evil.com").hostname'
evil.com
$ cat /tmp/foo.go
package main import ( func main() { |
There are also some notes at https://tools.ietf.org/html/rfc3986#section-7.6 Because the userinfo subcomponent is rarely used and appears before ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm might lead a human user to assume that the host is 'cnn.example.com', A misleading URI, such as that above, is an attack on the user's In Firefox nightly and latest chrome pasting the above URL makes a request to 10.0.0.1/top_story.htm where in Chrome the URL in the address bar is changed to 10.0.0.1/top_story.htm and Firefox has the same URL in the address bar. Python also returns '10.0.0.1' as the hostname for the above example using urlparse. |
I believe that Python's behaviour here is correct. You are supplying a netloc which includes a username "www.google.com\" with no password. That might be what you intend to do, or it might be malicious data. That depends on context, and the urlparse module can't tell what the context is and has no reason to assume malice. If I am reading this correctly: https://tools.ietf.org/html/rfc1738#section-3.1 the colon after the username can be omitted, so the URL is legal and Python has returned the correct value for the netloc. As Christian says, Python is not an end-user application like a browser. It is right and proper for a browser to expect that the user is non-technical and may not have noticed the @ sign, and to expect malicious behaviour, or to assume that backslash \ is a typo for forward slash / but Python programmers by definition are technical users and it is their responsibility to validate their data. There are legitimate uses for the userinfo component (user:password@hostname) and it is not the library's responsibility to assume that backslashes are typos for forward slashes. So I think that the behaviour here is correct, and this should be closed. But if you disagree, please explain what you think the library should do, and why. WHen you do, remember that:
|
The “urllib.parse” module generally follows RFC 3986, which does not allow a literal backslash in the “userinfo” part: userinfo = *( unreserved / pct-encoded / sub-delims / ":" ) The RFC does not allow a backslash in the host name, path, query or fragment either. That is why I said the URL is not valid. |
And yet the parse() function seems to allow arbitrary unescaped py> from urllib.parse import urlparse If that's a bug, it is a separate bug to this issue. Backslash doesn't seem relevant to the security issue of userinfo being py> urlparse('http://www.google.com@evil.com').netloc If it is relevant, can somebody explain to me how? |
From RFC-1738: hostname = *[ domainlabel "." ] toplabel However: py> urlparse('https://foo\\\\bar/baz') The hostname's BNF doesn't allow for a backslash ('\\') character, so I'd expect urlparse to raise a ValueError for this "URL". |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: