classification
Title: urllib2 works slowly with proxy on windows
Type: performance Stage:
Components: Library (Lib), Windows Versions: Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, juliadolgova, orsenthil, paul.moore, schlamar, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2017-02-11 08:43 by juliadolgova, last changed 2017-04-20 13:08 by schlamar.

Files
File name Uploaded Description Edit
checklib.py juliadolgova, 2017-02-11 08:43 just make a call to urllib2.urlopen("https://docs.python.org")
socket.py juliadolgova, 2017-02-11 08:43 not a patched lib. Has debug lines near 141 line. Use it with checklib.py.
test.py juliadolgova, 2017-02-11 08:44 compares performance of gethostbyaddr with gethostbyname_ex
log.txt juliadolgova, 2017-02-11 08:44 result of test.py on my computer (Windows 8, 64 bit)
socket.patch juliadolgova, 2017-02-11 08:44 not actual (wrong) patch
compare_ie_urllib.txt juliadolgova, 2017-02-19 07:52
request.patch juliadolgova, 2017-02-23 06:21 review
checklib-py3.py juliadolgova, 2017-03-30 05:13 make a request to http://python.org/ via proxy using urllib
compare_urllib_progs.png juliadolgova, 2017-04-19 13:22
Pull Requests
URL Status Linked Edit
PR 247 open juliadolgova, 2017-02-23 06:42
Messages (23)
msg287597 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-11 08:43
I've found that urllib works sometimes slowly on windows with proxy.

To reproduce the issue:
on Windows:
1. Turn on the option "use proxy" in "browser settings" in "control panel".
No real proxy needed. The problem will come out before addressing to proxy. Just don't pay attention to exception.
2. Make sure that the list of addresses for proxy bypass is not empty
3. Execute checklib.py with socket.py (attached here) in the same directory

The result output could be:
A (not a problem):
Before call to _socket.gethostbyaddr("docs.python.org")
After call to _socket.gethostbyaddr("docs.python.org")

B (little problem):
Before call to _socket.gethostbyaddr("docs.python.org")
Exception in call to _socket.gethostbyaddr("docs.python.org")

C (worse problem):
Before call to _socket.gethostbyaddr("docs.python.org")
(Delay)
Exception in call to _socket.gethostbyaddr("docs.python.org")

The result A,B or C depends on what DNS server you use, what url you pass into urllib2.urlopen(), and could differ at different time because dns is not a stable thing. 
However, no matter what result you have, this test shows that a hostname is passed into gethostbyaddr instead of IP as expected and described in MSDN. It should be changed to gethostbyname_ex here.

test.py compare performance of gethostbyaddr and gethostbyname_ex. 
It sets different dns servers on the system and calls these functions with different hostnames passed into. Run on my computer shows that gethostbyname_ex is 3 times more productive and doesn't raise exceptions.

-----------------------------
Attached files:
checklib.py - just make a call to urllib2.urlopen("https://docs.python.org")
socket.py - not a patched lib. Has debug lines near 141 line. Use it with checklib.py.
test.py - compare performance of gethostbyaddr with gethostbyname_ex
log.txt - result of test.py on my computer (Windows 8, 64 bit)
socket.patch - socket.py patch
msg287905 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-16 00:05
Surely noone is concerned that programms written on python could work better when addressing to "python.org"?
msg287944 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2017-02-16 13:44
There's a few reasons why you haven't heard a reply. First among them is that we're all volunteers with limited free time, and second is that we just migrated to github and all that free time is being consumed right now.

Python 2.7 is only receiving security fixes at this point. We might apply a fix for this in 3.5 and later, but you haven't indicated whether it applies to those and (I assume) nobody has tested it yet.

Your initial report is very good and much appreciated, we've just been busy and this doesn't jump out as urgent.
msg287945 - (view) Author: Eryk Sun (eryksun) * Date: 2017-02-16 13:52
gethostbyname_ex won't do a reverse lookup on an IP to get the fully-qualified domain name, which seems pointless for a function named getfqdn. I think calling gethostbyaddr is intentional here and goes back to the Python 1.x days.

Also, FYI, socket_gethostbyaddr in socketmodule.c doesn't pass a name to  C gethostbyaddr. It first calls setipaddr, which calls getaddrinfo to get the IP address. 

For "docs.python.org", the reverse lookup on the IP address has no data. Well, in Windows the error code is WSANO_DATA; in Linux I get HOST_NOT_FOUND.
msg287976 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-17 02:05
I sincerely appreciate your time. Thank you very much for your answer. I'll try to test this on python 3.5
msg288071 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-18 10:12
The issue applies to 3.6 as well. 

I agree that the replacement of gethostbyaddr with gethostbyname_ex is not a solution. But is there a way to check whether a hostname is in the <proxy bypass list> that doesn't bring to the reverse lookup? I suppose that IE doesn't make a reverse lookup for each request.
msg288118 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-19 07:52
I compared the behavior of IE and urllib.
I put different addresses to the <proxy bypass list> ("Do not use proxy server for address beginning with" setting), made different requests through IE and urllib and watched if the proxy was bypassed.

IE doesn't even make a forward dns lookup for the hosname given to check whether it should bypass proxy, whereas urllib does.

For example:
proxy_bypass_list       request                          IE bypasses proxy   urllib bypasses proxy
23.253.135.79           https://python.org/              no                  yes
151.101.76.223          https://docs.python.org/         no                  yes
ovinnik.canonical.com   https://ubuntu.com/              no                  yes

compare_ie_urllib.txt - full report
msg288139 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2017-02-19 14:30
My guess is that IE is implemented using lower level APIs and it can choose whether to bypass based on its own list. There's no reason for any other software to take its settings into account.

That said, it would be great if urllib can avoid adding long delays, at least more than once. I'm personally not sure how best to do that though.
msg288172 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-20 00:35
Why not to take it into account? 

Imagine that someone wants that requests to "ovinnik.canonical.com" should bypass proxy and requests to "ubuntu.com" souldn't. I don't know what for, it's just an assumption. 
He adds a hostname "ovinnik.canonical.com" into <proxy bypass list> and checks requests in IE. He sees that requests to "ovinnik.canonical.com" bypass proxy and requests to "ubuntu.com" go via proxy. And it's ok.
But suddenly he discovers that requests in urllib to "ubuntu.com" bypass proxy and it's unexpected. 

I think this behavior of urllib should be at least optional.
msg288179 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-20 02:16
http://bugs.python.org/issue23384 - same problem
msg288413 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-23 06:21
I added variable smart_proxy_bypass into request module. If it's False then checking IP and FQDN of the host is skipped in proxy_bypass_registry function. 

I set the default value to False, because I think it's better when the behaviour of urllib corresponds to IE rather than previous versions of urllib. This will affect only NT-systems.
msg290482 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-03-25 13:31
Could someone look into my PR, please...
msg290807 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-03-30 05:13
May be I described the problem not clearly enough, because English is not my native language, so I try to explain once again.

In Windows there is an option "Do not use proxy server for address beginning with". I call this option <proxy bypass list>. This option is invented for Internet Explorer (IE) and is used by IE. It could be used by other applications and I think it's obvious that other applications must handle it same way as IE does. May be I'm wrong here, please dissuade me.

The problem is that IE only compares the hostname received with items in this list. And urllib also makes the reverse lookup and the forward lookup of the hostname and compares the results of those lookups with items in this list. To reproduce that you need to:
1. Run a proxy on your Windows or use any other proxy, that outputs requests coming from clients.
2. On Windows in "Browser settings" (IE settings) turn on the option "use proxy", set up the IP of your proxy, set the list "Do not use proxy server for address beginning with" to '23.253.135.79' (without commas). (23.253.135.79 - is the result of 'nslookup python.org' at this time when I write this comment)
3. Make a request in IE to http://python.org/. Then analyze the output of your proxy. You will see that the request to python.org goes through proxy. 
4. Make a request to http://python.org/ via urllib (run checklib-py3.py). Analyze the output of your proxy. You will see that the request to python.org bypasses proxy.

Be careful: there might be redirections when you make a request to http://python.org/. If you see 'http://www.python.org/' in proxy output and don't see 'http://python.org/' it means that request to 'python.org' bypasses proxy.

This is the behavioral part of the problem which is attended by the performance decreasing, because the reverse lookup on some dns servers for some hostnames works slowly (up to 10 secs sometimes). May be the solution in my PR is not smart enough. But how can I make this issue go forward?
msg290833 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2017-03-30 08:03
The behaviour you're describing for IE sounds like a bug to me. If you specify a host that should bypass the proxy, then that's what should happen - it shouldn't matter if you specify the host by IP address or by name.

I'm -1 on Python trying to match IE bugs.
msg290841 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-03-30 14:06
Ok, but may be there are some Windows users, that have different opinion, who prefer to put up with this bug for the benefit of better performance. Could you leave them an opportunity to refuse this behavior of urllib?
msg290884 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2017-03-30 23:26
I think the point is that we don't want to be grabbing settings like this from other configuration locations. Ideally, there'd be a way to provide a list of "don't bypass the proxy for these names", which a caller could then read from the IE configuration if they want.

The other part of the problem is it seems that nobody on this thread (apart from perhaps you) understands exactly what's going on here :) You may want to post to python-dev and see if anyone who understands the intricacies of how gethostbyaddr() should/does work is willing to chime in.
msg291222 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-04-06 11:02
Steve, do you mean that there should be no address to IE configuration from urllib? I could undertake it if I understand the task.

gethostbyaddr() is ok. It just makes a reverse lookup, that some dns-servers work up too slow. The command "nslookup" also works slowly in same conditions. The problem is in those dns-servers I think.
msg291223 - (view) Author: Marc Schlaich (schlamar) * Date: 2017-04-06 12:41
This could be even a security issue.

People might rely on a proxy as a privacy feature. In this case the proxy should do forward/reverse DNS requests and not the client. Doing DNS lookups to check for proxy bypass doesn't seem right. I don't think that major browsers are doing this, at least Firefox is not (https://bugzilla.mozilla.org/show_bug.cgi?id=136789).
msg291823 - (view) Author: Marc Schlaich (schlamar) * Date: 2017-04-18 06:51
Julia, could you please add other major browsers/HTTP clients (Firefox, Chrome, curl, ...) to your comparison (compare_ie_urllib.txt). I would expect that Python/urllib is the only implementation doing DNS requests for proxy bypass handling.

Please note that curl uses the `no_proxy` environment variable, so the syntax is slightly different.

For anyone who doesn't fully grasp the details of this issue, there might be a better explanation at https://github.com/kennethreitz/requests/issues/2988.
msg291884 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-04-19 13:22
I compared the behaviour of urllib with these browsers: Firefox("use system proxy" selected), Google Chrome, Yandex. And also Skype (requests to login.live.com). All of them are not doing DNS requests for proxy bypass handling as Marc expects.
The result is attached: compare_urllib_progs.png
msg291953 - (view) Author: Marc Schlaich (schlamar) * Date: 2017-04-20 08:15
BTW, you can workaround this issue by defining the `http_proxy` and `no_proxy` environment variables.

In this case urllib isn't doing any DNS request.
msg291964 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-04-20 11:44
I'm not sure that users of my program will like if I define such variables in their systems
msg291972 - (view) Author: Marc Schlaich (schlamar) * Date: 2017-04-20 13:08
Well, you can read the proxy settings from registry and write them to os.environ (no_proxy needs to be transformed as it has a different format).

This will only take effect for the current process.
History
Date User Action Args
2017-04-20 13:08:10schlamarsetmessages: + msg291972
2017-04-20 11:44:32juliadolgovasetmessages: + msg291964
2017-04-20 08:15:41schlamarsetmessages: + msg291953
2017-04-19 13:22:09juliadolgovasetfiles: + compare_urllib_progs.png

messages: + msg291884
2017-04-18 06:51:59schlamarsetmessages: + msg291823
2017-04-06 12:41:01schlamarsetnosy: + schlamar
messages: + msg291223
2017-04-06 11:02:15juliadolgovasetmessages: + msg291222
2017-03-30 23:26:24steve.dowersetmessages: + msg290884
2017-03-30 14:06:37juliadolgovasetmessages: + msg290841
2017-03-30 08:03:51paul.mooresetmessages: + msg290833
2017-03-30 05:13:19juliadolgovasetfiles: + checklib-py3.py

messages: + msg290807
2017-03-25 13:31:42juliadolgovasetmessages: + msg290482
2017-02-23 06:42:14juliadolgovasetpull_requests: + pull_request212
2017-02-23 06:21:02juliadolgovasetfiles: + request.patch

messages: + msg288413
2017-02-20 02:16:17juliadolgovasetmessages: + msg288179
2017-02-20 00:35:06juliadolgovasetmessages: + msg288172
2017-02-19 14:30:53steve.dowersetmessages: + msg288139
2017-02-19 07:52:52juliadolgovasetfiles: + compare_ie_urllib.txt

messages: + msg288118
2017-02-18 10:12:16juliadolgovasetmessages: + msg288071
versions: + Python 3.6
2017-02-17 02:05:26juliadolgovasetmessages: + msg287976
2017-02-16 13:52:42eryksunsetnosy: + eryksun
messages: + msg287945
2017-02-16 13:44:23steve.dowersetnosy: + orsenthil
messages: + msg287944
2017-02-16 00:05:44juliadolgovasetmessages: + msg287905
2017-02-11 08:44:48juliadolgovasetfiles: + socket.patch
keywords: + patch
2017-02-11 08:44:34juliadolgovasetfiles: + log.txt
2017-02-11 08:44:24juliadolgovasetfiles: + test.py
2017-02-11 08:44:04juliadolgovasetfiles: + socket.py
2017-02-11 08:43:27juliadolgovacreate