classification
Title: urllib2 works slowly with proxy on windows
Type: performance Stage:
Components: Library (Lib), Windows Versions: Python 3.6, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, juliadolgova, orsenthil, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2017-02-11 08:43 by juliadolgova, last changed 2017-02-23 06:42 by juliadolgova.

Files
File name Uploaded Description Edit
checklib.py juliadolgova, 2017-02-11 08:43
socket.py juliadolgova, 2017-02-11 08:43
test.py juliadolgova, 2017-02-11 08:44
log.txt juliadolgova, 2017-02-11 08:44
socket.patch juliadolgova, 2017-02-11 08:44
compare_ie_urllib.txt juliadolgova, 2017-02-19 07:52
request.patch juliadolgova, 2017-02-23 06:21 review
Pull Requests
URL Status Linked Edit
PR 247 open juliadolgova, 2017-02-23 06:42
Messages (11)
msg287597 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-11 08:43
I've found that urllib works sometimes slowly on windows with proxy.

To reproduce the issue:
on Windows:
1. Turn on the option "use proxy" in "browser settings" in "control panel".
No real proxy needed. The problem will come out before addressing to proxy. Just don't pay attention to exception.
2. Make sure that the list of addresses for proxy bypass is not empty
3. Execute checklib.py with socket.py (attached here) in the same directory

The result output could be:
A (not a problem):
Before call to _socket.gethostbyaddr("docs.python.org")
After call to _socket.gethostbyaddr("docs.python.org")

B (little problem):
Before call to _socket.gethostbyaddr("docs.python.org")
Exception in call to _socket.gethostbyaddr("docs.python.org")

C (worse problem):
Before call to _socket.gethostbyaddr("docs.python.org")
(Delay)
Exception in call to _socket.gethostbyaddr("docs.python.org")

The result A,B or C depends on what DNS server you use, what url you pass into urllib2.urlopen(), and could differ at different time because dns is not a stable thing. 
However, no matter what result you have, this test shows that a hostname is passed into gethostbyaddr instead of IP as expected and described in MSDN. It should be changed to gethostbyname_ex here.

test.py compare performance of gethostbyaddr and gethostbyname_ex. 
It sets different dns servers on the system and calls these functions with different hostnames passed into. Run on my computer shows that gethostbyname_ex is 3 times more productive and doesn't raise exceptions.

-----------------------------
Attached files:
checklib.py - just make a call to urllib2.urlopen("https://docs.python.org")
socket.py - not a patched lib. Has debug lines near 141 line. Use it with checklib.py.
test.py - compare performance of gethostbyaddr with gethostbyname_ex
log.txt - result of test.py on my computer (Windows 8, 64 bit)
socket.patch - socket.py patch
msg287905 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-16 00:05
Surely noone is concerned that programms written on python could work better when addressing to "python.org"?
msg287944 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2017-02-16 13:44
There's a few reasons why you haven't heard a reply. First among them is that we're all volunteers with limited free time, and second is that we just migrated to github and all that free time is being consumed right now.

Python 2.7 is only receiving security fixes at this point. We might apply a fix for this in 3.5 and later, but you haven't indicated whether it applies to those and (I assume) nobody has tested it yet.

Your initial report is very good and much appreciated, we've just been busy and this doesn't jump out as urgent.
msg287945 - (view) Author: Eryk Sun (eryksun) * Date: 2017-02-16 13:52
gethostbyname_ex won't do a reverse lookup on an IP to get the fully-qualified domain name, which seems pointless for a function named getfqdn. I think calling gethostbyaddr is intentional here and goes back to the Python 1.x days.

Also, FYI, socket_gethostbyaddr in socketmodule.c doesn't pass a name to  C gethostbyaddr. It first calls setipaddr, which calls getaddrinfo to get the IP address. 

For "docs.python.org", the reverse lookup on the IP address has no data. Well, in Windows the error code is WSANO_DATA; in Linux I get HOST_NOT_FOUND.
msg287976 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-17 02:05
I sincerely appreciate your time. Thank you very much for your answer. I'll try to test this on python 3.5
msg288071 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-18 10:12
The issue applies to 3.6 as well. 

I agree that the replacement of gethostbyaddr with gethostbyname_ex is not a solution. But is there a way to check whether a hostname is in the <proxy bypass list> that doesn't bring to the reverse lookup? I suppose that IE doesn't make a reverse lookup for each request.
msg288118 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-19 07:52
I compared the behavior of IE and urllib.
I put different addresses to the <proxy bypass list> ("Do not use proxy server for address beginning with" setting), made different requests through IE and urllib and watched if the proxy was bypassed.

IE doesn't even make a forward dns lookup for the hosname given to check whether it should bypass proxy, whereas urllib does.

For example:
proxy_bypass_list       request                          IE bypasses proxy   urllib bypasses proxy
23.253.135.79           https://python.org/              no                  yes
151.101.76.223          https://docs.python.org/         no                  yes
ovinnik.canonical.com   https://ubuntu.com/              no                  yes

compare_ie_urllib.txt - full report
msg288139 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2017-02-19 14:30
My guess is that IE is implemented using lower level APIs and it can choose whether to bypass based on its own list. There's no reason for any other software to take its settings into account.

That said, it would be great if urllib can avoid adding long delays, at least more than once. I'm personally not sure how best to do that though.
msg288172 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-20 00:35
Why not to take it into account? 

Imagine that someone wants that requests to "ovinnik.canonical.com" should bypass proxy and requests to "ubuntu.com" souldn't. I don't know what for, it's just an assumption. 
He adds a hostname "ovinnik.canonical.com" into <proxy bypass list> and checks requests in IE. He sees that requests to "ovinnik.canonical.com" bypass proxy and requests to "ubuntu.com" go via proxy. And it's ok.
But suddenly he discovers that requests in urllib to "ubuntu.com" bypass proxy and it's unexpected. 

I think this behavior of urllib should be at least optional.
msg288179 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-20 02:16
http://bugs.python.org/issue23384 - same problem
msg288413 - (view) Author: Julia Dolgova (juliadolgova) * Date: 2017-02-23 06:21
I added variable smart_proxy_bypass into request module. If it's False then checking IP and FQDN of the host is skipped in proxy_bypass_registry function. 

I set the default value to False, because I think it's better when the behaviour of urllib corresponds to IE rather than previous versions of urllib. This will affect only NT-systems.
History
Date User Action Args
2017-02-23 06:42:14juliadolgovasetpull_requests: + pull_request212
2017-02-23 06:21:02juliadolgovasetfiles: + request.patch

messages: + msg288413
2017-02-20 02:16:17juliadolgovasetmessages: + msg288179
2017-02-20 00:35:06juliadolgovasetmessages: + msg288172
2017-02-19 14:30:53steve.dowersetmessages: + msg288139
2017-02-19 07:52:52juliadolgovasetfiles: + compare_ie_urllib.txt

messages: + msg288118
2017-02-18 10:12:16juliadolgovasetmessages: + msg288071
versions: + Python 3.6
2017-02-17 02:05:26juliadolgovasetmessages: + msg287976
2017-02-16 13:52:42eryksunsetnosy: + eryksun
messages: + msg287945
2017-02-16 13:44:23steve.dowersetnosy: + orsenthil
messages: + msg287944
2017-02-16 00:05:44juliadolgovasetmessages: + msg287905
2017-02-11 08:44:48juliadolgovasetfiles: + socket.patch
keywords: + patch
2017-02-11 08:44:34juliadolgovasetfiles: + log.txt
2017-02-11 08:44:24juliadolgovasetfiles: + test.py
2017-02-11 08:44:04juliadolgovasetfiles: + socket.py
2017-02-11 08:43:27juliadolgovacreate