classification
Title: Error code 10035 calling socket.recv() on a socket with a timeout (WSAEWOULDBLOCK - A non-blocking socket operation could not be completed immediately)
Type: behavior Stage: patch review
Components: IO, Windows Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Ben.Smith, Okko.Willeboordse, bjourne, ehohenstein, exarkun, giampaolo.rodola, gregory.p.smith, jcea, kristjan.jonsson, neologix, pitrou, python-dev
Priority: normal Keywords: patch

Created on 2010-06-27 01:38 by ehohenstein, last changed 2013-03-22 17:27 by kristjan.jonsson. This issue is now closed.

Files
File name Uploaded Description Edit
sock_recv.patch ehohenstein, 2010-06-27 01:38 Fix for 10035 error calling recv() on Windows
selectretry.patch pitrou, 2010-09-18 22:13
selectretry2.patch pitrou, 2010-09-28 16:19 review
socket_10035.patch ehohenstein, 2010-12-22 18:12
socket.patch kristjan.jonsson, 2013-03-18 20:05
Messages (25)
msg108770 - (view) Author: Eric Hohenstein (ehohenstein) Date: 2010-06-27 01:38
This error is unfortunately difficult to reproduce. I've only seen it happen on Windows XP running on a dual core VMWare VM. I haven't been able to reproduce it on a non-VM system running Windows 7. The only way I've been able to reproduce it is to run the following unit test repeatedly on the XP VM repeatedly until it fails:

import unittest
import urllib2

class DownloadUrlTest(unittest.TestCase):
    def testDownloadUrl(self):
        opener = urllib2.build_opener()
        handle = opener.open('http://localhost/', timeout=60)
        handle.info()
        data = handle.read()
        self.assertNotEqual(data, '')

if __name__ == "__main__":
    unittest.main()

This unit test obviously depends on a web server running on localhost. In the test environment where I was able to reproduce this problem the web server is Win32 Apache 2.0.54 with mod_php. When the test fails, it fails with Windows error code 10035 (WSAEWOULDBLOCK) being generated by the call to the recv() method rougly once every 50-100 times the test is run. The following is a the final entry in the stack when the error occurs:

  File "c:\slave\h05b15\build\Ext\Python26\lib\socket.py", line 353, in read (self=<socket._fileobject ...03B78170>, size=1027091)
    data = self._sock.recv(left)

The thing to note is that the socket is being created with a timeout of 60. The implementation of the socket.recv() method in socketmodule.c in the _socket import module is to use select() to wait for a socket to become readable for socket objects with a timeout and then to call recv() on the socket only if select() did not return indicating that the timeout period elapsed without the socket becoming readable. The fact that Windows error code 10035 (WSAEWOULDBLOCK) is being generated in the sock_recv_guts() method in socketmodule.c indicates that select() returned without timing out which means that Windows must have indicated that the socket is readable when in fact it wasn't. It appears that there is a known issue with Windows sockets where this type of problem may occur with non-blocking sockets. It is described in the msdn documentation for WSAAsyncSelect() (http://msdn.microsoft.com/en-us/library/ms741540%28VS.85%29.aspx). The code for socketmodule.c doesn't seem to handle this type of situation correctly. The patch I've included with this issue report retries the select() if the recv() call fails with WSAWOULDBLOCK (only if MS_WINDOWS is defined). With the patch in place the test ran approximately 23000 times without failure on the system where it was failing without the patch.
msg116498 - (view) Author: Ben Smith (Ben.Smith) Date: 2010-09-15 23:42
I also see this issue on occasion on windows XP SP 3, using python 2.6.5 to fetch large files via http.

The error is infrequent, but it is happening in my situation without a VM.
msg116499 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-16 00:02
> It appears that there is a known issue with Windows sockets where this 
> type of problem may occur with non-blocking sockets. It is described in 
> the msdn documentation for WSAAsyncSelect()
> (http://msdn.microsoft.com/en-us/library/ms741540%28VS.85%29.aspx).

That documentation doesn't seem to describe the same kind of situation; it is about delayed notification through Windows messages (if you read the sequence they given in example, it's quite logical why it can fail).

Have you tried instrumenting sock_recv_guts() and dumping the actual return values and errnos (from both internal_select() and recv())?
msg116553 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-09-16 14:29
Actually, it's possible that select(2) incorrectly reports sockets as ready for reading : for example if the socket receives data, but then discards it because of an invalid checksum (and I guess you're more likely to get this type of problem on a VM or while copying large files).
msg116554 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-16 14:36
So it means we should indeed retry on a socket with timeout... But we must take care not to exceed the original timeout, so we must measure the time taken by each select() call.
msg116814 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-09-18 17:09
Unfortunately, select doesn't necessarily update the timeout variable with the remaining time, so we can't rely on this. This would mean having the select enclosed within gettimeofday and friends, which seems a bit overkill...
msg116815 - (view) Author: Jean-Paul Calderone (exarkun) * (Python committer) Date: 2010-09-18 17:11
> Unfortunately, select doesn't necessarily update the timeout variable with the remaining time, so we can't rely on this. This would mean having the select enclosed within gettimeofday and friends, which seems a bit overkill...

How is it "overkill" to be correct?
msg116816 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-18 17:39
> Unfortunately, select doesn't necessarily update the timeout variable
> with the remaining time, so we can't rely on this. This would mean
> having the select enclosed within gettimeofday and friends, which
> seems a bit overkill...

Well, given the general cost of Python function calls and bytecode
interpretation, it would probably not be much of a runtime overhead. So
it's mainly some additional, not very exciting code to write :-)

(luckily, in 3.2 we have a cross-platform gettimeofday() abstraction in
pytime.h which will ease things quite a bit)
msg116817 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-09-18 18:01
Yes, I was concerned with the runtime overhead (especially under VMs, where clocks are emulated) and strace output ;-)
As for the dull code writting, I could give it a shot unless someone else wants to go ahead.
msg116820 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2010-09-18 18:19
@ehohenstein: sorry, I hadn't seen you'd already submitted a patch, so you may as well update it to take into account what was said above. Also, please note that this issue is not specific to windows, so it would be nice to fix it for Unices too.
msg116830 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-18 22:06
Here is a proof of concept patch for Python 3.2. It only wraps recv() and recv_into(). recvfrom() and recvfrom_into() should receive the same treatment, at least.
msg117535 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-28 16:19
Here is an updated patch wrapping all variants of recv() and send(), except sendall() which already has its own retry loop.
msg117549 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-28 21:24
Committed in 3.2 in r85074. I don't plan to backport it, since the _PyTime_gettimeofday abstraction is not available on earlier versions.
msg124515 - (view) Author: Eric Hohenstein (ehohenstein) Date: 2010-12-22 18:12
I have ported the changes related to this problem from the 3.2 branch to the 2.6 version of socketmodule.c. The changes are attached as a diff from Python 2.6.2. The changes apply to all platforms but I've only tested them on Windows.

The _PyTime_gettimeofday method is not available in 2.6 which is why the changes in 3.2 weren't originally back ported. I admit to adding a disgusting hack which was to copy some of the _PyTime_gettimeofday interface code from 3.2 to the socketmodule.c file and implement it using the time.time() method, falling back to the crt time() method. It's not as efficient as the implementation in 3.2 but I believe it should be equally correct.

The motivation for doing this was that I continued to see 10035 errors happening using Python 2.6 though in different code paths. Specifically, errors were being thrown when uploading a file using a PUT request using httplib which calls sendall(). It's noteworthy that analysing the changes made for this issue to Python 3.2 revealed that no change was made to the sendall() method. sendall() is actually problematic in that the timeout on the socket may actually be exceeded without error if any one call to select() doesn't exceed the socket's timeout but in aggregate the calls to select do wait longer than the timeout. The same generic solution that was applied to the other socket methods is not appropriate for sendall(). I elected to continue this behavior by just checking for EAGAIN and EWOULDBLOCK if the socket has a positive timeout value and the call to send failed and continuing the select/send loop in that case. As far as I can tell, sendall() will still fail with these recoverable errors in Python 3.2. I won't feel bad if this patch is rejected for 2.6 but the changes to sendall() should really be considered for the 3.2 branch.
msg124607 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-12-24 20:31
2.6 is closed except for security fixes, which this does not seem to be. If the problem is in 2.7, then it potentially could be fixed there, but with the same caveats. I will let Antoine reclose if he thinks appropriate.
msg125674 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-07 18:30
I will not bother backporting myself but an other core developer can do it if (s)he desires.
msg141221 - (view) Author: Björn Lindqvist (bjourne) Date: 2011-07-27 11:52
I don't have the expertise to backport it myself, but the problem certainly is still present in python 2.7.1 on Windows 7. It is especially pronounced when using threading to read from multiple url files.
msg148736 - (view) Author: Okko Willeboordse (Okko.Willeboordse) Date: 2011-12-01 20:01
I bumped into this issue at one of my customers that use Python to control a machine through Beckhoff EtherCAT. The Python code communicates with the Beckhoff EtherCAT program using TCP/IP.
They use non blocking sockets like so;
s = select.select([self._socket], [], [], timeout)[0]
if not s:
  raise NoData
self._socket.recv(length)

They also found that the recv occasionally raises a 10035.
I changed the code in to;
s = select.select([self._socket], [], [], timeout)[0]
if not s:
  raise NoData
try:
  buffer_ = self._socket.recv(length)
except socket.error as inst:
  if (gaius.utils.Platform.WINDOWS and 10035 == inst.args[0]) or \
     (gaius.utils.Platform.LINUX and 11 == inst.args[0]):
    raise NoData

So this issue applies also to sockets without timeout, albeit it can be worked around easily. 

Also note that this also applies to Linux as the man page of select states in the BUG section;

Under Linux, select() may report a socket file descriptor as "ready for
reading",  while  nevertheless a subsequent read blocks. This could for
example happen when data has arrived but  upon  examination  has  wrong
checksum  and is discarded. There may be other circumstances in which a
file descriptor is spuriously reported as ready.  Thus it may be  safer
to use O_NONBLOCK on sockets that should not block.

Note that Linux select is not Posix compliant as Posix states;

A descriptor shall be considered ready for reading when a call to an input function with O_NONBLOCK clear would not block, whether or not the function would transfer data successfully.

See https://lkml.org/lkml/2011/6/18/76

MSDN select says;

For other sockets, readability means that queued data is available for reading such that a call to recv, WSARecv, WSARecvFrom, or recvfrom is guaranteed not to block.

http://support.microsoft.com/kb/177346 says (for Windows 95);

The Winsock select() API might fail to block on a nonblocking socket and return WSAEWOULDBLOCK as an error code when either send() or recv() is subsequently called. For example, select() may return indicating there is data to read, yet a call to recv() returns with the error code WSAEWOULDBLOCK, indicating there is no data immediately available. Windows NT 4.0 does not exhibit this behavior.


Finally I have 2 questions;

Is this select behavior on Windows 'normal'? Noting that it is not documented. or does it behaves this way due to crippled NIC's or drivers (VMWare)?

Can this behavior be reproduced? I need this for automatic testing. Now these exceptional paths cannot be tested.
msg173351 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-10-19 18:22
Backport to 2.7 should be done: See Issue #16274.
msg184390 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-03-17 18:33
I will backport this.  I have recently seen this happening in 2.7 in our company and it would make sense to fix this before 2.7.4 is released.
msg184511 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-03-18 20:05
Here is a patch for 2.7
Since 2.7 doesn't have pytime.c, we export floattime() as _Py_floattime out of time.c
msg184654 - (view) Author: Roundup Robot (python-dev) Date: 2013-03-19 18:08
New changeset 8ec39bfd1f01 by Kristján Valur Jónsson in branch '2.7':
Issue #9090 : Error code 10035 calling socket.recv() on a socket with a timeout
http://hg.python.org/cpython/rev/8ec39bfd1f01
msg184667 - (view) Author: Roundup Robot (python-dev) Date: 2013-03-19 20:02
New changeset f22b93b318a5 by Kristján Valur Jónsson in branch '2.7':
issue #9090 : Limit the fix to windows since getting a portable simple
http://hg.python.org/cpython/rev/f22b93b318a5
msg184679 - (view) Author: Roundup Robot (python-dev) Date: 2013-03-19 20:57
New changeset 0f5e1e642dc3 by Kristján Valur Jónsson in branch '2.7':
issue #9090 : Take the same approach for socketmodule as daytimemodule
http://hg.python.org/cpython/rev/0f5e1e642dc3
msg184990 - (view) Author: Kristján Valur Jónsson (kristjan.jonsson) * (Python committer) Date: 2013-03-22 17:27
fixed for 2.7
History
Date User Action Args
2013-03-22 17:27:24kristjan.jonssonsetstatus: open -> closed
resolution: fixed
messages: + msg184990
2013-03-19 20:57:20python-devsetmessages: + msg184679
2013-03-19 20:02:17python-devsetmessages: + msg184667
2013-03-19 18:08:12python-devsetnosy: + python-dev
messages: + msg184654
2013-03-18 20:05:33kristjan.jonssonsetfiles: + socket.patch

messages: + msg184511
2013-03-17 18:33:59kristjan.jonssonsetnosy: + kristjan.jonsson
messages: + msg184390
2012-10-19 18:22:40jceasetnosy: + jcea
messages: + msg173351
2011-12-02 01:57:07terry.reedysetnosy: - terry.reedy
2011-12-01 20:01:10Okko.Willeboordsesetnosy: + Okko.Willeboordse
messages: + msg148736
2011-07-27 11:52:45bjournesetnosy: + bjourne
messages: + msg141221
2011-01-07 18:30:24pitrousetnosy: terry.reedy, gregory.p.smith, exarkun, pitrou, giampaolo.rodola, neologix, ehohenstein, Ben.Smith
messages: + msg125674
2010-12-24 20:31:55terry.reedysetversions: + Python 2.7, - Python 2.6
nosy: + terry.reedy

messages: + msg124607

stage: resolved -> patch review
2010-12-22 18:12:52ehohensteinsetstatus: closed -> open
files: + socket_10035.patch
versions: + Python 2.6, - Python 3.2
nosy: gregory.p.smith, exarkun, pitrou, giampaolo.rodola, neologix, ehohenstein, Ben.Smith
messages: + msg124515

resolution: fixed -> (no value)
2010-09-29 10:52:40pitrousetstatus: pending -> closed
2010-09-28 21:24:13pitrousetstatus: open -> pending
versions: - Python 3.1, Python 2.7
messages: + msg117549

resolution: fixed
stage: patch review -> resolved
2010-09-28 16:19:22pitrousetfiles: + selectretry2.patch

messages: + msg117535
2010-09-18 22:13:23pitrousetfiles: + selectretry.patch
2010-09-18 22:13:04pitrousetfiles: - selectretry.patch
2010-09-18 22:06:44pitrousetnosy: + gregory.p.smith
2010-09-18 22:06:17pitrousetfiles: + selectretry.patch

messages: + msg116830
stage: patch review
2010-09-18 18:19:30neologixsetmessages: + msg116820
2010-09-18 18:01:49neologixsetmessages: + msg116817
2010-09-18 17:39:24pitrousetmessages: + msg116816
2010-09-18 17:11:13exarkunsetmessages: + msg116815
2010-09-18 17:09:45neologixsetmessages: + msg116814
2010-09-16 14:36:50pitrousetmessages: + msg116554
2010-09-16 14:29:12neologixsetnosy: + neologix
messages: + msg116553
2010-09-16 00:02:08pitrousetnosy: + pitrou
messages: + msg116499
2010-09-15 23:46:45pitrousetnosy: + exarkun, giampaolo.rodola

versions: + Python 3.1, Python 2.7, Python 3.2, - Python 2.6
2010-09-15 23:42:29Ben.Smithsetnosy: + Ben.Smith
messages: + msg116498
2010-06-27 01:38:45ehohensteincreate