classification
Title: SO_REUSEADDR doesn't have the same semantics on Windows as on Unix
Type: behavior Stage: resolved
Components: Library (Lib), Windows Versions: Python 3.1, Python 3.2, Python 2.7, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: trent Nosy List: amak, exarkun, forest, gvanrossum, nnorwitz, pitrou, trent
Priority: high Keywords: 26backport, patch

Created on 2008-04-04 15:57 by trent, last changed 2010-04-27 21:26 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
test_socket.py.patch trent, 2008-04-04 15:57 Patch to trunk/Lib/test/test_socket.py
trunk.2550.patch trent, 2008-04-06 21:24
trunk.2550-2.patch trent, 2008-04-08 11:49
Messages (12)
msg64933 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-04 15:57
Background: I came across this issue when trying to track down why 
test_asynchat would periodically wedge python processes on the Windows 
buildbots, to the point that they wouldn't even respond to SIGKILL (or 
ctrl-c on the console).

What I found after a bit of digging is that Windows doesn't raise 
EADDRINUSE socket.errors when you bind() two sockets to identical 
host/ports *IFF* SO_REUSEADDR has been set as a socket option.

Decided to brighten up my tube journey into work this morning by 
reading the Gospel's take on the situation.  As per the 'SO_REUSEADDR 
and SO_REUSEPORT Socket Options' section in chapter 7.5 of Stevens' 
UNIX Network Programming Volume 1 (2nd Ed):

"With TCP, we are never able to start multiple servers that bind
 the same IP address and same port: a completely duplicate binding.
 That is, we cannot start one server that binds 198.69.10.2 port 80
 and start another that also binds 198.69.10.2 port 80, even if we
 set the SO_REUSEADDR socket option for the second server."

So, it seems at least Windows isn't adhering to this, at least on XP 
and Server 2008 with 2.5-2.6.  I've patched test_socket.py to 
explicitly test for this situation -- as expected, it passes on Unix 
(tested on FreeBSD in particular), and fails on Windows.  I'd like to 
commit this to trunk to see if any of the buildbots for different 
platforms match the behaviour of Windows.
msg65050 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-06 21:20
[Updating the issue with relevant mailing list conversation]
Interesting results!  I committed the patch to test_socket.py in 
r62152.  I was expecting all other platforms except for Windows to 
behave consistently (i.e. pass).  That is, given the following:

        import socket
        host = '127.0.0.1'
        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock.bind((host, 0))
        port = sock.getsockname()[1]
        sock.close()
        del sock

        sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        sock1.bind((host, port))
        sock2 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        sock2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
        sock2.bind((host, port))
        ^^^^

....the second bind should fail with EADDRINUSE, at least according to 
the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 
7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed):

"With TCP, we are never able to start multiple servers that bind
 the same IP address and same port: a completely duplicate binding.
 That is, we cannot start one server that binds 198.69.10.2 port 80
 and start another that also binds 198.69.10.2 port 80, even if we
 set the SO_REUSEADDR socket option for the second server."

The results: both Windows *and* Linux fail the patched test; none of 
the buildbots for either platform encountered an EADDRINUSE 
socket.error after the second bind().  FreeBSD, OS X, Solaris and Tru64 
pass the test -- EADDRINUSE is raised on the second bind.  (Interesting 
that all the ones that passed have a BSD lineage.)

I've just reverted the test in r62156 as planned.  The real issue now 
is that there are tests that are calling test_support.bind_socket() 
with the assumption that the port returned by this method is 'unbound', 
when in fact, the current implementation can't guarantee this:

def bind_port(sock, host='', preferred_port=54321):
    for port in [preferred_port, 9907, 10243, 32999, 0]:
        try:
            sock.bind((host, port))
            if port == 0:
                port = sock.getsockname()[1]
            return port
        except socket.error, (err, msg):
            if err != errno.EADDRINUSE:
                raise
            print >>sys.__stderr__, \
                '  WARNING: failed to listen on port %d, trying 
another' % port

This logic is only correct for platforms other than Windows and Linux.  
I haven't looked into all the networking test cases that rely on 
bind_port(), but I would think an implementation such as this would be 
much more reliable than what we've got for returning an unused port:

def bind_port(sock, host='127.0.0.1', *args):
    s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    s.bind((host, 0))
    port = s.getsockname()[1]
    s.close()
    del s

    sock.bind((host, port))
    return port

Actually, FWIW, I just ran a full regrtest.py against trunk on Win32 
with this change in place and all the tests still pass.

Thoughts?

    Trent.
msg65051 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-06 21:21
[Updating issue with mailing list discussion; Jean-Paul's reply]
On Fri, 4 Apr 2008 13:24:49 -0700, Trent Nelson <tnelson@onresolve.com> 
wrote:
>Interesting results!  I committed the patch to test_socket.py in 
r62152.  I was expecting all other platforms except for Windows to 
behave consistently (i.e. pass).  That is, given the following:
>
>        import socket
>        host = '127.0.0.1'
>        sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>        sock.bind((host, 0))
>        port = sock.getsockname()[1]
>        sock.close()
>        del sock
>
>        sock1 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>        sock1.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
>        sock1.bind((host, port))
>        sock2 = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
>        sock2.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
>        sock2.bind((host, port))
>        ^^^^
>
>....the second bind should fail with EADDRINUSE, at least according to 
the 'SO_REUSEADDR and SO_REUSEPORT Socket Options' section in chapter 
7.5 of Stevens' UNIX Network Programming Volume 1 (2nd Ed):
>
>"With TCP, we are never able to start multiple servers that bind
> the same IP address and same port: a completely duplicate binding.
> That is, we cannot start one server that binds 198.69.10.2 port 80
> and start another that also binds 198.69.10.2 port 80, even if we
> set the SO_REUSEADDR socket option for the second server."
>
>The results: both Windows *and* Linux fail the patched test; none of 
the buildbots for either platform encountered an EADDRINUSE 
socket.error after the second bind().  FreeBSD, OS X, Solaris and Tru64 
pass the test -- EADDRINUSE is raised on the second bind.  (Interesting 
that all the ones that passed have a BSD lineage.)

Notice that the quoted text explains that you cannot start multiple 
servers
that etc.  Since you didn't call listen on either socket, it's arguable 
that
you didn't start any servers, so there should be no surprise regarding 
the
behavior.  Try adding listen calls at various places in the example and
you'll see something different happen.

FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote on
Linux/BSD/UNIX/etc.  On Windows, however, that option actually means
something quite different.  It means that the address should be stolen 
from
any process which happens to be using it at the moment.

There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think,
which, AIUI, makes it impossible for another process to steal the port
using SO_REUSEADDR.

Hope this helps,

Jean-Paul
msg65052 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-06 21:21
[Updating issue with mailing list discussion; my reply to Jean-Paul]
> >"With TCP, we are never able to start multiple servers that bind
> > the same IP address and same port: a completely duplicate binding.
> > That is, we cannot start one server that binds 198.69.10.2 port 80
> > and start another that also binds 198.69.10.2 port 80, even if we
> > set the SO_REUSEADDR socket option for the second server."

> Notice that the quoted text explains that you cannot start multiple
> servers that etc.  Since you didn't call listen on either socket, it's
> arguable that you didn't start any servers, so there should be no
> surprise regarding the behavior.  Try adding listen calls at various
> places in the example and you'll see something different happen.

I agree in principle, Stevens says nothing about what happens if you 
*do* try and bind two sockets on two identical host/port addresses.  
Even so, test_support.bind_port() makes an assumption that bind() will 
raise EADDRINUSE if the port is not available, which, as has been 
demonstrated, won't be the case on Windows or Linux.

> FWIW, AIUI, SO_REUSEADDR behaves just as described in the above quote
> on Linux/BSD/UNIX/etc.  On Windows, however, that option actually 
means
> something quite different.  It means that the address should be stolen
> from any process which happens to be using it at the moment.

Probably explains why the python process wedges when this happens on 
Windows...

> There is another option, SO_EXCLUSIVEADDRUSE, only on Windows I think,
> which, AIUI, makes it impossible for another process to steal the port
> using SO_REUSEADDR.

Nod, if SO_EXCLUSIVEADDRUSE is used instead in the code I posted, 
Windows raises EADDRINUSE on the second bind().  I don't have access to 
any Linux boxes at the moment, so I can't test what sort of error is 
raised with the example I posted if listen() and accept() are called on 
the two sockets bound to identical addresses.  Can anyone else shed 
some light on this?  I'd be interested in knowing if the process wedges 
on Linux as badly as it does on Windows (to the point where it's not 
respecting ctrl-c or sigkill).


        Trent.
msg65054 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-06 21:24
I've attached another patch that fixes test_support.bind_port() as well 
as a bunch of files that used that method.  The new implementation 
always uses an ephemeral port in order to elicit an unused port for 
subsequent binding.  Tested on Windows 32-bit & x64 and FreeBSD 6.2.  
Would like to apply sooner rather than later unless anyone has any 
objections as it'll fix my two Windows buildbots that are on the same 
machine from both hanging if they test asynchat at the same time (which 
happens more often than you'd think).
msg65055 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2008-04-06 22:04
Trent, go ahead and try this out.  We should definitely be moving in
this direction.  So I'd rather fix the problem than keep suffering with
the current problems of not being able to run the test suite
concurrently.  I think bind_port might be documented, so you should
update the docs if so.  Also, please add a Misc/NEWS entry.
msg65075 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-07 15:10
I don't like that the patch changes the API of a function in
test_support() (in particular changing the return type; adding optional
arguments is not a problem).  This could trip up 3rd party users of this
API.  I recommend creating a new API bind_host_and_port() (or whatever
you'd like to name it) and implement the original API in terms of the
new one.  (You can even add a warning if you think the original API is
always unsafe.)
msg65077 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-07 16:03
To be honest, I wasn't really happy either with having to return HOST, 
it's somewhat redundant given that all these tests should be binding 
against localhost.  What about something like this for bind_port():

def bind_port(sock, host=''):
    """Bind the socket to a free port and return the port number.
    Relies on ephemeral ports in order to ensure we are using an
    unbound port.  This is important as many tests may be running
    simultaneously, especially in a buildbot environment."""

    # Use a temporary socket object to ensure we're not 
    # affected by any socket options that have already 
    # been set on the 'sock' object we're passed. 
    tempsock = socket.socket(sock.family, sock.type)
    tempsock.bind((host, 0))
    port = tempsock.getsockname()[1]
    tempsock.close()
    del tempsock

    sock.bind((host, port))
    return port

The tests would then look something like:

HOST = 'localhost'
PORT = None

class Foo(TestCase):
    def setUp(self):
        sock = socket.socket()
        global PORT
        PORT = test_support.bind_port(sock, HOST)

So, the return value is the port bound to, no change there, but we're 
abolishing preferred_port as an optional argument, which is important, 
IMO, as none of these tests should be stipulating which port they want 
to listen on.  That's actually the root of this entire problem.
msg65078 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-07 17:20
Thanks, that's much better (though I'm not the authority on all details
of this patch).
msg65155 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-08 11:49
Invested quite a few cycles on this issue last night.  The more time I 
spent on it, the more I became convinced that every single test working 
with sockets should be changed in one fell swoop in order to facilitate 
(virtually unlimited) parallel test execution without fear of port 
conflicts.

I've attached a second patch, trunk.2550-2.patch, which is my progress 
so far on doing just this.  The main changes can be expressed by the 
following two points:

a) do whatever it takes in network-oriented tests to ensure
   unique ports are obtained (relying on the bind_port() and
   find_unused_port() methods exposed by test_support)

b) never, ever, ever call SO_REUSEADDR on a socket from a test;
   because we're putting so much effort into obtaining a unique
   port, this should never be necessary -- in the rare cases that
   our attempts to obtain a unique port fail, then we absolutely
   should fail with EADDRINUSE, as the ability to obtain a unique
   port for the duration of a client/server test is an invariant
   that we *must* be able to depend upon.  If the invariant is
   broken, fail immediately (don't mask the problem with 
   SO_REUSEADDR).

With this patch applied, I can spawn a handful of Python processes and 
run the entire test suite (without -r, ensuring all tests are run in 
the same order, which should encourage port conflicts (if there were 
any)) without any errors*.  Doing that now is completely and utterly 
impossible.

[*] Well, almost without error.  All the I/O related tests that try and 
open @test fail.

I believe there's still outstanding work to do with this patch with 
regards to how the intracacies of SO_REUSEADDR and SO_EXCLUSIVEADDRUSE 
should be handled in the rest of the stdlib.  I'm still thinking about 
the best approach for this.  However, the patch as it currently stands 
is still quite substantial so I wanted to get it out sooner rather than 
later for review.

(I'll forward this to python-dev@ to try and encourage more eyes from 
people with far more network-fu than I.)
msg65224 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2008-04-08 23:48
Committed updates to relevant network-oriented tests, as well as 
test_support changes discussed, in r62234.
msg104365 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-04-27 21:26
This is now fixed, right? Personal experience as well as buildbot behaviour seems to show that parallel test execution (either through -j, or by running several test suites at the same time) works ok.
History
Date User Action Args
2010-04-27 21:26:06pitrousetstatus: open -> closed

nosy: + pitrou, exarkun
messages: + msg104365

resolution: accepted -> fixed
stage: test needed -> resolved
2010-03-20 17:44:28r.david.murraysetstage: test needed
versions: + Python 3.1, Python 2.7, Python 3.2, - Python 3.0
2008-09-18 22:05:37forestsetnosy: + forest
2008-05-13 18:23:06amaksetnosy: + amak
2008-04-08 23:48:17trentsetmessages: + msg65224
2008-04-08 11:49:32trentsetfiles: + trunk.2550-2.patch
messages: + msg65155
2008-04-07 17:20:41gvanrossumsetmessages: + msg65078
2008-04-07 16:03:51trentsetmessages: + msg65077
2008-04-07 15:10:09gvanrossumsetnosy: + gvanrossum
messages: + msg65075
2008-04-06 22:04:54nnorwitzsetresolution: accepted
messages: + msg65055
nosy: + nnorwitz
2008-04-06 21:25:02trentsetfiles: + trunk.2550.patch
messages: + msg65054
2008-04-06 21:21:34trentsetmessages: + msg65052
2008-04-06 21:21:02trentsetmessages: + msg65051
2008-04-06 21:20:26trentsetmessages: + msg65050
2008-04-04 15:57:33trentcreate