classification
Title: resolver not thread safe
Type: Stage:
Components: Interpreter Core Versions:
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: Nosy List: akuchling, dustin, loewis, nnorwitz, tim.peters
Priority: normal Keywords:

Created on 2002-04-04 09:54 by dustin, last changed 2006-12-22 04:24 by dustin. This issue is now closed.

Files
File name Uploaded Description Edit
resolv-bug.py akuchling, 2006-12-21 15:13 Test script
Messages (11)
msg10149 - (view) Author: dustin sallings (dustin) Date: 2002-04-04 09:54
I've got an application that does SNMP monitoring and
has a thread listening with SimpleXMLRPCServer for
remote control.  I noticed the XMLRPC listener logging
an incorrect address while snmp jobs were processing:

sw1.west.spy.net - - [04/Apr/2002 01:16:37] "POST /RPC2
HTTP/1.0" 200 -
localhost.west.spy.net - - [04/Apr/2002 01:16:43] "POST
/RPC2 HTTP/1.0" 200 -

sw1 is one of the machines that is being queried, but
the XMLRPC requests are happening over localhost.

gethostbyname() and gethostbyaddr() both return static
data, thus they aren't reentrant.

As a workaround, I copied socket.py to my working
directory and added the following to it:

try:
    import threading
except ImportError, ie:
    sys.stderr.write(str(ie) + "\n")

# mutex for DNS lookups
__dns_mutex=None
try:
    __dns_mutex=threading.Lock()
except NameError:
    pass

def __lock():
    if __dns_mutex!=None:
        __dns_mutex.acquire()
    
def __unlock():
    if __dns_mutex!=None:
        __dns_mutex.release()
    
def gethostbyaddr(addr):
    """Override gethostbyaddr to try to get some thread
safety."""
    rv=None
    try:
        __lock()
        rv=_socket.gethostbyaddr(addr)
    finally:
        __unlock()
    return rv

def gethostbyname(name):
    """Override gethostbyname to try to get some thread
safety."""
    rv=None
    try:
        __lock()
        rv=_socket.gethostbyname(name)
    finally:
        __unlock()
    return rv
msg10150 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-04-04 20:06
Logged In: YES 
user_id=21627

I'm not sure what problem you are reporting. Python does not
attempt to invoke gethostbyname from two threads
simultaneously; this is prevented by the GIL.

On some systems, gethostname is reentrant (in the
gethostname_r incarnation); Python uses that where
available, and releases the GIL before calling it.

So I fail to see the bug.
msg10151 - (view) Author: dustin sallings (dustin) Date: 2002-04-04 21:08
Logged In: YES 
user_id=43919

The XMLRPC request is clearly being logged as coming from my
cisco switch when it was, in fact, coming from localhost.

I can't find any clear documentation, but it seems that on
at least some systems gethostbyname and gethostbyaddr
reference the same static variable, so having separate locks
for each one (as seen in socketmodule.c) doesn't help
anything.  It's not so much that they're not reentrant, but
you can't call any combination of the two of them at the
same time.  Here's some test code:

#include <stdio.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <netdb.h>
#include <assert.h>

int main(int argc, char **argv) {
    struct hostent *byaddr, *byname;
    unsigned int addr;
    struct sockaddr *sa = (struct sockaddr *)&addr;

    addr=1117120483;

    byaddr=gethostbyaddr(sa, sizeof(addr), AF_INET);
    assert(byaddr);
    printf("byaddr:  %s\n", byaddr->h_name);

    byname=gethostbyname("mail.west.spy.net");
    assert(byname);
    printf("byname:  %s\n", byname->h_name);

    printf("\nReprinting:\n\n");

    printf("byaddr:  %s\n", byaddr->h_name);
    printf("byname:  %s\n", byname->h_name);
}
msg10152 - (view) Author: dustin sallings (dustin) Date: 2002-04-04 22:21
Logged In: YES 
user_id=43919

Looking over the code a bit more, I see that my last message
wasn't entirely accurate.  There does seem to be only one
lock for both gethostbyname and gethostbyaddr
(gethostbyname_lock is used for both).

This is a pretty simple test that illustrates the problem
I'm seeing.  My previous work was on my OS X machine, but
this is Python 2.2 (#3, Mar  6 2002, 18:30:37) [C] on irix6.

#!/usr/bin/env python
#
# Copyright (c) 2002  Dustin Sallings <dustin@spy.net>
# $Id$

import threading
import socket
import time

class ResolveMe(threading.Thread):

    hosts=['propaganda.spy.net', 'bleu.west.spy.net',
'mail.west.spy.net']

    def __init__(self):
        threading.Thread.__init__(self)
        self.setDaemon(1)

    def run(self):
        # Run 100 times
        for i in range(100):
            for h in self.hosts:
                nrv=socket.gethostbyname_ex(h)
                arv=socket.gethostbyaddr(nrv[2][0])

                try:
                    # Verify the hostname is correct
                    assert(h==nrv[0])
                    # Verify the two hostnames match
                    assert(nrv[0]==arv[0])
                    # Verify the two addresses match
                    assert(nrv[2]==arv[2])
                except AssertionError:
                    print "Failed!  Checking " + `h` + "
got, " \
                        + `nrv` + " and " + `arv`

if __name__=='__main__':
    for i in range(1,10):
        print "Starting " + `i` + " threads."
        threads=[]
        for n in range(i):
            rm=ResolveMe()
            rm.start()
            threads.append(rm)
        for t in threads:
            t.join()
        print `i` + " threads complete."
    time.sleep(60)

The output looks like this:

verde:/tmp 190> ./pytest.py
Starting 1 threads.
1 threads complete.
Starting 2 threads.
Failed!  Checking 'propaganda.spy.net' got,
('mail.west.spy.net', [], ['66.149.231.226']) and
('mail.west.spy.net', [], ['66.149.231.226'])
Failed!  Checking 'bleu.west.spy.net' got,
('mail.west.spy.net', [], ['66.149.231.226']) and
('mail.west.spy.net', [], ['66.149.231.226'])

[...]
msg10153 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2002-04-05 08:56
Logged In: YES 
user_id=21627

Can you spot the error in the Python socket module? I still
fail to see our bug, and I would assume it is a C library
bug; I also cannot reproduce the problem on any of my machines.

Can you please report the settings of the various HAVE_
defines for irix?
msg10154 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2002-04-05 21:31
Logged In: YES 
user_id=31435

Just a reminder that the first thing to try on any SGI box 
is to recompile Python with optimization disabled.  I can't 
remember the last time we had "a Python bug" on SGI that 
wasn't traced to a compiler -O bug.
msg10155 - (view) Author: dustin sallings (dustin) Date: 2002-04-05 21:44
Logged In: YES 
user_id=43919

I first noticed this problem on my OS X box.

Since it's affecting me, it's not obvious to anyone else,
and I'm perfectly capable of fixing it myself, I'll try to
spend some time figuring out what's going on this weekend. 
It seems like it might be making a decision to not use the
lock at compile time.  I will investigate further and submit
a patch.
msg10156 - (view) Author: Neal Norwitz (nnorwitz) * (Python committer) Date: 2002-08-11 15:04
Logged In: YES 
user_id=33168

Dustin, any progress on a patch or diagnosing this further?
msg10157 - (view) Author: dustin sallings (dustin) Date: 2002-08-11 19:27
Logged In: YES 
user_id=43919

No, unfortunately, I haven't been able to look at it in a
while.  Short of locking it in python, I wasn't able to
avoid the failure.

I'm sorry I haven't updated this at all.  As far as I can
tell, it's still a problem, but I haven't not been able to
find a solution in the C code.  I supposely I spoke with too
much haste when I said I was perfectly capable of fixing the
problem myself.  The locking in the C code did seem correct,
but the memory was still getting stomped.
msg10158 - (view) Author: A.M. Kuchling (akuchling) * (Python committer) Date: 2006-12-21 15:13
Attaching the test script.

The script now fails because some of the spy.net addresses are resolved to hostnames such as
adsl-69-230-8-158.dsl.pltn13.pacbell.net.  When I changed the test script to use python.org machine names and ran it with Python 2.5 on Linux, no errors were reported.

Does this still fail on current OS X?  If not, I suggest calling this a platform C library bug and closing this report.

File Added: resolv-bug.py
msg10159 - (view) Author: dustin sallings (dustin) Date: 2006-12-22 04:24
I'll go ahead and close it.  It does not fail under 2.4 on any of my machines (tried OS X/intel, PPC G3, and FreeBSD/intel).
History
Date User Action Args
2002-04-04 09:54:15dustincreate