classification
Title: random.choice hits ValueError: cannot convert float NaN to integer
Type: Stage:
Components: Versions: Python 3.2, Python 3.3, Python 2.7
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: rhettinger Nosy List: gregory.p.smith, mark.dickinson, rhettinger, tim.peters
Priority: normal Keywords:

Created on 2012-02-16 07:57 by gregory.p.smith, last changed 2012-02-17 01:23 by rhettinger. This issue is now closed.

Messages (7)
msg153464 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-16 07:57
Using a 32-bit Python 2.6.5 on a Linux system at work we observed the following:

  File "/.../lib/python2.6/tempfile.py", line 349, in mktemp
    name = names.next()
  File "/.../lib/python2.6/tempfile.py", line 134, in next
    letters = [choose(c) for dummy in "123456"]
  File "/.../lib/python2.6/random.py", line 261, in choice
    return seq[int(self.random() * len(seq))]  # raises IndexError if seq is empty
ValueError: cannot convert float NaN to integer

This is rare and hard to reproduce.  The hardware appears to be healthy and this was on a server with ECC.


Some searching reveals that other people have hit this in random.choice in Python 2.7 as well:  https://bugs.launchpad.net/ubuntu/+source/desktopcouch/+bug/886159

The ubuntu developer seems to think this is related to time.time() returning NaN at some point (I haven't looked into that myself).
msg153473 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-02-16 10:43
Hmm, this is a little odd.  For 2.7 at least, the error message is coming from PyLong_FromDouble in Objects/longobject.c.  I can't immediately see how PyLong_FromDouble could be called by the random seeding process.

So it seems more likely that the error is really coming from the int() call in the traceback.  But now that implies that the random call is returning NaN, which looks unpossible from the code (random_random in Modules/_randommodule.c).


static PyObject *
random_random(RandomObject *self)
{
    unsigned long a=genrand_int32(self)>>5, b=genrand_int32(self)>>6;
    return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0));
}


So despite your comments about healthy hardware, my bet's on corrupted memory. :-)
msg153474 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-02-16 10:57
The bugs.launchpad.net URL shows a call to 'entropy.choice'.  Any idea what 'entropy' is?  Could it be that they're using their own Random subclass, not tied to the Python MT implementation?
msg153475 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-02-16 11:09
The hypothesis that time.time() is returning NaN doesn't match the provided traceback.  If time.time() had returned NaN, the exception would have happened earlier, on line 113 in random.py:  long(time.time() * 256)

I'm wondering if the NaN arises in the C code for random():

random_random(RandomObject *self)
{
    unsigned long a=genrand_int32(self)>>5, b=genrand_int32(self)>>6;
    return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0));
}

Upstream from that, only integers are used, so this would be the earliest a NaN could arise when running the code in choice():  ``return seq[int(self.random() * len(seq))]``
msg153476 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2012-02-16 11:29
> I'm wondering if the NaN arises in the C code for random():

I don't think that's possible.  In the second line:

    return PyFloat_FromDouble((a*67108864.0+b)*(1.0/9007199254740992.0));

a and b are already C unsigned longs, so no matter what their value, the result of the expression is well in range for an IEEE 754 double, and on a normal machine there's just no realistic way that this calculation could produce a NaN.  PyFloat_FromDouble does no manipulation of the C double, but just stores it directly in the PyFloat object.


I think there are two different things going on here.

(1) The Ubuntu error reporter seems to be using something other than the standard Random class, so all bets are off there without knowing more about what's being used.  Chances seem good that whatever random number generator they're using really *is* producing a NaN.

(2) That leaves Greg's report above, where the standard Random class is apparently what's being used.  Here I'm baffled---I can't see any realistic mechanism that might produce that traceback.
msg153488 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2012-02-16 17:29
I think my claim the hardware appears healthy was premature.  I misunderstood our initial error report internally on where the code ran and was looking at the wrong host.  doh.  my bad.

Several more of these have been found in the last week and they all suspiciously ran on the same machine.  One of them had a _different_ failure that is an even stronger suggestion of bad hardware:

 File "/.../lib/python2.6/random.py", line 57, in <module>
   NV_MAGICCONST = 4 * _exp(-0.5)/_sqrt(2.0)
ValueError: math domain error

Sorry for the false alarm.
msg153521 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2012-02-17 01:23
Well, at least it was an interesting bug report ;-)
History
Date User Action Args
2012-02-17 01:23:20rhettingersetmessages: + msg153521
2012-02-16 17:29:47gregory.p.smithsetstatus: open -> closed
resolution: not a bug
messages: + msg153488
2012-02-16 11:29:10mark.dickinsonsetmessages: + msg153476
2012-02-16 11:09:07rhettingersetmessages: + msg153475
2012-02-16 10:57:57mark.dickinsonsetmessages: + msg153474
2012-02-16 10:43:26mark.dickinsonsetmessages: + msg153473
2012-02-16 10:28:08rhettingersetassignee: rhettinger
2012-02-16 10:24:39mark.dickinsonsetnosy: + mark.dickinson
2012-02-16 08:10:09loewissetversions: - Python 2.6, Python 3.1
2012-02-16 08:06:47gregory.p.smithsetnosy: + rhettinger
2012-02-16 07:57:52gregory.p.smithcreate