classification
Title: calling email.utils.make_msgid frequently has a non-trivial probability of generating colliding ids
Type: behavior Stage:
Components: Library (Lib) Versions: Python 3.2, Python 3.1, Python 2.7, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: allenap, gagenellina, mwh, pitrou (4)
Priority: Keywords patch

Created on 2009-07-30 00:32 by mwh, last changed 2009-11-15 10:12 by gagenellina.

Files
File name Uploaded Description Edit Remove
make_msgid.diff gagenellina, 2009-11-15 10:12
Messages (6)
msg91073 - (view) Author: Michael Hudson (mwh) Date: 2009-07-30 00:32
If you call email.utils.make_msgid a number of times within the same
second, the uniqueness of the results depends on random.randint(100000)
returning different values each time.

A little mathematics proves that you don't have to call make_msgid
*that* often to get the same message id twice: if you call it 'n' times,
the probability of a collision is approximately "1 -
math.exp(-n*(n-1)/200000.0)", and for n == 100, that's about 5%.  For n
== 1000, it's over 99%.

These numbers are born out by experiment:

>>> def collisions(n):
...     msgids = [make_msgid() for i in range(n)]
...     return len(msgids) - len(set(msgids))
... 
>>> sum((collisions(100)>0) for i in range(1000))
49
>>> sum((collisions(1000)>0) for i in range(1000))
991

I think probably having a counter in addition to the randomness would be
a good fix for the problem, though I guess then you have to worry about
thread safety.
msg91074 - (view) Author: Michael Hudson (mwh) Date: 2009-07-30 00:34
A higher resolution timer would also help, of course.

(Thanks to James Knight for the prod).
msg91251 - (view) Author: Gabriel Genellina (gagenellina) Date: 2009-08-04 04:30
This patch replaces the random part with an increasing sequence (in a 
thread safe way).
Also, added a test case for make_msgid (there was none previously)
msg91378 - (view) Author: Antoine Pitrou (pitrou) Date: 2009-08-06 18:18
Is it ok if the message id is predictable?
Besides, _gen_next_number() can more efficiently be written as:

_gen_next_number = itertools.cycle(xrange(N))
msg91410 - (view) Author: Gabriel Genellina (gagenellina) Date: 2009-08-07 19:34
--- El jue 6-ago-09, Antoine Pitrou <report@bugs.python.org> escribió:

> Antoine Pitrou <pitrou@free.fr>
> added the comment:
> 
> Is it ok if the message id is predictable?

I don't know of any use of message ids apart from uniquely identifying the message, but we could still keep a random part followed by the sequence.

> Besides, _gen_next_number() can more efficiently be written
> as:
> 
> _gen_next_number = itertools.cycle(xrange(N))

Written that way it will eventually hold a list with 100000 integers forever.

      Yahoo! Cocina

Encontra las mejores recetas con Yahoo! Cocina.

http://ar.mujer.yahoo.com/cocina/
msg95281 - (view) Author: Gabriel Genellina (gagenellina) Date: 2009-11-15 10:12
An updated version of the make_msgid patch.
Includes a random part plus a sequential part. Testing takes at most 3 
seconds now (we're interested in those msgids generated in a whole 
second)
History
Date User Action Args
2009-11-15 10:12:49gagenellinasetfiles: - utils.diff
2009-11-15 10:12:43gagenellinasetfiles: - test_email.diff
2009-11-15 10:12:29gagenellinasetfiles: + make_msgid.diff

messages: + msg95281
2009-08-07 19:59:53gagenellinasetfiles: + utils.diff
2009-08-07 19:59:18gagenellinasetfiles: - utils.diff
2009-08-07 19:34:15gagenellinasetmessages: + msg91410
2009-08-06 18:18:59pitrousetnosy: + pitrou
messages: + msg91378
2009-08-04 04:32:45gagenellinasetversions: - Python 2.5, Python 2.4, 3rd party, Python 3.0
2009-08-04 04:31:30gagenellinasetfiles: + test_email.diff
2009-08-04 04:30:54gagenellinasetfiles: + utils.diff

nosy: + gagenellina
messages: + msg91251

keywords: + patch
2009-07-30 08:21:31allenapsetnosy: + allenap
2009-07-30 00:34:46mwhsetmessages: + msg91074
2009-07-30 00:32:15mwhcreate