New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
calling email.utils.make_msgid frequently has a non-trivial probability of generating colliding ids #50847
Comments
If you call email.utils.make_msgid a number of times within the same A little mathematics proves that you don't have to call make_msgid These numbers are born out by experiment: >>> def collisions(n):
... msgids = [make_msgid() for i in range(n)]
... return len(msgids) - len(set(msgids))
...
>>> sum((collisions(100)>0) for i in range(1000))
49
>>> sum((collisions(1000)>0) for i in range(1000))
991 I think probably having a counter in addition to the randomness would be |
A higher resolution timer would also help, of course. (Thanks to James Knight for the prod). |
This patch replaces the random part with an increasing sequence (in a |
Is it ok if the message id is predictable? _gen_next_number = itertools.cycle(xrange(N)) |
--- El jue 6-ago-09, Antoine Pitrou <report@bugs.python.org> escribió:
I don't know of any use of message ids apart from uniquely identifying the message, but we could still keep a random part followed by the sequence.
Written that way it will eventually hold a list with 100000 integers forever.
Encontra las mejores recetas con Yahoo! Cocina. |
An updated version of the make_msgid patch. |
Can we have a patch review please. If nothing else xrange will have to change for Python 3. |
The probability of a collision when generated n numbers from 0 to m-1 is
When n << sqrt(m), it is approximately equal to n**2/(2*m). When m = 1000000, we have problems for n about 1000. When increase m to 2**32, the probability of collisions is decreased to 0.01%. When increase m to 2**64 as recommended in [1], it is decreased to 2.7e-14. I.e. when generate 1000 IDs per second, collisions happen one per a million years. We could also take datetime with larger precision. E.g with 2 digits after the dot, as recommended in [1]. When apply both changes, we could generate 100000 IDs per second with one collision per 10000 years or 10000 IDs per second with one collision per 1000000 years. [1] https://tools.ietf.org/html/draft-ietf-usefor-message-id-01 |
Here is a patch that changes the generating of message IDs:
This increases the length of generated part of the ID from 26 to 39 characters in average. Perhaps in new releases we can use non-decimal or even non-alphanumeric characters in ID for compactness. |
Looking at my mail store, it looks like one more more mail clients use a GUID for the messageid, which is 36 characters. So 39 doesn't seem so bad. There's even one that is 74 characters long. There are also shorter ones, though, so compactifying the result wouldn't be a bad thing. |
An increase of 13 characters doesn't seem so bad. |
Currently the maximal length of local part is
With the patch it will be
If encode all three numbers with hexadecimal, it will be
If encode their with base32:
Using base64 or base85 can be not safe, because message ID can be used as file name in case-insensitive file system. |
New changeset 6969bac411fa by Serhiy Storchaka in branch '2.7': New changeset ea878f847eee by Serhiy Storchaka in branch '3.4': New changeset 933addbc7041 by Serhiy Storchaka in branch 'default': |
Now there is a question. Is it worth to use base16 (hexadecimal) to compact message id to 34 characters or base32 to compact it to 27 characters? Using base16 is pretty easy: just replace %d with %x. Using base32 is more complicated. With base64 and base85 the length can be decreased to 23 and 21, but I doubt that this is safe. |
On May 19, 2015, at 07:23 AM, Serhiy Storchaka wrote:
It might not be relevant, but we're using base 32 in Message-ID-Hash: http://wiki.list.org/DEV/Stable%20URLs We settled on base 32 after consulting with the curators of mail-archive.com |
http://buildbot.python.org/all/builders/x86%20Tiger%202.7/builds/3246/steps/test/logs/stdio Traceback (most recent call last):
File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/email/test/test_email.py", line 2434, in test_make_msgid_collisions
pass
File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/contextlib.py", line 24, in __exit__
self.gen.next()
File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/test/test_support.py", line 1570, in start_threads
raise AssertionError('Unable to join %d threads' % len(started))
AssertionError: Unable to join 5 threads Threads that should be finished for 3 seconds can't be finished for 15 minutes. It looks as clock() wrapped around. From clock (3) manpage:
The solution is to use monotonic() (time() on older releases) instead of clock(). |
New changeset 3c0a817ab616 by Serhiy Storchaka in branch '3.4': New changeset e259c0ab7a69 by Serhiy Storchaka in branch '3.5': New changeset d1c11a78b43a by Serhiy Storchaka in branch 'default': New changeset bdd257d60da2 by Serhiy Storchaka in branch '2.7': |
Perhaps will open a separate issue for more compact non-decimal message ID. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: