Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calling email.utils.make_msgid frequently has a non-trivial probability of generating colliding ids #50847

Closed
mwhudson opened this issue Jul 30, 2009 · 18 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@mwhudson
Copy link

BPO 6598
Nosy @mwhudson, @warsaw, @pitrou, @allenap, @bitdancer, @serhiy-storchaka
Files
  • make_msgid.diff
  • make_msgid_2.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/serhiy-storchaka'
    closed_at = <Date 2015-11-13.11:57:05.364>
    created_at = <Date 2009-07-30.00:32:14.943>
    labels = ['type-bug', 'library']
    title = 'calling email.utils.make_msgid frequently has a non-trivial probability of generating colliding ids'
    updated_at = <Date 2015-11-13.11:57:05.362>
    user = 'https://github.com/mwhudson'

    bugs.python.org fields:

    activity = <Date 2015-11-13.11:57:05.362>
    actor = 'serhiy.storchaka'
    assignee = 'serhiy.storchaka'
    closed = True
    closed_date = <Date 2015-11-13.11:57:05.364>
    closer = 'serhiy.storchaka'
    components = ['Library (Lib)']
    creation = <Date 2009-07-30.00:32:14.943>
    creator = 'mwh'
    dependencies = []
    files = ['15339', '39405']
    hgrepos = []
    issue_num = 6598
    keywords = ['patch']
    message_count = 18.0
    messages = ['91073', '91074', '91251', '91378', '91410', '95281', '236504', '236527', '243394', '243398', '243411', '243417', '243561', '243562', '243585', '254458', '254461', '254600']
    nosy_count = 10.0
    nosy_names = ['mwh', 'barry', 'ggenellina', 'pitrou', 'allenap', 'r.david.murray', 'BreamoreBoy', 'python-dev', 'serhiy.storchaka', 'varun']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue6598'
    versions = ['Python 2.7', 'Python 3.4', 'Python 3.5']

    @mwhudson
    Copy link
    Author

    If you call email.utils.make_msgid a number of times within the same
    second, the uniqueness of the results depends on random.randint(100000)
    returning different values each time.

    A little mathematics proves that you don't have to call make_msgid
    *that* often to get the same message id twice: if you call it 'n' times,
    the probability of a collision is approximately "1 -
    math.exp(-n*(n-1)/200000.0)", and for n == 100, that's about 5%. For n
    == 1000, it's over 99%.

    These numbers are born out by experiment:

    >>> def collisions(n):
    ...     msgids = [make_msgid() for i in range(n)]
    ...     return len(msgids) - len(set(msgids))
    ... 
    >>> sum((collisions(100)>0) for i in range(1000))
    49
    >>> sum((collisions(1000)>0) for i in range(1000))
    991

    I think probably having a counter in addition to the randomness would be
    a good fix for the problem, though I guess then you have to worry about
    thread safety.

    @mwhudson mwhudson added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Jul 30, 2009
    @mwhudson
    Copy link
    Author

    A higher resolution timer would also help, of course.

    (Thanks to James Knight for the prod).

    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Aug 4, 2009

    This patch replaces the random part with an increasing sequence (in a
    thread safe way).
    Also, added a test case for make_msgid (there was none previously)

    @pitrou
    Copy link
    Member

    pitrou commented Aug 6, 2009

    Is it ok if the message id is predictable?
    Besides, _gen_next_number() can more efficiently be written as:

    _gen_next_number = itertools.cycle(xrange(N))

    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Aug 7, 2009

    --- El jue 6-ago-09, Antoine Pitrou <report@bugs.python.org> escribió:

    Antoine Pitrou <pitrou@free.fr>
    added the comment:

    Is it ok if the message id is predictable?

    I don't know of any use of message ids apart from uniquely identifying the message, but we could still keep a random part followed by the sequence.

    Besides, _gen_next_number() can more efficiently be written
    as:

    _gen_next_number = itertools.cycle(xrange(N))

    Written that way it will eventually hold a list with 100000 integers forever.

      Yahoo! Cocina
    

    Encontra las mejores recetas con Yahoo! Cocina.

    http://ar.mujer.yahoo.com/cocina/

    @ggenellina
    Copy link
    Mannequin

    ggenellina mannequin commented Nov 15, 2009

    An updated version of the make_msgid patch.
    Includes a random part plus a sequential part. Testing takes at most 3
    seconds now (we're interested in those msgids generated in a whole
    second)

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Feb 24, 2015

    Can we have a patch review please. If nothing else xrange will have to change for Python 3.

    @serhiy-storchaka
    Copy link
    Member

    The probability of a collision when generated n numbers from 0 to m-1 is

    1 - factorial(m)/factorial(m-n)/m**n
    

    When n << sqrt(m), it is approximately equal to n**2/(2*m). When m = 1000000, we have problems for n about 1000. When increase m to 2**32, the probability of collisions is decreased to 0.01%. When increase m to 2**64 as recommended in [1], it is decreased to 2.7e-14. I.e. when generate 1000 IDs per second, collisions happen one per a million years.

    We could also take datetime with larger precision. E.g with 2 digits after the dot, as recommended in [1].

    When apply both changes, we could generate 100000 IDs per second with one collision per 10000 years or 10000 IDs per second with one collision per 1000000 years.

    [1] https://tools.ietf.org/html/draft-ietf-usefor-message-id-01

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch that changes the generating of message IDs:

    1. The datetime is taken with higher precision, 2 decimal digits after dot.
    2. The datetime is written as Unix time in 0.01 second units, not as YYYYmmddHHMMSS. This is more compact and faster.
    3. The random part is taken as the random integer in the range 0 from to 2**64, not from 0 to 10**5.

    This increases the length of generated part of the ID from 26 to 39 characters in average. Perhaps in new releases we can use non-decimal or even non-alphanumeric characters in ID for compactness.

    @bitdancer
    Copy link
    Member

    Looking at my mail store, it looks like one more more mail clients use a GUID for the messageid, which is 36 characters. So 39 doesn't seem so bad. There's even one that is 74 characters long. There are also shorter ones, though, so compactifying the result wouldn't be a bad thing.

    @warsaw
    Copy link
    Member

    warsaw commented May 17, 2015

    An increase of 13 characters doesn't seem so bad.

    @serhiy-storchaka
    Copy link
    Member

    Currently the maximal length of local part is

    14+1+len(str(2**16))+1+len(str(10**5-1)) == 26
    

    With the patch it will be

    len(str(2**31*100))+1+len(str(2**16))+1+len(str(2**64)) = 39
    

    If encode all three numbers with hexadecimal, it will be

    len('%x'%(2**31*100))+1+len('%x'%(2**16))+1+len('%x'%(2**64)) = 34
    

    If encode their with base32:

    math.ceil((31+math.log(100)/math.log(2))/5)+1+math.ceil(16/5)+1+math.ceil(64/5) = 27
    

    Using base64 or base85 can be not safe, because message ID can be used as file name in case-insensitive file system.

    @serhiy-storchaka serhiy-storchaka self-assigned this May 19, 2015
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented May 19, 2015

    New changeset 6969bac411fa by Serhiy Storchaka in branch '2.7':
    Issue bpo-6598: Increased time precision and random number range in
    https://hg.python.org/cpython/rev/6969bac411fa

    New changeset ea878f847eee by Serhiy Storchaka in branch '3.4':
    Issue bpo-6598: Increased time precision and random number range in
    https://hg.python.org/cpython/rev/ea878f847eee

    New changeset 933addbc7041 by Serhiy Storchaka in branch 'default':
    Issue bpo-6598: Increased time precision and random number range in
    https://hg.python.org/cpython/rev/933addbc7041

    @serhiy-storchaka
    Copy link
    Member

    Now there is a question. Is it worth to use base16 (hexadecimal) to compact message id to 34 characters or base32 to compact it to 27 characters? Using base16 is pretty easy: just replace %d with %x. Using base32 is more complicated.

    With base64 and base85 the length can be decreased to 23 and 21, but I doubt that this is safe.

    @warsaw
    Copy link
    Member

    warsaw commented May 19, 2015

    On May 19, 2015, at 07:23 AM, Serhiy Storchaka wrote:

    Now there is a question. Is it worth to use base16 (hexadecimal) to compact
    message id to 34 characters or base32 to compact it to 27 characters? Using
    base16 is pretty easy: just replace %d with %x. Using base32 is more
    complicated.

    It might not be relevant, but we're using base 32 in Message-ID-Hash:

    http://wiki.list.org/DEV/Stable%20URLs

    We settled on base 32 after consulting with the curators of mail-archive.com
    and others.

    @serhiy-storchaka
    Copy link
    Member

    http://buildbot.python.org/all/builders/x86%20Tiger%202.7/builds/3246/steps/test/logs/stdio
    ======================================================================
    FAIL: test_make_msgid_collisions (email.test.test_email.TestMiscellaneous)
    ----------------------------------------------------------------------

    Traceback (most recent call last):
      File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/email/test/test_email.py", line 2434, in test_make_msgid_collisions
        pass
      File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/contextlib.py", line 24, in __exit__
        self.gen.next()
      File "/Users/db3l/buildarea/2.7.bolen-tiger/build/Lib/test/test_support.py", line 1570, in start_threads
        raise AssertionError('Unable to join %d threads' % len(started))
    AssertionError: Unable to join 5 threads

    Threads that should be finished for 3 seconds can't be finished for 15 minutes. It looks as clock() wrapped around. From clock (3) manpage:

    Note that the time can wrap around.  On a 32-bit system where CLOCKS_PER_SEC equals 1000000 this function will return the same value approximately every 72 minutes.
    

    The solution is to use monotonic() (time() on older releases) instead of clock().

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Nov 10, 2015

    New changeset 3c0a817ab616 by Serhiy Storchaka in branch '3.4':
    Issue bpo-6598: Avoid clock wrapping around in test_make_msgid_collisions.
    https://hg.python.org/cpython/rev/3c0a817ab616

    New changeset e259c0ab7a69 by Serhiy Storchaka in branch '3.5':
    Issue bpo-6598: Avoid clock wrapping around in test_make_msgid_collisions.
    https://hg.python.org/cpython/rev/e259c0ab7a69

    New changeset d1c11a78b43a by Serhiy Storchaka in branch 'default':
    Issue bpo-6598: Avoid clock wrapping around in test_make_msgid_collisions.
    https://hg.python.org/cpython/rev/d1c11a78b43a

    New changeset bdd257d60da2 by Serhiy Storchaka in branch '2.7':
    Issue bpo-6598: Avoid clock wrapping around in test_make_msgid_collisions.
    https://hg.python.org/cpython/rev/bdd257d60da2

    @serhiy-storchaka
    Copy link
    Member

    Perhaps will open a separate issue for more compact non-decimal message ID.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants