Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add randbytes() method to random.Random #84466

Closed
vstinner opened this issue Apr 14, 2020 · 25 comments
Closed

Add randbytes() method to random.Random #84466

vstinner opened this issue Apr 14, 2020 · 25 comments
Labels
3.9 only security fixes stdlib Python modules in the Lib dir

Comments

@vstinner
Copy link
Member

BPO 40286
Nosy @tim-one, @rhettinger, @mdickinson, @vstinner, @serhiy-storchaka, @vedgar
PRs
  • bpo-40286: Add randbytes() method to random.Random #19527
  • bpo-40286: Makes simpler the relation between randbytes() and getrandbits() #19574
  • bpo-40286: Use random.randbytes() in tests #19575
  • bpo-40286: Remove C implementation of Random.randbytes() #19797
  • bpo-40286: Put methods in correct sections. Add security notice #19870
  • Files
  • run_random.py: Custom PRNG where randbytes() doesn't fit
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2020-04-29.16:51:01.179>
    created_at = <Date 2020-04-14.22:30:18.004>
    labels = ['library', '3.9']
    title = 'Add randbytes() method to random.Random'
    updated_at = <Date 2020-05-05.05:52:17.085>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2020-05-05.05:52:17.085>
    actor = 'rhettinger'
    assignee = 'none'
    closed = True
    closed_date = <Date 2020-04-29.16:51:01.179>
    closer = 'vstinner'
    components = ['Library (Lib)']
    creation = <Date 2020-04-14.22:30:18.004>
    creator = 'vstinner'
    dependencies = []
    files = ['49078']
    hgrepos = []
    issue_num = 40286
    keywords = ['patch']
    message_count = 25.0
    messages = ['366454', '366457', '366496', '366498', '366500', '366502', '366505', '366507', '366512', '366514', '366524', '366665', '366666', '366676', '366677', '366860', '366861', '366869', '366894', '366911', '366919', '367180', '367676', '367677', '368104']
    nosy_count = 6.0
    nosy_names = ['tim.peters', 'rhettinger', 'mark.dickinson', 'vstinner', 'serhiy.storchaka', 'veky']
    pr_nums = ['19527', '19574', '19575', '19797', '19870']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue40286'
    versions = ['Python 3.9']

    @vstinner
    Copy link
    Member Author

    The random module lacks a getrandbytes() method which leads developers to be creative how to generate bytes:
    https://stackoverflow.com/questions/5495492/random-byte-string-in-python

    It's a common use request:

    Python already has three functions to generate random bytes:

    • os.getrandom(): specific to Linux, not portable
    • os.urandom()
    • secrets.token_bytes()

    These 3 functions are based on system entropy and they block on Linux until the kernel collected enough entropy: PEP-524.

    While many users are fine with these functions, there are also use cases for simulation where the security doesn't matter, and it's more about being able to get reproducible experience from a seed. That's what random.Random is about.

    The numpy module provides numpy.random.bytes(length) function for such use case:
    https://docs.scipy.org/doc/numpy-1.15.0/reference/generated/numpy.random.bytes.html

    One example can be to generate UUID4 with the ability to reproduce the random UUID from a seed for testing purpose, or to get reproducible behavior.

    Attached PR implements the getrandbytes() method.

    @vstinner vstinner added 3.9 only security fixes stdlib Python modules in the Lib dir labels Apr 14, 2020
    @rhettinger
    Copy link
    Contributor

    If we have to have this, the method name should be differentiated from getrandbits() because the latter returns an integer. I suggest just random.bytes(n), the same as numpy.

    Python already has three functions to generate random bytes:

    Now, there will be four ;-)

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented Apr 15, 2020

    I suggest just random.bytes(n), the same as numpy.

    The problem with this is that people who from random import * (some schools insist on this, probably because most functions they need already start with rand) will shadow builtin bytes. Not that those schools do anything with bytes, but still, it might be inconvenient.

    (The metaproblem is of course that some functions already do the "poor man's namespacing" in C-style by starting with rand, and some don't. I'm always for user control of namespacing, but I'm just saying that it doesn't correspond to how many beginners use random module.)

    @rhettinger
    Copy link
    Contributor

    Do you have another name suggestion that doesn't have a parallelism problem with the existing name? The names getrandbytes() and getrandbits() suggest a parallelism that is incorrect.

    @vedgar
    Copy link
    Mannequin

    vedgar mannequin commented Apr 15, 2020

    I think that the "module owner";-P must decide whether the random module should follow the C-namespacing or not. Of course, I'm in the "not" camp, so I believe those two "rand..." functions (randrange is completely redundant with random.choice(range)) should be supplemented with random.int and random.float. And then random.bytes will be completely natural. And people might be gently nudged into the right direction when using Python module namespaces.

    @rhettinger
    Copy link
    Contributor

    I concur that bytes() isn't a good name, but am still concerned that the proposed name is a bad API decision.

    @serhiy-storchaka
    Copy link
    Member

    Maybe randbytes()?

    @vstinner
    Copy link
    Member Author

    I like "from random import randbytes" name. I concur that "from random import bytes" overrides bytes() builtin type and so can likely cause troubles.

    @vstinner
    Copy link
    Member Author

    I updated my PR to rename the method to randbytes().

    @vstinner vstinner changed the title Add getrandbytes() method to random.Random Add randbytes() method to random.Random Apr 15, 2020
    @vstinner vstinner changed the title Add getrandbytes() method to random.Random Add randbytes() method to random.Random Apr 15, 2020
    @vstinner
    Copy link
    Member Author

    The performance of the new method is not my first motivation.

    My first motivation is to avoid consumers of the random to write a wrong implementation which would be biased. It's too easy to write biased functions without notifying.

    Moreover, it seems like we can do something to get reproducible behavior on different architectures (different endianness) which would also be a nice feature.

    For example, in bpo-13396, Amaury found this two functions in the wild:

    • struct.pack("Q", random.getrandbits(64))
    • sha1(str(random.getrandbits(8*20))).digest()

    As I wrote, users are creative to workaround missing features :-) I don't think that these two implementations give the same result on big and little endian.

    @serhiy-storchaka
    Copy link
    Member

    All Random methods give the same result independently of endianess and bitness of the platform.

    I don't think that these two implementations give the same result on big and little endian.

    The second one does.

    @vstinner
    Copy link
    Member Author

    I wrote a quick benchmark:
    ---

    import pyperf
    import random
    
    gen = random.Random()
    # gen = random.SystemRandom()
    gen.seed(850779834)
    
    if 1: #hasattr(gen, 'randbytes'):
        func = type(gen).randbytes
    elif 0:
        def py_randbytes(gen, n):
            data = bytearray(n)
            i = 0
            while i < n:
                chunk = 4
                word = gen.getrandbits(32)
                word = word.to_bytes(4, 'big')
                chunk = min(n, 4)
                data[i:i+chunk] = word[:chunk]
                i += chunk
            return bytes(data)
    
        func = py_randbytes
    else:
        def getrandbits_to_bytes(gen, n):
            return gen.getrandbits(n * 8).to_bytes(n, 'little')
    
        func = getrandbits_to_bytes
    
    runner = pyperf.Runner()
    for nbytes in (1, 4, 16, 1024, 1024 * 1024):
        runner.bench_func(f'randbytes({nbytes})', func, gen, nbytes)

    Results on Linux using gcc -O3 (without LTO or PGO) using the C randbytes() implementation as the reference:

    +--------------------+-------------+----------------------------------+-------------------------------+
    | Benchmark | c_randbytes | py_randbytes | getrandbits_to_bytes |
    +====================+=============+==================================+===============================+
    | randbytes(1) | 71.4 ns | 1.04 us: 14.51x slower (+1351%) | 244 ns: 3.42x slower (+242%) |
    +--------------------+-------------+----------------------------------+-------------------------------+
    | randbytes(4) | 71.4 ns | 1.03 us: 14.48x slower (+1348%) | 261 ns: 3.66x slower (+266%) |
    +--------------------+-------------+----------------------------------+-------------------------------+
    | randbytes(16) | 81.9 ns | 3.07 us: 37.51x slower (+3651%) | 321 ns: 3.92x slower (+292%) |
    +--------------------+-------------+----------------------------------+-------------------------------+
    | randbytes(1024) | 1.05 us | 173 us: 165.41x slower (+16441%) | 3.66 us: 3.49x slower (+249%) |
    +--------------------+-------------+----------------------------------+-------------------------------+
    | randbytes(1048576) | 955 us | 187 ms: 196.30x slower (+19530%) | 4.37 ms: 4.58x slower (+358%) |
    +--------------------+-------------+----------------------------------+-------------------------------+

    • c_randbytes: PR 19527, randbytes() methods implemented in C
    • py_randbytes: bytearray, getrandbits(), .to_bytes()
    • getrandbits_to_bytes: Serhiy's implementation: gen.getrandbits(n * 8).to_bytes(n, 'little')

    So well, the C randbytes() implementation is always the fastest.

    random.SystemRandom().randbytes() (os.urandom(n)) performance using random.Random().randbytes() (Mersenne Twister) as a reference:

    +--------------------+-------------+---------------------------------+
    | Benchmark | c_randbytes | systemrandom |
    +====================+=============+=================================+
    | randbytes(1) | 71.4 ns | 994 ns: 13.93x slower (+1293%) |
    +--------------------+-------------+---------------------------------+
    | randbytes(4) | 71.4 ns | 1.04 us: 14.60x slower (+1360%) |
    +--------------------+-------------+---------------------------------+
    | randbytes(16) | 81.9 ns | 1.02 us: 12.49x slower (+1149%) |
    +--------------------+-------------+---------------------------------+
    | randbytes(1024) | 1.05 us | 6.22 us: 5.93x slower (+493%) |
    +--------------------+-------------+---------------------------------+
    | randbytes(1048576) | 955 us | 5.64 ms: 5.91x slower (+491%) |
    +--------------------+-------------+---------------------------------+

    os.urandom() is way slower than Mersenne Twister.

    Well, that's not surprising: os.urandom() requires at least one syscall (getrandom() syscall on my Linux machine).

    @vstinner
    Copy link
    Member Author

    New changeset 9f5fe79 by Victor Stinner in branch 'master':
    bpo-40286: Add randbytes() method to random.Random (GH-19527)
    9f5fe79

    @serhiy-storchaka
    Copy link
    Member

    New changeset 223221b by Serhiy Storchaka in branch 'master':
    bpo-40286: Makes simpler the relation between randbytes() and getrandbits() (GH-19574)
    223221b

    @vstinner
    Copy link
    Member Author

    New changeset 87502dd by Victor Stinner in branch 'master':
    bpo-40286: Use random.randbytes() in tests (GH-19575)
    87502dd

    @rhettinger
    Copy link
    Contributor

    The randbytes() method needs to depend on genrandbits(). It is documented that custom generators can supply there own random() and genrandbits() methods and expect that the other downstream generators all follow. See the attached example which demonstrates that randbytes() bypasses this framework pattern.

    Also, I don't want randbytes() in the C extension. We're tried to keep as much of the code as possible in pure Python and only have the MersenneTwister specific code in the C module. The improves maintainability and makes the code more accessible to a broader audience.

    Also, please don't change the name of the genrand_int32() function. It was a goal to change as little as possible from the official, standard version of the C code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html . For the most part, we just want to wrap that code for Python bindings, but not modify it.

    @rhettinger rhettinger reopened this Apr 20, 2020
    @rhettinger rhettinger reopened this Apr 20, 2020
    @rhettinger
    Copy link
    Contributor

    Direct link to MT code that I would like to leave mostly unmodified: http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/CODES/mt19937ar.c

    @rhettinger
    Copy link
    Contributor

    When a new method gets added to a module, it should happen in a way that is in harmony with the module's design.

    @vstinner
    Copy link
    Member Author

    I created bpo-40346: "Redesign random.Random class inheritance".

    @serhiy-storchaka
    Copy link
    Member

    $ ./python -m timeit -s 'import random' 'random.randbytes(10**6)'
    200 loops, best of 5: 1.36 msec per loop
    
    $ ./python -m timeit -s 'import random' 'random.getrandbits(10**6*8).to_bytes(10**6, "little")'
    50 loops, best of 5: 6.31 msec per loop

    The Python implementation is only 5 times slower than the C implementation. I am fine with implementing randbytes() in Python. This would automatically make it depending on the getrandbits() implementation.

    @vstinner
    Copy link
    Member Author

    Raymond:

    Also, I don't want randbytes() in the C extension. We're tried to keep as much of the code as possible in pure Python and only have the MersenneTwister specific code in the C module. The improves maintainability and makes the code more accessible to a broader audience.

    I don't see how 30 lines makes Python so harder to maintain. These lines make the function 4x to 5x faster. We are not talking about 5% or 10% faster. I think that such optimization is worth it. When did we decide to stop optimizing Python?

    Raymond:

    The randbytes() method needs to depend on genrandbits().

    I created bpo-40346: "Redesign random.Random class inheritance" for a more generic fix, not just randbytes().

    Raymond:

    Also, please don't change the name of the genrand_int32() function. It was a goal to change as little as possible from the official, standard version of the C code at http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html .

    This code was already modified to replace "unsigned long" with "uint32_t" for example. I don't think that renaming genrand_int32() to genrand_uint32() makes the code impossible to maintain. Moreover, it seems like http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/MT2002/emt19937ar.html was not updated for 13 years.

    @vstinner
    Copy link
    Member Author

    Raymond:

    The randbytes() method needs to depend on genrandbits().

    I created PR 19700 which allows to keep the optimization (C implementation in _randommodule.c) and Random subclasses implement randbytes() with getrandbits().

    @vstinner
    Copy link
    Member Author

    New changeset 2d87577 by Victor Stinner in branch 'master':
    bpo-40286: Remove C implementation of Random.randbytes() (GH-19797)
    2d87577

    @vstinner
    Copy link
    Member Author

    It removed the C implementation of randbytes(): it was the root issue which started discussions here and in bpo-40346. I rejected bpo-40346 (BaseRandom) and related PRs.

    I close the issue.

    @rhettinger
    Copy link
    Contributor

    New changeset f01d1be by Raymond Hettinger in branch 'master':
    bpo-40286: Put methods in correct sections. Add security notice to use secrets for session tokens. (GH-19870)
    f01d1be

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes stdlib Python modules in the Lib dir
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants