Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash collision security issue #57912

Closed
warsaw opened this issue Jan 3, 2012 · 328 comments
Closed

Hash collision security issue #57912

warsaw opened this issue Jan 3, 2012 · 328 comments
Labels
3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-security A security issue

Comments

@warsaw
Copy link
Member

warsaw commented Jan 3, 2012

BPO 13703
Nosy @malemburg, @gvanrossum, @tim-one, @loewis, @warsaw, @birkenfeld, @terryjreedy, @gpshead, @jcea, @mdickinson, @pitrou, @tiran, @benjaminp, @serwy, @merwok, @alex, @skrah, @davidmalcolm, @bz2, @markshannon, @ericsnowcurrently, @JimJJewett, @PaulMcMillan
Dependencies
  • bpo-13704: Random number generator in Python core
  • Files
  • hash-attack.patch
  • SafeDict.py: SafeDict implementation
  • bench_startup.py
  • random-8.patch
  • hash-collision-counting-dmalcolm-2012-01-20-001.patch
  • amortized-probe-counting-dmalcolm-2012-01-20-002.patch
  • amortized-probe-counting-dmalcolm-2012-01-21-003.patch
  • hash-attack-2.patch
  • hash-attack-3.patch
  • integercollision.py
  • backport-of-hash-randomization-to-2.7-dmalcolm-2012-01-23-001.patch: Backport of haypo's random-8.patch to 2.7
  • hybrid-approach-dmalcolm-2012-01-25-001.patch: Hybrid approach to solving dict DoS attack
  • hybrid-approach-dmalcolm-2012-01-25-002.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-27-001.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-28-001.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-29-001.patch
  • unnamed
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-001.patch
  • optin-hash-randomization-for-3.1-dmalcolm-2012-01-30-002.patch
  • optin-hash-randomization-for-2.6-dmalcolm-2012-01-30-001.patch
  • results-16.txt
  • add-randomization-to-2.6-dmalcolm-2012-02-01-001.patch
  • fix-broken-tests-on-2.6-dmalcolm-2012-02-01-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-01-001.patch
  • fix-broken-tests-on-3.1-dmalcolm-2012-02-01-001.patch
  • add-randomization-to-2.6-dmalcolm-2012-02-06-001.patch
  • fix-broken-tests-on-2.6-dmalcolm-2012-02-06-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-06-001.patch
  • fix-broken-tests-on-3.1-dmalcolm-2012-02-06-001.patch
  • add-randomization-to-2.6-dmalcolm-2012-02-11-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-11-001.patch
  • add-randomization-to-2.6-dmalcolm-2012-02-13-001.patch
  • add-randomization-to-3.1-dmalcolm-2012-02-13-001.patch
  • hash-patch-3.1-gb-03.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2012-03-13.22:25:45.919>
    created_at = <Date 2012-01-03.19:36:49.855>
    labels = ['type-security', 'interpreter-core', 'release-blocker', '3.11']
    title = 'Hash collision security issue'
    updated_at = <Date 2021-11-08.16:57:10.080>
    user = 'https://github.com/warsaw'

    bugs.python.org fields:

    activity = <Date 2021-11-08.16:57:10.080>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2012-03-13.22:25:45.919>
    closer = 'gregory.p.smith'
    components = ['Interpreter Core']
    creation = <Date 2012-01-03.19:36:49.855>
    creator = 'barry'
    dependencies = ['13704']
    files = ['24151', '24169', '24223', '24259', '24286', '24288', '24289', '24295', '24299', '24300', '24304', '24320', '24324', '24343', '24353', '24365', '24366', '24370', '24371', '24375', '24385', '24391', '24392', '24393', '24394', '24434', '24435', '24436', '24437', '24490', '24491', '24514', '24515', '24563']
    hgrepos = []
    issue_num = 13703
    keywords = ['patch']
    message_count = 328.0
    messages = ['150522', '150525', '150526', '150529', '150531', '150532', '150533', '150534', '150541', '150543', '150558', '150559', '150560', '150562', '150563', '150565', '150568', '150569', '150570', '150577', '150589', '150592', '150601', '150609', '150613', '150616', '150619', '150620', '150621', '150622', '150625', '150634', '150635', '150636', '150637', '150638', '150639', '150641', '150642', '150643', '150644', '150645', '150646', '150647', '150648', '150649', '150650', '150651', '150652', '150655', '150656', '150659', '150662', '150665', '150668', '150694', '150699', '150702', '150706', '150707', '150708', '150712', '150713', '150718', '150719', '150724', '150725', '150726', '150727', '150738', '150748', '150756', '150766', '150768', '150769', '150771', '150795', '150829', '150832', '150835', '150836', '150840', '150847', '150856', '150857', '150859', '150865', '150866', '150934', '151012', '151017', '151031', '151033', '151047', '151048', '151061', '151062', '151063', '151064', '151065', '151069', '151070', '151071', '151073', '151074', '151078', '151092', '151120', '151121', '151122', '151157', '151158', '151159', '151167', '151353', '151401', '151402', '151419', '151422', '151448', '151449', '151468', '151472', '151474', '151484', '151519', '151528', '151560', '151561', '151565', '151566', '151567', '151574', '151582', '151583', '151584', '151585', '151586', '151589', '151590', '151596', '151604', '151617', '151620', '151625', '151626', '151628', '151629', '151632', '151633', '151647', '151662', '151664', '151677', '151679', '151680', '151681', '151682', '151684', '151685', '151689', '151691', '151699', '151700', '151701', '151703', '151707', '151714', '151731', '151734', '151735', '151737', '151739', '151744', '151745', '151747', '151748', '151753', '151754', '151756', '151758', '151794', '151796', '151798', '151812', '151813', '151814', '151815', '151825', '151826', '151847', '151850', '151867', '151869', '151870', '151939', '151941', '151942', '151944', '151956', '151959', '151960', '151961', '151965', '151966', '151967', '151970', '151973', '151977', '151984', '152030', '152033', '152037', '152039', '152040', '152041', '152043', '152046', '152051', '152057', '152060', '152066', '152070', '152104', '152112', '152117', '152118', '152125', '152146', '152149', '152183', '152186', '152199', '152200', '152203', '152204', '152270', '152271', '152275', '152276', '152299', '152300', '152309', '152311', '152315', '152335', '152344', '152352', '152362', '152364', '152422', '152452', '152453', '152723', '152730', '152731', '152732', '152734', '152740', '152747', '152753', '152754', '152755', '152758', '152760', '152763', '152764', '152767', '152768', '152769', '152777', '152780', '152781', '152784', '152787', '152789', '152797', '152811', '152855', '153055', '153074', '153081', '153082', '153140', '153141', '153143', '153144', '153297', '153301', '153369', '153395', '153682', '153683', '153690', '153695', '153750', '153753', '153798', '153802', '153817', '153833', '153848', '153849', '153850', '153852', '153853', '153854', '153860', '153861', '153862', '153868', '153872', '153873', '153877', '153975', '153980', '154428', '154430', '154432', '154853', '155293', '155472', '155527', '155680', '155681', '155682', '405727', '405745']
    nosy_count = 36.0
    nosy_names = ['lemburg', 'gvanrossum', 'tim.peters', 'loewis', 'barry', 'georg.brandl', 'terry.reedy', 'gregory.p.smith', 'jcea', 'mark.dickinson', 'pitrou', 'christian.heimes', 'benjamin.peterson', 'roger.serwy', 'eric.araujo', 'grahamd', 'Arfrever', 'v+python', 'alex', 'cvrebert', 'zbysz', 'skrah', 'dmalcolm', 'gz', 'neologix', 'Arach', 'Mark.Shannon', 'python-dev', 'eric.snow', 'Zhiping.Deng', 'Huzaifa.Sidhpurwala', 'Jim.Jewett', 'PaulMcMillan', 'fx5', 'skorgu', 'jsvaughan']
    pr_nums = []
    priority = 'release blocker'
    resolution = 'fixed'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'security'
    url = 'https://bugs.python.org/issue13703'
    versions = ['Python 3.11']

    @warsaw
    Copy link
    Member Author

    warsaw commented Jan 3, 2012

    This is already publicly known and in deep discussion on python-dev. The proper fix is still TBD. Essentially, hash collisions can be exploited to DoS a web framework that automatically parses input forms into dictionaries.

    Start here:

    http://mail.python.org/pipermail/python-dev/2011-December/115116.html

    @warsaw warsaw added interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-security A security issue labels Jan 3, 2012
    @tiran
    Copy link
    Member

    tiran commented Jan 3, 2012

    I had a short chat with Guido yesterday. I'll try to sum up the conversation. Guido, please correct me if I got something wrong or missed a point.

    Guido wants the fix as simple and less intrusive as possible as he wants to provide/apply a patch for Python 2.4 to 3.3. This means any new stuff is off the table unless it's really, really necessary. Say goodbye to my experimental MurmurHash3 patch.

    We haven't agreed whether the randomization should be enabled by default or disabled by default. IMHO it should be disabled for all releases except for the upcoming 3.3 release. The env var PYTHONRANDOMHASH=1 would enable the randomization. It's simple to set the env var in e.g. Apache for mod_python and mod_wsgi.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 3, 2012

    We haven't agreed whether the randomization should be enabled by
    default or disabled by default. IMHO it should be disabled for all
    releases except for the upcoming 3.3 release.

    I think on the contrary it must be enabled by default. Leaving security
    holes open is wrong.

    @tiran
    Copy link
    Member

    tiran commented Jan 3, 2012

    I think on the contrary it must be enabled by default. Leaving security
    holes open is wrong.

    We can't foresee the implications of the randomization and only a small number of deployments is affected by the problem. But I won't start a fight on the matter. ;)

    @gvanrossum
    Copy link
    Member

    I'm with Antoine -- turn it on by default. Maybe there should be a release candidate to test the waters.

    @warsaw
    Copy link
    Member Author

    warsaw commented Jan 3, 2012

    On Jan 03, 2012, at 08:24 PM, Antoine Pitrou wrote:

    I think on the contrary it must be enabled by default. Leaving security
    holes open is wrong.

    Unless there's evidence of performance regressions or backward
    incompatibilities, I agree.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 3, 2012

    Unless there's evidence of performance regressions
    or backward incompatibilities, I agree.

    If hash() is modified, str(dict) and str(set) will change for example. It may break doctests. Can we consider that the application should not rely (indirectly) on hash and so fix (for example) their doctests? Or is it a backward incompatibility?

    hash() was already modified in major Python versions.

    For this specific issue, I consider that security is more important than str(dict).

    @benjaminp
    Copy link
    Contributor

    Barry, when this gets fixed, shall we coordinate release times?

    @warsaw
    Copy link
    Member Author

    warsaw commented Jan 3, 2012

    On Jan 03, 2012, at 09:43 PM, Benjamin Peterson wrote:

    Barry, when this gets fixed, shall we coordinate release times?

    Yes!

    @tiran
    Copy link
    Member

    tiran commented Jan 3, 2012

    Randomized hashing destabilizes the unit tests of Python, too. Here are the outputs of four test runs:

    11 tests failed:
    test_collections test_dbm test_dis test_gdb test_inspect
    test_packaging test_set test_symtable test_ttk_textonly
    test_urllib test_urlparse

    9 tests failed:
    test_dbm test_dis test_gdb test_json test_packaging test_set
    test_symtable test_urllib test_urlparse

    10 tests failed:
    test_dbm test_dis test_gdb test_inspect test_packaging test_set
    test_symtable test_ttk_textonly test_urllib test_urlparse

    9 tests failed:
    test_collections test_dbm test_dict test_dis test_gdb
    test_packaging test_symtable test_urllib test_urlparse

    @PaulMcMillan
    Copy link
    Mannequin

    PaulMcMillan mannequin commented Jan 3, 2012

    I agree that we should enable randomness by default, and provide an easy way for users to disable it if necessary (unit test suites that explicitly depend on order being an obvious candidate).

    I'll link my proposed algorithm change here, for the record:
    https://gist.github.com/0a91e52efa74f61858b5

    I've gotten confirmation from several other sources that the fix recommended by the presenters (just a random initialization seed) only prevents the most basic form of the attack.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 4, 2012

    Christian Heimes proposes the following change in its randomhash branch (see issue bpo-13704):

    -    x = (Py_uhash_t) *p << 7;
    +    x = Py_RndHashSeed + ((Py_uhash_t) *p << 7);
         for (i = 0; i < len; i++)
             x = (1000003U * x) ^ (Py_uhash_t) *p++;
         x ^= (Py_uhash_t) len;

    This change doesn't add any security if the attacker can inject any string and retreive the hash value. You can retreive directly Py_RndHashSeed using:

    Py_RndHashSeed = intmask((hash("a") ^ len("a") ^ ord("a")) * DIVIDE) - (ord("a") << 7)

    where intmask() truncates to a long (x mod 2^(long bits)) and DIVIDE = 1/1000003 mod 2^(long bits). For example, DIVIDE=2021759595 for 32 bits long.

    @tiran
    Copy link
    Member

    tiran commented Jan 4, 2012

    Victor, please ignore my code related to hash randomization for now. I've deliberately not linked my branch to this bug report. I'm well aware that it's not secure and that it's pretty easy to reverse engineer the seed from a hash of a short string. The code is a proof of concept to detect failing tests and other issues.

    I'm in private contact with Paul and we are working together. He has done extended research and I'll gladly follow his expertise. I've already discussed the issue with small strings, but I can't recall if it was a private mail to Paul or a public one to the dev list.

    Paul:
    I still think that you should special case short strings (five or few chars sound good). An attacker can't do much harm with one to five char strings but such short strings may make it too easy to calculate the seed.

    16kb of seed is still a lot. Most CPUs have about 16 to 32, maybe 64kb L1 cache for data. 1024 to 4096 bytes should increase cache locality and reduce speed impacts.

    PS: I'm going to reply to your last mail tomorrow.

    @terryjreedy
    Copy link
    Member

    In bpo-13707 I suggest a change to the current hash() entry which is needed independently of this issue, because the default hash (for object()), being tied to id() is already limited to an object's lifetime. But this change will become more imperative if hash() is made run-dependent for numbers and strings.

    There does not seems to presently *be* a security hole for 64 bit builds, so if there is any noticeable slowdown on 64 bit builds and it is sensibly easy to tie the default to the bitness, I would think it should be off for such builds.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 4, 2012

    Paul first proposition (on python-dev) was to replace:

    ...
    x = (ord(s[0]) << 7)
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i]))
        ...
    

    by:

    ...
    x = (ord(s[0]) << 7)
    while i < length:
        x = intmask((1000003*x) ^ ord(s[i])) ^ r[x % len_r]
        ...
    

    This change has a vulnerability similar than the one of Christian's suggested changed. The "r" array can be retreived directly with:

    r2 = []
    for i in xrange(len(r)):
        s = chr(intmask(i * UNSHIFT7) % len(r))
        h = intmask(hash(s) ^ len(s) ^ ord(s) ^ ((ord(s) << 7) * MOD))
        r2.append(chr(h))
    r2 = ''.join(r2)

    where UNSHIFT7 = 1/2**7 mod 2^(long bits).

    By the way, this change always use r[0] to hash all string of one ASCII character (U+0000-U+007F).

    @pitrou
    Copy link
    Member

    pitrou commented Jan 4, 2012

    I'm in private contact with Paul and we are working together. He has
    done extended research and I'll gladly follow his expertise. I've
    already discussed the issue with small strings, but I can't recall if
    it was a private mail to Paul or a public one to the dev list.

    Can all this be discussed on this issue now that it's the official point
    of reference? It will avoid the repetition of arguments we see here and
    there.

    (I don't think special-casing small strings makes sense, because then
    you have two algorithms to audit rather than one)

    @vstinner
    Copy link
    Member

    vstinner commented Jan 4, 2012

    https://gist.github.com/0a91e52efa74f61858b5

    Please, attach directly a file to the issue, or copy/paste the code in your comment. Interesting part the code:
    ---

    #Proposed replacement
    #--------------------------------------

    import os, array
    size_exponent = 14 #adjust as a memory/security tradeoff
    r = array.array('l', os.urandom(2**size_exponent))
    len_r = len(r)
    
    def _hash_string2(s):
        """The algorithm behind compute_hash() for a string or a unicode."""
        length = len(s)
        #print s
        if length == 0:
            return -1
        x = (ord(s[0]) << 7) ^ r[length % len_r]
        i = 0
        while i < length:
            x = intmask((1000003*x) ^ ord(s[i]))
            x ^= r[x % len_r]
            i += 1
        x ^= length
        return intmask(x)

    r = array.array('l', os.urandom(2**size_exponent))
    len_r = len(r)

    r size should not depend on the size of a long. You should write something like:

    sizeof_long = ctypes.sizeof(ctypes.c_long)
    r_bits = 8
    r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
    r_mask = 2**r_bits-1

    and then replace "% len_r" by "& r_mask".

    What is the minimum value of r_bits? For example, would it be safe to use a single long integer? (r_bits=1)

    @pitrou
    Copy link
    Member

    pitrou commented Jan 4, 2012

    > r = array.array('l', os.urandom(2**size_exponent))
    > len_r = len(r)

    r size should not depend on the size of a long. You should write something like:

    sizeof_long = ctypes.sizeof(ctypes.c_long)
    r_bits = 8
    r = array.array('l', os.urandom((2**r_bits) * sizeof_long))
    r_mask = 2**r_bits-1

    The final code will be in C and will use neither ctypes nor array.array.
    Arguing about this looks quite pointless IMO.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 4, 2012

    For the record, here is what "man urandom" says about random seed size:

    “[...] no cryptographic primitive available today can hope to promise
    more than 256 bits of security, so if any program reads more than
    256 bits (32 bytes) from the kernel random pool per invocation, or per
    reasonable reseed interval (not less than one minute), that should be
    taken as a sign that its cryptography is not skilfully implemented.”

    In that light, reading a 64 bytes seed from /dev/urandom is already a lot, and 4096 bytes is simply insane.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 4, 2012

    I read that the attack cannot be computed with actual computers (it's too expensive) against Python 64 bits. I tried to change str.__hash__ in Python 32 bits to compute the hash in 64 bits and than truncate the hash to 32 bits: it doesn't change anything, the hash values are the same, so it doesn't improve the security.

    @vstinner
    Copy link
    Member

    vstinner commented Jan 4, 2012

    Yet another random hash function, simplified version of Paul's function. It always use exactly 256 bits of entropy and so 32 bytes of memory, and doesn't keep the loop. I don't expect my function to be secure, but just give more work to the attacker to compute the data for an attack against our dict implementation.

    ---

    import os, array, struct
    sizeof_long = struct.calcsize("l")
    r_bits = 256
    r_len = r_bits // (sizeof_long * 8)
    r_mask = r_len - 1
    r = array.array('l', os.urandom(r_len * sizeof_long))
    
    def randomhash(s):
        length = len(s)
        if length == 0:
            return -2
        x = ord(s[0])
        x ^= r[x & r_mask]
        x <<= 7
        for ch in s:
            x = intmask(1000003 * x)
            x ^= ord(ch)
        x ^= length
        x ^= r[x & r_mask]
        return intmask(x)

    The first "x ^= r[x & r_mask]" may be replaced by "x ^= r[(x ^ length) & r_mask]".

    The binary shift is done after the first xor with r, because 2**7 and r_len are not coprime (r_len is a multipler of 2**7), and so (ord(s[0] << 7) & r_mask is always zero.

    randomhash(s)==hash(s) if we used twice the same index in the r array. I don't know if this case gives useful information.

    @PaulMcMillan
    Copy link
    Mannequin

    PaulMcMillan mannequin commented Jan 4, 2012

    A couple of things here:

    First, my proposed change is not cryptographically secure. There simply aren't any cryptographic hashing algorithms available that are in the performance class we need. My proposal does make the collision attack quite difficult to carry out, even if the raw output values from the hash are available to the attacker (they should not usually be).

    I favor using the same algorithm between 32 and 64 bit builds for consistency of behavior. Developers would be startled to find that ordering stays consistent on a 64 bit build but varies on 32 bit builds. Additionally, the impracticality of attacking of 64 bit builds rests on the fact that these particular researchers didn't devise a way to do it. I'd hate to make this change and then have a clever mathematician publish some elegant point requiring us to go fix the problem all over again.

    I could be convinced either way on small strings. I like that they can't be used to attack the secret. At the same time, I worry that combining 2 different hashing routines into the same output space may introduce unexpected collisions and other difficult to debug edge-case conditions. It also means that the order of the hashes of long strings will vary while the order of short strings will not - another inconsistency which will encourage bugs.

    Thank you Victor for the improvements to the python demonstration code. As Antoine said, it's only demo code, but better demo code is always good.

    Antoine: That section of the manpage is referring to the overall security of a key generated using urandom. 256 bits is overkill for this application. We could take 256 bits and use them to generate a key using a cryptographically appropriate algorithm, but it's simpler to read more bits and use them directly as the key.

    Additionally, that verbiage has been in the man page for urandom for quite some time (probably since the earliest version in the mid 90's). The PRNG has been improved since then.

    Minimum length of r is a hard question. The shorter it is, the higher the correlation of the output. In my tests, 16kb was the amount necessary to generally do reasonably well on my test suite for randomness even with problematic input. Obviously our existing output isn't random, so it doesn't pass those tests at all. Using a fairly small value (4k) should not make the results much worse from a security perspective, but might be problematic from a collision/distribution standpoint. It's clear that we don't need cryptographically good randomness here, but passing the test suite is not a bad thing when considering the distribution.

    When we settle on a C implementation, I'd like to run it through the smhasher set of tests to make sure we aren't making distribution worse, especially for very small values of r.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 4, 2012

    Using a fairly small value (4k) should not make the results much worse
    from a security perspective, but might be problematic from a
    collision/distribution standpoint.

    Keep in mind the average L1 data cache size is between 16KB and 64KB. 4KB is already a significant chunk of that.

    Given a hash function's typical loop is to feed back the current result into the next computation, I don't see why a small value (e.g. 256 bytes) would be detrimental.

    @merwok
    Copy link
    Member

    merwok commented Jan 4, 2012

    If test_packaging fails because it relies on dict order / hash details, that’s a bug. Can you copy the full tb (possibly in another report, I can fix it independently of this issue)?

    @warsaw
    Copy link
    Member Author

    warsaw commented Jan 4, 2012

    On Jan 04, 2012, at 06:00 AM, Paul McMillan wrote:

    Developers would be startled to find that ordering stays consistent on a 64
    bit build but varies on 32 bit builds.

    Well, one positive outcome of this issue is that users will finally viscerally
    understand that dictionary (and set) order should never be relied upon, even
    between successive runs of the same Python executable.

    @malemburg
    Copy link
    Member

    Some comments:

    1. The security implications in all this is being somewhat overemphasized.

    There are many ways you can do a DoS attack on web servers. It's the
    responsibility of the used web frameworks and servers to deal with
    the possible cases.

    It's a good idea to provide some way to protect against hash
    collision attacks, but that will only solve one possible way of
    causing a resource attack on a server.

    There are other ways you can generate lots of CPU overhead with
    little data input (e.g. think of targeting the search feature on
    many Zope/Plone sites).

    In order to protect against such attacks in general, we'd have to
    provide a way to control CPU time and e.g. raise an exception if too
    much time is being spent on a simple operation such as a key insertion.
    This can be done using timers, signals or even under OS control.

    The easiest way to protect against the hash collision attack is by
    limiting the POST/GET/HEAD request size.

    The second best way would be to limit the number of parameters that a
    web framework accepts for POST/GET/HEAD request.

    1. Changing the semantics of hashing in a dot release is not allowed.

    If randomization of the hash start vector or some other method is
    enabled by default in a dot release, this will change the semantics
    of any application switching to that dot release.

    The hash values of Python objects are not only used by the Python
    dictionary implementation, but also by other storage mechanisms
    such as on-disk dictionaries, inter-process object exchange via
    share memory, memcache, etc.

    Hence, if changed, the hash change should be disabled per default
    for dot releases and enabled for 3.3.

    1. Changing the way strings are hashed doesn't solve the problem.

    Hash values of other types can easily be guessed as well, e.g.
    take integers which use a trivial hash function.

    We'd have to adapt all hash functions of the basic types in Python
    or come up with a generic solution using e.g. double-hashing
    in the dictionary/set implementations.

    1. By just using a random start vector you change the absolute
      hash values for specific objects, but not the overall hash sequence
      or its period.

    An attacker only needs to create many hash collisions, not
    specific ones. It's the period of the hash function that's
    important in such attacks and that doesn't change when moving to
    a different start vector.

    1. Hashing needs to be fast.

    It's one of the most used operations in Python. Please get experts into
    the boat like Tim Peters and Christian Tismer, who both have worked
    on the dict implementation and the hash functions, before experimenting
    with ad-hoc fixes.

    1. Counting collisions could solve the issue without having to
      change hashing.

    Another idea would be counting the collisions and raising an
    exception if the number of collisions exceed a certain
    threshold.

    Such a change would work for all hashable Python objects and
    protect against the attack without changing any hash function.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @malemburg
    Copy link
    Member

    Marc-Andre Lemburg wrote:

    1. Changing the way strings are hashed doesn't solve the problem.

    Hash values of other types can easily be guessed as well, e.g.
    take integers which use a trivial hash function.

    Here's an example for integers on a 64-bit machine:

    >> g = ((x*(2**64 - 1), hash(x*(2**64 - 1))) for x in xrange(1, 1000000))
    >> d = dict(g)

    This takes ages to complete and only uses very little memory.
    The input data has some 32MB if written down in decimal numbers

    • not all that much data either.

    32397634

    @gpshead
    Copy link
    Member

    gpshead commented Feb 21, 2012

    The bug report is the easiest thing to search for and follow when checking when something is resolved so it is nice to have a link to the relevant patch(es) for each branch. I just wanted to note the major commit here so that all planned branches had a note recorded. I don't care that it wasn't automatic. :)

    For observers: There have been several more commits related to fixing this (test dict/set order fixes, bug/typo/merge oops fixes for the linked to patches, etc). Anyone interested in seeing the full list of diffs should look at their specific branch on our around the time of the linked to changelists. Too many to list here.

    @gpshead
    Copy link
    Member

    gpshead commented Feb 21, 2012

    Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.

    Saying yes "working as intended" is fine by me.

    sys.flags.hash_randomization seems to simply indicate that doing something with the hash seed was explicitly specified as opposed to defaulting to off, not that the hash seed was actually chosen randomly.

    What this implies for 3.3 after we make hash randomization default to on is that sys.flags.hash_randomization will always be 1.

    @birkenfeld
    Copy link
    Member

    That is a good question. I don't really care either way, but let's say +0 for turning it off when seed == 0.

    -R still needs to be made default in 3.3 - that's one reason this issue is still open.

    @vstinner
    Copy link
    Member

    Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0?  It is now.

    Saying yes "working as intended" is fine by me.

    It is documented that PYTHONHASHSEED=0 disables the randomization, so
    sys.flags.hash_randomization must be False (0).

    @malemburg
    Copy link
    Member

    Gregory P. Smith wrote:

    Gregory P. Smith <greg@krypto.org> added the comment:

    Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.

    The flag should probably be removed - simply because
    the env var is not a flag, it's a configuration parameter.

    Exposing the seed value as sys.hashseed would be better and more useful
    to applications.

    @malemburg
    Copy link
    Member

    STINNER Victor wrote:

    STINNER Victor <victor.stinner@gmail.com> added the comment:

    > Question: Should sys.flags.hash_randomization be True (1) when PYTHONHASHSEED=0? It is now.
    >
    > Saying yes "working as intended" is fine by me.

    It is documented that PYTHONHASHSEED=0 disables the randomization, so
    sys.flags.hash_randomization must be False (0).

    PYTHONHASHSEED=1 will disable randomization as well :-)

    Only setting PYTHONHASHSEED=random actually enables randomization.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 21, 2012

    That is a good question. I don't really care either way, but let's
    say +0 for turning it off when seed == 0.

    +1

    @warsaw
    Copy link
    Member Author

    warsaw commented Feb 21, 2012

    On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:

    Exposing the seed value as sys.hashseed would be better and more useful
    to applications.

    That makes the most sense to me.

    @warsaw
    Copy link
    Member Author

    warsaw commented Feb 21, 2012

    On Feb 21, 2012, at 09:48 AM, Marc-Andre Lemburg wrote:

    The flag should probably be removed - simply because
    the env var is not a flag, it's a configuration parameter.

    Exposing the seed value as sys.hashseed would be better and more useful
    to applications.

    Okay, after chatting with __ap__ on irc, here's what I think the behavior
    should be:

    sys.flags.hash_randomization should contain just the value given by the -R
    flag. It should only be True if the flag is present, False otherwise.

    sys.hash_seed contains the hash seed, set by virtue of the flag or envar. It
    should contain the *actual* seed value used. E.g. it might be zero, the
    explicitly set integer, or the randomly selected seed value in use during this
    Python execution if a random seed was requested.

    If you really need the envar value, getenv('PYTHONHASHSEED') is good enough
    for that.

    @gpshead
    Copy link
    Member

    gpshead commented Feb 21, 2012

    +1 to what barry and __ap__ discussed and settled on.

    @warsaw
    Copy link
    Member Author

    warsaw commented Feb 22, 2012

    I have to amend my suggestion about sys.flags.hash_randomization. It needs to be non-zero even if $PYTHONHASHSEED is given instead of -R. Many other flags that also have envars work the same way, e.g. -O and $PYTHONOPTIMIZE. So hash_randomization has to work the same way.

    I'll still work on a patch for exposing the seed in sys.

    @warsaw
    Copy link
    Member Author

    warsaw commented Feb 22, 2012

    Never mind about sys.hash_seed. See my follow up in python-dev. I consider this issue is closed wrt the 2.6 branch.

    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Feb 27, 2012

    After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.

    @benjaminp
    Copy link
    Contributor

    Can you paste the error you're getting?

    2012/2/26 Roger Serwy <report@bugs.python.org>:

    Roger Serwy <roger.serwy@gmail.com> added the comment:

    After pulling the latest code, random.py no longer works since it tries to import urandom from os on both 3.3 and 2.7.

    ----------
    nosy: +serwy


    Python tracker <report@bugs.python.org>
    <http://bugs.python.org/issue13703\>


    @serwy
    Copy link
    Mannequin

    serwy mannequin commented Feb 27, 2012

    It was a false alarm. I didn't recompile python before running it with the latest /Lib files. My apologies.

    @cvrebert
    Copy link
    Mannequin

    cvrebert mannequin commented Mar 3, 2012

    The Design and History FAQ (will) need a minor corresponding update:
    http://docs.python.org/dev/faq/design.html#how-are-dictionaries-implemented

    @kseifriedredhatcom
    Copy link
    Mannequin

    kseifriedredhatcom mannequin commented Mar 10, 2012

    I have assigned CVE-2012-1150 for this issue as per http://www.openwall.com/lists/oss-security/2012/03/10/3

    @jsvaughan
    Copy link
    Mannequin

    jsvaughan mannequin commented Mar 12, 2012

    FWIW I upgraded to ubuntu pangolin beta over the weekend, which includes 2.7.3rc1, and I'm also experiencing a problem with urandom.

    File "/usr/lib/python2.7/email/utils.py", line 27, in <module>
    import random
    File "/usr/lib/python2.7/random.py", line 47, in <module>
    from os import urandom as _urandom
    ImportError: cannot import name urandom

    Given Roger Serwy's comment it sounds like a beta ubuntu problem, but thought it worth mentioning.

    @vstinner
    Copy link
    Member

    FWIW I upgraded to ubuntu pangolin beta over the weekend,
    which includes 2.7.3rc1, ...

    File "/usr/lib/python2.7/random.py", line 47, in <module>
    from os import urandom as _urandom
    ImportError: cannot import name urandom

    It looks like you are using random.py of Python 2.7.3 with the Python program 2.7.2, because os.urandom() is now always available in Python 2.7.3.

    @jsvaughan
    Copy link
    Mannequin

    jsvaughan mannequin commented Mar 13, 2012

    Victor - yes that was it; a mixture of a 2.7.2 virtual env and 2.7.3. Apologies for any nuisance caused.

    @vstinner
    Copy link
    Member

    Can we close this issue?

    @gpshead
    Copy link
    Member

    gpshead commented Mar 13, 2012

    I believe so. This is in all of the release candidates.

    The expat/xmlparse.c hash collision DoS issue is being handled on its own via http://bugs.python.org/issue14234.

    @gpshead gpshead closed this as completed Mar 13, 2012
    @ahmedsayeed1982 ahmedsayeed1982 mannequin added 3.11 only security fixes and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Nov 4, 2021
    @erlend-aasland erlend-aasland added the interpreter-core (Objects, Python, Grammar, and Parser dirs) label Nov 4, 2021
    @gvanrossum
    Copy link
    Member

    Hey Erlend, why did you add so many people to the nosy list of this old
    issue?

    On Thu, Nov 4, 2021 at 07:33 Erlend E. Aasland <report@bugs.python.org>
    wrote:

    Change by Erlend E. Aasland <erlend.aasland@innova.no>:

    ----------
    components: +Interpreter Core -Argument Clinic
    nosy: +Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon,
    PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson,
    christian.heimes, cvrebert, dmalcolm, eric.araujo, eric.snow, fx5,
    georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, jsvaughan,
    lemburg, loewis, mark.dickinson, neologix, pitrou, python-dev, roger.serwy,
    skorgu, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
    -ahmedsayeed1982, larry


    Python tracker <report@bugs.python.org>
    <https://bugs.python.org/issue13703\>


    --
    --Guido (mobile)

    @terryjreedy
    Copy link
    Member

    Because today's spammer, whose message was removed, deleted us all. Restoring the version to 3.3 is not possible.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) release-blocker type-security A security issue
    Projects
    None yet
    Development

    No branches or pull requests