Message 151625 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arach, Arfrever, Huzaifa.Sidhpurwala, Jim.Jewett, Mark.Shannon, PaulMcMillan, Zhiping.Deng, alex, barry, benjamin.peterson, christian.heimes, dmalcolm, eric.araujo, eric.snow, fx5, georg.brandl, grahamd, gregory.p.smith, gvanrossum, gz, jcea, lemburg, mark.dickinson, neologix, pitrou, skrah, terry.reedy, tim.peters, v+python, vstinner, zbysz
Date	2012-01-19.14:27:35
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<4F182851.5090506@egenix.com>
In-reply-to	<CAMpsgwav+2G-BZ4fw_7HU9zev-L2zk4Jwzsb_UBpjn-TNPDtCA@mail.gmail.com>

Content
STINNER Victor wrote: > ... > So I expect something similar in applications: no change in the > applications, but a lot of hacks/tricks in tests. Tests usually check output of an application given a certain input. If those fail with the randomization, then it's likely real-world application uses will show the same kinds of failures due to the application changing from deterministic to non-deterministic via the randomization. >> BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix >> which needlessly complicates the code and doesn't any additional >> protection against hash value collisions > > How does it complicate the code? It adds an extra XOR to hash(str) and > 4 or 8 bytes in memory, that's all. It is more difficult to compute > the secret from hash(str) output if there is a prefix and a suffix. > If there is only a prefix, knowning a single hash(str) value is just > enough to retrieve directly the secret. The suffix only introduces a constant change in all hash values output, so even if you don't know the suffix, you can still generate data sets with collisions by just having the prefix. >> I don't think it affects more than 0.01% of applications/users :) > > It would help to try a patched Python on a real world application like > Django to realize how much code is broken (or not) by a randomized > hash function. That would help for both approaches, indeed. Please note, that you'd have to extend the randomization to all other Python data types as well in order to reach the same level of security as the collision counting approach. As-is the randomization patch does not solve the integer key attack and even though parsers such as JSON and XML-RPC aren't directly affected, it is well possible that stringified integers such as IDs are converted back to integers later during processing, thereby triggering the attack. Note that the integer attack also applies to other number types in Python: (3, 3, 3) See Tim's post I referenced earlier on for the reasons. Here's a quick summary ;-) ... {3: 3}

STINNER Victor wrote:
> ...
> So I expect something similar in applications: no change in the
> applications, but a lot of hacks/tricks in tests.

Tests usually check output of an application given a certain
input. If those fail with the randomization, then it's likely
real-world application uses will show the same kinds of failures
due to the application changing from deterministic to
non-deterministic via the randomization.

>> BTW: The patch still includes the unnecessary _Py_unicode_hash_secret.suffix
>> which needlessly complicates the code and doesn't any additional
>> protection against hash value collisions
> 
> How does it complicate the code? It adds an extra XOR to hash(str) and
> 4 or 8 bytes in memory, that's all. It is more difficult to compute
> the secret from hash(str) output if there is a prefix *and* a suffix.
> If there is only a prefix, knowning a single hash(str) value is just
> enough to retrieve directly the secret.

The suffix only introduces a constant change in all hash values
output, so even if you don't know the suffix, you can still
generate data sets with collisions by just having the prefix.

>> I don't think it affects more than 0.01% of applications/users :)
> 
> It would help to try a patched Python on a real world application like
> Django to realize how much code is broken (or not) by a randomized
> hash function.

That would help for both approaches, indeed.

Please note, that you'd have to extend the randomization to
all other Python data types as well in order to reach the same level
of security as the collision counting approach.

As-is the randomization patch does not solve the integer key attack and
even though parsers such as JSON and XML-RPC aren't directly affected,
it is well possible that stringified integers such as IDs are converted
back to integers later during processing, thereby triggering the
attack.

Note that the integer attack also applies to other number types
in Python:

(3, 3, 3)

See Tim's post I referenced earlier on for the reasons. Here's
a quick summary ;-) ...

{3: 3}

History
Date	User	Action	Args
2012-01-19 14:27:36	lemburg	set	recipients: + lemburg, gvanrossum, tim.peters, barry, georg.brandl, terry.reedy, gregory.p.smith, jcea, mark.dickinson, pitrou, vstinner, christian.heimes, benjamin.peterson, eric.araujo, grahamd, Arfrever, v+python, alex, zbysz, skrah, dmalcolm, gz, neologix, Arach, Mark.Shannon, eric.snow, Zhiping.Deng, Huzaifa.Sidhpurwala, Jim.Jewett, PaulMcMillan, fx5
2012-01-19 14:27:36	lemburg	link	issue13703 messages
2012-01-19 14:27:35	lemburg	create