classification
Title: Expose siphash
Type: enhancement Stage:
Components: Extension Modules Versions: Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Dima.Tisnek, benjamin.peterson, steven.daprano
Priority: normal Keywords:

Created on 2018-12-28 02:47 by Dima.Tisnek, last changed 2018-12-28 07:09 by steven.daprano.

Messages (5)
msg332633 - (view) Author: Dima Tisnek (Dima.Tisnek) * Date: 2018-12-28 02:47
Just recently, i found rolling my own simple hash for strings.
(task was to distribute tests across executors, stably across invocations, no external input, no security)

In the old days I'd just `hash(some_variable)` but of course now I cannot. `hashlib.sha*` seemed too complex and I ended up with something like `sum(map(ord, str(some_variable)))`.

How much easier this would be is `siphash` implementation that cpython uses internally was available to me!
msg332636 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-28 03:35
> In the old days I'd just `hash(some_variable)` but of course now I cannot.

I'm sorry, I don't understand... why can't you?

py> text = "NOBODY expects the Spanish Inquisition!"
py> hash(text)
1245575277


There's also this:

py> hashlib.md5(text.encode('utf-8')).digest()
b'@\xfb[&\t]\x9c\xc0\xc5\xfcvH\xe8:\x1b]'


although it might be a bit expensive if you don't care about security and too weak if you do. Can you explain why hash() isn't suitable?

For what's its worth, I wouldn't use sum() to generate a hash since it may be unbounded and may not be "mixed up" enough. If you can't hash a string, perhaps you can hash a tuple of ints?

py> hash(tuple(map(ord, text)))
-816773268
py> hash(tuple(map(ord, text+"\0")))
667761418
msg332639 - (view) Author: Dima Tisnek (Dima.Tisnek) * Date: 2018-12-28 04:04
Steven, my requirement calls for same hash on multiple machines. Python's hash (for strings) is keyed with a random value.

You are correct that `hash(tuple(map(ord, str(something))))` is stable.

In the worst case, I could override `PYTHONHASHSEED` globally.

I suppose this relegates my suggestion to "why not" or "because we can" category.
msg332644 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018-12-28 06:10
How about using one of these modules? https://pypi.org/search/?q=siphash
msg332646 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2018-12-28 07:09
> Steven, my requirement calls for same hash on multiple machines. 
> Python's hash (for strings) is keyed with a random value.

Ah, of course it does, I forgot about that.

The only problem with exposing siphash is that we are exposing a private 
implementation detail (the specific hash function used) as a public 
interface. That means that we'd need to keep siphash forever, even if we 
want to use a different hash function in the future.

Now maybe we're willing to do that, perhaps exposing it through the 
hashlib module, with no guarantee that it is related in any way to what 
hash() calls. But I think now we're moving in Python-Ideas mailing list 
territory, and as Benjamin points out, there is a third-party library.
History
Date User Action Args
2018-12-28 07:09:18steven.dapranosetmessages: + msg332646
2018-12-28 06:10:57benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg332644
2018-12-28 04:04:26Dima.Tisneksetmessages: + msg332639
2018-12-28 03:35:19steven.dapranosetnosy: + steven.daprano
messages: + msg332636
2018-12-28 02:47:05Dima.Tisnekcreate