This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Deprecate ob_shash in BytesObject
Type: compile error Stage: resolved
Components: Interpreter Core Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brandtbucher, christian.heimes, malin, methane, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2022-02-26 08:49 by methane, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
remove-bytes-hash.patch methane, 2022-02-27 02:52
Pull Requests
URL Status Linked Edit
PR 31598 merged methane, 2022-02-26 08:49
PR 32042 merged methane, 2022-03-22 05:48
PR 32176 merged christian.heimes, 2022-03-29 15:34
Messages (21)
msg414083 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-02-26 08:49
Code objects have more and more bytes attributes for now.
To reduce the RAM by code, I want to remove ob_shash (cached hash value) from bytes object.

Sets and dicts have own hash cache.
Unless checking same bytes object against dicts/sets many times, this don't cause big performance loss.
msg414096 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-02-26 12:14
I think it is a legacy of Python 2. Attributes and variable names are Unicode strings in Python 3, so the main reason of this optimization is no longer relevant.

But some programs can still work with encoded bytes instead of strings. In particular os.environ and os.environb are implemented as dict of bytes on non-Windows.
msg414134 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-02-27 02:35
> But some programs can still work with encoded bytes instead of strings. In particular os.environ and os.environb are implemented as dict of bytes on non-Windows.

This change doesn't affect to os.environ.

os.environ[key] does `key.encode(sys.getfilesystemencoding(), "surrogateescape")` internally. So the encoded key doesn't have cached hash.
On the other hand, dict (`self._data`) has own hash cache. So it don't use hash cached in the bytes objects.

On the other hand, this change will affect `os.environb[key]` if key is used repeatedly.
msg414136 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-02-27 02:52
When removed shash:

```
## small key
$ ./python -m pyperf timeit --compare-to ../cpython/python -s 'd={b"foo":1, b"bar":2, b"buzz":3}' -- 'b"key" in d'
/home/inada-n/work/python/cpython/python: ..................... 23.2 ns +- 1.7 ns
/home/inada-n/work/python/remove-bytes-hash/python: ..................... 40.0 ns +- 1.5 ns

Mean +- std dev: [/home/inada-n/work/python/cpython/python] 23.2 ns +- 1.7 ns -> [/home/inada-n/work/python/remove-bytes-hash/python] 40.0 ns +- 1.5 ns: 1.73x slower

## large key
$ ./python -m pyperf timeit --compare-to ../cpython/python -s 'd={b"foo":1, b"bar":2, b"buzz":3};k=b"key"*100' -- 'k in d'
/home/inada-n/work/python/cpython/python: ..................... 22.3 ns +- 1.2 ns
/home/inada-n/work/python/remove-bytes-hash/python: ..................... 108 ns +- 2 ns

Mean +- std dev: [/home/inada-n/work/python/cpython/python] 22.3 ns +- 1.2 ns -> [/home/inada-n/work/python/remove-bytes-hash/python] 108 ns +- 2 ns: 4.84x slower
```


I will reconsider the removal before remove the cache.
We changed code object too often. If Python 3.13 don't use so much bytes objects, we don't need to remove the hash to save some RAM.
msg414605 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-06 02:39
New changeset 2d8b764210c8de10893665aaeec8277b687975cd by Inada Naoki in branch 'main':
bpo-46864: Deprecate PyBytesObject.ob_shash. (GH-31598)
https://github.com/python/cpython/commit/2d8b764210c8de10893665aaeec8277b687975cd
msg415684 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-03-21 15:47
I'm getting tons of deprecation warnings from deepfreeze script. Please add _Py_COMP_DIAG_IGNORE_DEPR_DECL to Tools/scripts/deepfreeze.py.
msg415743 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-22 05:49
I'm sorry. Maybe, ccache hides the warning from me.
msg415744 - (view) Author: Ma Lin (malin) * Date: 2022-03-22 06:20
If run this code, would it be slower?

    bytes_hash = hash(bytes_data)
    bytes_hash = hash(bytes_data)  # get hash twice
msg415747 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-22 06:51
Since Python 3.13, yes. It will be bit slower.
msg415748 - (view) Author: Ma Lin (malin) * Date: 2022-03-22 07:06
Since hash() is a public function, maybe some users use hash value to manage bytes objects in their own way, then there may be a performance regression.

For a rough example, dispatch data to 16 servers.

    h = hash(b)
    sendto(server_number=h & 0xF, data=b)
msg415749 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-22 07:48
Since the hash is randomized, using hash(bytes) for such use case is not recommended. User should use stable hash functions instead.

I agree that there is few use cases this change cause performance regression. But it is really few compared to overhead of adding 8bytes for all bytes instances.
msg415854 - (view) Author: Ma Lin (malin) * Date: 2022-03-23 05:26
RAM is now relatively cheaper than CPU.
1 million bytes object additionally use 7.629 MiB RAM for ob_shash. (100_0000*8/1024/1024).
This causes hash() performance regression anyway.
msg415855 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-23 05:40
Average RAM capacity doesn't grow as CPU cores grows.
Additionally, L1+L2 cache is really limited resource compared to CPU or RAM.

Bytes object is used for co_code that is hot. So cache efficiency is important.

Would you give us more realistic (or real world) example for caching bytes hash is important?
msg415857 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-23 08:11
New changeset 894d0ea5afa822c23286e9e68ed80bb1122b402d by Inada Naoki in branch 'main':
bpo-46864: Suppress deprecation warnings for ob_shash. (GH-32042)
https://github.com/python/cpython/commit/894d0ea5afa822c23286e9e68ed80bb1122b402d
msg415865 - (view) Author: Ma Lin (malin) * Date: 2022-03-23 11:50
If put a bytes object into multiple dicts/sets, the hash need to be computed multiple times. This seems a common usage.

bytes is a very basic type, users may use it in various ways. And unskilled users may checking the same bytes object against dicts/sets many times.

FYI, 1 GiB data:

function         seconds
hash()           0.40
binascii.crc32() 1.66   (Gregory P. Smith is trying to improve this)
zlib.crc32()     0.65
msg415923 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-24 04:32
First of all, this is just deprecating direct access of `ob_shash`. This makes users need to use `PyObject_Hash()`.
We don't make the final decision about removing it. We just make we can remove it in Python 3.13.

RAM and CACHE efficiency is not the only motivation for this.
There is a discussion about (1) increasing CoW efficiency, and (2) sharing data between subinterpreters after per-interpreter GIL.
Removing ob_shash will help them, especially about the (2).

But if we stop using bytes objects in code objects by Python 3.13, there is no need to remove ob_shash.


> If put a bytes object into multiple dicts/sets, the hash need to be computed multiple times. This seems a common usage.

Doesn't it lose only some milliseconds?
I posted remove-bytes-hash.patch in this issue. Would you measure how this affects whole application performance rather than micro benchmarks?
msg415925 - (view) Author: Ma Lin (malin) * Date: 2022-03-24 07:22
> I posted remove-bytes-hash.patch in this issue. Would you measure how this affects whole application performance rather than micro benchmarks?

I guess not much difference in benchmarks.
But if put a bytes object into multiple dicts/sets, and len(bytes_key) is large, it will take a long time. (1 GiB 0.40 seconds on i5-11500 DDR4-3200)
The length of bytes can be arbitrary,so computing time may be very different.

Is it possible to let code objects use other types? In addition to ob_hash, maybe the extra byte \x00 at the end can be saved.
msg415927 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-24 07:51
> I guess not much difference in benchmarks.
> But if put a bytes object into multiple dicts/sets, and len(bytes_key) is large, it will take a long time. (1 GiB 0.40 seconds on i5-11500 DDR4-3200)
> The length of bytes can be arbitrary,so computing time may be very different.

I don't think calculating hash() for large bytes is not so common use case.
Rare use cases may not justify adding 8bytes to basic types, especially users expect it is compact.

Balance is important. Microbenchmark for specific case doesn't guarantee the good balance.
So I want real world examples. Do you know some popular libraries that are depending on hash(bytes) performance?


> Is it possible to let code objects use other types? In addition to ob_hash, maybe the extra byte \x00 at the end can be saved.

Of course, it is possible. But it needs large refactoring around code, including pyc cache file format.
I will try it before 3.13.
msg415958 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2022-03-24 16:40
Just a note: as of this week (GH-31888), we no longer use bytes objects to store bytecode. Instead, the instructions are stored as part of the PyCodeObject struct.

(Of course, we still use bytes objects for the various exception handling and debugging tables attached to code objects, but these are very cold fields by comparison.)

Not to say this work isn't worthwhile, but it probably isn't worth doing for `co_code` *alone*.
msg415984 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2022-03-25 02:15
OK. Cache efficiency is dropped from motivations list.
Current motivations are:

* Memory saving (currently, 4 BytesObject (= 32 bytes of ob_shash) per code object.
* Make bytes objects immutable
  * Share objects among multi interpreters.
  * CoW efficiency.

I close this issue for now, because this issue is just for making direct access of ob_shash deprecated.

After Python 3.12 become beta, we will reconsider about we should remove ob_shash or keep it.
msg416325 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2022-03-30 06:35
New changeset d8f530fe329c6bd9ad6e1a9db9aa32b465c2d67f by Christian Heimes in branch 'main':
bpo-46864: Suppress even more ob_shash deprecation warnings (GH-32176)
https://github.com/python/cpython/commit/d8f530fe329c6bd9ad6e1a9db9aa32b465c2d67f
History
Date User Action Args
2022-04-11 14:59:56adminsetgithub: 91020
2022-03-30 06:35:35christian.heimessetmessages: + msg416325
2022-03-29 15:34:34christian.heimessetpull_requests: + pull_request30254
2022-03-25 02:15:16methanesetstatus: open -> closed
resolution: fixed
messages: + msg415984

stage: patch review -> resolved
2022-03-24 16:40:38brandtbuchersetmessages: + msg415958
2022-03-24 07:51:33methanesetmessages: + msg415927
2022-03-24 07:22:52malinsetmessages: + msg415925
2022-03-24 04:32:00methanesetmessages: + msg415923
2022-03-23 11:50:49malinsetmessages: + msg415865
2022-03-23 08:11:30methanesetmessages: + msg415857
2022-03-23 05:40:52methanesetmessages: + msg415855
2022-03-23 05:26:05malinsetmessages: + msg415854
2022-03-22 07:48:09methanesetmessages: + msg415749
2022-03-22 07:06:58malinsetmessages: + msg415748
2022-03-22 06:51:17methanesetmessages: + msg415747
2022-03-22 06:20:57malinsetnosy: + malin
messages: + msg415744
2022-03-22 05:49:16methanesetmessages: + msg415743
2022-03-22 05:48:24methanesetstage: needs patch -> patch review
pull_requests: + pull_request30132
2022-03-21 15:47:12christian.heimessetstatus: closed -> open

type: compile error

nosy: + christian.heimes
messages: + msg415684
resolution: fixed -> (no value)
stage: resolved -> needs patch
2022-03-06 02:39:32methanesetmessages: + msg414605
2022-03-06 02:39:25methanesetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2022-03-03 00:51:21brandtbuchersetnosy: + brandtbucher
2022-02-27 02:52:55methanesetfiles: + remove-bytes-hash.patch

messages: + msg414136
2022-02-27 02:35:42methanesetmessages: + msg414134
2022-02-26 12:14:43serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg414096
2022-02-26 08:49:46methanesetkeywords: + patch
stage: patch review
pull_requests: + pull_request29721
2022-02-26 08:49:05methanecreate