Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deprecate ob_shash in BytesObject #91020

Closed
methane opened this issue Feb 26, 2022 · 34 comments
Closed

Deprecate ob_shash in BytesObject #91020

methane opened this issue Feb 26, 2022 · 34 comments
Labels
3.11 only security fixes build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs)

Comments

@methane
Copy link
Member

methane commented Feb 26, 2022

BPO 46864
Nosy @tiran, @methane, @serhiy-storchaka, @animalize, @brandtbucher
PRs
  • bpo-46864: Deprecate PyBytesObject.ob_shash. #31598
  • bpo-46864: Suppress deprecation warnings for ob_shash. #32042
  • bpo-46864: Suppress even more ob_shash deprecation warnings (GH-32176) #32176
  • Files
  • remove-bytes-hash.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2022-03-25.02:15:16.155>
    created_at = <Date 2022-02-26.08:49:05.075>
    labels = ['interpreter-core', 'build', '3.11']
    title = 'Deprecate ob_shash in BytesObject'
    updated_at = <Date 2022-03-30.06:35:35.207>
    user = 'https://github.com/methane'

    bugs.python.org fields:

    activity = <Date 2022-03-30.06:35:35.207>
    actor = 'christian.heimes'
    assignee = 'none'
    closed = True
    closed_date = <Date 2022-03-25.02:15:16.155>
    closer = 'methane'
    components = ['Interpreter Core']
    creation = <Date 2022-02-26.08:49:05.075>
    creator = 'methane'
    dependencies = []
    files = ['50649']
    hgrepos = []
    issue_num = 46864
    keywords = ['patch']
    message_count = 21.0
    messages = ['414083', '414096', '414134', '414136', '414605', '415684', '415743', '415744', '415747', '415748', '415749', '415854', '415855', '415857', '415865', '415923', '415925', '415927', '415958', '415984', '416325']
    nosy_count = 5.0
    nosy_names = ['christian.heimes', 'methane', 'serhiy.storchaka', 'malin', 'brandtbucher']
    pr_nums = ['31598', '32042', '32176']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'compile error'
    url = 'https://bugs.python.org/issue46864'
    versions = ['Python 3.11']

    @methane
    Copy link
    Member Author

    methane commented Feb 26, 2022

    Code objects have more and more bytes attributes for now.
    To reduce the RAM by code, I want to remove ob_shash (cached hash value) from bytes object.

    Sets and dicts have own hash cache.
    Unless checking same bytes object against dicts/sets many times, this don't cause big performance loss.

    @methane methane added 3.11 only security fixes interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 26, 2022
    @serhiy-storchaka
    Copy link
    Member

    I think it is a legacy of Python 2. Attributes and variable names are Unicode strings in Python 3, so the main reason of this optimization is no longer relevant.

    But some programs can still work with encoded bytes instead of strings. In particular os.environ and os.environb are implemented as dict of bytes on non-Windows.

    @methane
    Copy link
    Member Author

    methane commented Feb 27, 2022

    But some programs can still work with encoded bytes instead of strings. In particular os.environ and os.environb are implemented as dict of bytes on non-Windows.

    This change doesn't affect to os.environ.

    os.environ[key] does key.encode(sys.getfilesystemencoding(), "surrogateescape") internally. So the encoded key doesn't have cached hash.
    On the other hand, dict (self._data) has own hash cache. So it don't use hash cached in the bytes objects.

    On the other hand, this change will affect os.environb[key] if key is used repeatedly.

    @methane
    Copy link
    Member Author

    methane commented Feb 27, 2022

    When removed shash:

    ## small key
    $ ./python -m pyperf timeit --compare-to ../cpython/python -s 'd={b"foo":1, b"bar":2, b"buzz":3}' -- 'b"key" in d'
    /home/inada-n/work/python/cpython/python: ..................... 23.2 ns +- 1.7 ns
    /home/inada-n/work/python/remove-bytes-hash/python: ..................... 40.0 ns +- 1.5 ns
    
    Mean +- std dev: [/home/inada-n/work/python/cpython/python] 23.2 ns +- 1.7 ns -> [/home/inada-n/work/python/remove-bytes-hash/python] 40.0 ns +- 1.5 ns: 1.73x slower
    
    ## large key
    $ ./python -m pyperf timeit --compare-to ../cpython/python -s 'd={b"foo":1, b"bar":2, b"buzz":3};k=b"key"*100' -- 'k in d'
    /home/inada-n/work/python/cpython/python: ..................... 22.3 ns +- 1.2 ns
    /home/inada-n/work/python/remove-bytes-hash/python: ..................... 108 ns +- 2 ns
    
    Mean +- std dev: [/home/inada-n/work/python/cpython/python] 22.3 ns +- 1.2 ns -> [/home/inada-n/work/python/remove-bytes-hash/python] 108 ns +- 2 ns: 4.84x slower
    

    I will reconsider the removal before remove the cache.
    We changed code object too often. If Python 3.13 don't use so much bytes objects, we don't need to remove the hash to save some RAM.

    @methane methane closed this as completed Mar 6, 2022
    @methane methane closed this as completed Mar 6, 2022
    @methane
    Copy link
    Member Author

    methane commented Mar 6, 2022

    New changeset 2d8b764 by Inada Naoki in branch 'main':
    bpo-46864: Deprecate PyBytesObject.ob_shash. (GH-31598)
    2d8b764

    @tiran
    Copy link
    Member

    tiran commented Mar 21, 2022

    I'm getting tons of deprecation warnings from deepfreeze script. Please add _Py_COMP_DIAG_IGNORE_DEPR_DECL to Tools/scripts/deepfreeze.py.

    @tiran tiran reopened this Mar 21, 2022
    @tiran tiran added the build The build process and cross-build label Mar 21, 2022
    @tiran tiran reopened this Mar 21, 2022
    @tiran tiran added the build The build process and cross-build label Mar 21, 2022
    @methane
    Copy link
    Member Author

    methane commented Mar 22, 2022

    I'm sorry. Maybe, ccache hides the warning from me.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 22, 2022

    If run this code, would it be slower?

        bytes_hash = hash(bytes_data)
        bytes_hash = hash(bytes_data)  # get hash twice

    @methane
    Copy link
    Member Author

    methane commented Mar 22, 2022

    Since Python 3.13, yes. It will be bit slower.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 22, 2022

    Since hash() is a public function, maybe some users use hash value to manage bytes objects in their own way, then there may be a performance regression.

    For a rough example, dispatch data to 16 servers.

        h = hash(b)
        sendto(server_number=h & 0xF, data=b)

    @methane
    Copy link
    Member Author

    methane commented Mar 22, 2022

    Since the hash is randomized, using hash(bytes) for such use case is not recommended. User should use stable hash functions instead.

    I agree that there is few use cases this change cause performance regression. But it is really few compared to overhead of adding 8bytes for all bytes instances.

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 23, 2022

    RAM is now relatively cheaper than CPU.
    1 million bytes object additionally use 7.629 MiB RAM for ob_shash. (100_0000*8/1024/1024).
    This causes hash() performance regression anyway.

    @methane
    Copy link
    Member Author

    methane commented Mar 23, 2022

    Average RAM capacity doesn't grow as CPU cores grows.
    Additionally, L1+L2 cache is really limited resource compared to CPU or RAM.

    Bytes object is used for co_code that is hot. So cache efficiency is important.

    Would you give us more realistic (or real world) example for caching bytes hash is important?

    @methane
    Copy link
    Member Author

    methane commented Mar 23, 2022

    New changeset 894d0ea by Inada Naoki in branch 'main':
    bpo-46864: Suppress deprecation warnings for ob_shash. (GH-32042)
    894d0ea

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 23, 2022

    If put a bytes object into multiple dicts/sets, the hash need to be computed multiple times. This seems a common usage.

    bytes is a very basic type, users may use it in various ways. And unskilled users may checking the same bytes object against dicts/sets many times.

    FYI, 1 GiB data:

    function seconds
    hash() 0.40
    binascii.crc32() 1.66 (Gregory P. Smith is trying to improve this)
    zlib.crc32() 0.65

    @methane
    Copy link
    Member Author

    methane commented Mar 24, 2022

    First of all, this is just deprecating direct access of ob_shash. This makes users need to use PyObject_Hash().
    We don't make the final decision about removing it. We just make we can remove it in Python 3.13.

    RAM and CACHE efficiency is not the only motivation for this.
    There is a discussion about (1) increasing CoW efficiency, and (2) sharing data between subinterpreters after per-interpreter GIL.
    Removing ob_shash will help them, especially about the (2).

    But if we stop using bytes objects in code objects by Python 3.13, there is no need to remove ob_shash.

    If put a bytes object into multiple dicts/sets, the hash need to be computed multiple times. This seems a common usage.

    Doesn't it lose only some milliseconds?
    I posted remove-bytes-hash.patch in this issue. Would you measure how this affects whole application performance rather than micro benchmarks?

    @animalize
    Copy link
    Mannequin

    animalize mannequin commented Mar 24, 2022

    I posted remove-bytes-hash.patch in this issue. Would you measure how this affects whole application performance rather than micro benchmarks?

    I guess not much difference in benchmarks.
    But if put a bytes object into multiple dicts/sets, and len(bytes_key) is large, it will take a long time. (1 GiB 0.40 seconds on i5-11500 DDR4-3200)
    The length of bytes can be arbitrary,so computing time may be very different.

    Is it possible to let code objects use other types? In addition to ob_hash, maybe the extra byte \x00 at the end can be saved.

    @methane
    Copy link
    Member Author

    methane commented Mar 24, 2022

    I guess not much difference in benchmarks.
    But if put a bytes object into multiple dicts/sets, and len(bytes_key) is large, it will take a long time. (1 GiB 0.40 seconds on i5-11500 DDR4-3200)
    The length of bytes can be arbitrary,so computing time may be very different.

    I don't think calculating hash() for large bytes is not so common use case.
    Rare use cases may not justify adding 8bytes to basic types, especially users expect it is compact.

    Balance is important. Microbenchmark for specific case doesn't guarantee the good balance.
    So I want real world examples. Do you know some popular libraries that are depending on hash(bytes) performance?

    Is it possible to let code objects use other types? In addition to ob_hash, maybe the extra byte \x00 at the end can be saved.

    Of course, it is possible. But it needs large refactoring around code, including pyc cache file format.
    I will try it before 3.13.

    @brandtbucher
    Copy link
    Member

    Just a note: as of this week (GH-31888), we no longer use bytes objects to store bytecode. Instead, the instructions are stored as part of the PyCodeObject struct.

    (Of course, we still use bytes objects for the various exception handling and debugging tables attached to code objects, but these are very cold fields by comparison.)

    Not to say this work isn't worthwhile, but it probably isn't worth doing for co_code alone.

    @methane
    Copy link
    Member Author

    methane commented Mar 25, 2022

    OK. Cache efficiency is dropped from motivations list.
    Current motivations are:

    • Memory saving (currently, 4 BytesObject (= 32 bytes of ob_shash) per code object.
    • Make bytes objects immutable
      • Share objects among multi interpreters.
      • CoW efficiency.

    I close this issue for now, because this issue is just for making direct access of ob_shash deprecated.

    After Python 3.12 become beta, we will reconsider about we should remove ob_shash or keep it.

    @tiran
    Copy link
    Member

    tiran commented Mar 30, 2022

    New changeset d8f530f by Christian Heimes in branch 'main':
    bpo-46864: Suppress even more ob_shash deprecation warnings (GH-32176)
    d8f530f

    @seberg
    Copy link
    Contributor

    seberg commented Apr 11, 2022

    This has been brought up in NumPy, NumPy subclasses bytes, and I am not immediately sure how to best initialize the hash value. We can't really call __new__ from C, but FromString would give the wrong type.

    Would it make sense to move the -1 initialization to a custom tp_alloc? Or am I missing an obvious solution? If this the slow is removed entirely, then it does not matter of course.

    @methane
    Copy link
    Member Author

    methane commented Apr 12, 2022

    This has been brought up in NumPy, NumPy subclasses bytes, and I am not immediately sure how to best initialize the hash value.

    Would you give me a link to the code?

    Would it make sense to move the -1 initialization to a custom tp_alloc? Or am I missing an obvious solution? If this the slow is removed entirely, then it does not matter of course.

    How about tp_init?

    @seberg
    Copy link
    Contributor

    seberg commented Apr 12, 2022

    @methane the code is here:
    https://github.com/numpy/numpy/blob/46a61b3aed3cf674586b456b05c6c9b7b9c5020d/numpy/core/src/multiarray/scalarapi.c#L687

    With the initialization a bit later.

    The point being, this creates a bytes subclass the manual way currently. I am not condoning that, but I am also not sure there is an obvious replacement for creating a byte subclass instances. FromStringAndSize would be the right thing, but that won't give the subclass.

    EDIT: The point about init/new is that they are Python level API, and as such expect args/kwargs. And I don't really think that would be ideal here either. I don't mind this requiring a bit of a hackish solution, but right now I am not sure what the solution is at all.

    @methane
    Copy link
    Member Author

    methane commented Apr 12, 2022

    Thank you. tp_alloc seems good.

    @serhiy-storchaka
    Copy link
    Member

    What are problems with calling __new__ (tp_new)?

    @seberg
    Copy link
    Contributor

    seberg commented Apr 12, 2022

    @serhiy-storchaka it requires passing a Python object to create the new bytes (subclass) instance. NumPy does not have such an object available. We would have to create it just to call bytes.__new__()... There is a reason the C-API provides PyBytes_FromStringAndSize, but we can't use it for a subclass.

    Now, I don't care about using "unstable" API. So if the category of "unstable" is created (and the discussion and Petr's last mail sound a bit like it), then it would make sense to me if this ABI detail was available only through that? Otherwise a custom tp_alloc initializing this to -1 makes perfect sense to me, but maybe there are e.g. micro-optimization reasons against it.

    @methane
    Copy link
    Member Author

    methane commented Apr 13, 2022

    Current ideas:

    • Create custom .tp_alloc()
    • Add new PyTypeObject *type parameter to _PyBytes_FromSize() and expose it as a cpython API.

    BTW, Since bytes object is PyVarObject, every subclasses won't extend the PyBytesObject, am I right?

    @seberg
    Copy link
    Contributor

    seberg commented Apr 13, 2022

    BTW, Since bytes object is PyVarObject, every subclasses won't extend the PyBytesObject, am I right?

    NumPy certainly does not, I suppose you could dynamically add more "items" at the end in a custom tp_alloc, though? Again, I am happy with changes or removal of ob_shash (so long it is not in a bug-fix release).
    It just would be nice to avoid the DeprecationWarning in a clean way

    @mattip
    Copy link
    Contributor

    mattip commented Apr 19, 2022

    The happy path of the code in question is:

        PyTypeObject *type = /* get the type object */;
        int itemsize = /* get the string 'length' */;
        PyObject *obj;
        void *destptr;
        obj = type->tp_alloc(type, itemsize);
        destptr = PyBytes_AS_STRING(obj);
        ((PyBytesObject *)obj)->ob_shash = -1;
        memcpy(destptr, data, itemsize);
        return obj;

    How should NumPy refactor that?

    @methane
    Copy link
    Member Author

    methane commented Apr 19, 2022

    #if PY_VERSION_HEX < 0x030b00b0
        ((PyBytesObject *)obj)->ob_shash = -1;
    #endif

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Nov 4, 2022

    I recommend keeping the ob_shash field. The memory use is negligible — 25% on tiny bytes instances and diminishing further in significance as the sizes get larger.

    The caching benefit doesn't happen often but when it does the benefits are significant. Crypto hashes are always stored as bytes and we use them to check for duplicates (i.e. the way git works). The output of str.encode() is bytes and it is legitimate to use this as a cache key (with the lru_cache for example). If I recall correctly, DNA sequence fragments get stored as bytes and are used for repeated dict/set lookups.

    This code has been stable for a very long time and it is almost certain that some code in the wild depends on cached hash values for efficient computation. ISTM the deprecation damages that code for no good reason.

    Looking back over the thread, it seems that the original motivations are mostly gone or are insignificant. Unless someone really wants this, I request that the field be undeprecated.

    @rhettinger rhettinger reopened this Nov 4, 2022
    @methane
    Copy link
    Member Author

    methane commented Nov 4, 2022

    Memory saving is not the only reason for removing the hash cache.
    Please read my previous comment.

    Additionally, this issue doesn't remove the hash cache.
    This issue only deprecate direct access to PyBytesObjects.ob_shash.
    Even if we won't remove the hash cache, direct access to struct member is not recommended.

    @rhettinger
    Copy link
    Contributor

    rhettinger commented Nov 5, 2022

    The hash cache should remain. If you want to block direct access to the structure member that is reasonable.

    For code that does use the hash cache, it is performance critical. The feature has been in a place a long time. Code in the wild almost certainly relies on it (and reasonably so) for good performance. Saying "that CPUs are fast" is dismissive and ignore the essential fact that bytes can be of any size. At some size (likely somewhat small), a fast CPU computation loses to a field lookup, especially for a field that is likely already loaded in the cache line. In contrast, the data array is much less likely to be in the current cache line, so its memory accesses will be expensive.

    I've ready you "reasons" and still object to removing the hash cache. Please undo the deprecation.

    @methane methane closed this as completed May 31, 2023
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.11 only security fixes build The build process and cross-build interpreter-core (Objects, Python, Grammar, and Parser dirs)
    Projects
    None yet
    Development

    No branches or pull requests

    7 participants