Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pickler hook for the user to customize the serialization of user defined functions and types. #80081

Closed
pierreglaser mannequin opened this issue Feb 5, 2019 · 8 comments
Labels
3.8 only security fixes type-feature A feature request or enhancement

Comments

@pierreglaser
Copy link
Mannequin

pierreglaser mannequin commented Feb 5, 2019

BPO 35900
Nosy @pitrou, @avassalotti, @serhiy-storchaka, @ogrisel, @pierreglaser
PRs
  • bpo-35900: Enable custom reduction callback registration in _pickle #12499
  • bpo-35900: Add a state_setter arg to save_reduce #12588
  • Files
  • test_hook.py
  • pickler_hook.patch
  • pickler_hook.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-05-08.21:08:54.715>
    created_at = <Date 2019-02-05.14:40:22.937>
    labels = ['type-feature', '3.8']
    title = 'Add pickler hook for the user to customize the serialization of user defined functions and types.'
    updated_at = <Date 2019-05-08.21:08:54.715>
    user = 'https://github.com/pierreglaser'

    bugs.python.org fields:

    activity = <Date 2019-05-08.21:08:54.715>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-05-08.21:08:54.715>
    closer = 'pitrou'
    components = []
    creation = <Date 2019-02-05.14:40:22.937>
    creator = 'pierreglaser'
    dependencies = []
    files = ['48101', '48102', '48203']
    hgrepos = []
    issue_num = 35900
    keywords = ['patch']
    message_count = 7.0
    messages = ['334870', '334872', '335404', '337673', '341933', '341944', '341945']
    nosy_count = 5.0
    nosy_names = ['pitrou', 'alexandre.vassalotti', 'serhiy.storchaka', 'Olivier.Grisel', 'pierreglaser']
    pr_nums = ['12499', '12588']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue35900'
    versions = ['Python 3.8']

    @pierreglaser
    Copy link
    Mannequin Author

    pierreglaser mannequin commented Feb 5, 2019

    Pickler objects provide a dispatch_table attribute, where the user can specify
    custom saving functions depending on the object-to-be-saved type. However, for
    performance purposes, this table is predated (in the C implementation only) by
    a hardcoded switch that will take care of the saving for many built-in types,
    without a lookup in the dispatch_table.

    Especially, it is not possible to define custom saving methods for functions
    and classes, although the current default (save_global, that saves an object
    using its module attribute path) is likely to fail at pickling or unpickling
    time in many cases.

    The aforementioned failures exist on purpose in the standard library (as a way
    to allow for the serialization of functions accessible from non-dynamic (*)
    modules only). However, there exist cases where serializing functions from
    dynamic modules matter. These cases are currently handled thanks the
    cloudpickle module (https://github.com/cloudpipe/cloudpickle), that is used by
    many distributed data-science frameworks such as pyspark, ray and dask. For the
    reasons explained above, cloudpickle's Pickler subclass derives from the python
    Pickler class instead of its C class, which severely harms its performance.

    While prototyping with Antoine Pitrou, we came to the conclusion that a hook
    could be added to the C Pickler class, in which an optional user-defined
    callback would be invoked (if defined) when saving functions and classes
    instead of the traditional save_global. Here is a patch so that we can have
    something concrete of which to discuss.

    (*) dynamic module are modules that cannot be imported by name as traditional
    python file backed module. Examples include the __main__ module that can be
    populated dynamically by running a script or by a, user writing code in a
    python shell / jupyter notebook.

    @pierreglaser pierreglaser mannequin added 3.8 only security fixes type-feature A feature request or enhancement labels Feb 5, 2019
    @SilentGhost SilentGhost mannequin changed the title Add pickler hoor for the user to customize the serialization of user defined functions and types. Add pickler hook for the user to customize the serialization of user defined functions and types. Feb 5, 2019
    @pitrou
    Copy link
    Member

    pitrou commented Feb 5, 2019

    FYI, I've removed the duplicate message :-) Also adding Serhiy as cc.

    @ogrisel
    Copy link
    Mannequin

    ogrisel mannequin commented Feb 13, 2019

    Adding such a hook would make it possible to reimplement cloudpickle.CloudPickler by deriving from the fast _pickle.Pickler class (instead of the slow pickle._Pickler as done currently). This would mean rewriting most of the CloudPickler method to only rely on a save_reduce-style design instead of directly calling pickle._Pickler.write and pickle._Pickler.save. This is tedious but doable.

    There is however a blocker with the current way closures are set: when we pickle a dynamically defined function (e.g. lambda, nested function or function in __main__), we currently use a direct call to memoize (https://github.com/cloudpipe/cloudpickle/blob/v0.7.0/cloudpickle/cloudpickle.py#L594) so as to be able to refer to the function itself in its own closure without causing an infinite loop in CloudPickler.dump. This also makes possible to pickle mutually recursive functions.

    The easiest way to avoid having to call memoize explicitly would be to be able to pass the full __closure__ attribute in the state dict of the reduce call. Indeed the save_reduce function calls memoize automatically after saving the reconstructor and its args but prior to saving the state:

    https://github.com/python/cpython/blob/v3.7.2/Modules/_pickle.c#L3903-L3931

    It would therefore be possible to pass a (state, slotstate) tuple with the closure in slotstate that so it could be reconstructed at unpickling time with a setattr:

    https://github.com/python/cpython/blob/v3.7.2/Modules/_pickle.c#L6258-L6272

    However, it is currently not possible to setattr __closure__ at the moment. We can only set individual closure cell contents (which is not compatible with the setattr state trick described above).

    To summarize, we need to implement the setter function for the __closure__ attribute of functions and methods to make it natural to reimplement the CloudPickler by inheriting from _pickle.Pickler using the hook described in this issue.

    @pierreglaser
    Copy link
    Mannequin Author

    pierreglaser mannequin commented Mar 11, 2019

    Update:

    Instead of changing permission on some attributes of function objects (globals and __closure__), we added an optional argument called state_setter to save_reduce. This expects a callable that will be saved inside the object's pickle string, and called when setting the state of the object instead of using the default way in load_build.
    This allows for external flexibility when setting custom pickling behavior of built-in types (in our use-cases: function and classes). I updated the patches so that anyone interested can take a look.

    Also, we tested the cloudpickle package against these patches (see cloudpipe/cloudpickle#253). The tests run fine, and we observe a 10-30x speedup for real-life use-cases. We are starting to hit convergence on the implementation :)

    @pitrou
    Copy link
    Member

    pitrou commented May 8, 2019

    New changeset 65d98d0 by Antoine Pitrou (Pierre Glaser) in branch 'master':
    bpo-35900: Add a state_setter arg to save_reduce (GH-12588)
    65d98d0

    @pitrou
    Copy link
    Member

    pitrou commented May 8, 2019

    New changeset 289f1f8 by Antoine Pitrou (Pierre Glaser) in branch 'master':
    bpo-35900: Enable custom reduction callback registration in _pickle (GH-12499)
    289f1f8

    @pitrou
    Copy link
    Member

    pitrou commented May 8, 2019

    Both PRs are now merged. Thank you Pierre!

    @pitrou pitrou closed this as completed May 8, 2019
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @serhiy-storchaka
    Copy link
    Member

    Sorry, there were hard times to me, so I missed this issue. Adding the 6th item breaks the pickle protocol and the code which expects __reduce__() returning a tuple with 2 to 5 items. See for example #90494. I think this change required introducing a new version of the pickle protocol. It may be too late now.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants
    @pitrou @serhiy-storchaka and others