Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_pickle's whichmodule() is slow #66866

Closed
kbengine mannequin opened this issue Oct 20, 2014 · 24 comments
Closed

_pickle's whichmodule() is slow #66866

kbengine mannequin opened this issue Oct 20, 2014 · 24 comments
Labels
extension-modules C modules in the Modules dir performance Performance or resource usage

Comments

@kbengine
Copy link
Mannequin

kbengine mannequin commented Oct 20, 2014

BPO 22676
Nosy @brettcannon, @birkenfeld, @pitrou, @vstinner, @avassalotti, @zware, @zooba
Files
  • 20141020193031.jpg
  • whichmodule_speedup.diff
  • whichmodule.patch
  • whichmodule_error.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2014-12-01.23:20:33.084>
    created_at = <Date 2014-10-20.11:38:22.549>
    labels = ['extension-modules', 'performance']
    title = "_pickle's whichmodule() is slow"
    updated_at = <Date 2014-12-01.23:20:33.084>
    user = 'https://bugs.python.org/kbengine'

    bugs.python.org fields:

    activity = <Date 2014-12-01.23:20:33.084>
    actor = 'pitrou'
    assignee = 'none'
    closed = True
    closed_date = <Date 2014-12-01.23:20:33.084>
    closer = 'pitrou'
    components = ['Extension Modules']
    creation = <Date 2014-10-20.11:38:22.549>
    creator = 'kbengine'
    dependencies = []
    files = ['36980', '36985', '36986', '37310']
    hgrepos = []
    issue_num = 22676
    keywords = ['patch']
    message_count = 24.0
    messages = ['229723', '229727', '229746', '229748', '229749', '229751', '229752', '229753', '229754', '229755', '229756', '229757', '229758', '229759', '229900', '229904', '230545', '230546', '231632', '231829', '231836', '231876', '231929', '231966']
    nosy_count = 9.0
    nosy_names = ['brett.cannon', 'georg.brandl', 'pitrou', 'vstinner', 'alexandre.vassalotti', 'python-dev', 'zach.ware', 'steve.dower', 'kbengine']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'performance'
    url = 'https://bugs.python.org/issue22676'
    versions = ['Python 3.5']

    @kbengine
    Copy link
    Mannequin Author

    kbengine mannequin commented Oct 20, 2014

    I have an application, the use of Python3.2.3 development.
    When I upgrade to Python3.4.2, found a problem.

    I have an extended "xxx.c (c-python)" module, I call pickle to serialize and deserialize, _pickle.c calls the whichmodule to look for this module, The final will be to find the module from sys.modules.

    But probably there are 200 elements in sys.modules, Use the "obj = getattribute (module, global_name, allow_qualname)" to try to get the object:

    static PyObject *
    getattribute(PyObject *obj, PyObject *name, int allow_qualname) {
    ...
    ...
    ...

            tmp = PyObject_GetAttr(obj, subpath);
            Py_DECREF(obj);
     
    // There will be hundreds of times to return to NULL
    // There will be hundreds of times calls "PyErr_Format           
    // (PyExc_AttributeError," Can't get attribute%R on%R ", name, obj);"
    // This process will lead to hundreds of calls to "moduleobject.c-
    // module_repr(PyModuleObject *m)".

    // So I frequently call pickle.dumps CPU consumption sharply.

            if (tmp == NULL) {
                if (PyErr_ExceptionMatches(PyExc_AttributeError)) {
                    PyErr_Clear();
    
    
                    PyErr_Format(PyExc_AttributeError,
                                 "Can't get attribute %R on %R", name, obj);
                }
                Py_DECREF(dotted_path);
                return NULL;
            }
           ...
           ...
    }

    ncalls tottime percall cumtime percall filename:lineno(function)
    315 0.001 0.000 0.004 0.000 <frozen importlib._bootstrap>:690(_module_repr)

    I went wrong?
    Look forward to answer, thanks!

    @kbengine kbengine mannequin added extension-modules C modules in the Modules dir performance Performance or resource usage labels Oct 20, 2014
    @brettcannon
    Copy link
    Member

    In Python 3.3 the import machinery changed to use importlib. This means the code to create the representation of a module now calls into Python code (the <frozen importlib._bootstrap>:690(_module_repr) you're seeing).

    But my question is why are you not calling PyObject_HasAttr() before calling PyObject_GetAttr()? Exceptions may be relatively cheap but they are not free.

    @brettcannon brettcannon changed the title _pickle.c Creating the string representation of a module is slower Oct 20, 2014
    @kbengine
    Copy link
    Mannequin Author

    kbengine mannequin commented Oct 21, 2014

    This is a misunderstanding, "getattribute (PyObject *obj, PyObject *name, int allow_qualname)" is the Python code, in the Python/Modules/_pickle.c (1538).

    Do efficiency is decreased a lot.

    @birkenfeld
    Copy link
    Member

    HasAttr would just call GetAttr and discard the exception.

    @ OP: what objects are you pickling? You can give them a __module__ attribute to save the lookup in sys.modules.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 21, 2014

    There's no doubt that whichmodule() has grown more complex since 3.2. We probably cannot eliminate the O(<number of modules>) component, but could at least make it faster.

    @pitrou pitrou changed the title Creating the string representation of a module is slower _pickle's whichmodule() is slow Oct 21, 2014
    @pitrou
    Copy link
    Member

    pitrou commented Oct 21, 2014

    Note the problem is easily reproduce. For example we can take numpy where some ufuncs (i.e. function-like objects implemented in C) don't have a __module__.

    $ PYTHONHASHSEED=0 python3.4 -m timeit -s "import pickle, numpy; d=numpy.add" "pickle.dumps(d)"
    1000 loops, best of 3: 280 usec per loop
    $ PYTHONHASHSEED=0 python3.4 -m timeit -s "import pickle, numpy; d=numpy.diff" "pickle.dumps(d)"
    100000 loops, best of 3: 2.74 usec per loop

    We see that pickling numpy.add (which doesn't have a __module__) is 100x slower than numpy.diff (which has a __module__)! Note I'm forcing PYTHONHASHSEED for consistent results, since whichmodule() uses dict iteration.

    For comparison, 2.7 is fast enough:

    $ PYTHONHASHSEED=0 python2.7 -m timeit -s "import cPickle as pickle, numpy; d=numpy.add" "pickle.dumps(d)"
    100000 loops, best of 3: 6.12 usec per loop
    $ PYTHONHASHSEED=0 python2.7 -m timeit -s "import cPickle as pickle, numpy; d=numpy.diff" "pickle.dumps(d)"
    100000 loops, best of 3: 2.35 usec per loop

    (varying PYTHONHASHSEED didn't produce any slower results)

    @birkenfeld
    Copy link
    Member

    Attached patch addresses the repeated __repr__ calling and gives a 4x speedup here.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 21, 2014

    Actually, numpy.add takes a different path. Sorry.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 21, 2014

    So, a proper way to reproduce it is through Ellipsis (which does go through whichmodule()):

    $ PYTHONHASHSEED=0 python3.4 -m timeit -s "import pickle; d=Ellipsis" "pickle.dumps(d)"
    1000 loops, best of 3: 201 usec per loop

    @vstinner
    Copy link
    Member

    For example we can take numpy where some ufuncs (i.e. function-like objects implemented in C) don't have a __module__.

    Oh. I was not aware of that. Is there a way to fix numpy to set the __module__ attribute? Maybe we should warn users that serialiaing objects without __module__ is much slower?

    @vstinner
    Copy link
    Member

    whichmodule_speedup.diff: I proposed once a more generic solution which is much more complex to implement. Lazy formatting of Exception message: in most cases, the message is not used. Replace AttributeError(message) with AttributeError(attr=name), only format when str(exc) is called.

    @pitrou
    Copy link
    Member

    pitrou commented Oct 21, 2014

    Victor, see numpy/numpy#4952

    @pitrou
    Copy link
    Member

    pitrou commented Oct 21, 2014

    Attached patch produces a 8x speedup on Ellipsis.

    @birkenfeld
    Copy link
    Member

    I didn't have numpy anyway, I tested on a function with __module__ set to None manually. But I see the same speedup for Ellipsis.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Oct 23, 2014

    New changeset e5ad1f27fb54 by Antoine Pitrou in branch 'default':
    Issue bpo-22676: Make the pickling of global objects which don't have a __module__ attribute less slow.
    https://hg.python.org/cpython/rev/e5ad1f27fb54

    @pitrou
    Copy link
    Member

    pitrou commented Oct 23, 2014

    Now applied. As Georg said, though, the definitive fix is to add a __module__ attribute to your global objects.

    @pitrou pitrou closed this as completed Oct 23, 2014
    @zooba
    Copy link
    Member

    zooba commented Nov 3, 2014

    The fix for this now uses module without initializing it, which could lead to a crash if get_dotted_path() tries to raise an exception (it puts module into the error string with %R). See Modules/_pickle.c#l1656 and Modules/_pickle.c#l1538.

    I think the fix is to initialize module with Py_None and currently there's no need to bump the refcount (though I'm always wary of doing that in case things change in the future). If that sounds right I'm happy to put the fix in, else Antoine can do it :)

    @zooba
    Copy link
    Member

    zooba commented Nov 3, 2014

    (Looks like I was outsmarted by the tracker when trying to put line numbers in my links... is there a way to do that?)

    @zware
    Copy link
    Member

    zware commented Nov 25, 2014

    This is still causing a somewhat serious warning on Windows, see [1] for example. The condition warned about affects every platform.

    It took me some time to make sense of that function, but I think I finally understand what's going on. I think Steve's suggestion of initializing module to Py_None would be fine, or just pass in a known good object like global instead since the 'obj' parameter is only used for error message formatting. Either way, I think the error check at Modules/_pickle.c:1657 [2] should either use reformat_attribute_error() or do the reformatting itself (to provide a different message) because I don't think the AttributeError message from get_dotted_path would be anywhere close to correct (especially if 'obj' is given as Py_None).

    [1] http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/8969/steps/compile/logs/stdio
    [2] Steve: the 'Comment:' label in the form is a handy link to the devguide section about how magic links work. In this case, it's filepath.ext:number

    @zware zware reopened this Nov 25, 2014
    @vstinner
    Copy link
    Member

    On Windows with Visual Studio, I got a compiler warning. In whichmodule(), get_dotted_path() is called with module whereas module is not initialiazed:

        dotted_path = get_dotted_path(module, global_name, allow_qualname);

    @pitrou
    Copy link
    Member

    pitrou commented Nov 28, 2014

    Here is a patch.

    @zooba
    Copy link
    Member

    zooba commented Nov 30, 2014

    That patch looks good to me (better than my fix would have been :) )

    @vstinner
    Copy link
    Member

    vstinner commented Dec 1, 2014

    This issue may have introduce a regression, please see:
    Issue bpo-22971: test_pickle: "Fatal Python error: Cannot recover from stack overflow." on FreeBSD buildbot

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 1, 2014

    New changeset 3e3bec66409c by Antoine Pitrou in branch 'default':
    Fix uninitialized variable after bpo-22676.
    https://hg.python.org/cpython/rev/3e3bec66409c

    @pitrou pitrou closed this as completed Dec 1, 2014
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    extension-modules C modules in the Modules dir performance Performance or resource usage
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants