This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: _pickle's whichmodule() is slow
Type: performance Stage: resolved
Components: Extension Modules Versions: Python 3.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: alexandre.vassalotti, brett.cannon, georg.brandl, kbengine, pitrou, python-dev, steve.dower, vstinner, zach.ware
Priority: normal Keywords: patch

Created on 2014-10-20 11:38 by kbengine, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
20141020193031.jpg kbengine, 2014-10-20 11:38
whichmodule_speedup.diff georg.brandl, 2014-10-21 07:44 review
whichmodule.patch pitrou, 2014-10-21 08:17 review
whichmodule_error.patch pitrou, 2014-11-28 22:50 review
Messages (24)
msg229723 - (view) Author: kbengine (kbengine) Date: 2014-10-20 11:38
I have an application, the use of Python3.2.3 development.
When I upgrade to Python3.4.2, found a problem.

I have an extended "xxx.c (c-python)" module, I call pickle to serialize and deserialize, _pickle.c calls the whichmodule to look for this module, The final will be to find the module from sys.modules.

But probably there are 200 elements in sys.modules, Use the "obj = getattribute (module, global_name, allow_qualname)" to try to get the object:

static PyObject *
getattribute(PyObject *obj, PyObject *name, int allow_qualname) {
        ...
        ...
        ...

        tmp = PyObject_GetAttr(obj, subpath);
        Py_DECREF(obj);
 
// There will be hundreds of times to return to NULL
// There will be hundreds of times calls "PyErr_Format           
// (PyExc_AttributeError," Can't get attribute%R on%R ", name, obj);"
// This process will lead to hundreds of calls to "moduleobject.c-
// module_repr(PyModuleObject *m)".

// So I frequently call pickle.dumps CPU consumption sharply.

        if (tmp == NULL) {
            if (PyErr_ExceptionMatches(PyExc_AttributeError)) {
                PyErr_Clear();


                PyErr_Format(PyExc_AttributeError,
                             "Can't get attribute %R on %R", name, obj);
            }
            Py_DECREF(dotted_path);
            return NULL;
        }
       ...
       ...
}

------------------------------------------------------
   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      315    0.001    0.000    0.004    0.000 <frozen importlib._bootstrap>:690(_module_repr)




I went wrong?
Look forward to answer, thanks!
msg229727 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2014-10-20 15:02
In Python 3.3 the import machinery changed to use importlib. This means the code to create the representation of a module now calls into Python code (the `<frozen importlib._bootstrap>:690(_module_repr)` you're seeing).

But my question is why are you not calling PyObject_HasAttr() before calling PyObject_GetAttr()? Exceptions may be relatively cheap but they are not free.
msg229746 - (view) Author: kbengine (kbengine) Date: 2014-10-21 01:22
This is a misunderstanding, "getattribute (PyObject *obj, PyObject *name, int allow_qualname)" is the Python code, in the Python/Modules/_pickle.c (1538).

Do efficiency is decreased a lot.
msg229748 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-21 07:14
HasAttr would just call GetAttr and discard the exception.

@ OP: what objects are you pickling? You can give them a __module__ attribute to save the lookup in sys.modules.
msg229749 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-21 07:20
There's no doubt that whichmodule() has grown more complex since 3.2. We probably cannot eliminate the O(<number of modules>) component, but could at least make it faster.
msg229751 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-21 07:29
Note the problem is easily reproduce. For example we can take numpy where some ufuncs (i.e. function-like objects implemented in C) don't have a __module__.

$ PYTHONHASHSEED=0 python3.4 -m timeit -s "import pickle, numpy; d=numpy.add" "pickle.dumps(d)"
1000 loops, best of 3: 280 usec per loop
$ PYTHONHASHSEED=0 python3.4 -m timeit -s "import pickle, numpy; d=numpy.diff" "pickle.dumps(d)"
100000 loops, best of 3: 2.74 usec per loop

We see that pickling numpy.add (which doesn't have a __module__) is 100x slower than numpy.diff (which has a __module__)! Note I'm forcing PYTHONHASHSEED for consistent results, since whichmodule() uses dict iteration.

For comparison, 2.7 is fast enough:

$ PYTHONHASHSEED=0 python2.7 -m timeit -s "import cPickle as pickle, numpy; d=numpy.add" "pickle.dumps(d)"
100000 loops, best of 3: 6.12 usec per loop
$ PYTHONHASHSEED=0 python2.7 -m timeit -s "import cPickle as pickle, numpy; d=numpy.diff" "pickle.dumps(d)"
100000 loops, best of 3: 2.35 usec per loop

(varying PYTHONHASHSEED didn't produce any slower results)
msg229752 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-21 07:44
Attached patch addresses the repeated __repr__ calling and gives a 4x speedup here.
msg229753 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-21 08:05
Actually, numpy.add takes a different path. Sorry.
msg229754 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-21 08:08
So, a proper way to reproduce it is through Ellipsis (which does go through whichmodule()):

$ PYTHONHASHSEED=0 python3.4 -m timeit -s "import pickle; d=Ellipsis" "pickle.dumps(d)"
1000 loops, best of 3: 201 usec per loop
msg229755 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-21 08:10
> For example we can take numpy where some ufuncs (i.e. function-like objects implemented in C) don't have a __module__.

Oh. I was not aware of that. Is there a way to fix numpy to set the __module__ attribute? Maybe we should warn users that serialiaing objects without __module__ is much slower?
msg229756 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-10-21 08:13
whichmodule_speedup.diff: I proposed once a more generic solution which is much more complex to implement. Lazy formatting of Exception message: in most cases, the message is not used. Replace AttributeError(message) with AttributeError(attr=name), only format when str(exc) is called.
msg229757 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-21 08:17
Victor, see https://github.com/numpy/numpy/issues/4952
msg229758 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-21 08:17
Attached patch produces a 8x speedup on Ellipsis.
msg229759 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2014-10-21 08:22
I didn't have numpy anyway, I tested on a function with __module__ set to None manually.  But I see the same speedup for Ellipsis.
msg229900 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-10-23 20:52
New changeset e5ad1f27fb54 by Antoine Pitrou in branch 'default':
Issue #22676: Make the pickling of global objects which don't have a __module__ attribute less slow.
https://hg.python.org/cpython/rev/e5ad1f27fb54
msg229904 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-10-23 21:00
Now applied. As Georg said, though, the definitive fix is to add a __module__ attribute to your global objects.
msg230545 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2014-11-03 16:43
The fix for this now uses module without initializing it, which could lead to a crash if get_dotted_path() tries to raise an exception (it puts module into the error string with %R). See Modules/_pickle.c#l1656 and Modules/_pickle.c#l1538.

I think the fix is to initialize module with Py_None and currently there's no need to bump the refcount (though I'm always wary of doing that in case things change in the future). If that sounds right I'm happy to put the fix in, else Antoine can do it :)
msg230546 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2014-11-03 16:43
(Looks like I was outsmarted by the tracker when trying to put line numbers in my links... is there a way to do that?)
msg231632 - (view) Author: Zachary Ware (zach.ware) * (Python committer) Date: 2014-11-25 04:35
This is still causing a somewhat serious warning on Windows, see [1] for example.  The condition warned about affects every platform.

It took me some time to make sense of that function, but I think I finally understand what's going on.  I think Steve's suggestion of initializing module to Py_None would be fine, or just pass in a known good object like global instead since the 'obj' parameter is only used for error message formatting.  Either way, I think the error check at Modules/_pickle.c:1657 [2] should either use reformat_attribute_error() or do the reformatting itself (to provide a different message) because I don't think the AttributeError message from get_dotted_path would be anywhere close to correct (especially if 'obj' is given as Py_None).

[1] http://buildbot.python.org/all/builders/x86%20Windows7%203.x/builds/8969/steps/compile/logs/stdio
[2] Steve: the 'Comment:' label in the form is a handy link to the devguide section about how magic links work.  In this case, it's filepath.ext:number
msg231829 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-11-28 21:08
On Windows with Visual Studio, I got a compiler warning. In whichmodule(), get_dotted_path() is called with module whereas module is not initialiazed:

    dotted_path = get_dotted_path(module, global_name, allow_qualname);
msg231836 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-11-28 22:50
Here is a patch.
msg231876 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2014-11-30 01:32
That patch looks good to me (better than my fix would have been :) )
msg231929 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2014-12-01 10:59
This issue may have introduce a regression, please see:
Issue #22971: test_pickle: "Fatal Python error: Cannot recover from stack overflow." on FreeBSD buildbot
msg231966 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2014-12-01 23:20
New changeset 3e3bec66409c by Antoine Pitrou in branch 'default':
Fix uninitialized variable after #22676.
https://hg.python.org/cpython/rev/3e3bec66409c
History
Date User Action Args
2022-04-11 14:58:09adminsetgithub: 66866
2014-12-01 23:20:33pitrousetstatus: open -> closed
resolution: fixed
stage: commit review -> resolved
2014-12-01 23:20:15python-devsetmessages: + msg231966
2014-12-01 10:59:16vstinnersetmessages: + msg231929
2014-11-30 01:32:50steve.dowersetmessages: + msg231876
2014-11-28 22:50:04pitrousetfiles: + whichmodule_error.patch

messages: + msg231836
2014-11-28 21:08:32vstinnersetmessages: + msg231829
2014-11-25 04:35:15zach.waresetstatus: closed -> open

nosy: + zach.ware
messages: + msg231632

resolution: fixed -> (no value)
stage: resolved -> commit review
2014-11-03 16:43:57steve.dowersetmessages: + msg230546
2014-11-03 16:43:16steve.dowersetnosy: + steve.dower
messages: + msg230545
2014-10-23 21:00:24pitrousetstatus: open -> closed
resolution: fixed
messages: + msg229904

stage: needs patch -> resolved
2014-10-23 20:52:52python-devsetnosy: + python-dev
messages: + msg229900
2014-10-21 08:22:11georg.brandlsetmessages: + msg229759
2014-10-21 08:17:49pitrousetfiles: + whichmodule.patch

messages: + msg229758
2014-10-21 08:17:18pitrousetmessages: + msg229757
2014-10-21 08:13:37vstinnersetmessages: + msg229756
2014-10-21 08:10:32vstinnersetmessages: + msg229755
2014-10-21 08:08:56pitrousetmessages: + msg229754
2014-10-21 08:05:32pitrousetmessages: + msg229753
2014-10-21 07:44:02georg.brandlsetfiles: + whichmodule_speedup.diff
keywords: + patch
messages: + msg229752
2014-10-21 07:29:44pitrousetmessages: + msg229751
2014-10-21 07:20:56pitrousettitle: Creating the string representation of a module is slower -> _pickle's whichmodule() is slow
stage: needs patch
messages: + msg229749
versions: + Python 3.5, - Python 3.4
2014-10-21 07:14:08georg.brandlsetnosy: + georg.brandl
messages: + msg229748
2014-10-21 01:22:19kbenginesetstatus: pending -> open

messages: + msg229746
2014-10-20 15:02:26brett.cannonsetstatus: open -> pending
title: _pickle.c -> Creating the string representation of a module is slower
nosy: + brett.cannon

messages: + msg229727
2014-10-20 12:21:35vstinnersetnosy: + vstinner
2014-10-20 12:16:31serhiy.storchakasetnosy: + pitrou, alexandre.vassalotti
2014-10-20 11:38:22kbenginecreate