Issue 21233: Add *Calloc functions to CPython memory allocation API

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65432

classification

Title:	Add *Calloc functions to CPython memory allocation API
Type:	enhancement	Stage:
Components:	Interpreter Core	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	josh.r, jtaylor, neologix, njs, pitrou, python-dev, skrah, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-04-15 08:56 by njs, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
calloc.patch	vstinner, 2014-04-15 21:27		review
calloc-2.patch	vstinner, 2014-04-16 04:21		review
calloc-3.patch	vstinner, 2014-04-16 19:48		review
bench_alloc.py	vstinner, 2014-04-27 10:36
test.c	neologix, 2014-04-27 18:31
calloc-4.patch	vstinner, 2014-04-27 23:03		review
use_calloc.patch	vstinner, 2014-04-27 23:03		review
bench_alloc2.py	vstinner, 2014-04-27 23:15
calloc-5.patch	vstinner, 2014-04-28 09:01		review
calloc-6.patch	vstinner, 2014-04-29 20:59		review

Messages (95)
msg216281 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-04-15 08:55
Numpy would like to switch to using the CPython allocator interface in order to take advantage of the new tracemalloc infrastructure in 3.4. But, numpy relies on the availability of calloc(), and the CPython allocator API does not expose calloc(). https://docs.python.org/3.5/c-api/memory.html#c.PyMemAllocator So, we should add *Calloc variants. This met general approval on python-dev. Thread here: https://mail.python.org/pipermail/python-dev/2014-April/133985.html This would involve adding a new .calloc field to the PyMemAllocator struct, exposed through new API functions PyMem_RawCalloc, PyMem_Calloc, PyObject_Calloc. [It's not clear that all 3 would really be used, but since we have only one PyMemAllocator struct that they all share, it'd be hard to add support to only one or two of these domains and not the rest. And the higher-level calloc variants might well be used. Numpy array buffers are often small (e.g., holding only a single value), and these small buffers benefit from small-alloc optimizations; meanwhile, large buffers benefit from calloc optimizations. So it might be optimal to use a single allocator that has both.] We might also have to rename the PyMemAllocator struct to ensure that compiling old code with the new headers doesn't silently leave garbage in the .calloc field and lead to crashes.
msg216390 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-15 21:27
Here is a first patch adding the following functions: void* PyMem_RawCalloc(size_t n); void* PyMem_Calloc(size_t n); void* PyObject_Calloc(size_t n); PyObject* _PyObject_GC_Calloc(size_t); It adds the following field after malloc field to PyMemAllocator structure: void* (calloc) (void ctx, size_t size); It changes the tracemalloc module to trace "calloc" allocations, add new tests and document new functions. The patch also contains an important change: PyType_GenericAlloc() uses calloc instead of malloc+memset(0). It may be faster, I didn't check.
msg216394 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-15 21:39
So what is the point of _PyObject_GC_Calloc ?
msg216399 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-04-15 22:05
General comment on patch: For the flag value that toggles zero-ing, perhaps use a different name, e.g. setzero, clearmem, initzero or somesuch instead of calloc? calloc already gets used to refer to both the C standard function and the function pointer structure member; it's mildly confusing to have it also refer to a boolean flag as well.
msg216403 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-04-15 22:17
Additional comment on clarity: Might it make sense to make the calloc structure member take both the num and size arguments that the underlying calloc takes? That is, instead of: void* (calloc) (void ctx, size_t size); Declare it as: void* (calloc) (void ctx, size_t num, size_t size); Beyond potentially allowing more detailed tracing info at some later point (and much like the original calloc, potentially allowing us to verify that the components do not overflow on multiply, instead of assuming every caller must multiply and check for themselves), it also seems like it's a bit more friendly to have the prototype for the structure calloc to follow the same pattern as the other members for consistency (Principle of Least Surprise): A context pointer, plus the arguments expected by the equivalent C function.
msg216404 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-04-15 22:20
Sorry for breaking it up, but the same comment on consistent prototypes mirroring the C standard lib calloc would apply to all the API functions as well, e.g. PyMem_RawCalloc, PyMem_Calloc, PyObject_Calloc and _PyObject_GC_Calloc, not just the structure function pointer.
msg216422 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 02:40
> So what is the point of _PyObject_GC_Calloc ? It calls calloc(size) instead of malloc(size), calloc() which can be faster than malloc()+memset(), see: https://mail.python.org/pipermail/python-dev/2014-April/133985.html _PyObject_GC_Calloc() is used by PyType_GenericAlloc(). If I understand directly, it is the default allocator to allocate Python objects.
msg216425 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 02:49
In numpy, I found the two following functions: /NUMPY_API Allocates memory for array data. / void PyDataMem_NEW(size_t size); /NUMPY_API Allocates zeroed memory for array data. / void PyDataMem_NEW_ZEROED(size_t size, size_t elsize); So it looks like it needs two size_t parameters. Prototype of the C function calloc(): void *calloc(size_t nmemb, size_t size); I agree that it's better to provide the same prototype than calloc().
msg216431 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 04:21
New patch: - replace "size_t size" with "size_t nelem, size_t elsize" in the prototype of calloc functions (the parameter names come from the POSIX standard) - replace "int calloc" with "int zero" in helper functions
msg216433 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-16 05:34
Le 16/04/2014 04:40, STINNER Victor a écrit : > > STINNER Victor added the comment: > >> So what is the point of _PyObject_GC_Calloc ? > > It calls calloc(size) instead of malloc(size) No, the question is why you didn't simply change _PyObject_GC_Malloc (which is a private function).
msg216444 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-16 07:18
>> So what is the point of _PyObject_GC_Calloc ? > > It calls calloc(size) instead of malloc(size), calloc() which can be faster than malloc()+memset(), see: > https://mail.python.org/pipermail/python-dev/2014-April/133985.html It will only make a difference if the allocated region is large enough to be allocated by mmap (so not for 90% of objects).
msg216451 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 08:04
>>> So what is the point of _PyObject_GC_Calloc ? >> >> It calls calloc(size) instead of malloc(size) > > No, the question is why you didn't simply change _PyObject_GC_Malloc > (which is a private function). Oh ok, I didn't understand. I don't like changing the behaviour of functions, but it's maybe fine if the function is private.
msg216452 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 08:06
2014-04-16 3:18 GMT-04:00 Charles-François Natali <report@bugs.python.org>: >> It calls calloc(size) instead of malloc(size), calloc() which can be faster than malloc()+memset(), see: >> https://mail.python.org/pipermail/python-dev/2014-April/133985.html > > It will only make a difference if the allocated region is large enough > to be allocated by mmap (so not for 90% of objects). Even if there are only 10% of cases where it may be faster, I think that it's interesting to use calloc() to allocate Python objects. You may create large Python objects ;-) I didn't check which objects use (indirectly) _PyObject_GC_Calloc().
msg216455 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-16 09:54
I left a Rietveld comment, which probably did not get mailed.
msg216515 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-16 17:47
On mer., 2014-04-16 at 08:06 +0000, STINNER Victor wrote: > I didn't check which objects use (indirectly) _PyObject_GC_Calloc(). I've checked: lists, tuples, dicts and sets at least seem to use it. Obviously, objects which are not tracked by the GC (such as str and bytes) won't use it.
msg216567 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-16 19:48
Patch version 3: remove _PyObject_GC_Calloc(), modify _PyObject_GC_Malloc() instead of use calloc() instead of malloc()+memset(0).
msg216668 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-17 07:19
Do you have benchmarks? (I'm not looking for an improvement, just no regression.)
msg216671 - (view)	Author: Julian Taylor (jtaylor)	Date: 2014-04-17 08:04
won't replacing _PyObject_GC_Malloc with a calloc cause Var objects (PyObject_NewVar) to be completely zeroed which I think they didn't before? Some numeric programs stuff a lot of data into var objects and could care about python suddenly setting them to zero when they don't need it. An example would be tinyarray.
msg216681 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-04-17 10:35
Julian: No. See the diff: http://bugs.python.org/review/21233/diff/11644/Objects/typeobject.c The original GC_Malloc was explicitly memset-ing after confirming that it received a non-NULL pointer from the underlying malloc call; that memset is removed in favor of using the calloc call.
msg216682 - (view)	Author: Josh Rosenberg (josh.r) *	Date: 2014-04-17 10:39
Well, to be more specific, PyType_GenericAlloc was originally calling one of two methods that didn't zero the memory (one of which was GC_Malloc), then memset-ing. Just realized you're talking about something else; not sure if you're correct about this now, but I have to get to work, will check later if no one else does.
msg216686 - (view)	Author: Julian Taylor (jtaylor)	Date: 2014-04-17 11:35
I just tested it, PyObject_NewVar seems to use RawMalloc not the GC malloc so its probably fine.
msg217228 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 00:05
I read again some remarks about alignement, it was suggested to provide allocators providing an address aligned to a requested alignement. This topic was already discussed in #18835. If Python doesn't provide such memory allocators, it was suggested to provide a "trace" function which can be called on the result of a successful allocator to "trace" an allocation (and a similar function for free). But this is very different from the design of the PEP 445 (new malloc API). Basically, it requires to rewrite the PEP 445.
msg217242 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 08:30
> I read again some remarks about alignement, it was suggested to provide allocators providing an address aligned to a requested alignement. This topic was already discussed in #18835. The alignement issue is really orthogonal to the calloc one, so IMO this shouldn't be discussed here (and FWIW I don't think we should expose those: alignement only matters either for concurrency or SIMD instructions, and I don't think we should try to standardize this kind of API, it's way to special-purpose (then we'd have to think about huge pages, etc...). Whereas calloc is a simple and immediately useful addition, not only for Numpy but also CPython).
msg217246 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 09:51
2014-04-27 10:30 GMT+02:00 Charles-François Natali <report@bugs.python.org>: >> I read again some remarks about alignement, it was suggested to provide allocators providing an address aligned to a requested alignement. This topic was already discussed in #18835. > > The alignement issue is really orthogonal to the calloc one, so IMO > this shouldn't be discussed here (and FWIW I don't think we should > expose those: alignement only matters either for concurrency or SIMD > instructions, and I don't think we should try to standardize this kind > of API, it's way to special-purpose (then we'd have to think about > huge pages, etc...). Whereas calloc is a simple and immediately useful > addition, not only for Numpy but also CPython). This issue was opened to be able to use tracemalloc on numpy. I would like to make sure that calloc is enough for numpy. I would prefer to change the malloc API only once.
msg217251 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 10:20
> This issue was opened to be able to use tracemalloc on numpy. I would > like to make sure that calloc is enough for numpy. I would prefer to > change the malloc API only once. Then please at least rename the issue. Also, I don't see why everything should be done at once: calloc support is a self-contained change, which is useful outside of numpy. Enhanced tracemalloc support for numpy certainly belongs to another issue. Regarding the Calloc functions: how about we provide a sane API instead of reproducing the cumbersome C API? I mean, why not expose: PyAPI_FUNC(void ) PyMem_Calloc(size_t size); insteaf of PyAPI_FUNC(void ) PyMem_Calloc(size_t nelem, size_t elsize); AFAICT, the two arguments are purely historical (it was used when malloc() didn't guarantee suitable alignment, and has the advantage of performing overflow check when doing the multiplication, but in our code we always check for it anyway). See https://groups.google.com/forum/#!topic/comp.lang.c/jZbiyuYqjB4 http://stackoverflow.com/questions/4083916/two-arguments-to-calloc And http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/malloc/malloc.c?view=markup to check that calloc(nelem, elsize) is implemented as calloc(nelem elsize) I'm also concerned about the change to _PyObject_GC_Malloc(): it now calls calloc() instead of malloc(): it can definitely be slower.
msg217252 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 10:32
Note to numpy devs: it would be great if some of you followed the python-dev mailing list (I know it can be quite volume intensive, but maybe simple filters could help keep the noise down): you guys have definitely both expertise and real-life applications that could be very valuable in helping us design the best possible public/private APIs. It's always great to have downstream experts/end-users!
msg217253 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 10:36
I wrote a short microbenchmark allocating objects using my benchmark.py script. It looks like the operation "(None,) * N" is slower with calloc-3.patch, but it's unclear how much times slower it is. I don't understand why only this operation has different speed. Do you have ideas for other benchmarks? Using the timeit module: $ ./python.orig -m timeit '(None,) * 10*5' 1000 loops, best of 3: 357 usec per loop $ ./python.calloc -m timeit '(None,) 10*5' 1000 loops, best of 3: 698 usec per loop But with different parameters, the difference is lower: $ ./python.orig -m timeit -r 20 -n '1000' '(None,) 10*5' 1000 loops, best of 20: 362 usec per loop $ ./python.calloc -m timeit -r 20 -n '1000' '(None,) 10*5' 1000 loops, best of 20: 392 usec per loop Results of bench_alloc.py: Common platform: CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes CPU model: Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz Python unicode implementation: PEP 393 Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Timer: time.perf_counter SCM: hg revision=462470859e57+ branch=default date="2014-04-26 19:01 -0400" Platform: Linux-3.13.8-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug Bits: int=32, long=64, long long=64, size_t=64, void=64 Platform of campaign orig: Timer precision: 42 ns Date: 2014-04-27 12:27:26 Python version: 3.5.0a0 (default:462470859e57, Apr 27 2014, 11:52:55) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Platform of campaign calloc: Timer precision: 45 ns Date: 2014-04-27 12:29:10 Python version: 3.5.0a0 (default:462470859e57+, Apr 27 2014, 12:04:57) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] -----------------------------------+--------------+--------------- Tests \| orig \| calloc -----------------------------------+--------------+--------------- object() \| 61 ns () \| 62 ns b'A' 10 \| 55 ns () \| 51 ns (-7%) b'A' 10*3 \| 99 ns () \| 94 ns b'A' * 10*6 \| 37.5 us () \| 36.6 us 'A' * 10 \| 62 ns () \| 58 ns (-7%) 'A' 10*3 \| 107 ns () \| 104 ns 'A' * 10*6 \| 37 us () \| 36.6 us 'A' * 10*8 \| 16.2 ms () \| 16.4 ms decode 10 null bytes from ASCII \| 253 ns () \| 248 ns decode 103 null bytes from ASCII \| 359 ns () \| 357 ns decode 10*6 null bytes from ASCII \| 78.8 us () \| 78.7 us decode 10*8 null bytes from ASCII \| 26.2 ms () \| 25.9 ms (None,) * 10*0 \| 30 ns () \| 30 ns (None,) * 10*1 \| 78 ns () \| 77 ns (None,) * 10*2 \| 427 ns () \| 460 ns (+8%) (None,) * 10*3 \| 3.5 us () \| 3.7 us (+6%) (None,) * 10*4 \| 34.7 us () \| 37.2 us (+7%) (None,) * 10*5 \| 357 us () \| 390 us (+9%) (None,) * 10*6 \| 3.86 ms () \| 4.43 ms (+15%) (None,) * 10*7 \| 50.4 ms () \| 50.3 ms (None,) * 10*8 \| 505 ms () \| 504 ms ([None] * 10)[1:-1] \| 121 ns () \| 120 ns ([None] 10*3)[1:-1] \| 3.57 us () \| 3.57 us ([None] * 10*6)[1:-1] \| 4.61 ms () \| 4.59 ms ([None] * 10*8)[1:-1] \| 585 ms () \| 582 ms -----------------------------------+--------------+--------------- Total \| 1.19 sec (*) \| 1.19 sec -----------------------------------+--------------+---------------
msg217254 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-27 11:02
> Regarding the *Calloc functions: how about we provide a sane API > instead of reproducing the cumbersome C API? Isn't the point of reproducing the C API to allow quickly switching from calloc() to PyObject_Calloc()? (besides, it seems the OpenBSD guys like the two-argument form :-))
msg217255 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-27 11:05
Just to add another data point, I don't find the calloc() API cumbersome.
msg217256 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 11:12
It looks like calloc-3.patch is wrong: it modify _PyObject_GC_Malloc() to fill the newly allocated buffer with zeros, but _PyObject_GC_Malloc() is not only called by PyType_GenericAlloc(): it is also used by _PyObject_GC_New() and _PyObject_GC_NewVar(). The patch is maybe a little bit slower because it writes zeros twice. calloc.patch adds "PyObject* _PyObject_GC_Calloc(size_t);" and doesn't have this issue.
msg217257 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-27 11:26
Actually, I think we have to match the C-API: For instance, in Modules/_decimal/_decimal.c:5527 the libmpdec allocators are set to the Python allocators. So I'd need to do: mpd_callocfunc = PyMem_Calloc; I suppose that's a common use case.
msg217262 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 13:19
> It looks like calloc-3.patch is wrong: it modify _PyObject_GC_Malloc() to fill the newly allocated buffer with zeros, but _PyObject_GC_Malloc() is not only called by PyType_GenericAlloc(): it is also used by _PyObject_GC_New() and _PyObject_GC_NewVar(). The patch is maybe a little bit slower because it writes zeros twice. Exactly (sorry, I thought you'd already seen that, otherwise I could have told you!) > Actually, I think we have to match the C-API: For instance, in > Modules/_decimal/_decimal.c:5527 the libmpdec allocators are > set to the Python allocators. Hmm, ok then, I didn't know we were plugging our allocators for external libraries: that's indeed a very good reason to keep the same prototype. But I still find this API cumbersome: calloc is exactly like malloc except for the zeroing, so the prototype could be simpler (a quick look at Victor's patch shows a lot of calloc(1, n), which is a sign something's wrong). Maybe it's just me ;-) Otherwise, a random thought: by changing PyType_GenericAlloc() from malloc() + memset(0) to calloc(), there could be a subtle side effect: if a given type relies on the 0-setting (which is documented), and doesn't do any other work on the allocated area behind the scenes (think about a mmap-like object), we could lose our capacity to detect MemoryError, and run into segfaults instead. Because if a code creates many such objects which basically just do calloc(), on operating systems with memory overommitting (such as Linux), the calloc() allocations will pretty much always succeed, but will segfault when the page is first written to in case of low memory. I don't think such use cases should be common: I would expect most types to use tp_alloc(type, 0) and then use an internal additional pointer for the allocations it needs, or immediately write to the allocated memory area right after allocation, but that's something to keep in mind.
msg217274 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 15:43
"And http://www.eglibc.org/cgi-bin/viewvc.cgi/trunk/libc/malloc/malloc.c?view=markup to check that calloc(nelem, elsize) is implemented as calloc(nelem * elsize)" __libc_calloc() starts with a check on integer overflow.
msg217276 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 15:59
> __libc_calloc() starts with a check on integer overflow. Yes, see my previous message: """ AFAICT, the two arguments are purely historical (it was used when malloc() didn't guarantee suitable alignment, and has the advantage of performing overflow check when doing the multiplication, but in our code we always check for it anyway). """
msg217282 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 16:31
list: items are allocated in a second memory block. PyList_New() uses memset(0) to set all items to NULL. tuple: header and items are stored in a single structure (PyTupleObject), in a single memory block. PyTuple_New() fills the items will NULL (so write again null bytes). Something can be optimized here. dict: header, keys and values are stored in 3 different memory blocks. It may be interesting to use calloc() to allocate keys and values. Initialization of keys and values to NULL uses a dummy loop. I expect that memset(0) would be faster. Anyway, I expect that all items of builtin containers (tuple, list, dict, etc.) are set to non-NULL values. So the lazy initialization to zeros may be useless for them. It means that benchmarking builtin containers should not show any speedup. Something else (numpy?) should be used to see an interesting speedup.
msg217283 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 16:38
"Because if a code creates many such objects which basically just do calloc(), on operating systems with memory overommitting (such as Linux), the calloc() allocations will pretty much always succeed, but will segfault when the page is first written to in case of low memory." Overcommit leads to segmentation fault when there is no more memory, but I don't see how calloc() is worse then malloc()+memset(0). It will crash in both cases, no? In my experience (embedded device with low memory), programs crash because they don't check the result of malloc() (return NULL on allocation failure), not because of overcommit.
msg217284 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-04-27 16:39
@Charles-François: I think your worries about calloc and overcommit are unjustified. First, calloc and malloc+memset actually behave the same way here -- with a large allocation and overcommit enabled, malloc and calloc will both go ahead and return the large allocation, and then the actual out-of-memory (OOM) event won't occur until the memory is accessed. In the malloc+memset case this access will occur immediately after the malloc, during the memset -- but this is still too late for us to detect the malloc failure. Second, OOM does not cause segfaults on any system I know. On Linux it wakes up the OOM killer, which shoots some random (possibly guilty) process in the head. The actual program which triggered the OOM is quite likely to escape unscathed. In practice, the only cases where you can get a MemoryError on modern systems are (a) if the user has turned overcommit off, (b) you're on a tiny embedded system that doesn't have overcommit, (c) if you run out of virtual address space. None of these cases are affected by the differences between malloc and calloc. Regarding the calloc API: it's a wart, but it seems like a pretty unavoidable wart at this point, and the API compatibility argument is strong. I think we should just keep the two argument form and live with it...
msg217291 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 17:29
> @Charles-François: I think your worries about calloc and overcommit are unjustified. First, calloc and malloc+memset actually behave the same way here -- with a large allocation and overcommit enabled, malloc and calloc will both go ahead and return the large allocation, and then the actual out-of-memory (OOM) event won't occur until the memory is accessed. In the malloc+memset case this access will occur immediately after the malloc, during the memset -- but this is still too late for us to detect the malloc failure. Not really: what you describe only holds for a single object. But if you allocate let's say 1000 such objects at once: - in the malloc + memset case, the committed pages are progressively accessed (i.e. the pages for object N are accessed before the memory is allocated for object N+1), so they will be counted not only as committed, but also as active (for example the RSS will increase gradually): so at some point, even though by default the Linux VM subsystem is really lenient toward overcommitting, you'll likely have malloc/mmap return NULL because of this - in the calloc() case, all the memory is first committed, but not touched: the kernel will likely happily overcommit all of this. Only when you start progressively accessing the pages will the OOM kick in. > Second, OOM does not cause segfaults on any system I know. On Linux it wakes up the OOM killer, which shoots some random (possibly guilty) process in the head. The actual program which triggered the OOM is quite likely to escape unscathed. Ah, did I say segfault? Sorry, I of course meant that the process will get nuked by the OOM killer. > In practice, the only cases where you can get a MemoryError on modern systems are (a) if the user has turned overcommit off, (b) you're on a tiny embedded system that doesn't have overcommit, (c) if you run out of virtual address space. None of these cases are affected by the differences between malloc and calloc. That's a common misconception: provided that the memory allocated is accessed progressively (see above point), you'll often get ENOMEM, even with overcommitting: $ /sbin/sysctl -a \| grep overcommit vm.nr_overcommit_hugepages = 0 vm.overcommit_memory = 0 vm.overcommit_ratio = 50 $ cat /tmp/test.py l = [] with open('/proc/self/status') as f: try: for i in range(50000000): l.append(i) except MemoryError: for line in f: if 'VmPeak' in line: print(line) raise $ python /tmp/test.py VmPeak: 720460 kB Traceback (most recent call last): File "/tmp/test.py", line 7, in <module> l.append(i) MemoryError I have a 32-bit machine, but the process definitely has more than 720MB of address space ;-) If your statement were true, this would mean that it's almost impossible to get ENOMEM with overcommitting on a 64-bit machine, which is - fortunately - not true. Just try python -c "[i for i in range(<large value>)]" on a 64-bit machine, I'll bet you'll get a MemoryError (ENOMEM).
msg217294 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-27 17:40
> Just try python -c "[i for i in > range(<large value>)]" on a 64-bit machine, I'll bet you'll get a > MemoryError (ENOMEM). Hmm, I get an OOM kill here.
msg217295 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-04-27 17:41
On my laptop (x86-64, Linux 3.13, 12 GB RAM): $ python3 -c "[i for i in range(999999999)]" zsh: killed python3 -c "[i for i in range(999999999)]" $ dmesg \| tail -n 2 [404714.401901] Out of memory: Kill process 10752 (python3) score 687 or sacrifice child [404714.401903] Killed process 10752 (python3) total-vm:17061508kB, anon-rss:10559004kB, file-rss:52kB And your test.py produces the same result. Are you sure you don't have a ulimit set on address space?
msg217297 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 17:48
> And your test.py produces the same result. Are you sure you don't have a ulimit set on address space? Yep, I'm sure: $ ulimit -v unlimited It's probably due to the exponential over-allocation used by the array (to guarantee amortized constant cost). How about: python -c "b = bytes('x' * <large>)"
msg217298 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 17:53
Dammit, read: python -c 'b"x" * (2**48)'
msg217302 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-04-27 18:27
Right, python3 -c 'b"x" * (2 ** 48)' does give an instant MemoryError for me. So I was wrong about it being the VM limit indeed. The documentation on this is terrible! But, if I'm reading this right: http://lxr.free-electrons.com/source/mm/util.c#L434 the actual rules are: overcommit mode 1: allocating a VM range always succeeds. overcommit mode 2: (Slightly simplified) You can allocate total VM ranges up to (swap + RAM * overcommit_ratio), and overcommit_ratio is 50% by default. So that's a bit odd, but whatever. This is still entirely a limit on VM size. overcommit mode 0 ("guess", the default): when allocating a VM range, the kernel imagines what would happen if you immediately used all those pages. If that would put you OOM, then we fall back to mode 2 rules. If that would not put you OOM, then the allocation unconditionally succeeds. So yeah, touching pages can affect whether a later malloc returns ENOMEM. I'm not sure any of this actually matters in the Python case though :-). There's still no reason to go touching pages pre-emptively just in case we might write to them later -- all that does is increase the interpreter's memory footprint, which can't help anything. If people are worried about overcommit, then they should turn off overcommit, not try and disable it on a piece-by-piece basis by trying to get individual programs to memory before they need it.
msg217303 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 18:31
Alright, it bothered me so I wrote a small C testcase (attached), which calls malloc in a loop, and can call memset upon the allocated block right after allocation: $ gcc -o /tmp/test /tmp/test.c; /tmp/test malloc() returned NULL after 3050MB $ gcc -DDO_MEMSET -o /tmp/test /tmp/test.c; /tmp/test malloc() returned NULL after 2130MB Without memset, the kernel happily allocates until we reach the 3GB user address space limit. With memset, it bails out way before. I don't know what this'll give on 64-bit, but I assume one should get comparable result. I would guess that the reason why the Python list allocation fails is because of the exponential allocation scheme: since memory is allocated in large chunks before being used, the kernel happily overallocates. With a more progressive allocation+usage, it should return ENOMEM at some point. Anyway, that's probably off-topic!
msg217304 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 18:36
> So yeah, touching pages can affect whether a later malloc returns ENOMEM. > > I'm not sure any of this actually matters in the Python case though :-). There's still no reason to go touching pages pre-emptively just in case we might write to them later -- all that does is increase the interpreter's memory footprint, which can't help anything. If people are worried about overcommit, then they should turn off overcommit, not try and disable it on a piece-by-piece basis by trying to get individual programs to memory before they need it. Absolutely: that's why I'm really in favor of exposing calloc, this could definitely help many workloads. Victor, did you run any non-trivial benchmark, like pybench & Co? As I said, I'm not expecting any improvement, I just want to make sure there's not hidden regression somewhere (like the one for GC-tracked objects above).
msg217305 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-27 18:37
> $ gcc -o /tmp/test /tmp/test.c; /tmp/test > malloc() returned NULL after 3050MB > $ gcc -DDO_MEMSET -o /tmp/test /tmp/test.c; /tmp/test > malloc() returned NULL after 2130MB > > Without memset, the kernel happily allocates until we reach the 3GB > user address space limit. > With memset, it bails out way before. > > I don't know what this'll give on 64-bit, but I assume one should get > comparable result. Both OOM here (3.11.0-20-generic, 64-bit, Ubuntu).
msg217306 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-27 18:49
This is probably offtopic, but I think people who want reliable MemoryErrors can use limits, e.g. via djb's softlimit (daemontools): $ softlimit -m 100000000 ./python Python 3.5.0a0 (default:462470859e57+, Apr 27 2014, 19:34:06) [GCC 4.7.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> [i for i in range(9999999)] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 1, in <listcomp> MemoryError
msg217307 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 19:03
> Both OOM here (3.11.0-20-generic, 64-bit, Ubuntu). Hm... What's /proc/sys/vm/overcommit_memory ? If it's set to 0, then the kernel will always overcommit. If you set it to 2, normally you'd definitely get ENOMEM (which is IMO much nicer than getting nuked by the OOM killer, especially because, like in real life, there's often collateral damage ;-)
msg217308 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 19:07
> Hm... > What's /proc/sys/vm/overcommit_memory ? > If it's set to 0, then the kernel will always overcommit. I meant 1 (damn, I need sleep).
msg217309 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-27 19:09
> Hm... > What's /proc/sys/vm/overcommit_memory ? > If it's set to 0, then the kernel will always overcommit. Ah, indeed. > If you set it to 2, normally you'd definitely get ENOMEM You're right, but with weird results: $ gcc -o /tmp/test test.c; /tmp/test malloc() returned NULL after 600MB $ gcc -DDO_MEMSET -o /tmp/test test.c; /tmp/test malloc() returned NULL after 600MB (I'm supposed to have gigabytes free?!)
msg217310 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-27 19:15
>> Hm... >> What's /proc/sys/vm/overcommit_memory ? >> If it's set to 0, then the kernel will always overcommit. > > Ah, indeed. See above, I mistyped: 0 is the default (which is already quite optimistic), 1 is always. >> If you set it to 2, normally you'd definitely get ENOMEM > > You're right, but with weird results: > > $ gcc -o /tmp/test test.c; /tmp/test > malloc() returned NULL after 600MB > $ gcc -DDO_MEMSET -o /tmp/test test.c; /tmp/test > malloc() returned NULL after 600MB > > (I'm supposed to have gigabytes free?!) The formula is RAM * vm.overcommit_ratio /100 + swap So if you don't have swap, or a low overcommit_ratio, it could explain why it returns so early. Or maybe you have some processes with a lot of mapped-yet-unused memory (chromium is one of those for example). Anyway, it's really a mess!
msg217323 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 23:03
I splitted my patch into two parts: - calloc-4.patch: add new "Calloc" functions including _PyObject_GC_Calloc() - use_calloc.patch: patch types (bytes, dict, list, set, tuple, etc.) and various modules to use calloc I reverted my changes on _PyObject_GC_Malloc() and added _PyObject_GC_Calloc(), performance regressions are gone. Creating a large tuple is a little bit (8%) faster. But the real speedup is to build a large bytes strings of null bytes: $ ./python.orig -m timeit 'bytes(5010241024)' 100 loops, best of 3: 5.7 msec per loop $ ./python.calloc -m timeit 'bytes(5010241024)' 100000 loops, best of 3: 4.12 usec per loop On Linux, no memory is allocated, even if you read the bytes content. RSS is almost unchanged. Ok, now the real use case where it becomes faster: I implemented the same optimization for bytearray. $ ./python.orig -m timeit 'bytearray(5010241024)' 100 loops, best of 3: 6.33 msec per loop $ ./python.calloc -m timeit 'bytearray(5010241024)' 100000 loops, best of 3: 4.09 usec per loop If you overallocate a bytearray and only write a few bytes, the bytes of end of bytearray will not be allocated (at least on Linux). Result of bench_alloc.py comparing original Python to patched Python (calloc-4.patch + use_calloc.patch). Common platform: SCM: hg revision=4b97092aa4bd+ tag=tip branch=default date="2014-04-27 18:02 +0100" Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Python unicode implementation: PEP 393 CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes Bits: int=32, long=64, long long=64, size_t=64, void=64 Timer: time.perf_counter CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug Platform of campaign orig: Timer precision: 42 ns Date: 2014-04-28 00:27:19 Python version: 3.5.0a0 (default:4b97092aa4bd, Apr 28 2014, 00:24:03) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Platform of campaign calloc: Timer precision: 54 ns Date: 2014-04-28 00:28:35 Python version: 3.5.0a0 (default:4b97092aa4bd+, Apr 28 2014, 00:25:56) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] -----------------------------------+-------------+-------------- Tests \| orig \| calloc -----------------------------------+-------------+-------------- object() \| 61 ns () \| 71 ns (+16%) b'A' * 10 \| 54 ns () \| 52 ns b'A' 10*3 \| 124 ns () \| 110 ns (-12%) b'A' * 10*6 \| 38.4 us () \| 38.5 us 'A' * 10 \| 59 ns () \| 62 ns 'A' 10*3 \| 132 ns () \| 107 ns (-19%) 'A' * 10*6 \| 38.5 us () \| 38.5 us 'A' * 10*8 \| 10.3 ms () \| 10.6 ms decode 10 null bytes from ASCII \| 264 ns () \| 263 ns decode 103 null bytes from ASCII \| 403 ns () \| 379 ns (-6%) decode 10*6 null bytes from ASCII \| 80.5 us () \| 80.5 us decode 10*8 null bytes from ASCII \| 17.7 ms () \| 17.3 ms (None,) * 10*0 \| 29 ns () \| 28 ns (None,) * 10*1 \| 75 ns () \| 76 ns (None,) * 10*2 \| 461 ns () \| 460 ns (None,) * 10*3 \| 3.6 us () \| 3.57 us (None,) * 10*4 \| 35.7 us () \| 35.7 us (None,) * 10*5 \| 364 us () \| 365 us (None,) * 10*6 \| 4.12 ms () \| 4.11 ms (None,) * 10*7 \| 43.5 ms () \| 40.3 ms (-7%) (None,) * 10*8 \| 433 ms () \| 400 ms (-8%) ([None] * 10)[1:-1] \| 121 ns () \| 134 ns (+11%) ([None] 10*3)[1:-1] \| 3.62 us () \| 3.61 us ([None] * 10*6)[1:-1] \| 4.24 ms () \| 4.22 ms ([None] * 10*8)[1:-1] \| 440 ms () \| 402 ms (-9%) -----------------------------------+-------------+-------------- Total \| 954 ms (*) \| 880 ms (-8%) -----------------------------------+-------------+--------------
msg217324 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 23:15
bench_alloc2.py: updated benchmark script. I added bytes(n) and bytearray(n) tests and removed the test decoding from ASCII. Common platform: Timer: time.perf_counter Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug SCM: hg revision=4b97092aa4bd+ tag=tip branch=default date="2014-04-27 18:02 +0100" Python unicode implementation: PEP 393 Bits: int=32, long=64, long long=64, size_t=64, void=64 CFLAGS: -Wno-unused-result -Werror=declaration-after-statement -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes CPU model: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz Platform of campaign orig: Date: 2014-04-28 01:11:49 Timer precision: 39 ns Python version: 3.5.0a0 (default:4b97092aa4bd, Apr 28 2014, 01:02:01) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] Platform of campaign calloc: Date: 2014-04-28 01:12:29 Timer precision: 44 ns Python version: 3.5.0a0 (default:4b97092aa4bd+, Apr 28 2014, 01:06:54) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] -----------------------+-------------+---------------- Tests \| orig \| calloc -----------------------+-------------+---------------- object() \| 62 ns () \| 72 ns (+16%) b'A' * 10 \| 53 ns () \| 52 ns b'A' 10*3 \| 96 ns () \| 110 ns (+15%) b'A' * 10*6 \| 38.5 us () \| 38.6 us 'A' * 10 \| 59 ns () \| 61 ns 'A' 10*3 \| 105 ns () \| 108 ns 'A' * 10*6 \| 38.6 us () \| 38.6 us 'A' * 10*8 \| 10.3 ms () \| 10.4 ms (None,) * 10*0 \| 29 ns () \| 29 ns (None,) * 10*1 \| 75 ns () \| 76 ns (None,) * 10*2 \| 432 ns () \| 461 ns (+7%) (None,) * 10*3 \| 3.58 us () \| 3.6 us (None,) * 10*4 \| 35.8 us () \| 35.7 us (None,) * 10*5 \| 365 us () \| 365 us (None,) * 10*6 \| 4.1 ms () \| 4.13 ms (None,) * 10*7 \| 43.6 ms () \| 40.3 ms (-8%) (None,) * 10*8 \| 433 ms () \| 401 ms (-7%) ([None] * 10)[1:-1] \| 122 ns () \| 134 ns (+10%) ([None] 10*3)[1:-1] \| 3.6 us () \| 3.62 us ([None] * 10*6)[1:-1] \| 4.22 ms () \| 4.2 ms ([None] * 10*8)[1:-1] \| 441 ms () \| 402 ms (-9%) bytes(10) \| 137 ns () \| 136 ns bytes(103) \| 181 ns () \| 191 ns (+5%) bytes(10*6) \| 38.7 us () \| 39.2 us bytes(10*8) \| 10.3 ms () \| 4.36 us (-100%) bytearray(10) \| 138 ns () \| 153 ns (+11%) bytearray(103) \| 184 ns () \| 211 ns (+14%) bytearray(10*6) \| 38.7 us () \| 39.3 us bytearray(10*8) \| 10.3 ms () \| 4.32 us (-100%) -----------------------+-------------+---------------- Total \| 957 ms (*) \| 862 ms (-10%) -----------------------+-------------+----------------
msg217325 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-27 23:16
> Common platform: > Timer: time.perf_counter > Timer info: namespace(adjustable=False, implementation='clock_gettime(CLOCK_MONOTONIC)', monotonic=True, resolution=1e-09) > Platform: Linux-3.13.9-200.fc20.x86_64-x86_64-with-fedora-20-Heisenbug ^^^^^^^^^ Are you sure this is a good platform for performance reports? :)
msg217326 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 23:20
> Are you sure this is a good platform for performance reports? :) Don't hesitate to rerun my benchmark on more different platforms?
msg217330 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 23:34
> Don't hesitate to rerun my benchmark on more different platforms? Oops, I wanted to write ";-)" not "?".
msg217331 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-27 23:35
> Ok, now the real use case where it becomes faster: I implemented the > same optimization for bytearray. The real use case I envision is with huge powers of two. If I write: x = 2 ** 1000000 then all of x's bytes except the highest one will be zeros. If we map those to /dev/zero, it will be a massive saving for programs using huge powers of two.
msg217333 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 00:09
> The real use case I envision is with huge powers of two. I'm not sure that it's a common use case, but it can be nice to optimize this case if it doesn't make longobject.c more complex. It looks like calloc() becomes interesting for objects larger than 1 MB.
msg217346 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 07:51
It looks like Windows supports also lazy initialization of memory pages initialized to zero. According to my microbenchmark on Linux and Windows, only bytes(n) and bytearray(n) are really faster with use_calloc.patch. Most changes of use_calloc.patch are maybe useless since all bytes are initilized to zero, but just after that they are replaced with new bytes. Results of bench_alloc2.py on Windows 7: original vs calloc-4.patch+use_calloc.patch: Common platform: Timer: time.perf_counter Python unicode implementation: PEP 393 Bits: int=32, long=32, long long=64, size_t=32, void=32 Platform: Windows-7-6.1.7601-SP1 CFLAGS: None Timer info: namespace(adjustable=False, implementation='QueryPerformanceCounter( )', monotonic=True, resolution=1e-08) Platform of campaign orig: SCM: hg revision=4b97092aa4bd branch=default date="2014-04-27 18:02 +0100" Date: 2014-04-28 09:35:30 Python version: 3.5.0a0 (default, Apr 28 2014, 09:33:30) [MSC v.1600 32 bit (Int el)] Timer precision: 4.47 us Platform of campaign calloc: SCM: hg revision=4f0aaa8804c6 tag=tip branch=default date="2014-04-28 09:27 +020 0" Date: 2014-04-28 09:37:37 Python version: 3.5.0a0 (default:4f0aaa8804c6, Apr 28 2014, 09:37:03) [MSC v.160 0 32 bit (Intel)] Timer precision: 4.44 us -----------------------+-------------+---------------- Tests \| orig \| calloc -----------------------+-------------+---------------- object() \| 121 ns () \| 109 ns (-10%) b'A' * 10 \| 77 ns () \| 79 ns b'A' 10*3 \| 159 ns () \| 168 ns (+5%) b'A' * 10*6 \| 428 us () \| 415 us 'A' * 10 \| 87 ns () \| 89 ns 'A' 10*3 \| 175 ns () \| 177 ns 'A' * 10*6 \| 429 us () \| 454 us (+6%) 'A' * 10*8 \| 48.4 ms () \| 49 ms (None,) * 10*0 \| 49 ns () \| 51 ns (None,) * 10*1 \| 115 ns () \| 99 ns (-14%) (None,) * 10*2 \| 433 ns () \| 422 ns (None,) * 10*3 \| 3.58 us () \| 3.57 us (None,) * 10*4 \| 34.9 us () \| 34.9 us (None,) * 10*5 \| 347 us () \| 351 us (None,) * 10*6 \| 5.14 ms () \| 4.85 ms (-6%) (None,) * 10*7 \| 53.2 ms () \| 50.2 ms (-6%) (None,) * 10*8 \| 563 ms () \| 515 ms (-9%) ([None] * 10)[1:-1] \| 217 ns () \| 217 ns ([None] 10*3)[1:-1] \| 3.89 us () \| 3.92 us ([None] * 10*6)[1:-1] \| 5.13 ms () \| 5.17 ms ([None] * 10*8)[1:-1] \| 634 ms () \| 533 ms (-16%) bytes(10) \| 193 ns () \| 206 ns (+7%) bytes(103) \| 266 ns () \| 296 ns (+12%) bytes(10*6) \| 414 us () \| 3.89 us (-99%) bytes(10*8) \| 44.2 ms () \| 4.56 us (-100%) bytearray(10) \| 229 ns () \| 243 ns (+6%) bytearray(103) \| 301 ns () \| 330 ns (+10%) bytearray(10*6) \| 421 us () \| 3.89 us (-99%) bytearray(10*8) \| 44.4 ms () \| 4.56 us (-100%) -----------------------+-------------+---------------- Total \| 1.4 sec (*) \| 1.16 sec (-17%) -----------------------+-------------+----------------
msg217348 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 08:31
Changes on the pickle module don't look like an interesting optimization. It even looks slower. $ python perf.py -b fastpickle,fastunpickle,pickle,pickle_dict,pickle_list,slowpickle,slowunpickle,unpickle ../default/python.orig ../default/python.calloc ... Report on Linux selma 3.13.9-200.fc20.x86_64 #1 SMP Fri Apr 4 12:13:05 UTC 2014 x86_64 x86_64 Total CPU cores: 4 ### fastpickle ### Min: 0.364510 -> 0.374144: 1.03x slower Avg: 0.367882 -> 0.377714: 1.03x slower Significant (t=-11.54) Stddev: 0.00493 -> 0.00347: 1.4209x smaller The following not significant results are hidden, use -v to show them: fastunpickle, pickle_dict, pickle_list.
msg217349 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 09:01
Patch version 5. This patch is ready for a review. Summary of calloc-5.patch: - add the following functions: * void* PyMem_RawCalloc(size_t nelem, size_t elsize) * void* PyMem_Calloc(size_t nelem, size_t elsize) * void* PyObject_Calloc(size_t nelem, size_t elsize) * PyObject* _PyObject_GC_Calloc(size_t basicsize) - add "void* calloc(void ctx, size_t nelem, size_t elsize)" field to the PyMemAllocator structure - optimize bytes(n) and bytearray(n) to allocate objects using calloc() instead of malloc() - update tracemalloc to trace also calloc() - document new functions and add unit tests for the calloc "hook" (in _testcapi) Changes between versions 4 and 5: - revert all changes except bytes(n) and bytearray(n) of use_calloc.patch: they were useless according to benchmarks - _PyObject_GC_Calloc() now takes a single parameter - add versionadded and versionchanged fields in the documentation According to benchmarks, calloc() is only useful for large allocation (1 MB?) if only a part of the memory block is modified (to non-zero bytes) just after the allocation. Untouched memory pages don't use physical memory and don't use RSS memory pages, but it is possible to read their content (null bytes). Using calloc() instead of malloc()+memset(0) doens't look to be faster (it may be a little bit slower) if all bytes are set just after the allocation. I chose to only use one parameter for _PyObject_GC_Calloc() because this function is used to allocate Python objects. A structure of a Python object must start with PyObject_HEAD or PyObject_VAR_HEAD and so the total size of an object cannot be expressed as NELEM ELEMSIZE. I have no use case for _PyObject_GC_Calloc(), but it makes sense to use it to allocate a large Python object tracked by the GC and using a single memory block for the Python header + data. PyObject_Calloc() simply use memset(0) for small objects (<= 512 bytes). It delegates the allocation to PyMem_RawCalloc(), and so indirectly to calloc(), for larger objects. Note: use_calloc.patch is no more needed, I merged the two patches since only bytes(n) and bytearray(n) now use calloc().
msg217351 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 09:15
Demo of calloc-5.patch on Linux. Thanks to calloc(), bytes(50 * 1024 * 1024) doesn't allocate memory for null bytes and so the RSS memory is unchanged (+148 kB, not +50 MB), but tracemalloc says that 50 MB were allocated. $ ./python -X tracemalloc Python 3.5.0a0 (default:4b97092aa4bd+, Apr 28 2014, 10:40:53) [GCC 4.8.2 20131212 (Red Hat 4.8.2-7)] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os, tracemalloc >>> os.system("grep RSS /proc/%s/status" % os.getpid()) VmRSS: 10736 kB 0 >>> before = tracemalloc.get_traced_memory()[0] >>> large = bytes(50 * 1024 * 1024) >>> import sys >>> sys.getsizeof(large) / 1024. 51200.0478515625 >>> (tracemalloc.get_traced_memory()[0] - before) / 1024. 51198.1962890625 >>> os.system("grep RSS /proc/%s/status" % os.getpid()) VmRSS: 10884 kB 0
msg217357 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-28 10:43
With the latest patch the decimal benchmark with a lot of small allocations is consistently 2% slower. Large factorials (where the operands are initialized to zero for the number-theoretic transform) have the same performance with and without the patch. It would be interesting to see some NumPy benchmarks (Nathaniel?).
msg217375 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 14:09
> With the latest patch the decimal benchmark with a lot of small > allocations is consistently 2% slower. Does your benchmark use bytes(int) or bytearray(int)? If not, I guess that your benchmark is not reliable because only these two functions are changed by calloc-5.patch, except if there is a bug in my patch.
msg217380 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-28 15:52
Hmm, obmalloc.c changed as well, so already the gcc optimizer can take different paths and produce different results. Also I did set mpd_callocfunc to PyMem_Calloc(). 2% slowdown is far from being a tragic result, so I guess we can ignore that. The bytes() speedup is very nice. Allocations that took one second are practically instant now.
msg217382 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-28 16:31
> Also I did set mpd_callocfunc to PyMem_Calloc(). 2% slowdown is far > from being a tragic result, so I guess we can ignore that. Agreed. > The bytes() speedup is very nice. Allocations that took one second > are practically instant now. Indeed. Victor, thanks for the great work!
msg217423 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-28 21:16
> Hmm, obmalloc.c changed as well, so already the gcc optimizer can take > different paths and produce different results. If decimal depends on allocator performances, you should maybe try to implement a freelist. > Also I did set mpd_callocfunc to PyMem_Calloc(). I don't understand. 2% slowdown is when you use calloc? Do you have the same speed if you don't use calloc? According to my benchmarks, calloc is slower if some bytes are modified later. > The bytes() speedup is very nice. Allocations that took one second > are practically instant now. Is it really useful? Who need bytes(10**8) object? Faster creation of bytearray(int) may be useful in real applications. I really like bytearray and memoryview to avoid memory copies.
msg217436 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-28 22:48
The order of the nelem/elsize matters for readability. Otherwise it is not intuitive what happens after the jump to redirect in _PyObject_Alloc(). Why would you assert that 'nelem' is one?
msg217445 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-04-28 23:33
> It would be interesting to see some NumPy benchmarks (Nathaniel?). What is it you want to see? NumPy already uses calloc; we benchmarked it when we added it and it made a huge difference to various realistic workloads :-). What NumPy gets out of this isn't calloc, it's access to tracemalloc.
msg217549 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-29 20:59
Patch version 6: - I renamed "int zero" parameter to "int use_calloc" and move the new parameter at the first position to avoid confusion with nelem. For example, _PyObject_Alloc(ctx, 1, nbytes, 0) becomes _PyObject_Alloc(0, ctx, 1, nbytes). It also more logical to put it in the first position. In bytesobject.c, I leaved it at the parameter at the end since its meaning is different (fill bytes with zero or not) IMO. - I removed my hack (premature optimization) "assert(nelem == 1); ... malloc(elsize);" and replaced it with a less surprising "... malloc(nelem * elsize);" Stefan & Charles-François: I hope that the patch looks better to you.
msg217553 - (view)	Author: Charles-François Natali (neologix) *	Date: 2014-04-29 21:14
LGTM!
msg217594 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-30 10:02
@Stefan: Can you please review calloc-6.patch? Charles-François wrote that the patch looks good, but for such critical operation (memory allocation), I would prefer a second review ;)
msg217617 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-04-30 12:58
Victor, sure, maybe not right away. If you prefer to commit very soon, I promise to do a post commit review.
msg217619 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-30 13:23
> If you prefer to commit very soon, > I promise to do a post commit review. There is no need to hurry.
msg217785 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-05-02 20:31
New changeset 5b0fda8f5718 by Victor Stinner in branch 'default': Issue #21233: Add new C functions: PyMem_RawCalloc(), PyMem_Calloc(), http://hg.python.org/cpython/rev/5b0fda8f5718
msg217786 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-02 20:35
> There is no need to hurry. I changed my mind :-p It should be easier for numpy to test the development version of Python. Let's wait for buildbots.
msg217794 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-02 21:13
Antoine Pitrou wrote: > The real use case I envision is with huge powers of two. If I write: > x = 2 ** 1000000 I created the issue #21419 for this idea.
msg217797 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-05-02 21:26
New changeset 62438d1b11c7 by Victor Stinner in branch 'default': Issue #21233: Oops, Fix _PyObject_Alloc(): initialize nbytes before going to http://hg.python.org/cpython/rev/62438d1b11c7
msg217826 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-05-03 19:29
I did a post-commit review. A couple of things: 1) I think Victor and I have a different view of the calloc() parameters. calloc(size_t nmemb, size_t size) If a memory region of bytes is allocated, IMO 'nbytes' should be in the place of 'nmemb' and '1' should be in the place of 'size'. That is, "allocate nbytes elements of size 1": calloc(nbytes, 1) In the commit the parameters are reversed in many places, which confuses me quite a bit, since it means "allocate one element of size nbytes". calloc(1, nbytes) 2) I'm not happy with the refactoring in bytearray_init(). I think it would be safer to make focused minimal changes in PyByteArray_Resize() instead. In fact, there is a behavior change which isn't correct: Before: ======= >>> x = bytearray(0) >>> m = memoryview(x) >>> x.__init__(10) Traceback (most recent call last): File "<stdin>", line 1, in <module> BufferError: Existing exports of data: object cannot be re-sized Now: ==== >>> x = bytearray(0) >>> m = memoryview(x) >>> x.__init__(10) >>> x[0] 0 >>> m[0] Traceback (most recent call last): File "<stdin>", line 1, in <module> IndexError: index out of bounds 3) Somewhat similarly, I wonder if it was necessary to refactor PyBytes_FromStringAndSize(). I find the new version more difficult to understand. 4) _PyObject_Alloc(): assert(nelem <= PY_SSIZE_T_MAX / elsize) can be called with elsize = 0.
msg217829 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-05-03 21:39
I forgot one thing: 5) If WITH_VALGRIND is defined, nbytes is uninitialized in _PyObject_Alloc().
msg217831 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-05-03 21:57
Another thing: 6) We need some kind of prominent documentation that existing programs need to be changed: Python 3.5.0a0 (default:62438d1b11c7+, May 3 2014, 23:35:03) [GCC 4.8.2] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import failmalloc >>> failmalloc.enable() >>> bytes(1) Segmentation fault (core dumped)
msg217832 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-05-03 22:01
A simple solution would be to change the name of the struct, so that non-updated libraries will get a compile error instead of a runtime crash.
msg217838 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-03 22:49
> 6) We need some kind of prominent documentation that existing > programs need to be changed: My final commit includes an addition to What's New in Python 3.5 doc, including a notice in the porting section. It is not enough? Even if the API is public, the PyMemAllocator thing is low level. It's not part of the stable ABI. Except failmalloc, I don't know any user. I don't expect a lot of complain and it's easy to port the code.
msg217839 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-03 22:51
> 5) If WITH_VALGRIND is defined, nbytes is uninitialized in _PyObject_Alloc(). Did you see my second commit? It's nlt already fixed?
msg217840 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-05-03 22:59
> > 5) If WITH_VALGRIND is defined, nbytes is uninitialized in > _PyObject_Alloc(). > > Did you see my second commit? It's nlt already fixed? I don't think so, I have revision 5d076506b3f5 here.
msg217841 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-05-03 23:00
> "allocate nbytes elements of size 1" PyObject_Malloc(100) asks to allocate one object of 100 bytes. For PyMem_Malloc() and PyMem_RawMalloc(), it's more difficult to guess, but IMO it's sane to bet that a single memory block of size bytes is requested. I consider that char data[100] is a object of 100 bytes, but you call it 100 object of 1 byte. I don't think that using nelem or elsize matters in practice.
msg217844 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-05-03 23:25
STINNER Victor <report@bugs.python.org> wrote: > PyObject_Malloc(100) asks to allocate one object of 100 bytes. Okay, then let's please call it: _PyObject_Calloc(void ctx, size_t nobjs, size_t objsize) _PyObject_Alloc(int use_calloc, void ctx, size_t nobjs, size_t objsize)
msg217866 - (view)	Author: Stefan Krah (skrah) *	Date: 2014-05-04 11:12
STINNER Victor <report@bugs.python.org> wrote: > My final commit includes an addition to What's New in Python 3.5 doc, > including a notice in the porting section. It is not enough? I'm not sure: The usual case with ABI changes is that extensions may segfault if they are not recompiled [1]. In that case documenting it in What's New is standard procedure. Here the extension is recompiled and still segfaults. > Even if the API is public, the PyMemAllocator thing is low level. It's not > part of the stable ABI. Except failmalloc, I don't know any user. I don't > expect a lot of complain and it's easy to port the code. Perhaps it's worth asking on python-dev. Nathaniel's suggestion isn't bad either (e.g. name it PyMemAllocatorEx). [1] I was told on python-dev that many people in fact do not recompile.
msg217972 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-05-06 09:32
New changeset 358a12f4d4bc by Victor Stinner in branch 'default': Issue #21233: Fix _PyObject_Alloc() when compiled with WITH_VALGRIND defined http://hg.python.org/cpython/rev/358a12f4d4bc
msg219627 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-06-02 19:57
New changeset 6374c2d957a9 by Victor Stinner in branch 'default': Issue #21233: Rename the C structure "PyMemAllocator" to "PyMemAllocatorEx" to http://hg.python.org/cpython/rev/6374c2d957a9
msg219628 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-06-02 20:13
> I'm not sure: The usual case with ABI changes is that extensions may segfault if they are not recompiled [1]. Ok, I renamed the structure PyMemAllocator to PyMemAllocatorEx, so the compilation fails because PyMemAllocator name is not defined. Modules compiled for Python 3.4 will crash on Python 3.5 if they are not recompiled, but I hope that you recompile your modules when you don't use the stable ABI. Using PyMemAllocator is now more complex because it depends on the Python version. See for example the patch for pyfailmalloc: https://bitbucket.org/haypo/pyfailmalloc/commits/9db92f423ac5f060d6ff499ee4bb74ebc0cf4761 Using the C preprocessor, it's possible to limit the changes.
msg219630 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-06-02 20:16
"Okay, then let's please call it: _PyObject_Calloc(void ctx, size_t nobjs, size_t objsize) _PyObject_Alloc(int use_calloc, void ctx, size_t nobjs, size_t objsize)" "void * PyMem_RawCalloc(size_t nelem, size_t elsize);" prototype comes from the POSIX standad: http://pubs.opengroup.org/onlinepubs/009695399/functions/calloc.html I'm don't want to change the prototype in Python. Extract of Python documentation: .. c:function:: void* PyMem_RawCalloc(size_t nelem, size_t elsize) Allocates nelem elements each whose size in bytes is elsize (...)
msg219631 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-06-02 20:23
New changeset dff6b4b61cac by Victor Stinner in branch 'default': Issue #21233: Revert bytearray(int) optimization using calloc() http://hg.python.org/cpython/rev/dff6b4b61cac
msg219634 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-06-02 20:28
"2) I'm not happy with the refactoring in bytearray_init(). (...) 3) Somewhat similarly, I wonder if it was necessary to refactor PyBytes_FromStringAndSize(). (...)" Ok, I reverted the change on bytearray(int) and opened the issue #21644 to discuss these two optimizations.
msg219635 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-06-02 20:29
I reread the issue. I hope that I now addressed all issues. The remaining issue, bytearray(int) is now tracked by the new issue #21644.

History
Date	User	Action	Args
2022-04-11 14:58:01	admin	set	github: 65432
2014-06-02 20:29:45	vstinner	set	status: open -> closed resolution: fixed messages: + msg219635
2014-06-02 20:28:08	vstinner	set	messages: + msg219634
2014-06-02 20:23:26	python-dev	set	messages: + msg219631
2014-06-02 20:16:21	vstinner	set	messages: + msg219630
2014-06-02 20:13:25	vstinner	set	messages: + msg219628
2014-06-02 19:57:40	python-dev	set	messages: + msg219627
2014-05-06 09:32:48	python-dev	set	messages: + msg217972
2014-05-04 11:12:04	skrah	set	messages: + msg217866
2014-05-03 23:25:44	skrah	set	messages: + msg217844
2014-05-03 23:00:51	vstinner	set	messages: + msg217841
2014-05-03 22:59:28	skrah	set	messages: + msg217840
2014-05-03 22:51:32	vstinner	set	messages: + msg217839
2014-05-03 22:49:18	vstinner	set	messages: + msg217838
2014-05-03 22:01:03	njs	set	messages: + msg217832
2014-05-03 21:57:18	skrah	set	messages: + msg217831
2014-05-03 21:39:14	skrah	set	messages: + msg217829
2014-05-03 19:29:59	skrah	set	messages: + msg217826
2014-05-02 21:26:17	python-dev	set	messages: + msg217797
2014-05-02 21:13:20	vstinner	set	messages: + msg217794
2014-05-02 20:35:08	vstinner	set	messages: + msg217786
2014-05-02 20:31:29	python-dev	set	nosy: + python-dev messages: + msg217785
2014-04-30 13:23:43	vstinner	set	messages: + msg217619
2014-04-30 12:58:52	skrah	set	messages: + msg217617
2014-04-30 10:02:30	vstinner	set	messages: + msg217594
2014-04-29 21:14:30	neologix	set	messages: + msg217553
2014-04-29 20:59:54	vstinner	set	files: + calloc-6.patch messages: + msg217549
2014-04-28 23:33:05	njs	set	messages: + msg217445
2014-04-28 22:48:12	skrah	set	messages: + msg217436
2014-04-28 21:16:01	vstinner	set	messages: + msg217423
2014-04-28 16:31:59	neologix	set	messages: + msg217382
2014-04-28 15:52:17	skrah	set	messages: + msg217380
2014-04-28 14:09:39	vstinner	set	messages: + msg217375
2014-04-28 10:43:19	skrah	set	messages: + msg217357
2014-04-28 09:15:06	vstinner	set	messages: + msg217351
2014-04-28 09:01:14	vstinner	set	files: + calloc-5.patch messages: + msg217349
2014-04-28 08:31:41	vstinner	set	messages: + msg217348
2014-04-28 07:51:02	vstinner	set	messages: + msg217346
2014-04-28 00:09:33	vstinner	set	messages: + msg217333
2014-04-27 23:35:24	pitrou	set	messages: + msg217331
2014-04-27 23:34:41	vstinner	set	messages: + msg217330
2014-04-27 23:20:04	vstinner	set	messages: + msg217326
2014-04-27 23:16:51	pitrou	set	messages: + msg217325
2014-04-27 23:15:48	vstinner	set	files: + bench_alloc2.py messages: + msg217324
2014-04-27 23:03:40	vstinner	set	files: + use_calloc.patch
2014-04-27 23:03:31	vstinner	set	files: + calloc-4.patch messages: + msg217323
2014-04-27 19:15:06	neologix	set	messages: + msg217310
2014-04-27 19:09:37	pitrou	set	messages: + msg217309
2014-04-27 19:07:37	neologix	set	messages: + msg217308
2014-04-27 19:03:45	neologix	set	messages: + msg217307
2014-04-27 18:49:35	skrah	set	messages: + msg217306
2014-04-27 18:37:28	pitrou	set	messages: + msg217305
2014-04-27 18:36:51	neologix	set	messages: + msg217304
2014-04-27 18:31:50	neologix	set	files: + test.c messages: + msg217303
2014-04-27 18:27:31	njs	set	messages: + msg217302
2014-04-27 17:53:22	neologix	set	messages: + msg217298
2014-04-27 17:48:44	neologix	set	messages: + msg217297
2014-04-27 17:41:49	njs	set	messages: + msg217295
2014-04-27 17:40:15	pitrou	set	messages: + msg217294
2014-04-27 17:29:04	neologix	set	messages: + msg217291
2014-04-27 16:39:10	njs	set	messages: + msg217284
2014-04-27 16:38:51	vstinner	set	messages: + msg217283
2014-04-27 16:31:56	vstinner	set	messages: + msg217282
2014-04-27 15:59:55	neologix	set	messages: + msg217276
2014-04-27 15:43:01	vstinner	set	messages: + msg217274
2014-04-27 13:19:03	neologix	set	messages: + msg217262
2014-04-27 11:26:55	skrah	set	messages: + msg217257
2014-04-27 11:12:44	vstinner	set	messages: + msg217256
2014-04-27 11:05:55	skrah	set	messages: + msg217255
2014-04-27 11:02:55	pitrou	set	messages: + msg217254
2014-04-27 10:36:05	vstinner	set	files: + bench_alloc.py messages: + msg217253
2014-04-27 10:32:47	neologix	set	messages: + msg217252
2014-04-27 10:21:00	neologix	set	messages: + msg217251
2014-04-27 09:51:46	vstinner	set	messages: + msg217246
2014-04-27 08:30:36	neologix	set	messages: + msg217242
2014-04-27 00:05:50	vstinner	set	messages: + msg217228
2014-04-17 11:35:08	jtaylor	set	messages: + msg216686
2014-04-17 10:39:32	josh.r	set	messages: + msg216682
2014-04-17 10:35:42	josh.r	set	messages: + msg216681
2014-04-17 08:04:36	jtaylor	set	nosy: + jtaylor messages: + msg216671
2014-04-17 07:19:26	neologix	set	messages: + msg216668
2014-04-16 19:48:02	vstinner	set	files: + calloc-3.patch messages: + msg216567
2014-04-16 17:47:13	pitrou	set	messages: + msg216515
2014-04-16 09:54:37	skrah	set	messages: + msg216455
2014-04-16 08:06:13	vstinner	set	messages: + msg216452
2014-04-16 08:04:31	vstinner	set	messages: + msg216451
2014-04-16 07:18:37	neologix	set	messages: + msg216444
2014-04-16 05:34:58	pitrou	set	messages: + msg216433
2014-04-16 04:21:26	vstinner	set	files: + calloc-2.patch messages: + msg216431
2014-04-16 02:49:46	vstinner	set	messages: + msg216425
2014-04-16 02:40:56	vstinner	set	messages: + msg216422
2014-04-15 22:20:09	josh.r	set	messages: + msg216404
2014-04-15 22:17:19	josh.r	set	messages: + msg216403
2014-04-15 22:05:23	josh.r	set	messages: + msg216399
2014-04-15 21:39:08	pitrou	set	messages: + msg216394
2014-04-15 21:30:03	josh.r	set	nosy: + josh.r
2014-04-15 21:28:08	vstinner	set	nosy: + pitrou, neologix
2014-04-15 21:27:57	vstinner	set	files: + calloc.patch keywords: + patch messages: + msg216390
2014-04-15 15:37:10	eric.araujo	set	nosy: + vstinner
2014-04-15 09:41:10	skrah	set	nosy: + skrah
2014-04-15 08:56:01	njs	create