Issue 18835: Add PyMem_AlignedAlloc()

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/63035

classification

Title:	Add PyMem_AlignedAlloc()
Type:	enhancement	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	closed	Resolution:	rejected
Dependencies:		Superseder:
Assigned To:		Nosy List:	benjamin.peterson, neologix, njs, pitrou, rhettinger, skrah, tim.peters, trent, vstinner, wscullin, xdegaye
Priority:	normal	Keywords:	patch

Created on 2013-08-26 02:32 by rhettinger, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
align.diff	rhettinger, 2013-09-01 04:40	Use case for alignment functions and macros	review
pymem_alignedalloc.patch	vstinner, 2017-11-24 01:07

Pull Requests
URL	Status	Linked	Edit
PR 4089	closed	vstinner, 2017-10-23 17:08
PR 4200	merged	vstinner, 2017-10-31 16:17
PR 15333		vstinner, 2019-08-19 10:30

Messages (57)
msg196174 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2013-08-26 02:32
Currently, we have little to no control over the alignment of memory returned by the PyMem functions. I suggest variants such as PyMem_Alloc32(n) and PyMem_Alloc64(n) to return suitably aligned data blocks. GNU C provides memalign() for this purpose: http://www.delorie.com/gnu/docs/glibc/libc_31.html and the Microsoft C compiler has _aligned_malloc() for the same purpose: http://msdn.microsoft.com/en-us/library/8z34s9c6%28VS.80%29.aspx A principal use case would be PyObject pointers where we want to keep all or most of the data fields in the same cache line (i.e. the fields for list, tuple, dict, and set objects). Deques would benefit from having the deque blocks aligned to 64byte boundaries and never crossing page boundaries. Set entries would benefit from 32byte alignment.
msg196176 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2013-08-26 02:38
I should also mention posix_memalign() for Unix/Linux systems. On systems without a clean way to provide aligned memory, we can always fall back to regular malloc(). The API wouldn't guarantee alignment but would comply when possible.
msg196194 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-26 10:23
Unless the memory allocator actually supports it, this means you lose a whole lot of memory for padding, though... Memory which will sit there unused at the end of another cacheline. Note that the current small object allocator, if not disabled, should already return you aligned memory, by construction (each allocation size has dedicated pools from which memory blocks are carved).
msg196196 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-08-26 10:52
The default allocator for PyObject is PyType_GenericAlloc(). If the type has the Py_TPFLAGS_HAVE_GC flag, PyType_GenericAlloc() calls _PyObject_GC_Malloc(). It is the case for the set type. _PyObject_GC_Malloc() adds an header of sizeof(PyGC_Head) (12 bytes on x86) before the PyObject data and then calls PyObject_MALLOC(). By default, Python uses pymalloc for PyObject_MALLOC() which uses an alignment of 8 bytes. For set, I suppose that you are talking about the allocation of the "table", not of the set object itself. set_table_resize() uses PyMem_NEW() to allocate the table. If we provide a PyMem_MallocAligned(alignment, size), table entries would be aligned, but not entries of the smalltable. Does it matter? > I suggest variants such as PyMem_Alloc32(n) and PyMem_Alloc64(n) to return suitably aligned data blocks. I would prefer a parameter rather than a new function per alignment! Which API does you propose exactly? On Linux, I see at least 3 functions: int posix_memalign(void *memptr, size_t alignment, size_t size); void valloc(size_t size); void *memalign(size_t boundary, size_t size); Do you propose aligned variant for PyMem, PyMem_Raw and PyObject, or only PyMem? "Unless the memory allocator actually supports it, this means you lose a whole lot of memory for padding, though... Memory which will sit there unused at the end of another cacheline." What is the alignment of a cacheline? Can a line starts at any address? Do you have an idea of performance benefit of memory alignment? Adding yet another API to allocate memory has a cost. Python 3.4 already implemented the PEP 445 which added many new functions. If we add new functions, they should conform the PEP 445 (need get/set allocator functions to support hooks to track the memory usage).
msg196202 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-08-26 12:18
> What is the alignment of a cacheline? Can a line starts at any > address? If it could, Raymond wouldn't be asking for this feature ;-) Cachelines are typically aligned at whatever their size is. So, a 64-byte cacheline will be aligned at a 64 bytes boundary. Perhaps some CPUs operate differently, but mainstream CPUs are generally like that (for good reason: a cache is much simpler to implement if there can't be some overlapping between cache lines). > Do you have an idea of performance benefit of memory alignment? > > Adding yet another API to allocate memory has a cost. Agreed. Aligned memory allocation is useful if your algorithms benefit from alignment (e.g. some SIMD-optimized code, or something relying on page tables). But aligning every data structure on a cacheline boundary doesn't sound like a very good idea: if it was, the system allocator would do it for you, IMHO ;-)
msg196693 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2013-09-01 00:36
> Adding yet another API to allocate memory has a cost Please don't FUD this one to death. Aligned memory access is sometimes important and we currently have no straight-forward way to achieve it. If you're truly worried about adding single new function to the public C API, we can create just a single internal function: void * _PyMem_RawMallocAligned(size_t size, size_t alignment). > aligning every data structure on a cacheline boundary > doesn't sound like a very good idea We don't have to align EVERY data structure. But I do have immediate beneficial use cases for set tables and for data blocks in deque objects. I need this function and would appreciate your help in fitting it in nicely with the current memory management functions and macros.
msg196700 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2013-09-01 04:40
Attaching a patch for what I would like to do with the alignment functions and macros.
msg196713 - (view)	Author: Charles-François Natali (neologix) *	Date: 2013-09-01 07:30
> Please don't FUD this one to death. Aligned memory access is > sometimes important and we currently have no straight-forward > way to achieve it. I guess that a simple way to cut the discussion short would be to have a first implementation, and run some benchmarks to measure the benefits. I can certainly see the benefit of cacheline-aligned data structures in multithreaded code (to avoid false sharing/cacheline bouncing): I'm really curious to see how much this would benefit in a single-threaded workload.
msg196828 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2013-09-03 08:32
> We don't have to align EVERY data structure. But I do have immediate > beneficial use cases for set tables and for data blocks in deque > objects. Can you explain what the use cases are, and post some benchmarking code? Also, what would be the strategy? Would you align every set/deque, or only the bigger ones?
msg196834 - (view)	Author: STINNER Victor (vstinner) *	Date: 2013-09-03 09:40
Linux provides the following functions: int posix_memalign(void *memptr, size_t alignment, size_t size); void valloc(size_t size); # obsolete void memalign(size_t boundary, size_t size); # obsolete Windows provides the following functions: void _aligned_malloc(size_t size, size_t alignment); void _aligned_free(void memblock); _aligned_malloc() has a warning: "Using free is illegal." Do all platform provide at least a function to allocate aligned memory? Windows, Mac OS X, FreeBSD, old operating systems, etc.? If no, how should we handle these operating systems? Refuse to start Python? Disable some optimizations? How should we check in the source code (ex: setobject.c) than aligned allocations are not supported? An #ifdef? ** Because of the Windows API, wee need at least two functions: void* PyMem_MallocAligned(size_t size, size_t alignment); void PyMem_FreeAligned(void ptr); The GIL must be held when callling these functions. Windows provides also a "realloc" function: void _aligned_realloc(void memblock, size_t size, size_t alignment); If the OS does not provide a realloc function, we can reimplement it (malloc, memcpy, free). void PyMem_ReallocAligned(void ptr, size_t size, size_t alignment); ** For the PEP 445: the new API is different than the PyMemAllocator structure because malloc and realloc functions have an extra "alignment" parameter. We can drop the parameter if the allocator has always the size alignment, but each object may require a different aligment?
msg217229 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-04-27 00:11
It looks like a memory allocator with a given alignment would help numpy, for SIMD instructions: https://mail.python.org/pipermail/python-dev/2014-April/134092.html (but "Numpy does not currently use aligned allocation, and it's not clear how important it is") See also this old discussion on python-dev: https://mail.python.org/pipermail/python-dev/2010-September/103911.html Related to this website: http://mallocv2.wordpress.com/
msg232201 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-12-05 18:25
Benchmarks and Intel's recommendation show that aligned allocation is actually important for AVX performance, and NumPy depends on CPython providing the right allocation APIs (for integration with tracemalloc): https://github.com/numpy/numpy/issues/5312 So I think for 3.5 we should start providing the APIs. Whether we use them in Python core is another discussion. Nathaniel, what APIs would you need exactly? See Victor's proposal in msg196834.
msg232209 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-12-05 21:20
Windows provides: void * _aligned_malloc( size_t size, size_t alignment ); http://msdn.microsoft.com/en-US/library/8z34s9c6%28v=vs.80%29.aspx How should we handle platforms which don't provide a memory allocator with an alignment? The simplest option is to return NULL (MemoryError). Allocating more memory and skip first bytes may work, but how do we retrieve the original address if the function releasing the memory block? What about Solaris, Mac OS X, FreeBSD, etc.
msg232215 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-12-05 21:49
> How should we handle platforms which don't provide a memory allocator > with an alignment? The simplest option is to return NULL (MemoryError). Are there such platforms? posix_memalign() is a POSIX standard, even OpenBSD has it.
msg232219 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-12-05 22:15
PyMem_GetAllocator() and PyMem_SetAllocator() have a domain parameter which can take 3 values: PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_MEM and PYMEM_DOMAIN_OBJ. I don't think that we need 3 flavors of allocators (PyMem_Raw, PyMem, PyObject). Maybe the PYMEM_DOMAIN_RAW domain is enough: OS functions don't require the GIL. In this case, should we add a new pair of Get/Set functions with an associated structure? Or maybe PyMem_SetAllocator() may ignore the aligned members of the patched PyMemAllocatorEx structure for domains other than PYMEM_DOMAIN_RAW? And PyMem_GetAllocator() would fill members with NULL.
msg232221 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-12-05 22:30
It's not terribly difficult to write a crude-but-effective aligned allocator on top of raw malloc: def aligned_malloc(size, alignment): assert alignment < 255 raw_pointer = (uint8) malloc(size + alignment) shift = alignment - (raw_pointer % alignment) assert 0 < shift <= alignment aligned_pointer = raw_pointer + shift (aligned_pointer - 1) = shift return aligned_pointer def aligned_free(uint8* pointer): shift = (pointer - 1) free(pointer - shift) But, this fallback and the official Win32 API both disallow the use of plain free() (like Victor points out in msg196834), so we can't just add an aligned_malloc slot to the PyMemAllocator struct. This kind of aligned allocation is effectively its own memory domain. If native aligned allocation support were added to PyMalloc then it could potentially do better (e.g. by noticing that it already has a block on its freelist with the requested alignment and just returning that instead of overallocating). This might be the ideal solution for Raymond's use case, but I have no idea how much work it would be to mess around with PyMalloc innards. Numpy doesn't currently use aligned allocation for anything, but we'd like to keep our options open. If we do end up using it in the future then there's a reasonable chance we might want to use it without* the GIL held (e.g. for allocating temporary buffers inside C loops). OTOH we are also happy to implement the aligned allocation ourselves (either on top of the system APIs or directly) -- we just don't want to lose tracemalloc support when we do. For numpy's purposes, I think the best approach would be to add a tracemalloc "escape valve", with an interface like: PyMem_RecordAlloc(const char* domain, void* tag, size_t quantity, PyMem_RecordRealloc(const char* domain, void* old_tag, void* new_tag, size_t new_quantity) PyMem_RecordFree(const char* domain, void* tag) where the idea is that if after someone allocates memory (or potentially other discrete resources) directly without going through PyMem_*, they could then call these functions to tell tracemalloc what they just did. This would be useful in a number of cases: in addition to tracking aligned allocations, it would make it possible to re-use the tracemalloc infrastructure to track GPU buffers allocated by CUDA/GPGPU-type code, mmap usage, hugetlbfs usage, etc. Potentially even open file descriptors if one wants to go there (seems pretty useful, actually).
msg232222 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-12-05 22:36
Le 05/12/2014 23:15, STINNER Victor a écrit : > > > I don't think that we need 3 flavors of allocators (PyMem_Raw, > PyMem, PyObject). > > Maybe the PYMEM_DOMAIN_RAW domain is enough: OS functions don't require the GIL. In this case, should we add a new pair of Get/Set functions with an associated structure? How about a new domain instead? PYMEM_DOMAIN_RAW_ALIGNED?
msg232223 - (view)	Author: Nathaniel Smith (njs) *	Date: 2014-12-05 22:43
Re: msg232219: If you go down the route of adding both aligned_malloc and aligned_free to the Allocator structure, I think you might as well support it for all domains? For the PyMem and PyObject domains you can just literally set the default functions to be PyMem_RawAlignedMalloc and PyMem_RawAlignedFree, and that leaves the door open to fancier implementations in the future (e.g. if PyMalloc starts implementing aligned allocation directly). Re: msg232222: Currently all the domains share the same vtable struct, though, whereas aligned allocator functions have different signatures. So you can't literally just add an entry to the existing domain enum and be done. It also occurs to me that if numpy ever gets serious about using aligned memory then we might also need aligned_realloc (numpy allows arrays to be resized, sigh), which is possible to do but I cannot recommend python attempt to provide as a standard interface. (It's not available at all in POSIX.) I guess this is another argument that it might be best to just give numpy an escape valve and worry about CPython's internal needs separately.
msg232225 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-12-05 22:58
You cannot just add a new domain because the function prototypes are different (need an extra alignement parameter). You have to add new members to the structure or add a new structure.
msg234086 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2015-01-15 16:50
Due to the realloc() problem, I'm thinking that the best course for Numpy would be to use its own allocator wrapper like Nathaniel outligned. Also, Numpy wants calloc() for faster allocation of zeroed arrays...
msg286074 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-01-23 11:15
Antoine Pitrou: "Benchmarks and Intel's recommendation show that aligned allocation is actually important for AVX performance, and NumPy depends on CPython providing the right allocation APIs (for integration with tracemalloc): https://github.com/numpy/numpy/issues/5312" I don't think that NumPy was ever fully integrated with tracemalloc. Since Python 3.6, NumPy doesn't have to use Python memory allocators to benefit of tracemalloc debugger: I added a new C API to be able to manually track/untrack memory blocks which are not directly allocated by Python, see the issue #26530. I implemented this feature for NumPy, but since I never got any feedback from NumPy, I left the API private. Moreover, I also added second feature to tracemalloc: it's now possible to track memory allocation in different address spaces. The feature was also designed for NumPy which can allocate memory in the GPU address space. See the issue #26588. With these new tracemalloc features, I don't think that NumPy can still be used to request this feature in CPython core. -- Raymond: "A principal use case would be PyObject pointers where we want to keep all or most of the data fields in the same cache line (i.e. the fields for list, tuple, dict, and set objects). Deques would benefit from having the deque blocks aligned to 64byte boundaries and never crossing page boundaries. Set entries would benefit from 32byte alignment." Victor (me!): "Do you have an idea of performance benefit of memory alignment?" Since Raymond never provided any evidence that a new aligned memory allocator would give a significant speedup, and there issue is inactive for 2 years, I close it. See also the change 6e16b0045cf1, it seems like Raymond doesn't plan to use this feature anymore. -- If someone wants this feature, we need good reasons to implement it.
msg304803 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-23 13:26
I need this too. I would like to set this https://github.com/plures/ndtypes/commit/c260fdbae707da0dfefef499621a0a9f37a3e509#diff-2402fff6223084b74d97237c0d620b29R50 to something line PyMem_AlignedAlloc(), because the Python allocator is faster. I think many people would welcome this in scientific computing: The Arrow memory format for example recommends 64 bit alignment.
msg304805 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-10-23 13:35
Do you need aligned allocation even on small objects? The Python allocator doesn't handle allocations > 512 bytes.
msg304807 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-23 14:11
Yes, I think it is partly convenience. I want to set ... ndt_mallocfunc = PyMem_Malloc; ndt_alignedallocfunc = PyMem_AlignedAlloc; ndt_callocfunc = PyMem_Calloc; ndt_reallocfunc = PyMem_Realloc; ndt_freefunc = PyMem_Free; ... so I can always just call ndt_free(), because there's only one memory allocator. But the other part is that datashape allows to specify alignment regardless of the size of the type. Example: >>> from ndtypes import * >>> from xnd import * >>> t = ndt("{a: int64, b: uint64, align=16}") >>> xnd(t, {'a': 111, 'b': 222}) <xnd.xnd object at 0x7f82750c2a30> The xnd object essentially wraps a typed data pointer. In the above case, the 'align' keyword has the same purpose as gcc's __attribute__((aligned(16))). There are several other cases in datashape where alignment can specified explicitly. For the convenience case it would already help if PyMem_AlignedAlloc() did not use the fast allocator, but just delegated to _aligned_malloc() (MSVC) or aligned_alloc() (C11), ...
msg304826 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-23 17:16
I added _PyTraceMalloc_Track() and _PyTraceMalloc_Untrack() private functions to the C API in Python 3.6. These functions were made public in Python 3.7: renamed to PyTraceMalloc_Track() and PyTraceMalloc_Untrack(). I made this change to allow numpy to trace memory allocations, to debug memory leaks. numpy cannot use Python memory allocators because numpy requires aligned memory blocks which are required to use CPU SIMD instructions. Stefan Krah: > I think many people would welcome this in scientific computing: The Arrow memory format for example recommends 64 bit alignment. Ah, that's an interesting use case. I created attached PR 4089 to implement PyMem_AlignedAlloc(): void* PyMem_AlignedAlloc(size_t alignment, size_t size); void PyMem_AlignedFree(void *ptr); Memory allocated by PyMem_AlignedAlloc() must be freed with PyMem_AlignedFree(). We cannot reuse PyMem_Free(). On Windows, PyMem_AlignedAlloc() is implemented with _aligned_malloc() which requires to release the memory with _aligned_free(). Raymond Hettinger: >> Adding yet another API to allocate memory has a cost > Please don't FUD this one to death. Statistics (size) on my PR: Doc/c-api/memory.rst \| 43 +- Doc/whatsnew/3.7.rst \| 4 + Include/internal/mem.h \| 6 +- Include/objimpl.h \| 2 + Include/pymem.h \| 16 +- .../2017-10-23-19-03-38.bpo-18835.8XEjtG.rst \| 9 + Modules/_testcapimodule.c \| 83 ++- Modules/_tracemalloc.c \| 131 +++- Objects/obmalloc.c \| 616 +++++++++++++------ 9 files changed, 655 insertions(+), 255 deletions(-) That's quite a big change "just to add PyMem_AlignedAlloc()" :-)
msg304834 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-23 19:11
On Mon, Oct 23, 2017 at 05:16:53PM +0000, STINNER Victor wrote: > Memory allocated by PyMem_AlignedAlloc() must be freed with PyMem_AlignedFree(). > > We cannot reuse PyMem_Free(). On Windows, PyMem_AlignedAlloc() is implemented with _aligned_malloc() which requires to release the memory with _aligned_free(). Ah, too bad. Of course Windows does something different again. This weakens my use case somewhat, but I guess it would still be nice to have the functions (if you think it's maintainable).
msg304837 - (view)	Author: Nathaniel Smith (njs) *	Date: 2017-10-23 19:53
There's also aligned calloc, which no native APIs support but is still quite useful.
msg304843 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-10-23 21:16
By the way: > The Arrow memory format for example recommends 64 bit alignment. I presume you mean 64 bytes?
msg304844 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-23 21:23
On Mon, Oct 23, 2017 at 09:16:08PM +0000, Antoine Pitrou wrote: > > The Arrow memory format for example recommends 64 bit alignment. > > I presume you mean 64 bytes? Yes, I was typing too fast.
msg304859 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2017-10-24 05:57
Having the ability to allocated aligned memory could help avoid some undefined behavior. See #27987 (though, we only need 16-byte alignment there)
msg304878 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-24 09:08
[me] > This weakens my use case somewhat [...] I looked at Victor's patch, and thanks to the alignment <= ALIGNMENT optimization it seems that always using the aligned_alloc() and aligned_free() versions for a specific pointer is fast. Nice! So I retract the weakening of my use case (still shame on Microsoft for not implementing C11). :)
msg304989 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-25 14:33
Currently, the main question on my PR 4089 was raised by Antoine Pitrou: "Do people have to provide aligned_alloc and aligned_free? Or can they leave those pointers NULL and get a default implementation?" My reply: "Currently, you must provide all allocator functions, included aligned_alloc and aligned_free. Technically, we can implement a fallback, but I'm not sure that I want to do that :-)" I'm not sure about that. I can imagine platforms which provide a special malloc/free and that's all: no calloc, posix_memalign or _aligned_malloc(). But is Python suppose to fills the holes? For example, implement calloc() as malloc()+memset()? Or is the user of the PyMem_SetAllocator() API responsible to reimplement them? In Python 3.5, we added the requirement of a working calloc(). In Python 3.7, should we also add the "aligned alloc" requirement? In case of doubt, I prefer not to guess, and leave the decision to the caller of the API: require all functions to be implemented. I'm not sure that it's a good idea to provide a "aligned malloc" fallback if such fallback would be inefficient. For example, we would have to overallocate the memory block not only for the requested alignement, but also allocates extra sizeof(size_t) bytes, in each aligned memmory block, to store the size of the alignment itself, to recover the original pointer... to finally be able to call free(). An aligned memory block would look like: [AAAAA SSS DDD...DDDDDDD] where AAAAA are bytes lost for alignment, SSS bytes storing the alignment size (size of "AAAAA" in this example), and "DDD...DDDDDDD" would be the actual data.
msg305003 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-25 16:52
> In Python 3.7, should we also add the "aligned alloc" requirement? Linux, BSD, OSX, MSVC should be covered. According to Stackoverflow MinGW has an internal function. Android, I don't know. Xavier?
msg305009 - (view)	Author: Xavier de Gaye (xdegaye) *	Date: 2017-10-25 20:46
Android has both memalign() [1] and posix_memalign() [2] and does not have aligned_alloc(), posix_memalign() is a wrapper around memalign() [3]. [1] https://android.googlesource.com/platform/bionic/+/master/libc/include/malloc.h#38 [2] https://android.googlesource.com/platform/bionic/+/master/libc/include/stdlib.h#80 [3] https://android.googlesource.com/platform/bionic/+/85aad90%5E%21/
msg305012 - (view)	Author: Nathaniel Smith (njs) *	Date: 2017-10-25 21:38
> I'm not sure that it's a good idea to provide a "aligned malloc" fallback if such fallback would be inefficient. For example, we would have to overallocate the memory block not only for the requested alignement, but also allocates extra sizeof(size_t) bytes, in each aligned memmory block, to store the size of the alignment itself, to recover the original pointer... to finally be able to call free(). You can re-use the same bytes for padding and to store the offset. The main tricky thing is that for an alignment of N bytes you need to overallocate N bytes instead of (N-1). (In the worst case, malloc returns you a pointer that's already N-byte aligned, and then you have to advance it by a full N bytes so that you have some space before the pointer to store the offset.) Also you want to do a little N = max(N, sizeof(whatever int type you use)) at the beginning to make sure it's always big enough, but this is trivial (and really even a uint16_t is plenty big to store all reasonable alignments). And make sure that N is a power-of-two, which guarantees that whatever value malloc returns will be shifted by at least malloc's regular alignment, which is guaranteed to be large enough to store a standard int type (on reasonable systems).
msg305114 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-27 12:03
Should we care about the C11 restriction? http://en.cppreference.com/w/c/memory/aligned_alloc "size - number of bytes to allocate. An integral multiple of alignment" posix_memalign and _aligned_malloc don't care about the multiple. On the other hand, sane requests will have the exact multiple most of the time anyway.
msg305122 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-27 14:14
PR 4089 becomes much more larger than what I expected, so I propose to defer enhancements to following PR, especially the idea of "emulating" PyMem_AlignedAlloc() on top of PyMem_Malloc() if the user calls PyMem_SetAllocators() without implemented aligned_alloc.
msg305136 - (view)	Author: Nathaniel Smith (njs) *	Date: 2017-10-27 21:19
> On the other hand, sane requests will have the exact multiple most of the time anyway. The ways we've discussed using aligned allocation in numpy wouldn't follow this requirement without special checking. Which isn't necessarily a big deal, and numpy won't necessarily use this API anyway. But I would suggest being very clear about exactly what you guarantee and what you don't :-).
msg305160 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-28 12:16
> The ways we've discussed using aligned allocation in numpy wouldn't follow this requirement without special checking. Which isn't necessarily a big deal, and numpy won't necessarily use this API anyway. But I would suggest being very clear about exactly what you guarantee and what you don't :-). In the GitHub issue we sort of decided to make the more relaxed Posix semantics official: 'alignment' must be a power of 2 and a multiple of 'sizeof(void )'. 'size' can be really anything, so it should work for numpy. It's a pity that Posix does not round up align={1,2,4} to 'sizeof(void )' automatically (why not?), so the applications will have to do that.
msg305177 - (view)	Author: Nathaniel Smith (njs) *	Date: 2017-10-29 05:02
Given the complexities here, and that the Track/Untrack functions are public now, I do wonder if the actual aligned allocation routines should just be an internal API (i.e., not exposed in Python.h).
msg305301 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-31 14:18
Nathaniel: "(...) and numpy won't necessarily use this API anyway." Can you elaborate why numpy wouldn't use this new API? I designed it with numpy in mind :-) Using PyMem_AlignedAlloc() instead of using directly posix_memalign()/_aligned_alloc() provides the debug features for free: * tracemalloc is able to trace memory allocations * detect buffer underflow * detect buffer overflow * detect API misuse like PyMem_Free(PyMem_AlignedAlloc()) -- it doesn't detect free(PyMem_AlignedAlloc()) which is plain wrong on Windows (but this one should crash immediately ;-)) Other advantages: * PyMem_AlignedAlloc(alignment, 0) is well defined: it never returns NULL * PyMem_AlignedAlloc(alignment, size) checks on alignment value are the same on all operating systems Moreover, Python takes care of the portability for you :-)
msg305302 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-31 14:23
Nathaniel Smith: "Given the complexities here, and that the Track/Untrack functions are public now, I do wonder if the actual aligned allocation routines should just be an internal API (i.e., not exposed in Python.h)." I don't see why we would hide PyMem_AlignedAlloc() but requires to implement aligned_alloc in PyMem_SetAllocators(). The plan is also to slowly use PyMem_AlignedAlloc() internally for performance. Can you elaborate the "complexities"? Do you mean that the proposed PyMem_AlignedAlloc() API is more complex than calling directly posix_memalign()? PyMem_AlignedAlloc() is designed for performance. For best performances, CPUs require memory to be aligned on convenient values like powers of 2 ;-) I also understand that alignment must be a multiple of sizeof(void*) because CPU work on "CPU words". On a 64-bit CPU, a word is 8 bytes. If the memory is aligned on 4 bytes, it may have to fetch two words, you loose the advantage of memory alignment. I understand that PyMem_AlignedAlloc() requirements come from the CPU arhcitecture, it's not an arbitrary limitation just for the fun ;-)
msg305304 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-31 14:26
Stefan Krah: "we care about the C11 restriction? (...) "size - number of bytes to allocate. An integral multiple of alignment" (...) posix_memalign and _aligned_malloc don't care about the multiple." I prefer to ignore this restriction at this point. I wouldn't be surprised if posix_memalign() and _aligned_malloc() already align the size for us internally. We can add the restriction later, if needed.
msg305308 - (view)	Author: Nathaniel Smith (njs) *	Date: 2017-10-31 14:55
> Can you elaborate why numpy wouldn't use this new API? I designed it with numpy in mind :-) The reasons I had in mind are: 1) numpy hasn't actually come to a decision about whether to use aligned allocation at all, or under what circumstances. 2) if we do use it, we'll probably need our own implementation anyway to support old pythons. 3) also it's not clear what the best approach will look like, given that we care a lot about using calloc when possible, and have reason to prefer using regular freeing functions whenever possible. I wasn't making a criticism of your API; "it's not you, it's us" :-). But this is a complicated and subtle area that's not really part of CPython's core competency, and coming at a time when people are fretting about how to shrink the C APIs surface area. E.g. I can think of more interesting ways for the PyPy folks to spend their time than implementing an aligned_alloc wrapper...
msg305310 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-10-31 15:01
Le 31/10/2017 à 15:55, Nathaniel Smith a écrit : > > 1) numpy hasn't actually come to a decision about whether to use aligned allocation at all, or under what circumstances. This isn't the Numpy bug tracker, but I can't help but mention that if Numpy grew a facility for users to override the memory allocators it invokes to allocate array data, Numpy may not have to come to a decision about this at all... ;-) And it would also help specialized accelerators, which may want to direct Numpy arrays to e.g. memory that's cheaply shared with the GPU. (see https://github.com/numpy/numpy/pull/5470) > I wasn't making a criticism of your API; "it's not you, it's us" :-). But this is a complicated and subtle area that's not really part of CPython's core competency, and coming at a time when people are fretting about how to shrink the C APIs surface area. E.g. I can think of more interesting ways for the PyPy folks to spend their time than implementing an aligned_alloc wrapper... The same argument can be made for any part of the stdlib or core language that PyPy has to reproduce. Besides, I don't think implementing an aligned_alloc wrapper is very difficult. The hard part is getting an agreement over the exposed APIs, and that's CPython's job, not PyPy ;-)
msg305311 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-31 15:14
On Tue, Oct 31, 2017 at 02:55:04PM +0000, Nathaniel Smith wrote: > 3) also it's not clear what the best approach will look like, given that we care a lot about using calloc when possible, and have reason to prefer using regular freeing functions whenever possible. I actually have the same problems. But since no fast (kernel-zeroed) aligned_calloc() exists, I must use memset() anyway. So an emulated aligned_calloc() should probably not go into CPython since it doesn't provide any performance advantages.
msg305327 - (view)	Author: Nathaniel Smith (njs) *	Date: 2017-10-31 18:26
> But since no fast (kernel-zeroed) aligned_calloc() exists, I must use memset() anyway. For large allocations, you'll probably be better off implementing your own aligned allocator on top of calloc than implementing your own calloc on top of an aligned allocator. (It's O(1) overhead versus O(n).) And once you're doing that you might want to use the same code for regular allocations too, so that you don't need to keep track of whether each memory block used aligned_calloc or aligned_malloc and can treat them the same... Depends on your exact circumstances.
msg305331 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-10-31 19:18
New changeset 9ed83c40855b57c10988f76770a4eb825e034cd8 by Victor Stinner in branch 'master': bpo-18835: Cleanup pymalloc (#4200) https://github.com/python/cpython/commit/9ed83c40855b57c10988f76770a4eb825e034cd8
msg305339 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-10-31 20:57
> For large allocations, you'll probably be better off implementing your own aligned allocator on top of calloc than implementing your own calloc on top of an aligned allocator. (It's O(1) overhead versus O(n).) And once you're doing that you might want to use the same code for regular allocations too, so that you don't need to keep track of whether each memory block used aligned_calloc or aligned_malloc and can treat them the same... Depends on your exact circumstances. Yes, but if the whole array is initialized with actual values, then the memset() overhead is not very large (something like 16% here). If uninitialized (or very sparse), the overhead is of course gigantic. What is more, in some crude tests the posix_memalign() performance isn't that great compared to malloc()/calloc(). C11 aligned_alloc() is also quite a bit faster than posix_memalign() here. I think you're right that a hand-rolled solution on top of calloc() is best for my use case.
msg305344 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-11-01 01:13
I merged my PR 4199 (Document PyObject_Malloc()) and PR 4200 (Cleanup pymalloc) to prepare PR 4089. PR 4089 should now be completed and well tested. The real question is now if we need PyMem_AlignedAlloc()? Stefan Krah and Nathaniel Smith are interested by aligned memory allocations, but both wrote that they don't plan to use PyMem_AlignedAlloc() if I understood correctly. What's the point of adding PyMem_AlignedAlloc() in that case?
msg305345 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-11-01 01:15
msg196194: Antoine Pitrou: "Note that the current small object allocator, if not disabled, should already return you aligned memory, by construction (each allocation size has dedicated pools from which memory blocks are carved)." In my current implementation of _PyObject_AlignedAlloc(), I only call pymalloc_alloc() for alignment <= pymalloc ALIGNMENT. Maybe we can do better?
msg305366 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2017-11-01 10:41
I agree, if the two primary users say they won't use it, it doesn't make sense for us to maintain 600 hundred lines of low-level C code.
msg305397 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-11-01 21:21
Yes, it may be better not to add it. To summarize the problems again: -------------------------------- - C11 aligned_alloc() / free() would be more comfortable but isn't available on MSVC. - posix_memalign() performance isn't that great. - hand-rolled aligned_calloc() is the fastest. The feature could still be useful for fixing #31912 and #27987, if someone has an idea how to integrate the aligned version.
msg305417 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-11-02 11:37
"C11 aligned_alloc() / free() would be more comfortable but isn't available on MSVC." Is it common to require the allocated memory block to be filled with zeros? In Python, calloc() is not the most common allocator. I only found 5 places where calloc is used: Modules/_ssl.c:5172: _ssl_locks = PyMem_Calloc(_ssl_locks_count, Modules/gcmodule.c:1689: g = (PyGC_Head )PyObject_Calloc(1, size); Modules/gcmodule.c:1717:_PyObject_GC_Calloc(size_t basicsize) Objects/bytesobject.c:83: op = (PyBytesObject )PyObject_Calloc(1, PyBytesObject_SIZE + size); Objects/listobject.c:176: op->ob_item = (PyObject *) PyMem_Calloc(size, sizeof(PyObject )); "posix_memalign() performance isn't that great. hand-rolled aligned_calloc() is the fastest." I'm not sure that the cost of the memory allocator itself defeats the gain of aligned memory on algorithms. I expect data processing to be much more expensive than the memory allocation, no? Again, the unknown remains the benchmark result. Can anyone write a quick benchmark to measure the gain of aligned memory? 4 years ago, Raymond Hettinger wanted to use it for the set and collection.deque types.
msg305418 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-11-02 11:40
About the API itself, I'm not sure that PyMem_AlignedAlloc(alignment, size) is flexible enough. If we want to get data aligned in a Python object, we would have to pass an offset to the data, since Python objects have headers of variable size (depending on the type). Windows has such API: void * _aligned_offset_malloc( size_t size, size_t alignment, size_t offset ); This function is based on malloc, so likely adds padding bytes for you depending on size, alignment and offset. https://msdn.microsoft.com/fr-fr/library/ec852tkw.aspx See bpo-27987: "obmalloc's 8-byte alignment causes undefined behavior".
msg305466 - (view)	Author: Stefan Krah (skrah) *	Date: 2017-11-03 09:53
> I'm not sure that the cost of the memory allocator itself defeats the gain of aligned memory on algorithms. I expect data processing to be much more expensive than the memory allocation, no? I guess this issue isn't easy to focus due to the vast variety of use cases. So the is only about numpy/ndtypes: What you write is true, but I'm simply getting cold feet w.r.t locking myself into memset(). calloc() uses mmap() for large allocations, so I think one can happily allocate a large number of huge arrays without any cost on Linux, as long as they're not accessed. At least that's what my tests indicate. Couple that with the fact that one has to use aligned_free() anyway, and that posix_memalign() isn't that great, and the use case seems less solid for scientific computing. So I rather waste a couple of bytes per allocation and deal with some Valgrind macros to get proper bounds checking. Note that CPython still knows any allocation from ndtypes, because ndt_callocfunc will be set to PyMem_Calloc() [1] and the custom ndt_aligned_calloc() uses ndt_callocfunc. [1] If #31912 is solved.
msg306864 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-11-24 01:07
Because of the lack of strong motivition compared to the size of the change and the cost to maintain it, I reject this issue. For the archives, I attach a copy of the latest version of my implementation.

History
Date	User	Action	Args
2022-04-11 14:57:50	admin	set	github: 63035
2019-08-19 10:30:41	vstinner	set	pull_requests: + pull_request15052
2017-11-24 01:08:11	vstinner	set	status: open -> closed files: + pymem_alignedalloc.patch messages: + msg306864 resolution: rejected stage: patch review -> resolved
2017-11-03 09:53:27	skrah	set	messages: + msg305466
2017-11-02 11:40:59	vstinner	set	messages: + msg305418
2017-11-02 11:37:03	vstinner	set	messages: + msg305417
2017-11-01 21:21:00	skrah	set	messages: + msg305397
2017-11-01 10:41:48	pitrou	set	messages: + msg305366
2017-11-01 01:15:54	vstinner	set	messages: + msg305345
2017-11-01 01:13:44	vstinner	set	title: Add aligned memory variants to the suite of PyMem functions/macros -> Add PyMem_AlignedAlloc()
2017-11-01 01:13:19	vstinner	set	messages: + msg305344
2017-10-31 20:57:50	skrah	set	messages: + msg305339
2017-10-31 19:18:13	vstinner	set	messages: + msg305331
2017-10-31 18:26:17	njs	set	messages: + msg305327
2017-10-31 16:17:32	vstinner	set	pull_requests: + pull_request4168
2017-10-31 15:14:41	skrah	set	messages: + msg305311
2017-10-31 15:01:12	pitrou	set	messages: + msg305310
2017-10-31 14:55:04	njs	set	messages: + msg305308
2017-10-31 14:26:16	vstinner	set	messages: + msg305304
2017-10-31 14:23:42	vstinner	set	messages: + msg305302
2017-10-31 14:18:30	vstinner	set	messages: + msg305301
2017-10-29 05:02:59	njs	set	messages: + msg305177
2017-10-28 12:16:23	skrah	set	messages: + msg305160
2017-10-27 21:19:17	njs	set	messages: + msg305136
2017-10-27 14:14:48	vstinner	set	messages: + msg305122
2017-10-27 12:03:40	skrah	set	messages: + msg305114
2017-10-25 21:38:08	njs	set	messages: + msg305012
2017-10-25 20:46:55	xdegaye	set	messages: + msg305009
2017-10-25 16:52:32	skrah	set	nosy: + xdegaye messages: + msg305003
2017-10-25 14:33:21	vstinner	set	messages: + msg304989
2017-10-24 09:08:34	skrah	set	messages: + msg304878
2017-10-24 05:57:54	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg304859
2017-10-23 21:23:19	skrah	set	messages: + msg304844
2017-10-23 21:16:08	pitrou	set	messages: + msg304843
2017-10-23 19:53:15	njs	set	messages: + msg304837
2017-10-23 19:11:27	skrah	set	messages: + msg304834
2017-10-23 17:16:53	vstinner	set	messages: + msg304826
2017-10-23 17:08:02	vstinner	set	stage: needs patch -> patch review pull_requests: + pull_request4059
2017-10-23 14:14:11	pitrou	set	versions: + Python 3.7, - Python 3.5
2017-10-23 14:11:28	skrah	set	messages: + msg304807
2017-10-23 13:35:53	pitrou	set	messages: + msg304805
2017-10-23 13:26:46	skrah	set	status: closed -> open nosy: + skrah messages: + msg304803 resolution: rejected -> (no value)
2017-01-23 11:15:19	vstinner	set	status: open -> closed resolution: rejected messages: + msg286074
2015-01-15 16:50:11	pitrou	set	messages: + msg234086
2015-01-14 17:57:47	wscullin	set	nosy: + wscullin
2014-12-08 14:25:34	trent	set	nosy: + trent
2014-12-05 22:58:03	vstinner	set	messages: + msg232225
2014-12-05 22:43:08	njs	set	messages: + msg232223
2014-12-05 22:36:22	pitrou	set	messages: + msg232222
2014-12-05 22:30:35	njs	set	messages: + msg232221
2014-12-05 22:15:18	vstinner	set	messages: + msg232219
2014-12-05 21:49:47	pitrou	set	messages: + msg232215
2014-12-05 21:20:21	vstinner	set	messages: + msg232209
2014-12-05 18:25:35	pitrou	set	versions: + Python 3.5, - Python 3.4 nosy: + njs messages: + msg232201 type: enhancement
2014-04-27 00:11:24	vstinner	set	messages: + msg217229
2013-09-03 09:40:10	vstinner	set	messages: + msg196834
2013-09-03 08:32:12	pitrou	set	messages: + msg196828
2013-09-01 07:30:38	neologix	set	nosy: + neologix messages: + msg196713
2013-09-01 04:40:17	rhettinger	set	files: + align.diff keywords: + patch messages: + msg196700
2013-09-01 00:36:11	rhettinger	set	messages: + msg196693
2013-09-01 00:35:47	rhettinger	set	messages: - msg196692
2013-09-01 00:33:44	rhettinger	set	title: Add aligned memroy variants to the suite of PyMem functions/macros -> Add aligned memory variants to the suite of PyMem functions/macros
2013-09-01 00:33:22	rhettinger	set	messages: + msg196692
2013-08-26 12:18:46	pitrou	set	messages: + msg196202
2013-08-26 10:52:46	vstinner	set	messages: + msg196196
2013-08-26 10:23:02	pitrou	set	nosy: + vstinner, pitrou, tim.peters messages: + msg196194
2013-08-26 02:38:03	rhettinger	set	messages: + msg196176
2013-08-26 02:32:51	rhettinger	create