classification
Title: Add aligned memory variants to the suite of PyMem functions/macros
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.5
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: haypo, neologix, njs, pitrou, rhettinger, tim.peters, trent, wscullin
Priority: normal Keywords: patch

Created on 2013-08-26 02:32 by rhettinger, last changed 2017-01-23 11:15 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
align.diff rhettinger, 2013-09-01 04:40 Use case for alignment functions and macros review
Messages (21)
msg196174 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-26 02:32
Currently, we have little to no control over the alignment of memory returned by the PyMem functions.   

I suggest variants such as PyMem_Alloc32(n) and PyMem_Alloc64(n) to return suitably aligned data blocks.

GNU C provides memalign() for this purpose:  http://www.delorie.com/gnu/docs/glibc/libc_31.html
and the Microsoft C compiler has _aligned_malloc() for the same purpose: http://msdn.microsoft.com/en-us/library/8z34s9c6%28VS.80%29.aspx

A principal use case would be PyObject pointers where we want to keep all or most of the data fields in the same cache line (i.e. the fields for list, tuple, dict, and set objects).

Deques would benefit from having the deque blocks aligned to 64byte boundaries and never crossing page boundaries.  Set entries would benefit from 32byte alignment.
msg196176 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-08-26 02:38
I should also mention posix_memalign() for Unix/Linux systems.

On systems without a clean way to provide aligned memory, we can always fall back to regular malloc().   The API wouldn't guarantee alignment but would comply when possible.
msg196194 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-26 10:23
Unless the memory allocator actually supports it, this means you lose a whole lot of memory for padding, though... Memory which will sit there unused at the end of another cacheline.

Note that the current small object allocator, if not disabled, *should* already return you aligned memory, by construction (each allocation size has dedicated pools from which memory blocks are carved).
msg196196 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-08-26 10:52
The default allocator for PyObject is PyType_GenericAlloc(). If the type has the Py_TPFLAGS_HAVE_GC flag, PyType_GenericAlloc() calls _PyObject_GC_Malloc(). It is the case for the set type. _PyObject_GC_Malloc() adds an header of sizeof(PyGC_Head) (12 bytes on x86) before the PyObject data and then calls PyObject_MALLOC(). By default, Python uses pymalloc for PyObject_MALLOC() which uses an alignment of 8 bytes.

For set, I suppose that you are talking about the allocation of the "table", not of the set object itself. set_table_resize() uses PyMem_NEW() to allocate the table.

If we provide a PyMem_MallocAligned(alignment, size), table entries would be aligned, but not entries of the smalltable. Does it matter?

> I suggest variants such as PyMem_Alloc32(n) and PyMem_Alloc64(n) to return suitably aligned data blocks.

I would prefer a parameter rather than a new function per alignment! Which API does you propose exactly? On Linux, I see at least 3 functions:

int posix_memalign(void **memptr, size_t alignment, size_t size);
void *valloc(size_t size);
void *memalign(size_t boundary, size_t size);

Do you propose aligned variant for PyMem, PyMem_Raw and PyObject, or only PyMem?

"Unless the memory allocator actually supports it, this means you lose a whole lot of memory for padding, though... Memory which will sit there unused at the end of another cacheline."

What is the alignment of a cacheline? Can a line starts at any address?

Do you have an idea of performance benefit of memory alignment?

Adding yet another API to allocate memory has a cost.

Python 3.4 already implemented the PEP 445 which added many new functions. If we add new functions, they should conform the PEP 445 (need get/set allocator functions to support hooks to track the memory usage).
msg196202 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-26 12:18
> What is the alignment of a cacheline? Can a line starts at any
> address?

If it could, Raymond wouldn't be asking for this feature ;-)
Cachelines are typically aligned at whatever their size is. So,
a 64-byte cacheline will be aligned at a 64 bytes boundary.
Perhaps some CPUs operate differently, but mainstream CPUs are
generally like that (for good reason: a cache is much simpler
to implement if there can't be some overlapping between cache
lines).

> Do you have an idea of performance benefit of memory alignment?
> 
> Adding yet another API to allocate memory has a cost.

Agreed. Aligned memory allocation is useful if your
*algorithms* benefit from alignment (e.g. some SIMD-optimized
code, or something relying on page tables). But aligning
every data structure on a cacheline boundary doesn't sound
like a very good idea: if it was, the system allocator
would do it for you, IMHO ;-)
msg196693 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-09-01 00:36
> Adding yet another API to allocate memory has a cost

Please don't FUD this one to death.  Aligned memory access is sometimes important and we currently have no straight-forward way to achieve it.  If you're truly worried about adding single new function to the public C API, we can create  just a single internal function:  void * _PyMem_RawMallocAligned(size_t size, size_t alignment).

> aligning every data structure on a cacheline boundary 
> doesn't sound like a very good idea

We don't have to align EVERY data structure.  But I do have immediate beneficial use cases for set tables and for data blocks in deque objects.  I need this function and would appreciate your help in fitting it in nicely with the current memory management functions and macros.
msg196700 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2013-09-01 04:40
Attaching a patch for what I would like to do with the alignment functions and macros.
msg196713 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-09-01 07:30
> Please don't FUD this one to death.  Aligned memory access is
> sometimes important and we currently have no straight-forward
> way to achieve it.

I guess that a simple way to cut the discussion short would be to have a first implementation, and run some benchmarks to measure the benefits.

I can certainly see the benefit of cacheline-aligned data structures in multithreaded code (to avoid false sharing/cacheline bouncing): I'm really curious to see how much this would benefit in a single-threaded workload.
msg196828 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-09-03 08:32
> We don't have to align EVERY data structure.  But I do have immediate
> beneficial use cases for set tables and for data blocks in deque
> objects.

Can you explain what the use cases are, and post some benchmarking code?

Also, what would be the strategy? Would you align every set/deque, or only
the bigger ones?
msg196834 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-09-03 09:40
Linux provides the following functions:

int posix_memalign(void **memptr, size_t alignment, size_t size);
void *valloc(size_t size);  # obsolete
void *memalign(size_t boundary, size_t size);  # obsolete

Windows provides the following functions:

void* _aligned_malloc(size_t size,  size_t alignment);
void _aligned_free(void *memblock);

_aligned_malloc() has a warning: "Using free is illegal."

Do all platform provide at least a function to allocate aligned memory? Windows, Mac OS X, FreeBSD, old operating systems, etc.? If no, how should we handle these operating systems? Refuse to start Python? Disable some optimizations? How should we check in the source code (ex: setobject.c) than aligned allocations are not supported? An #ifdef?

***

Because of the Windows API, wee need at least two functions:

void* PyMem_MallocAligned(size_t size, size_t alignment);
void PyMem_FreeAligned(void *ptr);

The GIL must be held when callling these functions.


Windows provides also a "realloc" function:

void* _aligned_realloc(void *memblock, size_t size,  size_t alignment);

If the OS does not provide a realloc function, we can reimplement it (malloc, memcpy, free).

void* PyMem_ReallocAligned(void *ptr, size_t size, size_t alignment);

***

For the PEP 445: the new API is different than the PyMemAllocator structure because malloc and realloc functions have an extra "alignment" parameter. We can drop the parameter if the allocator has always the size alignment, but each object may require a different aligment?
msg217229 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-04-27 00:11
It looks like a memory allocator with a given alignment would help numpy, for SIMD instructions:
https://mail.python.org/pipermail/python-dev/2014-April/134092.html
(but "Numpy does not currently use aligned allocation, and it's not clear how important it is")

See also this old discussion on python-dev:
https://mail.python.org/pipermail/python-dev/2010-September/103911.html

Related to this website:
http://mallocv2.wordpress.com/
msg232201 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-12-05 18:25
Benchmarks and Intel's recommendation show that aligned allocation is actually important for AVX performance, and NumPy depends on CPython providing the right allocation APIs (for integration with tracemalloc): https://github.com/numpy/numpy/issues/5312

So I think for 3.5 we should start providing the APIs. Whether we use them in Python core is another discussion.

Nathaniel, what APIs would you need exactly? See Victor's proposal in 
msg196834.
msg232209 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-12-05 21:20
Windows provides:

void * _aligned_malloc(
    size_t size, 
    size_t alignment
);

http://msdn.microsoft.com/en-US/library/8z34s9c6%28v=vs.80%29.aspx

How should we handle platforms which don't provide a memory allocator with an alignment? The simplest option is to return NULL (MemoryError).

Allocating more memory and skip first bytes may work, but how do we retrieve the original address if the function releasing the memory block?

What about Solaris, Mac OS X, FreeBSD, etc.
msg232215 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-12-05 21:49
> How should we handle platforms which don't provide a memory allocator
> with an alignment? The simplest option is to return NULL (MemoryError).

Are there such platforms? posix_memalign() is a POSIX standard, even OpenBSD has it.
msg232219 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-12-05 22:15
PyMem_GetAllocator() and PyMem_SetAllocator() have a domain parameter which can take 3 values: PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_MEM and PYMEM_DOMAIN_OBJ.

I don't think that we need 3 flavors of allocators (PyMem_Raw, PyMem, PyObject).

Maybe the PYMEM_DOMAIN_RAW domain is enough: OS functions don't require the GIL. In this case, should we add a new pair of Get/Set functions with an associated structure? Or maybe PyMem_SetAllocator() may ignore the aligned members of the patched PyMemAllocatorEx structure for domains other than PYMEM_DOMAIN_RAW? And PyMem_GetAllocator() would fill members with NULL.
msg232221 - (view) Author: Nathaniel Smith (njs) * Date: 2014-12-05 22:30
It's not terribly difficult to write a crude-but-effective aligned allocator on top of raw malloc:

def aligned_malloc(size, alignment):
    assert alignment < 255
    raw_pointer = (uint8*) malloc(size + alignment)
    shift = alignment - (raw_pointer % alignment)
    assert 0 < shift <= alignment
    aligned_pointer = raw_pointer + shift
    *(aligned_pointer - 1) = shift
    return aligned_pointer

def aligned_free(uint8* pointer):
    shift = *(pointer - 1)
    free(pointer - shift)

But, this fallback and the official Win32 API both disallow the use of plain free() (like Victor points out in msg196834), so we can't just add an aligned_malloc slot to the PyMemAllocator struct. This kind of aligned allocation is effectively its own memory domain.

If native aligned allocation support were added to PyMalloc then it could potentially do better (e.g. by noticing that it already has a block on its freelist with the requested alignment and just returning that instead of overallocating). This might be the ideal solution for Raymond's use case, but I have no idea how much work it would be to mess around with PyMalloc innards.

Numpy doesn't currently use aligned allocation for anything, but we'd like to keep our options open. If we do end up using it in the future then there's a reasonable chance we might want to use it *without* the GIL held (e.g. for allocating temporary buffers inside C loops). OTOH we are also happy to implement the aligned allocation ourselves (either on top of the system APIs or directly) -- we just don't want to lose tracemalloc support when we do.

For numpy's purposes, I think the best approach would be to add a tracemalloc "escape valve", with an interface like:

PyMem_RecordAlloc(const char* domain, void* tag, size_t quantity, 
PyMem_RecordRealloc(const char* domain, void* old_tag, void* new_tag, size_t new_quantity)
PyMem_RecordFree(const char* domain, void* tag)

where the idea is that if after someone allocates memory (or potentially other discrete resources) directly without going through PyMem_*, they could then call these functions to tell tracemalloc what they just did.

This would be useful in a number of cases: in addition to tracking aligned allocations, it would make it possible to re-use the tracemalloc infrastructure to track GPU buffers allocated by CUDA/GPGPU-type code, mmap usage, hugetlbfs usage, etc. Potentially even open file descriptors if one wants to go there (seems pretty useful, actually).
msg232222 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-12-05 22:36
Le 05/12/2014 23:15, STINNER Victor a écrit :
> 
> 
> I don't think that we need 3 flavors of allocators (PyMem_Raw,
> PyMem,
PyObject).
> 
> Maybe the PYMEM_DOMAIN_RAW domain is enough: OS functions don't
require the GIL. In this case, should we add a new pair of Get/Set
functions with an associated structure?

How about a new domain instead? PYMEM_DOMAIN_RAW_ALIGNED?
msg232223 - (view) Author: Nathaniel Smith (njs) * Date: 2014-12-05 22:43
Re: msg232219: If you go down the route of adding both aligned_malloc and aligned_free to the Allocator structure, I think you might as well support it for all domains? For the PyMem and PyObject domains you can just literally set the default functions to be PyMem_RawAlignedMalloc and PyMem_RawAlignedFree, and that leaves the door open to fancier implementations in the future (e.g. if PyMalloc starts implementing aligned allocation directly).

Re: msg232222: Currently all the domains share the same vtable struct, though, whereas aligned allocator functions have different signatures. So you can't literally just add an entry to the existing domain enum and be done.

It also occurs to me that if numpy ever gets serious about using aligned memory then we might also need aligned_realloc (numpy allows arrays to be resized, sigh), which is possible to do but I *cannot* recommend python attempt to provide as a standard interface. (It's not available at all in POSIX.) I guess this is another argument that it might be best to just give numpy an escape valve and worry about CPython's internal needs separately.
msg232225 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2014-12-05 22:58
You cannot just add a new domain because the function prototypes are
different (need an extra alignement parameter). You have to add new members
to the structure or add a new structure.
msg234086 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-01-15 16:50
Due to the realloc() problem, I'm thinking that the best course for Numpy would be to use its own allocator wrapper like Nathaniel outligned.
Also, Numpy wants calloc() for faster allocation of zeroed arrays...
msg286074 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2017-01-23 11:15
Antoine Pitrou: "Benchmarks and Intel's recommendation show that aligned allocation is actually important for AVX performance, and NumPy depends on CPython providing the right allocation APIs (for integration with tracemalloc): https://github.com/numpy/numpy/issues/5312"

I don't think that NumPy was ever fully integrated with tracemalloc.

Since Python 3.6, NumPy doesn't have to use Python memory allocators to benefit of tracemalloc debugger: I added a new C API to be able to manually track/untrack memory blocks which are not directly allocated by Python, see the issue #26530. I implemented this feature for NumPy, but since I never got any feedback from NumPy, I left the API private.

Moreover, I also added second feature to tracemalloc: it's now possible to track memory allocation in different address spaces. The feature was also designed for NumPy which can allocate memory in the GPU address space. See the issue #26588.

With these new tracemalloc features, I don't think that NumPy can still be used to request this feature in CPython core.

--

Raymond: "A principal use case would be PyObject pointers where we want to keep all or most of the data fields in the same cache line (i.e. the fields for list, tuple, dict, and set objects).

Deques would benefit from having the deque blocks aligned to 64byte boundaries and never crossing page boundaries.  Set entries would benefit from 32byte alignment."

Victor (me!): "Do you have an idea of performance benefit of memory alignment?"

Since Raymond never provided any evidence that a new aligned memory allocator would give a significant speedup, and there issue is inactive for 2 years, I close it.

See also the change 6e16b0045cf1, it seems like Raymond doesn't plan to use this feature anymore.

--

If someone wants this feature, we need good reasons to implement it.
History
Date User Action Args
2017-01-23 11:15:19hayposetstatus: open -> closed
resolution: rejected
messages: + msg286074
2015-01-15 16:50:11pitrousetmessages: + msg234086
2015-01-14 17:57:47wscullinsetnosy: + wscullin
2014-12-08 14:25:34trentsetnosy: + trent
2014-12-05 22:58:03hayposetmessages: + msg232225
2014-12-05 22:43:08njssetmessages: + msg232223
2014-12-05 22:36:22pitrousetmessages: + msg232222
2014-12-05 22:30:35njssetmessages: + msg232221
2014-12-05 22:15:18hayposetmessages: + msg232219
2014-12-05 21:49:47pitrousetmessages: + msg232215
2014-12-05 21:20:21hayposetmessages: + msg232209
2014-12-05 18:25:35pitrousetversions: + Python 3.5, - Python 3.4
nosy: + njs

messages: + msg232201

type: enhancement
2014-04-27 00:11:24hayposetmessages: + msg217229
2013-09-03 09:40:10hayposetmessages: + msg196834
2013-09-03 08:32:12pitrousetmessages: + msg196828
2013-09-01 07:30:38neologixsetnosy: + neologix
messages: + msg196713
2013-09-01 04:40:17rhettingersetfiles: + align.diff
keywords: + patch
messages: + msg196700
2013-09-01 00:36:11rhettingersetmessages: + msg196693
2013-09-01 00:35:47rhettingersetmessages: - msg196692
2013-09-01 00:33:44rhettingersettitle: Add aligned memroy variants to the suite of PyMem functions/macros -> Add aligned memory variants to the suite of PyMem functions/macros
2013-09-01 00:33:22rhettingersetmessages: + msg196692
2013-08-26 12:18:46pitrousetmessages: + msg196202
2013-08-26 10:52:46hayposetmessages: + msg196196
2013-08-26 10:23:02pitrousetnosy: + haypo, pitrou, tim.peters
messages: + msg196194
2013-08-26 02:38:03rhettingersetmessages: + msg196176
2013-08-26 02:32:51rhettingercreate