Issue 26601: Use new madvise()'s MADV_FREE on the private heap

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70788

classification

Title:	Use new madvise()'s MADV_FREE on the private heap
Type:	enhancement	Stage:	needs patch
Components:	Interpreter Core	Versions:	Python 3.7

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	StyXman, bar.harel, dw, jtaylor, neologix, pitrou, vstinner, ztane
Priority:	normal	Keywords:

Created on 2016-03-21 10:51 by StyXman, last changed 2022-04-11 14:58 by admin.

Messages (23)
msg262117 - (view)	Author: Marcos Dione (StyXman) *	Date: 2016-03-21 10:51
Linux kernel's new madvise() MADV_FREE[1] could be used in the memory allocator to signal unused parts of the private heap as such, allowing the kernel use those pages for resolving lowmem pressure situations. From a LWN article[2]: [...] Rather than reclaiming the pages immediately, this operation marks them for "lazy freeing" at some future point. Should the kernel run low on memory, these pages will be among the first reclaimed for other uses; should the application try to use such a page after it has been reclaimed, the kernel will give it a new, zero-filled page. But if memory is not tight, pages marked with MADV_FREE will remain in place; a future access to those pages will clear the "lazy free" bit and use the memory that was there before the MADV_FREE call. [...] MADV_FREE appears to be aimed at user-space memory allocator implementations. When an application frees a set of pages, the allocator will use an MADV_FREE call to tell the kernel that the contents of those pages no longer matter. Should the application quickly allocate more memory in the same address range, it will use the same pages, thus avoiding much of the overhead of freeing the old pages and allocating and zeroing the new ones. In short, MADV_FREE is meant as a way to say "I don't care about the data in this address range, but I may reuse the address range itself in the near future." Also note that this feature already exists in BSD kernels. -- [1] http://kernelnewbies.org/Linux_4.5#head-42578a3e087d5bcc2940954a38ce794fe2cd642c [2] https://lwn.net/Articles/590991/
msg262120 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-03-21 12:05
Are you aware of unused memory in the heap memory? The pymalloc memory allocator uses munmap() to release a wgole arena as soon as the last memory block of an arena is freed.
msg263180 - (view)	Author: Antti Haapala (ztane) *	Date: 2016-04-11 14:13
... and it turns out that munmapping is not always that smart thing to do: http://stackoverflow.com/questions/36548518/variable-assignment-faster-than-one-liner
msg263181 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-11 14:29
> ... and it turns out that munmapping is not always that smart thing to do: http://stackoverflow.com/questions/36548518/variable-assignment-faster-than-one-liner py -3 -m timeit "tuple(range(2000)) == tuple(range(2000))" 10000 loops, best of 3: 97.7 usec per loop py -3 -m timeit "a = tuple(range(2000)); b = tuple(range(2000)); a==b" 10000 loops, best of 3: 70.7 usec per loop Hum, it looks like this specific benchmark spends a lot of time to allocate one arena and then release it. Maybe we should keep one "free" arena to avoid the slow mmap/munmap. But it means that we keep 256 KB of unused memory. Maybe we need an heuristic to release the free arena after N calls to object allocator functions which don't need this free arena.
msg263184 - (view)	Author: Antti Haapala (ztane) *	Date: 2016-04-11 14:49
> Maybe we need an heuristic to release the free arena after N calls to object allocator functions which don't need this free arena. That'd be my thought; again I believe that `madvise` could be useful there; now `mmap`/`munmap` I believe is particularly slow because it actually needs to supply 256kbytes of zeroed pages.
msg263192 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-04-11 17:35
> ... and it turns out that munmapping is not always that smart thing to do: I don't think a silly benchmark says anything about the efficiency of our allocation strategy. If you have a real-world use case where this turns up, then please post about it.
msg263201 - (view)	Author: Antti Haapala (ztane) *	Date: 2016-04-11 19:42
I said that munmapping is not the smart thing to do: and it is not, if you're going to mmap soon again.
msg263202 - (view)	Author: Antti Haapala (ztane) *	Date: 2016-04-11 19:56
Also what is important to notice is that the behaviour occurs exactly because the current heuristics work; the allocations were successfully organized so that one arena could be freed as soon as possible. The question is that is it sane to try to free the few bits of free memory asap - say you're now holding 100M of memory - it does not often matter much if you hold the 100M of memory for one second longer than you actually ended up needing.
msg263207 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-04-11 21:06
Another question is how often this situation occurs in practice and whether it's worth spending some bits, CPU cycles and developer time on "fixing" this.
msg263212 - (view)	Author: Bar Harel (bar.harel) *	Date: 2016-04-11 22:04
Any idea how to test it then? I found this happening by chance because I care about efficiency too much. We can't just stick timeit in random areas and hope to get results.
msg263937 - (view)	Author: Julian Taylor (jtaylor)	Date: 2016-04-21 22:21
simplest way to fix this would be to not use malloc instead of mmap in the allocator, then you also get MADV_FREE for free when malloc uses it. The rational for using mmap is kind of weak, the source just says "heap fragmentation". The usual argument for using mmap is not that but the instant return of memory to the system, quite the opposite of what the python memory pool does.
msg263939 - (view)	Author: David Wilson (dw) *	Date: 2016-04-21 22:39
@Julian note that ARENA_SIZE is double the threshold after which at least glibc resorts to calling mmap directly, so using malloc in place of mmap on at least Linux would have zero effect
msg263940 - (view)	Author: Julian Taylor (jtaylor)	Date: 2016-04-21 23:11
ARENA_SIZE is 256kb, the threshold in glibc is up to 32 MB
msg263941 - (view)	Author: David Wilson (dw) *	Date: 2016-04-21 23:16
It defaults to 128kb, and messing with global state like the system allocator is a fine way to tempt regressions in third party code
msg263942 - (view)	Author: Julian Taylor (jtaylor)	Date: 2016-04-21 23:18
it defaulted to 128kb ten years ago, its a dynamic threshold since ages.
msg263968 - (view)	Author: Charles-François Natali (neologix) *	Date: 2016-04-22 06:56
> Julian Taylor added the comment: > > it defaulted to 128kb ten years ago, its a dynamic threshold since ages. Indeed, and that's what encouraged switching the allocator to use mmap. The problem with dynamic mmap threshold is that since the Python allocator uses fixed-size arenas, basically malloc always ends up allocating from the heap (brk). Which means that given that we don't use a - compacting - garbage collector, after a while the heap would end up quite fragmented, or never shrink: for example let's say you allocate 1GB - on the heap - and then you free them, but a single object is allocated at the top of the heap, you heap never shrinks back. This has bitten people (and myself a couple times at work). Now, I see several options: - revert to using malloc, but this will re-introduce the original problem - build some form of hysteresis in the arena allocation - somewhat orthogonally, I'd be interested to see if we couldn't increase the arena size
msg263979 - (view)	Author: Julian Taylor (jtaylor)	Date: 2016-04-22 07:52
glibcs malloc is not obstack, its not a simple linear heap where one object on top means everything below is not freeable. It also uses MADV_DONTNEED give sbrk'd memory back to the system. This is the place where MADV_FREE can now be used now as the latter does not guarantee a page fault. But that said of course you can construct workloads which lead to increased memory usage also with malloc and maybe python triggers them more often than other applications. Is there an existing issues showing the problem? It would be a good form of documentation in the source.
msg263980 - (view)	Author: Charles-François Natali (neologix) *	Date: 2016-04-22 08:06
The heap on Linux is still a linear contiguous address space. I agree that MADV_DONTNEED allow's returning committed memory back to the VM subsystem, but it is still using a large virtual memory area. Not everyone runs on 64-bit, or can waste address space. Also, not every Unix is Linux. But it might make sense to use malloc on Linux, maybe only on 64-bit.
msg263983 - (view)	Author: STINNER Victor (vstinner) *	Date: 2016-04-22 08:20
I'm not sure that I understood correctly, but if you are proposing to use malloc()/free() instead of mmap()/munmap() to allocate arenas in pymalloc, you have to know that we already different allocators depending on the platform: https://docs.python.org/dev/c-api/memory.html#the-pymalloc-allocator By the way, it is possible to modify the arena allocator at runtime: https://docs.python.org/dev/c-api/memory.html#customize-pymalloc-arena-allocator
msg263997 - (view)	Author: Julian Taylor (jtaylor)	Date: 2016-04-22 11:24
I know one can change the allocator, but the default is mmap which I don't think is a very good choice for the current arena size. All the arguments about fragmentation and memory space also apply to pythons arena allocator itself and I am not convinced that fragmentation of the libc allocator is a real problem for python as pythons allocation pattern is very well behaved _due_ to its own arena allocator. I don't doubt it but I think it would be very valuable to document the actual real world use case that triggered this change, just to avoid people stumbling over this again and again. But then I also don't think that anything needs to be necessarily be changed either, I have not seen the mmaps being a problem in any profiles of applications I work with.
msg263998 - (view)	Author: Antti Haapala (ztane) *	Date: 2016-04-22 11:35
mmap is not the problem, the eagerness of munmap is a source of possible problem. The munmap eagerness does not show problems in all programs because the arena allocation heuristics do not work as intended. A proper solution in Linux and other operating systems where it is supported, is to put the freed arenas in a list, then mark freed with MADV_FREE. Now if the memory pressure grows, only then will the OS reclaim these. At any time the application can start reusing these arenas/pages; if they're not reclaimed, the old contents will be still present there; if operating system reclaimed them, they'd be remapped with zeroes. Really the only downside of all this that I can foresee is that `ps/top/whatever` output would see Python using way more memory in its RSS/virt/whatever than it is actually using.
msg264002 - (view)	Author: Julian Taylor (jtaylor)	Date: 2016-04-22 11:45
which is exactly what malloc is already doing for, thus my point is by using malloc we would fullfill your request. But do you have an actual real work application where this would help? it is pretty easy to figure out, just run the application under perf and see if there is a relevant amount of time spent in page_fault/clear_pages. And as mentioned you can already change the allocator for arenas at runtime, so you could also try changing it to malloc and see if your application gets any faster.
msg264014 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2016-04-22 13:59
All this discussion is in the context of the GNU libc allocator, but please remember that Python works on many platforms, including OS X, Windows, the *BSDs...

History
Date	User	Action	Args
2022-04-11 14:58:28	admin	set	github: 70788
2017-08-23 14:25:57	pitrou	set	stage: needs patch versions: + Python 3.7, - Python 3.6
2016-04-22 13:59:44	pitrou	set	messages: + msg264014
2016-04-22 11:45:14	jtaylor	set	messages: + msg264002
2016-04-22 11:35:10	ztane	set	messages: + msg263998
2016-04-22 11:24:02	jtaylor	set	messages: + msg263997
2016-04-22 08:20:15	vstinner	set	messages: + msg263983
2016-04-22 08:06:30	neologix	set	messages: + msg263980
2016-04-22 07:52:18	jtaylor	set	messages: + msg263979
2016-04-22 06:56:08	neologix	set	messages: + msg263968
2016-04-21 23:18:34	jtaylor	set	messages: + msg263942
2016-04-21 23:16:06	dw	set	messages: + msg263941
2016-04-21 23:11:12	jtaylor	set	messages: + msg263940
2016-04-21 22:39:22	dw	set	nosy: + dw messages: + msg263939
2016-04-21 22:21:31	jtaylor	set	nosy: + jtaylor messages: + msg263937
2016-04-11 22:04:42	bar.harel	set	messages: + msg263212
2016-04-11 21:06:31	pitrou	set	messages: + msg263207
2016-04-11 19:56:10	ztane	set	messages: + msg263202
2016-04-11 19:42:07	ztane	set	messages: + msg263201
2016-04-11 17:35:34	pitrou	set	nosy: + pitrou messages: + msg263192
2016-04-11 17:33:15	pitrou	link	issue26734 superseder
2016-04-11 17:17:47	bar.harel	set	nosy: + bar.harel
2016-04-11 14:49:12	ztane	set	messages: + msg263184
2016-04-11 14:29:05	vstinner	set	messages: + msg263181
2016-04-11 14:13:19	ztane	set	nosy: + ztane messages: + msg263180
2016-03-21 12:05:15	vstinner	set	messages: + msg262120
2016-03-21 11:49:36	pitrou	set	nosy: + neologix
2016-03-21 11:07:42	SilentGhost	set	nosy: + vstinner
2016-03-21 10:51:54	StyXman	create