classification
Title: Use new madvise()'s MADV_FREE on the private heap
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: StyXman, bar.harel, dw, jtaylor, neologix, pitrou, vstinner, ztane
Priority: normal Keywords:

Created on 2016-03-21 10:51 by StyXman, last changed 2017-08-23 14:25 by pitrou.

Messages (23)
msg262117 - (view) Author: Marcos Dione (StyXman) * Date: 2016-03-21 10:51
Linux kernel's new madvise() MADV_FREE[1] could be used in the memory allocator to signal unused parts of the private heap as such, allowing the kernel use those pages for resolving lowmem pressure situations. From a LWN article[2]:

[...] Rather than reclaiming the pages immediately, this operation marks them for "lazy freeing" at some future point. Should the kernel run low on memory, these pages will be among the first reclaimed for other uses; should the application try to use such a page after it has been reclaimed, the kernel will give it a new, zero-filled page. But if memory is not tight, pages marked with MADV_FREE will remain in place; a future access to those pages will clear the "lazy free" bit and use the memory that was there before the MADV_FREE call. 

[...] MADV_FREE appears to be aimed at user-space memory allocator implementations. When an application frees a set of pages, the allocator will use an MADV_FREE call to tell the kernel that the contents of those pages no longer matter. Should the application quickly allocate more memory in the same address range, it will use the same pages, thus avoiding much of the overhead of freeing the old pages and allocating and zeroing the new ones. In short, MADV_FREE is meant as a way to say "I don't care about the data in this address range, but I may reuse the address range itself in the near future." 

Also note that this feature already exists in BSD kernels.

--
[1] http://kernelnewbies.org/Linux_4.5#head-42578a3e087d5bcc2940954a38ce794fe2cd642c

[2] https://lwn.net/Articles/590991/
msg262120 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-03-21 12:05
Are you aware of unused memory in the heap memory?

The pymalloc memory allocator uses munmap() to release a wgole arena as
soon as the last memory block of an arena is freed.
msg263180 - (view) Author: Antti Haapala (ztane) * Date: 2016-04-11 14:13
... and it turns out that munmapping is not always that smart thing to do: http://stackoverflow.com/questions/36548518/variable-assignment-faster-than-one-liner
msg263181 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-11 14:29
> ... and it turns out that munmapping is not always that smart thing to do: http://stackoverflow.com/questions/36548518/variable-assignment-faster-than-one-liner

py -3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
10000 loops, best of 3: 97.7 usec per loop
py -3 -m timeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
10000 loops, best of 3: 70.7 usec per loop

Hum, it looks like this specific benchmark spends a lot of time to allocate one arena and then release it.

Maybe we should keep one "free" arena to avoid the slow mmap/munmap. But it means that we keep 256 KB of unused memory.

Maybe we need an heuristic to release the free arena after N calls to object allocator functions which don't need this free arena.
msg263184 - (view) Author: Antti Haapala (ztane) * Date: 2016-04-11 14:49
> Maybe we need an heuristic to release the free arena after N calls to object allocator functions which don't need this free arena.

That'd be my thought; again I believe that `madvise` could be useful there; now `mmap`/`munmap` I believe is particularly slow because it actually needs to supply 256kbytes of *zeroed* pages.
msg263192 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-04-11 17:35
> ... and it turns out that munmapping is not always that smart thing to do:

I don't think a silly benchmark says anything about the efficiency of our allocation strategy. If you have a real-world use case where this turns up, then please post about it.
msg263201 - (view) Author: Antti Haapala (ztane) * Date: 2016-04-11 19:42
I said that *munmapping* is not the smart thing to do: and it is not, if you're going to *mmap* soon again.
msg263202 - (view) Author: Antti Haapala (ztane) * Date: 2016-04-11 19:56
Also what is important to notice is that the behaviour occurs *exactly* because the current heuristics *work*; the allocations were successfully organized so that one arena could be freed as soon as possible. The question is that is it sane to try to free the few bits of free memory asap - say you're now holding 100M of memory - it does not often matter much if you hold the 100M of memory for *one second longer* than you actually ended up needing.
msg263207 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-04-11 21:06
Another question is how often this situation occurs in practice and whether it's worth spending some bits, CPU cycles and developer time on "fixing" this.
msg263212 - (view) Author: Bar Harel (bar.harel) * Date: 2016-04-11 22:04
Any idea how to test it then? I found this happening by chance because I care about efficiency too much. We can't just stick timeit in random areas and hope to get results.
msg263937 - (view) Author: Julian Taylor (jtaylor) Date: 2016-04-21 22:21
simplest way to fix this would be to not use malloc instead of mmap in the allocator, then you also get MADV_FREE for free when malloc uses it.
The rational for using mmap is kind of weak, the source just says "heap fragmentation". The usual argument for using mmap is not that but the instant return of memory to the system, quite the opposite of what the python memory pool does.
msg263939 - (view) Author: David Wilson (dw) * Date: 2016-04-21 22:39
@Julian note that ARENA_SIZE is double the threshold after which at least glibc resorts to calling mmap directly, so using malloc in place of mmap on at least Linux would have zero effect
msg263940 - (view) Author: Julian Taylor (jtaylor) Date: 2016-04-21 23:11
ARENA_SIZE is 256kb, the threshold in glibc is up to 32 MB
msg263941 - (view) Author: David Wilson (dw) * Date: 2016-04-21 23:16
It defaults to 128kb, and messing with global state like the system allocator is a fine way to tempt regressions in third party code
msg263942 - (view) Author: Julian Taylor (jtaylor) Date: 2016-04-21 23:18
it defaulted to 128kb ten years ago, its a dynamic threshold since ages.
msg263968 - (view) Author: Charles-Fran├žois Natali (neologix) * (Python committer) Date: 2016-04-22 06:56
> Julian Taylor added the comment:
>
> it defaulted to 128kb ten years ago, its a dynamic threshold since ages.

Indeed, and that's what encouraged switching the allocator to use mmap.
The problem with dynamic mmap threshold is that since the Python
allocator uses fixed-size arenas, basically malloc always ends up
allocating from the heap (brk).
Which means that given that we don't use a - compacting - garbage
collector, after a while the heap would end up quite fragmented, or
never shrink: for example let's say you allocate 1GB - on the heap -
and then you free them, but  a single object is allocated at the top
of the heap, you heap never shrinks back.
This has bitten people (and myself a couple times at work).

Now, I see several options:
- revert to using malloc, but this will re-introduce the original problem
- build some form of hysteresis in the arena allocation
- somewhat orthogonally, I'd be interested to see if we couldn't
increase the arena size
msg263979 - (view) Author: Julian Taylor (jtaylor) Date: 2016-04-22 07:52
glibcs malloc is not obstack, its not a simple linear heap where one object on top means everything below is not freeable. It also uses MADV_DONTNEED give sbrk'd memory back to the system. This is the place where MADV_FREE can now be used now as the latter does not guarantee a page fault.
But that said of course you can construct workloads which lead to increased memory usage also with malloc and maybe python triggers them more often than other applications. Is there an existing issues showing the problem? It would be a good form of documentation in the source.
msg263980 - (view) Author: Charles-Fran├žois Natali (neologix) * (Python committer) Date: 2016-04-22 08:06
The heap on Linux is still a linear contiguous *address space*. I
agree that MADV_DONTNEED allow's returning committed memory back to
the VM subsystem, but it is still using a large virtual memory area.
Not everyone runs on 64-bit, or can waste address space.
Also, not every Unix is Linux.

But it might make sense to use malloc on Linux, maybe only on 64-bit.
msg263983 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-04-22 08:20
I'm not sure that I understood correctly, but if you are proposing to use malloc()/free() instead of mmap()/munmap() to allocate arenas in pymalloc, you have to know that we already different allocators depending on the platform:
https://docs.python.org/dev/c-api/memory.html#the-pymalloc-allocator

By the way, it is possible to modify the arena allocator at runtime:
https://docs.python.org/dev/c-api/memory.html#customize-pymalloc-arena-allocator
msg263997 - (view) Author: Julian Taylor (jtaylor) Date: 2016-04-22 11:24
I know one can change the allocator, but the default is mmap which I don't think is a very good choice for the current arena size.
All the arguments about fragmentation and memory space also apply to pythons arena allocator itself and I am not convinced that fragmentation of the libc allocator is a real problem for python as pythons allocation pattern is very well behaved _due_ to its own arena allocator. I don't doubt it but I think it would be very valuable to document the actual real world use case that triggered this change, just to avoid people stumbling over this again and again.

But then I also don't think that anything needs to be necessarily be changed either, I have not seen the mmaps being a problem in any profiles of applications I work with.
msg263998 - (view) Author: Antti Haapala (ztane) * Date: 2016-04-22 11:35
mmap is not the problem, the eagerness of munmap is a source of possible problem. 

The munmap eagerness does not show problems in all programs because the arena allocation heuristics do not work as intended. A proper solution in Linux and other operating systems where it is supported, is to put the freed arenas in a list, then mark freed with MADV_FREE. Now if the memory pressure grows, only *then* will the OS reclaim these. At any time the application can start reusing these arenas/pages; if they're not reclaimed, the old contents will be still present there; if operating system reclaimed them, they'd be remapped with zeroes.

Really the only downside of all this that I can foresee is that `ps/top/whatever` output would see Python using way more memory in its RSS/virt/whatever than it is actually using.
msg264002 - (view) Author: Julian Taylor (jtaylor) Date: 2016-04-22 11:45
which is exactly what malloc is already doing for, thus my point is by using malloc we would fullfill your request.

But do you have an actual real work application where this would help?
it is pretty easy to figure out, just run the application under perf and see if there is a relevant amount of time spent in page_fault/clear_pages.

And as mentioned you can already change the allocator for arenas at runtime, so you could also try changing it to malloc and see if your application gets any faster.
msg264014 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-04-22 13:59
All this discussion is in the context of the GNU libc allocator, but please remember that Python works on many platforms, including OS X, Windows, the *BSDs...
History
Date User Action Args
2017-08-23 14:25:57pitrousetstage: needs patch
versions: + Python 3.7, - Python 3.6
2016-04-22 13:59:44pitrousetmessages: + msg264014
2016-04-22 11:45:14jtaylorsetmessages: + msg264002
2016-04-22 11:35:10ztanesetmessages: + msg263998
2016-04-22 11:24:02jtaylorsetmessages: + msg263997
2016-04-22 08:20:15vstinnersetmessages: + msg263983
2016-04-22 08:06:30neologixsetmessages: + msg263980
2016-04-22 07:52:18jtaylorsetmessages: + msg263979
2016-04-22 06:56:08neologixsetmessages: + msg263968
2016-04-21 23:18:34jtaylorsetmessages: + msg263942
2016-04-21 23:16:06dwsetmessages: + msg263941
2016-04-21 23:11:12jtaylorsetmessages: + msg263940
2016-04-21 22:39:22dwsetnosy: + dw
messages: + msg263939
2016-04-21 22:21:31jtaylorsetnosy: + jtaylor
messages: + msg263937
2016-04-11 22:04:42bar.harelsetmessages: + msg263212
2016-04-11 21:06:31pitrousetmessages: + msg263207
2016-04-11 19:56:10ztanesetmessages: + msg263202
2016-04-11 19:42:07ztanesetmessages: + msg263201
2016-04-11 17:35:34pitrousetnosy: + pitrou
messages: + msg263192
2016-04-11 17:33:15pitroulinkissue26734 superseder
2016-04-11 17:17:47bar.harelsetnosy: + bar.harel
2016-04-11 14:49:12ztanesetmessages: + msg263184
2016-04-11 14:29:05vstinnersetmessages: + msg263181
2016-04-11 14:13:19ztanesetnosy: + ztane
messages: + msg263180
2016-03-21 12:05:15vstinnersetmessages: + msg262120
2016-03-21 11:49:36pitrousetnosy: + neologix
2016-03-21 11:07:42SilentGhostsetnosy: + vstinner
2016-03-21 10:51:54StyXmancreate