classification
Title: Use VirtualAlloc to allocate memory arenas
Type: resource usage Stage: commit review
Components: Interpreter Core, Windows Versions: Python 3.4
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: brian.curtin, giampaolo.rodola, haypo, loewis, neologix, pitrou, python-dev, tim.golden, trent
Priority: low Keywords: patch

Created on 2011-11-26 13:12 by pitrou, last changed 2013-06-27 10:24 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
va.patch pitrou, 2011-11-26 13:12 review
tuples.py loewis, 2012-06-21 17:10
va.diff loewis, 2012-06-21 17:11 review
Messages (23)
msg148399 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-26 13:12
Similar to issue #11849, this patch proposes to use VirtualAlloc/VirtualFree to allocate the Python allocator's memory arenas (rather than malloc() / free()). It might help release more memory if there is some fragmentation, although I don't know how Microsoft's malloc() works.
msg148605 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-29 20:48
The patch looks good to me.

To study Microsoft's malloc, see VC\crt\src\malloc.c. Typically, it uses HeapAlloc from the CRT heap, unless it's in 32-bit mode, and __active_heap is either __V6_HEAP or __V5_HEAP. This is determined at startup by __heap_select, inspecting an environment variable __MSVCRT_HEAP_SELECT. If that's not set, the CRT heap is used.

The CRT heap, in turn, is created with HeapCreate (no flags).

As an alternative approach, Python could consider completely dropping obmalloc on Windows, and using a Windows Low Fragementation Heap (LFH) instead, with HEAP_NO_SERIALIZE (as the heap would be protected by the GIL).

If we take the route proposed by this patch, I recommend also dropping all other CRT malloc() calls in Python, and make allocations from the process heap instead (that's a separate issue, though).
msg148611 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-29 21:02
> The patch looks good to me.
> 
> To study Microsoft's malloc, see VC\crt\src\malloc.c. Typically, it
> uses HeapAlloc from the CRT heap, unless it's in 32-bit mode, and
> __active_heap is either __V6_HEAP or __V5_HEAP. This is determined at
> startup by __heap_select, inspecting an environment variable
> __MSVCRT_HEAP_SELECT. If that's not set, the CRT heap is used.

Ah, right, I guessed it was using HeapAlloc indeed. What would be more
interesting is how HeapAlloc works :)

I think it would be nice to know whether the patch has a chance of being
useful before committing it. I did it as a thought experiment after the
similar change was committed for Unix, but I'm not an expert in Windows
internals. Perhaps HeapAlloc deals fine with fragmentation? Tim, Brian,
do you know anything about this?

> As an alternative approach, Python could consider completely dropping
> obmalloc on Windows, and using a Windows Low Fragementation Heap (LFH)
> instead, with HEAP_NO_SERIALIZE (as the heap would be protected by the
> GIL).

I'm not sure that would serve the same purpose as obmalloc, which
(AFAIU) is very fast at the expense of compacity.
msg148612 - (view) Author: Tim Golden (tim.golden) (Python committer) Date: 2011-11-29 21:04
'fraid not. I've never had to dig into the allocation stuff at this level.
msg148621 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-11-29 22:13
> I think it would be nice to know whether the patch has a chance of being
> useful before committing it. I did it as a thought experiment after the
> similar change was committed for Unix, but I'm not an expert in Windows
> internals. Perhaps HeapAlloc deals fine with fragmentation?

Unfortunately, the implementation of HeapAlloc isn't really documented.
If Reactos is right, it looks like this: http://bit.ly/t2NPHh

Blocks < 1024 bytes are allocated from per-size free lists.

Blocks < Heap->VirtualMemoryThreshold are allocated through the free
list for variable-sized blocks of the heap.

Other blocks are allocated through  ZwAllocateVirtualMemory, adding
sizeof(HEAP_VIRTUAL_ALLOC_ENTRY) in the beginning. I think this header
will cause malloc() to allocate one extra page in front of an arena.

>> As an alternative approach, Python could consider completely dropping
>> obmalloc on Windows, and using a Windows Low Fragementation Heap (LFH)
>> instead, with HEAP_NO_SERIALIZE (as the heap would be protected by the
>> GIL).
> 
> I'm not sure that would serve the same purpose as obmalloc, which
> (AFAIU) is very fast at the expense of compacity.

I'd expect that LFH heaps are also very fast. The major difference I can
see is that blocks in the LFH heap still have an 8-byte header (possibly
more on a 64-bit system). So I wouldn't expect any speed savings, but
(possibly relevant) memory savings from obmalloc.
msg148623 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2011-11-29 22:39
> Tim, Brian, do you know anything about this?

Unfortunately, no. It's on my todo list of things to understand but I don't see that happening in the near future.

I'm willing to run tests or benchmarks for this issue, but that's likely the most I can provide.
msg148625 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-11-29 23:12
Le mardi 29 novembre 2011 à 22:39 +0000, Brian Curtin a écrit :
> Brian Curtin <brian@python.org> added the comment:
> 
> > Tim, Brian, do you know anything about this?
> 
> Unfortunately, no. It's on my todo list of things to understand but I
> don't see that happening in the near future.
> 
> I'm willing to run tests or benchmarks for this issue, but that's
> likely the most I can provide.

Benchmarks would be nice indeed.
msg163350 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-21 17:10
Here is a benchmark. Based on my assumption that this patch may reduce allocation overheads due to minimizing padding+fragmentation, it allocates a lot of memory, and then waits 20s so you can check in the process explorer what the "Commit Size" of the process is.

For the current 3.3 tree, in 32-bit mode, on a 64-bit Windows 7 installation, I get 464,756K for the unpatched version, and 450,436K for the patched version.

This is a 3% saving, which seems good enough for me.
msg163351 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012-06-21 17:11
Here is an updated patch.
msg189760 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-05-21 14:21
Martin, do you think your latest patch can be committed?
msg189771 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-05-21 16:24
Antoine's request for benchmarks still stands. I continue to think that it should be applied even in absence of benchmarks. In the absence of third opinions on this specific aspect, I don't think it can be applied.
msg189824 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2013-05-22 17:03
I can't speak for Antoine, but I guess that the result of pybench
would be enough to make sure it doesn't introduce any regression
(which would be *really* suprising).
As for the memory savings, the benchmark you posted earlier is
conclusive enough IMO (especially since the it can be difficult to
come up with a scheme leading to heap fragmentation).
msg189825 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-05-22 17:24
I asked for benchmarks because I don't know anything about Windows virtual memory management, but if other people think this patch should go in then it's fine.

The main point of using VirtualAlloc/VirtualFree was, in my mind, to allow *releasing* memory in more cases than when relying on free() (assuming Windows uses some sbrk() equivalent). But perhaps Windows is already tuned to release memory on most free() calls.
msg189850 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-05-23 06:34
Ah ok. I guess tuples.py then indeed demonstrates a saving. I'll apply the patch.
msg189876 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-05-23 20:07
Set also issue #3329 which proposes an API to define memory allocators.
msg190427 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-05-31 23:29
I tested VirtualAlloc/VirtualFree versus malloc/free on Windows 7 SP1 64-bit. On my small test, when using VirtualAlloc/VirtualFree, the memory peak is lower (ex: 58.1 MB vs 59.0), and the memory usage is the same or sometimes lower. The difference is small, malloc() implementation on Windows 7 is efficient! But I am in favor of using VirtualAlloc/VirtualFree because it is the native API and the gain may be bigger on a real application.

--

I used the following script for my test:
https://bitbucket.org/haypo/misc/raw/98eb42a3ed2144141d62c75e3d07933839fe2a0c/python/python_memleak.py

I reused get_process_mem_info() code from psutil to get current and peak memory usage (I failed to install psutil, I don't understand why).

I also replace func() of my script with tuples.py to create many tuples.

--

Python < 3.3 wastes a lot of memory with python_memleak.py. Python 3.3 behaves much better thanks to the usage of mmap() on Linux, and the fixed threshold on 64-bit (min=512 bytes, instead of 256).
msg191252 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-06-16 02:05
Martin von Loewis: "If we take the route proposed by this patch, I recommend also dropping all other CRT malloc() calls in Python, and make allocations from the process heap instead (that's a separate issue, though)."

=> see issue #18203
msg191253 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-06-16 02:11
haypo> I tested VirtualAlloc/VirtualFree versus malloc/free
haypo> on Windows 7 SP1 64-bit. On my small test, ...

I realized that I was no precise: I tried attached va.diff patch. I didn't try to replace completly malloc().
msg191461 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-06-19 12:03
> Ah ok. I guess tuples.py then indeed demonstrates a saving. I'll apply the patch.

According to my test, the memory usage is a little bit better with the patch. So Martin:,do you plan to commit the patch?

Or is a benchmark required? Or should check first check the Low Fragmentation Allocator?

I plan to test the Low Fragmentation Allocator, at least on Windows 7. But I prefer to do it later, I'm working on the PEP 445 right now.
msg191504 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-06-20 08:24
> I plan to test the Low Fragmentation Allocator, at least on Windows 7.

I don't think it can be any better than raw mmap() / VirtualAlloc()...
msg191506 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-06-20 08:45
>> I plan to test the Low Fragmentation Allocator, at least on Windows 7.
> I don't think it can be any better than raw mmap() / VirtualAlloc()...

I mean using the Low Fragmentation Allocator for PyObject_Malloc()
instead of pymalloc.

Martin wrote (msg148605):
"As an alternative approach, Python could consider completely dropping
obmalloc on Windows, and using a Windows Low Fragementation Heap (LFH)
instead, with HEAP_NO_SERIALIZE (as the heap would be protected by the
GIL)."
msg191507 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2013-06-20 10:50
Ok, I'm going to commit this patch. Any further revisions (including reversions) can be done then.
msg191939 - (view) Author: Roundup Robot (python-dev) Date: 2013-06-27 10:24
New changeset 44f455e6163d by Martin v. Löwis in branch 'default':
Issue #13483: Use VirtualAlloc in obmalloc on Windows.
http://hg.python.org/cpython/rev/44f455e6163d
History
Date User Action Args
2013-06-27 10:24:52loewissetstatus: open -> closed
resolution: fixed
2013-06-27 10:24:20python-devsetnosy: + python-dev
messages: + msg191939
2013-06-20 10:50:56loewissetmessages: + msg191507
2013-06-20 08:45:06hayposetmessages: + msg191506
2013-06-20 08:24:56pitrousetmessages: + msg191504
2013-06-19 12:03:01hayposetmessages: + msg191461
2013-06-17 22:42:29trentsetnosy: + trent
2013-06-16 02:11:36hayposetmessages: + msg191253
2013-06-16 02:05:20hayposetmessages: + msg191252
2013-06-06 14:17:53giampaolo.rodolasetnosy: + giampaolo.rodola
2013-05-31 23:29:07hayposetmessages: + msg190427
2013-05-23 20:07:33hayposetnosy: + haypo
messages: + msg189876
2013-05-23 06:34:07loewissetmessages: + msg189850
2013-05-22 17:24:57pitrousetstage: commit review
messages: + msg189825
versions: + Python 3.4, - Python 3.3
2013-05-22 17:03:57neologixsetmessages: + msg189824
2013-05-21 16:24:28loewissetmessages: + msg189771
2013-05-21 14:21:57neologixsetmessages: + msg189760
2012-06-21 17:11:04loewissetfiles: + va.diff

messages: + msg163351
2012-06-21 17:10:39loewissetfiles: + tuples.py

messages: + msg163350
2011-11-29 23:12:56pitrousetmessages: + msg148625
2011-11-29 22:39:18brian.curtinsetmessages: + msg148623
2011-11-29 22:13:23loewissetmessages: + msg148621
2011-11-29 21:04:01tim.goldensetmessages: + msg148612
2011-11-29 21:02:08pitrousetmessages: + msg148611
2011-11-29 20:48:33loewissetnosy: + loewis
messages: + msg148605
2011-11-26 13:12:42pitroucreate