classification
Title: Customized malloc implementation on SunOS and AIX
Type: resource usage Stage:
Components: Interpreter Core Versions: Python 3.1, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: loewis, pitrou, sable, tim_one (4)
Priority: normal Keywords patch

Created on 2008-08-08 10:11 by sable, last changed 2008-09-10 16:33 by sable.

Files
File name Uploaded Description Edit Remove
customized_malloc_SUN.pdf sable, 2008-08-08 10:11
customized_malloc_AIX.pdf sable, 2008-08-08 10:13
patch_dlmalloc.diff sable, 2008-08-08 10:15
patch_dlmalloc2.diff sable, 2008-09-09 15:58
patch_dlmalloc3.diff sable, 2008-09-10 16:33
Messages (13)
msg70897 - (view) Author: Sébastien Sablé (sable) Date: 2008-08-08 10:11
Hi,

We run a big application mostly written in Python (with Pyrex/C
extensions) on different systems including Linux, SunOS and AIX.

The memory footprint of our application on Linux is fine; however we
found that on AIX and SunOS, any memory that has been allocated by our
application at some stage will never be freed at the system level.

After doing some analysis (see the 2 attached pdf documents), we found
that this is linked to the implementation of malloc on those various
systems:

The malloc used on Linux (glibc) is based on dlmalloc as described in
this document:
http://g.oswego.edu/dl/html/malloc.html

This implementation will use sbrk to allocate small chunks of memory,
but it will use mmap to allocate big chunks. This ensures that the
memory will actually get freed when free is called.

AIX and Sun have a more naive malloc implementation, so that the memory
allocated by an application through malloc is never actually freed until
the application leaves (this behavior has been confirmed by some experts
at IBM and Sun when we asked them for some feedback on this problem -
there is a 'memory disclaim' option on AIX but it is disabled by default
as it brings some major performance penalities).

For long running Python applications which may allocate a lot of memory
at some stage, this is a major drawback.

In order to bypass this limitation of the system on AIX and SunOS, we
have modified Python so that it will use the customized malloc
implementation dlmalloc like in glibc (see attached patch) - dlmalloc is
released in the public domain.

This patch adds a --enable-dlmalloc option to configure. When activated,
we observed a dramatic reduction of the memory used by our application.
I think many AIX and SunOS Python users could be interested by such an
improvement.

--
Sébastien Sablé
Sungard
msg70908 - (view) Author: Antoine Pitrou (pitrou) Date: 2008-08-08 19:11
This is very interesting, although it should probably go through
discussion on python-dev since it involves integrating a big chunk of
external code.
msg70920 - (view) Author: Martin v. Löwis (loewis) Date: 2008-08-08 22:46
I cannot quite see why the problem is serious: even though the memory is
not returned to the system, it will be swapped out to the swap file, so
it doesn't consume any real memory (just swap space).

I don't think Python should integrate a separate malloc implementation.
Instead, Python's own memory allocate (obmalloc) should be changed to
directly use the virtual memory interfaces of the operating system (i.e.
mmap), bypassing the malloc of the C library.

So I'm -1 on this patch.
msg70929 - (view) Author: Antoine Pitrou (pitrou) Date: 2008-08-09 10:57
Le vendredi 08 août 2008 à 22:46 +0000, Martin v. Löwis a écrit :
> Instead, Python's own memory allocate (obmalloc) should be changed to
> directly use the virtual memory interfaces of the operating system (i.e.
> mmap), bypassing the malloc of the C library.

How would that interact with fork()?
msg70940 - (view) Author: Martin v. Löwis (loewis) Date: 2008-08-09 17:25
>> Instead, Python's own memory allocate (obmalloc) should be changed to
>> directly use the virtual memory interfaces of the operating system (i.e.
>> mmap), bypassing the malloc of the C library.
> 
> How would that interact with fork()?

Nicely, why do you ask? Any anonymous mapping will be copied
(typically COW) to the child process, in fact, malloc itself
uses anonymous mapping (at least on Linux).
msg70945 - (view) Author: Antoine Pitrou (pitrou) Date: 2008-08-09 17:53
Le samedi 09 août 2008 à 17:28 +0000, Martin v. Löwis a écrit :
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> >> Instead, Python's own memory allocate (obmalloc) should be changed to
> >> directly use the virtual memory interfaces of the operating system (i.e.
> >> mmap), bypassing the malloc of the C library.
> > 
> > How would that interact with fork()?
> 
> Nicely, why do you ask?

Because I didn't know :)
But looking at the dlmalloc implementation bundled in the patch, it
seems that using mmap/munmap (or VirtualAlloc/VirtualFree under Windows)
should be ok.

Do you think we should create a separate issue for this improvement? It
could also solve #3531.
msg72382 - (view) Author: Sébastien Sablé (sable) Date: 2008-09-03 10:28
[sorry for the late reply, I have been on holidays]

Martin:
you are right that this memory is moved to swap and does not consume any
"real" memory; however we decided to work on this patch because we
observed on our application some performances degradation due to this
memory not being deallocated correctly.

Since then we have done some quite extensive tests (with the help of a
consultant at Sun): they have shown that this unnecessary swapping has a
noticeable impact on performances and at worst, when the system memory
is saturated, can completely put a server on its knees for several
minutes (we're talking of top of the line SunOS and AIX servers with
hundreds of GB of memory).

I will write a complete document explaining the tests and observations
that we did, but this memory issue was critical for us given the
degradation of performances it was generating on our production servers.

Concerning dlmalloc, you are right that it would be cleaner to improve
obmalloc so that it uses mmap when necessary, instead of adding another
layer with dlmalloc (even though that is what actually currently happens
on linux systems where dlmalloc is integrated in libc).

I will try to do that patch in coming weeks (obmalloc mostly allocates
some 256KB arenas so it should nearly always use mmap).
msg72750 - (view) Author: Martin v. Löwis (loewis) Date: 2008-09-07 19:45
> I will try to do that patch in coming weeks (obmalloc mostly allocates
> some 256KB arenas so it should nearly always use mmap).

Exactly so. If you can, please also consider supporting Windows, in the
same way.

Anything in obmalloc that is not arena space should continue to come
from malloc, I believe.
msg72758 - (view) Author: Tim Peters (tim_one) Date: 2008-09-08 00:52
> Anything in obmalloc that is not arena space should continue to come
> from malloc, I believe.

Sorry, but I don't understand why arena space should be different.  If a
platform's libc implementers think mmap should be used to obtain 256KB
chunks (i.e., arenas), then surely they implement the platform malloc to
defer to mmap in such cases.  If they don't but "should", then bugging
the platform vendor to improve the system malloc in this respect is the
best idea (then all apps on the platform benefit, and Python stays simpler).

OTOH, if for some compelling reason it's believed Python knows better
than platform vendors, then obmalloc should be uglied-up on all paths to
make the enlightened choice.
msg72761 - (view) Author: Martin v. Löwis (loewis) Date: 2008-09-08 03:21
> OTOH, if for some compelling reason it's believed Python knows better
> than platform vendors, then obmalloc should be uglied-up on all paths to
> make the enlightened choice.

I'm proposing that obmalloc is changed to know better than system malloc
on systems supporting anonymous mmap, and Windows, and that the call

   malloc(ARENA_SIZE)

is replaced by mmap. This has the advantage of doing better than system
malloc on Solaris, plus it also might guarantee that arenas will be
POOL_SIZE aligned.

OTOH, the calls

  realloc(arenas, nbytes)
  malloc(nbytes)

should continue to go to system malloc, because they are typically
not multiples of the system page size.
msg72762 - (view) Author: Tim Peters (tim_one) Date: 2008-09-08 03:26
I have to admit that if Python /didn't/ know better than platform libc
implementers in some cases, there would be no point to having obmalloc
at all :-(

What you (Martin) suggest is reasonable enough.
msg72876 - (view) Author: Sébastien Sablé (sable) Date: 2008-09-09 15:58
Here is a new patch so that pymalloc can be combined with dlmalloc.

I first added the --with-pymalloc-mmap option to configure.in which
ensures that pymalloc arenas are allocated through mmap when possible.

However I found this was not enough: PyObject_Malloc uses arenas only
when handling objects smaller than 256 bytes. For bigger objects, it
directly rely on the system malloc. There are also some big buffers
which can be directly allocated through PyMem_MALLOC.

This patch can be activated by compiling Python with:
--with-pymalloc --with-pymalloc-mmap --with-dlmalloc

The behavior is then like that:
* PyObject_MALLOC will allocate arenas with mmap

* when allocating an object smaller than 256 bytes with 
PyObject_MALLOC, it will be stored in an arena (like before)

* when allocating an object bigger than 256 bytes with PyObject_MALLOC,
it will be allocated by dlmalloc (if it is smaller than 256KB it will go
in a dlmalloc pool, otherwise it will be mmaped)

* allocation through PyMem_MALLOC is handled by dlmalloc

I think it is a good compromise:
On systems like Linux, where the system malloc is already clever enough,
compiling with only --with-pymalloc should behave like before. On
systems like SunOS and AIX, this patch ensures that Python can benefit
of the speed of pymalloc for small objects, while ensuring that most of
the memory allocated can be correctly released at the system level.
msg72975 - (view) Author: Sébastien Sablé (sable) Date: 2008-09-10 16:33
My previous patch has a small problem as I believed dlmalloc was always
returning a non-NULL value, even when asking for 0 bytes.

It turns out not to be the case, so here is a new patch
(patch_dlmalloc3.diff) which must be applied after the previous one
(patch_dlmalloc2.diff) to correct this problem.
History
Date User Action Args
2008-09-10 16:33:03sablesetfiles: + patch_dlmalloc3.diff
messages: + msg72975
2008-09-09 15:59:06sablesetfiles: + patch_dlmalloc2.diff
messages: + msg72876
2008-09-08 03:26:34tim_onesetmessages: + msg72762
2008-09-08 03:21:22loewissetmessages: + msg72761
2008-09-08 00:52:07tim_onesetnosy: + tim_one
messages: + msg72758
2008-09-07 19:45:13loewissetmessages: + msg72750
2008-09-03 10:28:09sablesetmessages: + msg72382
2008-08-09 17:53:52pitrousetmessages: + msg70945
2008-08-09 17:25:56loewissetmessages: + msg70940
2008-08-09 10:57:09pitrousetmessages: + msg70929
2008-08-08 22:46:50loewissetnosy: + loewis
messages: + msg70920
2008-08-08 19:11:07pitrousetpriority: normal
nosy: + pitrou
messages: + msg70908
components: + Interpreter Core
versions: + Python 3.1, Python 2.7
2008-08-08 10:15:35sablesetfiles: + patch_dlmalloc.diff
keywords: + patch
2008-08-08 10:13:45sablesetfiles: + customized_malloc_AIX.pdf
2008-08-08 10:11:58sablecreate