classification
Title: Customized malloc implementation on SunOS and AIX
Type: resource usage Stage:
Components: Interpreter Core Versions: Python 3.1, Python 3.2, Python 2.7
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, flub, loewis, neologix, pitrou, sable, tim.peters
Priority: normal Keywords: patch

Created on 2008-08-08 10:11 by sable, last changed 2013-08-12 11:21 by pitrou. This issue is now closed.

Files
File name Uploaded Description Edit
customized_malloc_SUN.pdf sable, 2008-08-08 10:11
customized_malloc_AIX.pdf sable, 2008-08-08 10:13
patch_dlmalloc.diff sable, 2008-08-08 10:15
patch_dlmalloc2.diff sable, 2008-09-09 15:58
patch_dlmalloc3.diff sable, 2008-09-10 16:33
patch_dlmalloc_Python_2_7_1.diff sable, 2011-07-19 16:03 Patch to use dlmalloc in Python - updated for Python 2.7.1 and to only use mmap
Messages (38)
msg70897 - (view) Author: Sébastien Sablé (sable) Date: 2008-08-08 10:11
Hi,

We run a big application mostly written in Python (with Pyrex/C
extensions) on different systems including Linux, SunOS and AIX.

The memory footprint of our application on Linux is fine; however we
found that on AIX and SunOS, any memory that has been allocated by our
application at some stage will never be freed at the system level.

After doing some analysis (see the 2 attached pdf documents), we found
that this is linked to the implementation of malloc on those various
systems:

The malloc used on Linux (glibc) is based on dlmalloc as described in
this document:
http://g.oswego.edu/dl/html/malloc.html

This implementation will use sbrk to allocate small chunks of memory,
but it will use mmap to allocate big chunks. This ensures that the
memory will actually get freed when free is called.

AIX and Sun have a more naive malloc implementation, so that the memory
allocated by an application through malloc is never actually freed until
the application leaves (this behavior has been confirmed by some experts
at IBM and Sun when we asked them for some feedback on this problem -
there is a 'memory disclaim' option on AIX but it is disabled by default
as it brings some major performance penalities).

For long running Python applications which may allocate a lot of memory
at some stage, this is a major drawback.

In order to bypass this limitation of the system on AIX and SunOS, we
have modified Python so that it will use the customized malloc
implementation dlmalloc like in glibc (see attached patch) - dlmalloc is
released in the public domain.

This patch adds a --enable-dlmalloc option to configure. When activated,
we observed a dramatic reduction of the memory used by our application.
I think many AIX and SunOS Python users could be interested by such an
improvement.

--
Sébastien Sablé
Sungard
msg70908 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-08 19:11
This is very interesting, although it should probably go through
discussion on python-dev since it involves integrating a big chunk of
external code.
msg70920 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-08-08 22:46
I cannot quite see why the problem is serious: even though the memory is
not returned to the system, it will be swapped out to the swap file, so
it doesn't consume any real memory (just swap space).

I don't think Python should integrate a separate malloc implementation.
Instead, Python's own memory allocate (obmalloc) should be changed to
directly use the virtual memory interfaces of the operating system (i.e.
mmap), bypassing the malloc of the C library.

So I'm -1 on this patch.
msg70929 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 10:57
Le vendredi 08 août 2008 à 22:46 +0000, Martin v. Löwis a écrit :
> Instead, Python's own memory allocate (obmalloc) should be changed to
> directly use the virtual memory interfaces of the operating system (i.e.
> mmap), bypassing the malloc of the C library.

How would that interact with fork()?
msg70940 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-08-09 17:25
>> Instead, Python's own memory allocate (obmalloc) should be changed to
>> directly use the virtual memory interfaces of the operating system (i.e.
>> mmap), bypassing the malloc of the C library.
> 
> How would that interact with fork()?

Nicely, why do you ask? Any anonymous mapping will be copied
(typically COW) to the child process, in fact, malloc itself
uses anonymous mapping (at least on Linux).
msg70945 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-08-09 17:53
Le samedi 09 août 2008 à 17:28 +0000, Martin v. Löwis a écrit :
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> >> Instead, Python's own memory allocate (obmalloc) should be changed to
> >> directly use the virtual memory interfaces of the operating system (i.e.
> >> mmap), bypassing the malloc of the C library.
> > 
> > How would that interact with fork()?
> 
> Nicely, why do you ask?

Because I didn't know :)
But looking at the dlmalloc implementation bundled in the patch, it
seems that using mmap/munmap (or VirtualAlloc/VirtualFree under Windows)
should be ok.

Do you think we should create a separate issue for this improvement? It
could also solve #3531.
msg72382 - (view) Author: Sébastien Sablé (sable) Date: 2008-09-03 10:28
[sorry for the late reply, I have been on holidays]

Martin:
you are right that this memory is moved to swap and does not consume any
"real" memory; however we decided to work on this patch because we
observed on our application some performances degradation due to this
memory not being deallocated correctly.

Since then we have done some quite extensive tests (with the help of a
consultant at Sun): they have shown that this unnecessary swapping has a
noticeable impact on performances and at worst, when the system memory
is saturated, can completely put a server on its knees for several
minutes (we're talking of top of the line SunOS and AIX servers with
hundreds of GB of memory).

I will write a complete document explaining the tests and observations
that we did, but this memory issue was critical for us given the
degradation of performances it was generating on our production servers.

Concerning dlmalloc, you are right that it would be cleaner to improve
obmalloc so that it uses mmap when necessary, instead of adding another
layer with dlmalloc (even though that is what actually currently happens
on linux systems where dlmalloc is integrated in libc).

I will try to do that patch in coming weeks (obmalloc mostly allocates
some 256KB arenas so it should nearly always use mmap).
msg72750 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-07 19:45
> I will try to do that patch in coming weeks (obmalloc mostly allocates
> some 256KB arenas so it should nearly always use mmap).

Exactly so. If you can, please also consider supporting Windows, in the
same way.

Anything in obmalloc that is not arena space should continue to come
from malloc, I believe.
msg72758 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2008-09-08 00:52
> Anything in obmalloc that is not arena space should continue to come
> from malloc, I believe.

Sorry, but I don't understand why arena space should be different.  If a
platform's libc implementers think mmap should be used to obtain 256KB
chunks (i.e., arenas), then surely they implement the platform malloc to
defer to mmap in such cases.  If they don't but "should", then bugging
the platform vendor to improve the system malloc in this respect is the
best idea (then all apps on the platform benefit, and Python stays simpler).

OTOH, if for some compelling reason it's believed Python knows better
than platform vendors, then obmalloc should be uglied-up on all paths to
make the enlightened choice.
msg72761 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-08 03:21
> OTOH, if for some compelling reason it's believed Python knows better
> than platform vendors, then obmalloc should be uglied-up on all paths to
> make the enlightened choice.

I'm proposing that obmalloc is changed to know better than system malloc
on systems supporting anonymous mmap, and Windows, and that the call

   malloc(ARENA_SIZE)

is replaced by mmap. This has the advantage of doing better than system
malloc on Solaris, plus it also might guarantee that arenas will be
POOL_SIZE aligned.

OTOH, the calls

  realloc(arenas, nbytes)
  malloc(nbytes)

should continue to go to system malloc, because they are typically
not multiples of the system page size.
msg72762 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2008-09-08 03:26
I have to admit that if Python /didn't/ know better than platform libc
implementers in some cases, there would be no point to having obmalloc
at all :-(

What you (Martin) suggest is reasonable enough.
msg72876 - (view) Author: Sébastien Sablé (sable) Date: 2008-09-09 15:58
Here is a new patch so that pymalloc can be combined with dlmalloc.

I first added the --with-pymalloc-mmap option to configure.in which
ensures that pymalloc arenas are allocated through mmap when possible.

However I found this was not enough: PyObject_Malloc uses arenas only
when handling objects smaller than 256 bytes. For bigger objects, it
directly rely on the system malloc. There are also some big buffers
which can be directly allocated through PyMem_MALLOC.

This patch can be activated by compiling Python with:
--with-pymalloc --with-pymalloc-mmap --with-dlmalloc

The behavior is then like that:
* PyObject_MALLOC will allocate arenas with mmap

* when allocating an object smaller than 256 bytes with 
PyObject_MALLOC, it will be stored in an arena (like before)

* when allocating an object bigger than 256 bytes with PyObject_MALLOC,
it will be allocated by dlmalloc (if it is smaller than 256KB it will go
in a dlmalloc pool, otherwise it will be mmaped)

* allocation through PyMem_MALLOC is handled by dlmalloc

I think it is a good compromise:
On systems like Linux, where the system malloc is already clever enough,
compiling with only --with-pymalloc should behave like before. On
systems like SunOS and AIX, this patch ensures that Python can benefit
of the speed of pymalloc for small objects, while ensuring that most of
the memory allocated can be correctly released at the system level.
msg72975 - (view) Author: Sébastien Sablé (sable) Date: 2008-09-10 16:33
My previous patch has a small problem as I believed dlmalloc was always
returning a non-NULL value, even when asking for 0 bytes.

It turns out not to be the case, so here is a new patch
(patch_dlmalloc3.diff) which must be applied after the previous one
(patch_dlmalloc2.diff) to correct this problem.
msg110893 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-20 13:32
Any SunOS/AIX people interested in keeping this open?
msg111255 - (view) Author: Sébastien Sablé (sable) Date: 2010-07-23 09:38
Well I am still interested in getting this patch officially integrated in Python.

This patch is integrated in the version of Python that we deploy to our customers with our products (Sungard GP3). So it runs in production at various clients sites (some European banks with massive SunOs and AIX servers running thousands of sessions of our application) and it has provided some huge memory consumption improvements.

The problem appears quite obviously when you run a relatively big application on SunOS or AIX: if you allocate some memory in a Python process at some stage, this memory will never be released to the system until you leave that process, even if that memory is not used by Python anymore.
With my patch, the process can actually release the memory to the system so that it can be used by other processes.

Linux is not impacted by this problem because the GNU libc implements the same memory allocation mechanism based on dlmalloc.

I guess there are not that many people running Python applications with a big memory footprint on AIX or SunOS, otherwise this problem would be more popular.
msg115620 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-04 23:48
> I guess there are not that many people running Python applications with 
> a big memory footprint on AIX or SunOS, otherwise this problem would be 
> more popular.

Not only, but integrating a big chunk of foreign code in something as critical as the memory allocation routines is not an easy decision to make. Also, the dlmalloc copy should then be regularly kept in sync with upstream.
msg134330 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-24 11:25
Sébastien:
I'm chiming in late, but doesn't AIX have something like LD_PRELOAD?
Why not use it to transparently replace AIX's legacy malloc by another malloc implementation like dlmalloc or ptmalloc?
That would not require any patching of Python, and could also be used for other applications.

As a side note, while mmap has some advantages, it is way slower than brk (because pages must be zero-filled, and since mmap/munmap is called at every malloc/free call, this zero-filling is done every time contrarily to brk pools). See http://sources.redhat.com/ml/libc-alpha/2006-03/msg00033.html
msg134470 - (view) Author: Sébastien Sablé (sable) Date: 2011-04-26 14:52
Hi Charles-François,

it is possible to impact the memory allocation system on AIX using some environment variables (MALLOCOPTIONS and others), but it is not very elegant (it will impact all applications running with this environment and it is difficult to ensure that those environment variables will be correctly set when distributing an application to a customer) and I am afraid most users will never hear about that and will just use the default behavior.

Concerning mmap performances, dlmalloc has a pool mechanism and Python has its own pool mechanism on top of that.
As a result, system calls to allocate memory do not happen frequently since the memory allocation is usually handled internally in those pools and dlmalloc is often faster than the native malloc.

I have been distributing a version of Python which integrates this patch with the application on which I work to various customers for the last few years and the benchmarks have not shown any significant performance degradation. On the other hand, the decrease in memory consumption has been clearly noticed and appreciated.

Also note that dlmalloc (or a derivative - ptmalloc) is part of GNU glibc which is used by most Linux systems, and is what you get when you call malloc.
http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_derivatives

So by using dlmalloc on SunOS and AIX you would get the same level of performance for memory operations that you already probably can appreciate on Linux systems.
msg134485 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-26 17:47
> it is possible to impact the memory allocation system on AIX using some environment variables (MALLOCOPTIONS and others)

LD_PRELOAD won't impact AIX's malloc behaviour, but allows you to
replace it transparently by any other implementation you like
(dlmalloc, ptmalloc, ...), without touching neither cpython nor your
application.

For example, let's says I want a Python version where getpid always returns 42.

$ cat /tmp/pid.c
int getpid(void)
{
        return 42;
}

$ gcc -o /tmp/pid.so /tmp/pid.c -fpic -shared

Now,

$ LD_PRELOAD=/tmp/pid.so python -c 'import os; print(os.getpid())'
42

That's it. If you replace pid.so by dlmalloc.so, you'll be using
dlmalloc instead of AIX's malloc, without having modified a single
line of code.
If you're concerned with impacting other applications, then you could
do something like:

$ cat python.c
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        setenv("LD_PRELOAD", "/tmp/pid.so", 1);
        execvl(<path to real python>, argv);

        return 1;
}

And then:
$ ./python -c 'import os; print(os.getpid())'
42

> Also note that dlmalloc (or a derivative - ptmalloc) is part of GNU glibc which is used by most Linux systems, and is what you get when you call malloc.
> http://en.wikipedia.org/wiki/Malloc#dlmalloc_and_its_derivatives
>

Actually, glibc/eglibc versions have diverged quite a lot from the
original ptmalloc2, see for example http://bugs.python.org/issue11849
(that's one reason why embedding such a huge piece of code into Python
is probably not a good idea as highlighted by Antoine, it's updated
fairly frequently).

> So by using dlmalloc on SunOS and AIX you would get the same level of performance for memory operations that you already probably can appreciate on Linux systems.

Yes, but with the above "trick", you can do that without patching
python nor your app.
I mean, if you start embedding malloc in python, why stop there, and
not embed the whole glibc ;-)
Note that I realize this won't solve the problem for other AIX users
(if there are any left :-), but since this patch doesn't seem to be
gaining adhesion, I'm just proposing an alternative that I find
cleaner, simpler and easier to maintain.
msg134489 - (view) Author: Floris Bruynooghe (flub) Date: 2011-04-26 19:16
> > So by using dlmalloc on SunOS and AIX you would get the same level
> > of performance for memory operations that you already probably can
> > appreciate on Linux systems.
>
> Yes, but with the above "trick", you can do that without patching
> python nor your app.
> I mean, if you start embedding malloc in python, why stop there, and
> not embed the whole glibc ;-)
> Note that I realize this won't solve the problem for other AIX users
> (if there are any left :-), but since this patch doesn't seem to be
> gaining adhesion, I'm just proposing an alternative that I find
> cleaner, simpler and easier to maintain.

This trick is hard to find however and I don't think it serves Solaris
and AIX users very much (and sadly IBM keeps pushing AIX so yes it's
used more then I like :-( ).

So how about a --with-dlmalloc=path/to/dlmalloc.c?  This way the
dlmalloc code does not live inside Python and doesn't need to be
maintained by python.  But python still supports the code and will
easily be built using it.  Add a note in the README for AIX and
Solaris and I think this would be a lot friendlier to users.  This is
similar in how python uses e.g. openssl to provide optional extra
functionality/performance.
msg134491 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-26 19:21
> So how about a --with-dlmalloc=path/to/dlmalloc.c?

Can't you just add dlmalloc to LDFLAGS or something? Or would the
default malloc still be selected?

> This is
> similar in how python uses e.g. openssl to provide optional extra
> functionality/performance.

It's not really similar. OpenSSL provides functionality that's not
available through the standard library. Here, we're talking about an
alternative implementation of the standard C routines.
msg134495 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-26 20:46
I just noticed there's already a version of dlmalloc in Modules/_ctypes/libffi/src/dlmalloc.c

Compiling with gcc -shared -fpic -o /tmp/dlmalloc.so ./Modules/_ctypes/libffi/src/dlmalloc.c

Then LD_PRELOAD=/tmp/dlmalloc.so ./python

works just fine (and by the way, it solves the problem with glibc's version in #11849, it's somewhat slower though).

Or am I missing something?
msg134774 - (view) Author: Sébastien Sablé (sable) Date: 2011-04-29 15:30
> I'm just proposing an alternative that I find cleaner, simpler and easier to maintain.

I understand how LD_PRELOAD works but I find it neither clean nor simple to maintain.

Also by using a wrapper to call Python you still impact all the applications that may be executed from Python since the environment variables are propagated. You also need to configure the path to the alternative malloc library at runtime.

And as I said above, I am afraid most AIX and SunOS users will never hear about that and will just use the default behavior, with their Python application taking much more memory than necessary.

As mentioned by Floris, AIX is being pushed by IBM quite a lot, and in some markets it is very common (if not predominant in finance for example - 50/50 with SunOS for my clients I would say).

> I mean, if you start embedding malloc in python, why stop there, and
not embed the whole glibc ;-)

Concerning AIX, that would not be such a bad idea given the number of bugs in the native C library (cf some of my other issues reported in python bug tracker) - just kidding ;-)

Concerning the fact that dlmalloc or ptmalloc evolve "quickly":
* dlmalloc V2.8.3 Thu Sep 22 11:16:32 2005
* ptmalloc2 release Jun 5th, 2006
* ptmalloc3 release May 31st, 2006
* dlmalloc V2.8.4 Wed May 27 09:56:23 2009
I think we can cope with that kind of "fast" evolution ;-)

Also an old dlmalloc is better than no dlmalloc at all.
And as you noticed, an old dlmalloc is already provided in libffi.

> So how about a --with-dlmalloc=path/to/dlmalloc.c?

That looks like a good alternative. I can implement that if that can help to get the patch in Python.

> Can't you just add dlmalloc to LDFLAGS or something? Or would the
default malloc still be selected?

There is a USE_DL_PREFIX in malloc.c. If this flag is defined, all functions will be prefixed by dl (dlmalloc, dlfree, dlrealloc...).
If it is not set, the functions will be named as usual (malloc, free...).

In my patch, I preferred to set USE_DL_PREFIX and call dlmalloc/dlfree explicitly where needed.

Since I want PyMem_MALLOC to call dlmalloc, I would need to export the "malloc" symbol from libpython so that Python extensions could use it when calling PyMem_MALLOC, but that would impact all malloc calls in applications which embed Python for example.

So I think it is probably better to explicitly distinguish when you want to call dlmalloc and leave the native malloc for the host application.

Also this only addresses the --with-dlmalloc part of my patch.
The other part concerning --with-pymalloc-mmap ensures that pymalloc uses mmap to allocate arenas rather than malloc.

I perfectly understand that people are reluctant to make the memory allocation system more complex than it is already in Python in order to bypass some limitations of systems which are not very widespread among Python users.

But Python eating a lot of memory on SunOS and AIX does not look very good either.

I have some strong requirements as far as memory is concerned for my application so I have maintained this patch internally and distributed it as part of my application.

I will probably change of job soon and will not have access to AIX systems anymore. I don't really expect this patch to be accepted soon as few people have expressed some interest and I don't have much time/interest to push it on python-dev, but I will update the patch for Python 2.7 and 3.2 before leaving so that people impacted by this problem could a least manually patch their Python if they find this issue.
msg134775 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-29 15:45
> Since I want PyMem_MALLOC to call dlmalloc, I would need to export the
> "malloc" symbol from libpython so that Python extensions could use it
> when calling PyMem_MALLOC, but that would impact all malloc calls in
> applications which embed Python for example.

Well, that would be a rather good thing. There are, IIRC, Python API
calls which require that the caller manually frees memory. If the API
call malloc()s memory with a certain allocator and the caller free()s it
with another allocator, the result won't be pretty :)

(a similar discrepancy occurs between function-based APIs and
macro-based APIs: functions get compiled inside the Python library while
macros get compiled within the embedding executable; if library and
application have an incompatible malloc()/free() pair, you will get
similarly funny results)
msg134777 - (view) Author: Sébastien Sablé (sable) Date: 2011-04-29 16:03
Yes, I was probably not clear:
When --with-dlmalloc is activated, PyMem_MALLOC/PyMem_Malloc will call dlmalloc, PyMem_REALLOC/PyMem_Realloc will call dlrealloc and PyMem_FREE/PyMem_Free will call dlfree.

While calls to malloc/free/realloc will use the platform implementation.

So I think there should not be any mix, since as it is mentioned in pymem.h, people should not mix PyMem_MALLOC/PyMem_FREE with malloc/free:

/* BEWARE:

   Each interface exports both functions and macros.  Extension modules should
   use the functions, to ensure binary compatibility across Python versions.
   Because the Python implementation is free to change internal details, and
   the macros may (or may not) expose details for speed, if you do use the
   macros you must recompile your extensions with each Python release.

   Never mix calls to PyMem_ with calls to the platform malloc/realloc/
   calloc/free.  For example, on Windows different DLLs may end up using
   different heaps, and if you use PyMem_Malloc you'll get the memory from the
   heap used by the Python DLL; it could be a disaster if you free()'ed that
   directly in your own extension.  Using PyMem_Free instead ensures Python
   can return the memory to the proper heap.  As another example, in
   PYMALLOC_DEBUG mode, Python wraps all calls to all PyMem_ and PyObject_
   memory functions in special debugging wrappers that add additional
   debugging info to dynamic memory blocks.  The system routines have no idea
   what to do with that stuff, and the Python wrappers have no idea what to do
   with raw blocks obtained directly by the system routines then.

   The GIL must be held when using these APIs.
*/
msg134780 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-29 16:16
> Yes, I was probably not clear:
> When --with-dlmalloc is activated, PyMem_MALLOC/PyMem_Malloc will call
> dlmalloc, PyMem_REALLOC/PyMem_Realloc will call dlrealloc and
> PyMem_FREE/PyMem_Free will call dlfree.
> 
> While calls to malloc/free/realloc will use the platform implementation.

I'm not sure why you would want that. If dlmalloc is clearly superior,
why not use it for all allocations inside the application (not only
Python ones)?
msg134783 - (view) Author: Floris Bruynooghe (flub) Date: 2011-04-29 16:26
On 29 April 2011 17:16, Antoine Pitrou <report@bugs.python.org> wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
>> Yes, I was probably not clear:
>> When --with-dlmalloc is activated, PyMem_MALLOC/PyMem_Malloc will call
>> dlmalloc, PyMem_REALLOC/PyMem_Realloc will call dlrealloc and
>> PyMem_FREE/PyMem_Free will call dlfree.
>>
>> While calls to malloc/free/realloc will use the platform implementation.
>
> I'm not sure why you would want that. If dlmalloc is clearly superior,
> why not use it for all allocations inside the application (not only
> Python ones)?

For the same reason that extension modules can choose between
PyMem_Malloc and plain malloc (or whatever else).  Python has never
forced it's malloc on extension modules why should it now?
msg134785 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-04-29 16:36
> For the same reason that extension modules can choose between
> PyMem_Malloc and plain malloc (or whatever else).  Python has never
> forced it's malloc on extension modules why should it now?

We're talking about a platform-specific feature request due to the fact
that dlmalloc is (supposedly) superior to AIX malloc(). If it's superior
than I don't see any *practical* reason not to want to use it for other
purposes than allocating Python objects.
msg134794 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-29 17:19
Even worse than that, mixing to malloc implementations could lead to trouble.
For example, the trimming code ensures that the heap is where it last set it. So if an allocation has been made by another implementation in the meantime, the heap won't be trimmed, and your memory usage won't decrease. Also, it'll increase memory fragmentation.
Finally, I've you've got two threads inside different malloc implementations at the same time, well, some really bad things could happen.
And there are probably many other reasons why it's a bad idea.
msg134808 - (view) Author: Sébastien Sablé (sable) Date: 2011-04-29 19:10
I share the opinion of Floris on this: just because you link your application with python does not mean you want it to handle all memory management.

If you want the memory to be handled by Python, you should call PyMem_Malloc.

Otherwise people may want to use different malloc implementations in different parts of their application/libraries for different reasons (dmalloc for debugging http://dmalloc.com/ for example - we have seen that libffi bundles its own dlmalloc - someone may prefer a derivative of ptmalloc for performance reasons with threads...).

My application is linked with various libraries including libpython, glib and gmp, and I sometimes like to be able to distinguish how much memory is allocated by which library for profiling/debugging purpose for example.

I don't understand the point concerning trimming/fragmentation/threading by Charles-Francois: dlmalloc will allocate its own memory segment using mmap and handle memory inside that segment when you do a dlmalloc/dlfree/dlrealloc. Other malloc implementations will work in their own separate space and so won't impact or be impacted by what happens in dlmalloc segments.

dlmalloc is not that much different from pymalloc in that regard: it handles its own memory pool on top of the system memory implementations.
Yet you can have an application that uses the ordinary malloc while calling some Python code which uses pymalloc without any trimming/fragmentation/threading issues.
msg134810 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-04-29 21:01
> I don't understand the point concerning trimming/fragmentation/threading by
> Charles-Francois: dlmalloc will allocate its own memory segment using mmap
> and handle memory inside that segment when you do a
> dlmalloc/dlfree/dlrealloc. Other malloc implementations will work in their
> own separate space and so won't impact or be impacted by what happens in
> dlmalloc segments.

Most of the allocations come from the heap - through sbrk - which is a
shared resource, and is a contiguous space. mmap is only used for big
allocations.

>
> dlmalloc is not that much different from pymalloc in that regard: it handles
> its own memory pool on top of the system memory implementations.
> Yet you can have an application that uses the ordinary malloc while calling
> some Python code which uses pymalloc without any
> trimming/fragmentation/threading issues.

It's completely different. Pymalloc is used *on top* of libc's malloc,
while dlmalloc would be be used in parallel.
msg135130 - (view) Author: Sébastien Sablé (sable) Date: 2011-05-04 13:49
Another reason why you should not force dlmalloc for all applications linked with libpython is because dlmalloc is (by default) not thread safe, while the system malloc is (generally) thread-safe. It is possible to define a constant in dlmalloc to make it thread-safe (using locks) but it will be slower and it is not needed in Python since the GIL must be held when using PyMem_ functions.

If a thread-safe implementation was needed, it would be better to switch to ptmalloc2.

Also that addresses the issue of "two threads inside different malloc implementations at the same time": it is currently not allowed with PyMem_Malloc.

> Most of the allocations come from the heap - through sbrk

Most python objects will be allocated in pymalloc arenas (if they are smaller than 256 bytes) which (if compiled with --with-pymalloc-mmap) will be directly allocated by calling mmap, or (without --with-pymalloc-mmap) will be allocated in dlmalloc by calling mmap (because arenas are 256KB).
So most of the python objects will end up in mmap segments separate from the heap.

The only allocations that will end up in the heap are for the medium python objects (>256 bytes and <256KB) or for allocations directly by calling  PyMem_Malloc (and for a size <256KB). Also dlmalloc will not call sbrk for each of those allocations: dlmalloc allocates some large memory pools and manage the smaller allocations within those pools in a very efficient way. So the heap fragmentation should be indeed reduced by using dlmalloc.

Most modern malloc implementations are also using pools/arenas anyway, so the heap will mostly contain a mix of native malloc arenas and dlmalloc pools. So the fragmentation should not be too much of a concern if you mix 2 malloc implementations.
Here is OpenSolaris malloc implementation for example:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/lib/libmalloc/common/malloc.c#514

Concerning trimming: the reason why I am proposing to use dlmalloc on AIX and Solaris is that the native malloc/free do not correctly trim the heap in the first place on those platforms! If malloc/free correctly worked on those platforms and the heap was trimmed when possible, I would not have taken the trouble of proposing this patch and using dlmalloc, I would happily use the native malloc/free.

So mixing 2 malloc implementations should not be a problem as long as you keep track of the right 'free' implementation to use for each pointer (which should already be the case when you call PyMem_Malloc/PyMem_Free instead of malloc/free).

If you are really concerned about mixing 2 malloc implementations in the heap, you can define "HAVE_MORECORE 0" in dlmalloc and that way dlmalloc will always use mmap and not use the heap at all.

My application uses the provided patch so that dlmalloc is used for Python objects and the native malloc for all the rest (much less consuming than the Python part) on AIX and SunOS. It has been in production for years and we never experienced any crash related to memory problems.
msg135148 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-05-04 18:46
> Also that addresses the issue of "two threads inside different malloc implementations at the same time": it is currently not allowed with PyMem_Malloc.
>

That's not true.
You can perfectly have one thread inside PyMem_Malloc while another
one is inside libc's malloc.
For example, posix_listdir does:

     Py_BEGIN_ALLOW_THREADS
     dirp = opendir(name);
     Py_END_ALLOW_THREADS

Where opendir calls malloc internally. Since the GIL is released, you
can have another thread inside PyMem_Malloc at the same time. This is
perfectly safe, as long as the libc's malloc version is thread-safe.

But with your patch, such code wouldn't be thread-safe anymore. This
patch implies that a thread can't call malloc directly or indirectly
(printf, opendir, and many others) while it doesn't hold the GIL. This
is going to break a lot of existing code.
This thread-safety issue is not theoretical: I wrote up a small
program with two threads, one allocating/freeing memory in loop with
glibc's malloc and the other one with dlmalloc: it crashes immediately
on a Linux box.

> Most python objects will be allocated in pymalloc arenas (if they are smaller than 256 bytes) which (if compiled with --with-pymalloc-mmap) will be directly allocated by calling mmap, or (without --with-pymalloc-mmap) will be allocated in dlmalloc by calling mmap (because arenas are 256KB).
> So most of the python objects will end up in mmap segments separate from the heap.
>
> The only allocations that will end up in the heap are for the medium python objects (>256 bytes and <256KB) or for allocations directly by calling  PyMem_Malloc (and for a size <256KB).

Note that there are actually many objects falling into this category:
for example, on 64-bit, a dictionary exceeds 256B, and is thus
allocated directly from the heap (well, it changed really recently
actually), the same holds for medium-sized lists and strings. So,
depending on your workload, the heap can extend and shrink quite a
bit.

> If you are really concerned about mixing 2 malloc implementations in the heap, you can define "HAVE_MORECORE 0" in dlmalloc and that way dlmalloc will always use mmap and not use the heap at all.
>

It will also be slower, and consume more memory.
msg140678 - (view) Author: Sébastien Sablé (sable) Date: 2011-07-19 16:02
Sorry for the very late reply; I have been quite busy recently with the 
birth of my second daughter, a new job, a new home town and soon a new home.

...
> But with your patch, such code wouldn't be thread-safe anymore. This
> patch implies that a thread can't call malloc directly or indirectly
> (printf, opendir, and many others) while it doesn't hold the GIL. This
> is going to break a lot of existing code.

I didn't have this problem since the threads in my application are 
handled by Python and so hold the GIL. But you are right it is a concern.

Fortunately, it is easy to solve by defining the following in dlmalloc:
#define HAVE_MORECORE 0

That way, all the memory allocations handled by Python will go in a 
dedicated mmaped memory segment controlled by dlmalloc, while all the 
calls to the system malloc will work as before (probably going into a 
segment handled by sbrk).

> It will also be slower, and consume more memory.

It should be noted that sbrk is deprecated on some platforms where mmap 
is suggested as a better replacement (Mac OS X, FreeBSD...).
sbrk is generally considered quite archaic.

I attach a new patch that can be applied to Python 2.7.1. It includes 
the dlmalloc modification and uses only mmap in this case (no sbrk).

We have delivered it in production with the new version of our software 
that works on AIX 6.1 and it works fine.

I also did some benchmarks and did not notice any slow down compared to 
a pristine Python 2.7.1 (actually it was slightly faster YMMV).
It also consumes a lot less memory, but that is the reason for this 
patch in the first place.

Since I am changing of job, I won't be working on AIX anymore (yeah!); I 
also don't expect this patch to be integrated spontaneously without 
someone interested in AIX pushing for it. So I leave this patch more as 
a reference for someone who would be impacted by this problem and would 
like to integrate it in his own Python. I hope it helps.
msg140681 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-07-19 16:12
> Since I am changing of job, I won't be working on AIX anymore (yeah!);

You seem happy about that :)
Does it mean the project to have an AIX buildbot is abandoned?

> I also don't expect this patch to be integrated spontaneously without 
> someone interested in AIX pushing for it. So I leave this patch more as 
> a reference for someone who would be impacted by this problem and would 
> like to integrate it in his own Python. I hope it helps.

Indeed, thanks for your contributions.
msg140682 - (view) Author: Sébastien Sablé (sable) Date: 2011-07-19 16:22
> Does it mean the project to have an AIX buildbot is abandoned?

We have a buildbot running internally on AIX. I could not get the necessary modifications integrated upstream in the official Python buildbot so that we could plug directly on it.

cf this thread:
http://mail.python.org/pipermail/python-dev/2010-October/thread.html#104714

I will try to get someone at my company to keep this buildbot running and report any outstanding bug, but I can't guarantee anything. 

> Indeed, thanks for your contributions.

Thanks! And thank you for your help in most of the issues related to AIX.
msg141177 - (view) Author: Charles-François Natali (neologix) * (Python committer) Date: 2011-07-26 18:31
> Fortunately, it is easy to solve by defining the following in
> dlmalloc:
> #define HAVE_MORECORE 0

I was expecting this answer ;-)
Here's a quick demo, on a Linux box:

cf@neobox:~/cpython$ ./python Tools/pybench/pybench.py -n 1
-------------------------------------------------------------------------------
Totals:                          19787ms  19787ms

cf@neobox:~/cpython$ MALLOC_MMAP_THRESHOLD_=0 ./python Tools/pybench/pybench.py -n 1
[...]
-------------------------------------------------------------------------------
Totals:                          33375ms  33375ms

That's a mere 70% slowdown, and without pymalloc, it would be much worse. malloc with mmap() is way slower than with sbrk() (see http://sources.redhat.com/ml/libc-alpha/2006-03/msg00033.html for more details). Since your benchmarks don't show this type of regression it probably means that AIX's malloc implementation is really broken (there's also the fact that part of the allocations are still routed to the libc's malloc, or maybe your workload is too specific to demonstrate this behavior).

> sbrk is generally considered quite archaic.

I wouldn't say that; see the above link on malloc's dynamic mmap() threshold.

> I also don't expect this patch to be integrated spontaneously without
> someone interested in AIX pushing for it.

Indeed.
As far as I'm concerned, there are two "showstoppers":
- shipping an implementation of dlmalloc with Python
- mixing dlmalloc with the host's malloc implementation

But I think the main problem with this patch is that AIX represents such a tiny fraction of the user base. This might change in the future, especially if IBM is successfull in its effort of pushing AIX (I hope they'll finally fix AIX's malloc by then...).

> I have been quite busy recently with the birth of my second daughter,
> a new job, a new home town and soon a new home.

Congratulations, and good luck!
msg194939 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2013-08-12 11:21
PEP 445 allows you to customize the Python memory allocators, which is a better solution than shipping several ones with Python ;-)
History
Date User Action Args
2013-08-12 11:21:33pitrousetstatus: open -> closed
resolution: wont fix
messages: + msg194939
2011-07-26 18:31:50neologixsetmessages: + msg141177
2011-07-19 16:22:30sablesetmessages: + msg140682
2011-07-19 16:12:22pitrousetmessages: + msg140681
2011-07-19 16:04:14sablesetfiles: + patch_dlmalloc_Python_2_7_1.diff
2011-07-19 16:02:06sablesetmessages: + msg140678
2011-05-04 18:46:41neologixsetmessages: + msg135148
2011-05-04 13:49:54sablesetmessages: + msg135130
2011-04-29 21:01:06neologixsetmessages: + msg134810
2011-04-29 19:10:16sablesetmessages: + msg134808
2011-04-29 17:19:48neologixsetmessages: + msg134794
2011-04-29 16:36:37pitrousetmessages: + msg134785
2011-04-29 16:26:57flubsetmessages: + msg134783
2011-04-29 16:16:48pitrousetmessages: + msg134780
2011-04-29 16:04:00sablesetmessages: + msg134777
2011-04-29 15:45:40pitrousetmessages: + msg134775
2011-04-29 15:30:13sablesetmessages: + msg134774
2011-04-26 20:46:05neologixsetmessages: + msg134495
2011-04-26 19:21:56pitrousetmessages: + msg134491
2011-04-26 19:16:10flubsetmessages: + msg134489
2011-04-26 17:47:58neologixsetmessages: + msg134485
2011-04-26 14:52:16sablesetmessages: + msg134470
2011-04-24 11:25:11neologixsetnosy: + neologix
messages: + msg134330
2010-10-18 13:28:07flubsetnosy: + flub
2010-09-04 23:48:58pitrousetmessages: + msg115620
2010-07-23 09:38:21sablesetmessages: + msg111255
2010-07-20 13:32:28BreamoreBoysetnosy: + BreamoreBoy

messages: + msg110893
versions: + Python 3.2
2008-09-10 16:33:03sablesetfiles: + patch_dlmalloc3.diff
messages: + msg72975
2008-09-09 15:59:06sablesetfiles: + patch_dlmalloc2.diff
messages: + msg72876
2008-09-08 03:26:34tim.peterssetmessages: + msg72762
2008-09-08 03:21:22loewissetmessages: + msg72761
2008-09-08 00:52:07tim.peterssetnosy: + tim.peters
messages: + msg72758
2008-09-07 19:45:13loewissetmessages: + msg72750
2008-09-03 10:28:09sablesetmessages: + msg72382
2008-08-09 17:53:52pitrousetmessages: + msg70945
2008-08-09 17:25:56loewissetmessages: + msg70940
2008-08-09 10:57:09pitrousetmessages: + msg70929
2008-08-08 22:46:50loewissetnosy: + loewis
messages: + msg70920
2008-08-08 19:11:07pitrousetpriority: normal
nosy: + pitrou
messages: + msg70908
components: + Interpreter Core
versions: + Python 3.1, Python 2.7
2008-08-08 10:15:35sablesetfiles: + patch_dlmalloc.diff
keywords: + patch
2008-08-08 10:13:45sablesetfiles: + customized_malloc_AIX.pdf
2008-08-08 10:11:58sablecreate