This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Test test_maxcontext_exact_arith (_decimal) consumes all memory on AIX
Type: behavior Stage: resolved
Components: Extension Modules Versions: Python 3.10
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: skrah Nosy List: David.Edelsohn, T.Rex, miss-islington, sanket, skrah
Priority: low Keywords: patch

Created on 2020-08-13 12:15 by T.Rex, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 21890 merged skrah, 2020-08-15 17:41
PR 21893 merged miss-islington, 2020-08-15 18:19
Messages (24)
msg375302 - (view) Author: Tony Reix (T.Rex) Date: 2020-08-13 12:15
Python master of 2020/08/11

Test test_maxcontext_exact_arith (test.test_decimal.CWhitebox) checks that Python correctly handles a case where an object of size 421052631578947376 is created.

maxcontext = Context(prec=C.MAX_PREC, Emin=C.MIN_EMIN, Emax=C.MAX_EMAX)

Both on Linux and AIX, we have:
Context(prec=999999999999999999,
        rounding=ROUND_HALF_EVEN,
        Emin=-999999999999999999,
         Emax=999999999999999999, capitals=1, clamp=0, flags=[], traps=[InvalidOperation, DivisionByZero, Overflow])

The test appears in:
  Lib/test/test_decimal.py
    5665     def test_maxcontext_exact_arith(self):
and the issue (on AIX) exactly appears at:
                self.assertEqual(Decimal(4) / 2, 2)

The issue is due to code in: Objects/obmalloc.c :
void *
PyMem_RawMalloc(size_t size)
{
    /*
     * Limit ourselves to PY_SSIZE_T_MAX bytes to prevent security holes.
     * Most python internals blindly use a signed Py_ssize_t to track
     * things without checking for overflows or negatives.
     * As size_t is unsigned, checking for size < 0 is not required.
     */
    if (size > (size_t)PY_SSIZE_T_MAX)
        return NULL;
    return _PyMem_Raw.malloc(_PyMem_Raw.ctx, size);

Both on Fedora/x86_64 and AIX, we have:
 size:            421052631578947376
 PY_SSIZE_T_MAX: 9223372036854775807
thus: size < PY_SSIZE_T_MAX and _PyMem_Raw.malloc() is called.

However, on Linux, the malloc() returns a NULL pointer in that case, and then Python handles this and correctly runs the test.
However, on AIX, the malloc() tries to allocate the requested memory, and the OS gets stucked till the Python process is killed by the OS.

Either size is too small, or PY_SSIZE_T_MAX is not correctly computed:
./Include/pyport.h :
  /* Largest positive value of type Py_ssize_t. */
  #define PY_SSIZE_T_MAX ((Py_ssize_t)(((size_t)-1)>>1))

Anyway, the following code added in PyMem_RawMalloc() before the call to _PyMem_Raw.malloc() , which in turns calls malloc() :
    if (size == 421052631578947376)
        {
                printf("TONY: 421052631578947376: --> PY_SSIZE_T_MAX: %ld \n", PY_SSIZE_T_MAX);
                return NULL;
        }
does fix the issue on AIX.
However, it is simply a way to show where the issue can be fixed.
Another solution (fix size < PY_SSIZE_T_MAX) is needed.
msg375303 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-08-13 12:39
Could be a duplicate/related to the problem described in https://bugs.python.org/issue39576
msg375304 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2020-08-13 12:43
Unfortunately, I do not understand why you conclude that the problem is in PyMem_RawMalloc. That code seems correct. Could you provide evidence that the value of PY_SSIZE_T_MAX is miscalculated in AIX? Alternatively, could you elaborate on what makes you believe that the problem is in PyMem_RawMalloc?
msg375305 - (view) Author: Tony Reix (T.Rex) Date: 2020-08-13 13:12
Some more explanations.

On AIX, the memory is controlled by the ulimit command.
"Global memory" comprises the physical memory and the paging space, associated with the Data Segment.

By default, both Memory and Data Segment are limited:
# ulimit -a
data seg size           (kbytes, -d) 131072
max memory size         (kbytes, -m) 32768
...

However, it is possible to remove the limit, like:
# ulimit -d unlimited

Now, when the "data seg size" is limited, the malloc() routine checks if enough memory/paging-space are available, and it immediately returns a NULL pointer.

But, when the "data seg size" is unlimited, the malloc() routine first tries to allocate and quickly consumes the paging space, which is much slower than acquiring memory since it consumes disk space. And it nearly hangs the OS. Thus, in that case, it does NOT check if enough memory of data segments are available. Bad.

So, this issue appears on AIX only if we have:
# ulimit -d unlimited

Anyway, the test:
    if (size > (size_t)PY_SSIZE_T_MAX)
in:
    Objects/obmalloc.c: PyMem_RawMalloc()
seems weird to me since the max of size is always lower than PY_SSIZE_T_MAX .
msg375306 - (view) Author: Tony Reix (T.Rex) Date: 2020-08-13 13:18
Hi Pablo,
I'm only surprised that the maximum size generated in the test is always lower than the PY_SSIZE_T_MAX. And this appears both on AIX and on Linux, which both compute the same values.

On AIX, it appears (I've just discovered this now) that malloc() does not ALWAYS check that there is enough memory to allocate before starting to claim memory (and thus paging space). This appears when Data Segment size is unlimited.

On Linux/Fedora, I had no limit too. But it behaves differently and malloc() always checks that the size is correct.
msg375309 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-13 13:57
We need more information.

Is this 64-bit AIX?

How much physical memory does the machine have?

Linux also has over-allocation and the default for ulimit is unlimited.
But it does not attempt to over-allocate such an outrageous amount of
memory.

Neither do FreeBSD, Windows, etc.
msg375310 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-13 14:16
Also, perhaps build Python with -bmaxdata?

https://www.enterprisedb.com/edb-docs/d/postgresql/reference/manual/11.1/installation-platform-notes.html
msg375313 - (view) Author: Tony Reix (T.Rex) Date: 2020-08-13 14:57
Is it a 64bit AIX ? Yes, AIX is 64bit by default and only since ages, but it manages 32bit applications as well as 64bit applications.

The experiments were done with 64bit Python executables on both AIX and Linux.

The AIX machine has 16GB Memory and 16GB Paging Space.

The Linux Fdora32/x86_64 machine has 16GB Memory and 8269820 Paging Space (swapon -s).

Yes, I agree that the behavior of AIX malloc() under "ulimit -d unlimited" is... surprising. And the manual of malloc() does not talk about this.

Anyway, does the test:   if (size > (size_t)PY_SSIZE_T_MAX)  was aimed to prevent calling malloc() with such a huge size? If yes, that does not work.
msg375316 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-13 16:51
The test (size > (size_t)PY_SSIZE_T_MAX)) has nothing to do with it. Within Python, most sizes are ssize_t, so a value larger than SSIZE_MAX is suspicious.


AIX is an unsupported platform.

Realistically, if people want AIX to be supported, someone has to give core devs full ssh access to an AIX system.

Perhaps I'll just skip this test on AIX.
msg375317 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2020-08-13 16:53
Core developers have full access to AIX system for the asking.  Back to you, Stefan.
msg375371 - (view) Author: Tony Reix (T.Rex) Date: 2020-08-14 05:22
I forgot to say that this behavior was not present in stable version 3.8.5 . Sorry.

On 2 machines AIX 7.2, testing Python 3.8.5 with:
+ cd /opt/freeware/src/packages/BUILD/Python-3.8.5
+ ulimit -d unlimited
+ ulimit -m unlimited
+ ulimit -s unlimited
+ export LIBPATH=/opt/freeware/src/packages/BUILD/Python-3.8.5/64bit:/usr/lib64:/usr/lib:/opt/lib
+ export PYTHONPATH=/opt/freeware/src/packages/BUILD/Python-3.8.5/64bit/Modules
+ ./python Lib/test/regrtest.py -v test_decimal
...
gave:

507 tests in 227 items.
507 passed and 0 failed.
Test passed.

So, this issue with v3.10 (master) appeared to me as a regression. However, after hours debugging the issue, I forgot to say it in this defect, sorry.

(Previously, I was using limits for -d -m and -s : max 4GB. However, that appeared to be an issue when running tests with Python test option -M12Gb, which requires up and maybe more than 12GB of my 16GB memory machine in order to be able to run a large part of the Python Big Memory tests. And thus I unlimited these 3 resources, with no problem at all with version 3.8.5 .)
msg375378 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-14 08:22
> Core developers have full access to AIX system for the asking.  Back to you, Stefan.

That sounds great. Can we contact you directly, or have I missed an earlier announcement from someone else giving out AIX access?

Or are you working on it and I misunderstand the idiomatic English? :)
msg375380 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-14 08:29
> So, this issue with v3.10 (master) appeared to me as a regression.

I understand that from your point of view it appears as a regression.

However, quoting the C standard, 7.20.3 Memory management functions:

"The pointer returned points to the start (lowest byte
address) of the allocated space. If the space cannot be allocated, a null pointer is
returned."


So, for a system that is not currently officially supported, I don't
consider it my problem if AIX thinks the space can be allocated.

But as I said, I can disable that test on AIX.  If I get AIX access,
I can look at this more.
msg375399 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2020-08-14 12:41
AIX systems at OSUOSL have been part of the GNU Compile Farm for a decade. It also is the system on which I have been running the Python Buildbot.  Any Compile Farm user has access to AIX.

https://cfarm.tetaneutral.net/users/new/

Also, IBM is in the process of creating a dedicated VM at OSUOSL for the Python Buildbot to support more builders. We probably can provide access to Python core members to log in to that system as well.
msg375462 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-15 12:51
Thank you, David!
  
Now that I can test on AIX, I can confirm that the data limit is the
culprit:

libmpdec deliberately calls malloc(52631578947368422ULL) in the
maxprec tests, which is supposed to fail, but succeeds.

However, instead of freezing the machine, the process gets a proper
SIGKILL almost instantly.



As I suggested earlier, using -bmaxdata prevents this from happening
and the test passes.

./configure CC=xlc AR="ar -X64" CFLAGS="-q64 -qmaxmem=70000 -qlanglvl=extc99 -qcpluscmt -qkeyword=inline -qalias=ansi -qthreaded -D_THREAD_SAFE -D__VACPP_MULTI__" LDFLAGS="-L/usr/lib64 -q64 -Wl,-bmaxdata:0x800000000"


I have not figured out a similar gcc option yet, but only searched for 5 min.



The question now is: Since this is expected behavior on AIX, and xlc (and probably
gcc as well) have limit command line switches, do we need to disable the test?



I tried to set rlimit in the test, which works on Linux but is broken on AIX:

test test_decimal failed -- Traceback (most recent call last):
  File "/home/skrah/cpython/Lib/test/test_decimal.py", line 5684, in test_maxcontext_exact_arith
    resource.setrlimit(resource.RLIMIT_DATA, (8000000, hardlimit))
OSError: [Errno 14] Bad address


So disabling the test would be the best option, unless we want to educate users
about -bmaxdata.
msg375464 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2020-08-15 13:49
AIX uses a "late" memory allocation scheme by default.  If the test wants to malloc(52631578947368422ULL) and intends it to fail, it should run with the AIX

$ export PSALLOC=early

environment variable.  More than all of the other maxdata changes.

Separate from all of this, you are configuring as 64 bit (-q64).  -qmaxmem affects the compiler optimization.

-Wl,-bmaxdata:0x800000000 is a GCC command line option.  "-Wl," passes the appended flag to the linker.  So somehow you're using GCC to invoke the linker, although building with XLC.  XLC knows about -bmaxdata directly.

On the other hand, -bmaxdata behaves differently in 32 bit mode and 64 bit mode.  In 32 bit mode, it increases the heap size from the default 256MB.  In 64 bit mode, it sets a guaranteed maximum size for the heap.

So I think that -bmaxdata may be helping, but not for the reason that you believe.  -Wl,-bmaxdata:0x80000000 may work, although I don't understand how that correctly interacts with XLC.  If you truly are running in 64 bit mode, then -bmaxdata has an effect like PSALLOC.

I sort of think that the solution desired for the testcase is PSALLOC=early to match traditional Unix/Linux malloc() behavior.
msg375468 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-15 15:06
> -qmaxmem affects the compiler optimization.

I know, that's just from the Python README.AIX.  I didn't expect it to have any influence.


> -Wl,-bmaxdata:0x800000000 is a GCC command line option.

That is indeed surprising. Linking is prepared by a script:

$ ./Modules/makexp_aix Modules/python.exp . libpython3.10.a;  xlc -L/usr/lib64 -q64 -Wl,-bmaxdata:0x800000000    -Wl,-bE:Modules/python.exp -lld -o python Programs/python.o libpython3.10.a -lintl -ldl  -lpthread -lm   -lm

The xlc command runs without warnings or errors.



Without Wl,-bmaxdata:0x800000000
================================

$ ./python -m test -uall test_decimal
0:00:00 Run tests sequentially
0:00:00 [1/1] test_decimal
Killed


With Wl,-bmaxdata:0x800000000
=============================

$ ./python -m test -uall test_decimal
0:00:00 Run tests sequentially
0:00:00 [1/1] test_decimal

== Tests result: SUCCESS ==

1 test OK.

Total duration: 17.3 sec
Tests result: SUCCESS


> On the other hand, -bmaxdata behaves differently in 32 bit mode and 64 bit mode.  In 32 bit mode, it increases the heap size from the default 256MB.  In 64 bit mode, it sets a guaranteed maximum size for the heap.

Yes, that's what I expected. The test only allocates that much memory
for 64-bit builds.  The workaround only needs to be enabled for 64-bit.

So a memory softlimit, same as e.g. djb uses for qmail with his
softlimit program, is exactly what I was looking for.


> I sort of think that the solution desired for the testcase is PSALLOC=early to match traditional Unix/Linux malloc() behavior.

I can try that, but our test suite might complain about the environment
being modified.
msg375480 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-15 17:05
To recap for people who find this: The problem occurs because of AIX's
extreme over-allocation and is specific to the 64-bit build.

Workarounds:

  1) Something like  ulimit -d 8000000.

  2) xlc: LDFLAGS="-L/usr/lib64 -q64 -bmaxdata:0x800000000" or
     LDFLAGS="-L/usr/lib64 -q64 -Wl,-bmaxdata:0x800000000".

     The first version seems more natural for xlc.

  3) gcc: LDFLAGS="-L/usr/lib64 -Wl,-bmaxdata:0x800000000 -maix64"



PSALLOC=early works really well for the libmpdec tests but is extremely
slow with the Python interpreter. Also, setting the environment in the
tests does not work.  It looks like it needs to be set before main()
starts.



So I'll just skip that test on AIX. It is not that important, and
the libmpdec maxprec tests, which are way more thorough, all pass
with PSALLOC=early.
msg375486 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-15 18:19
New changeset 39dab24621122338d01c1219bb0acc46ba9c9956 by Stefan Krah in branch 'master':
bpo-41540: AIX: skip test that is flaky with a default ulimit. (#21890)
https://github.com/python/cpython/commit/39dab24621122338d01c1219bb0acc46ba9c9956
msg375490 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-15 18:40
New changeset 28bf82661ac9dfaf1b2d0fd0ac98fc0b31cd95bb by Miss Islington (bot) in branch '3.9':
bpo-41540: AIX: skip test that is flaky with a default ulimit. (GH-21890) (#21893)
https://github.com/python/cpython/commit/28bf82661ac9dfaf1b2d0fd0ac98fc0b31cd95bb
msg375537 - (view) Author: Tony Reix (T.Rex) Date: 2020-08-17 09:45
Hi Stefan,
In your message https://bugs.python.org/issue41540#msg375462 , you said:
 "However, instead of freezing the machine, the process gets a proper SIGKILL almost instantly."
That's probably due to a very small size of the Paging Space of the AIX machine you used for testing. With very small PS, the OS quickly reaches the step where PS and memory are full and it tries to kill possible culprits (but often killing innocent processes, like my bash shell). However, with a large PS (size of the Memory, or half), it takes some time for the OS to consume the PS, and, during this time (many seconds if not minutes), the OS looks like frozen and it takes many seconds or minutes for a "kill -9 PID" to take effect.

About -bmaxdata, I always used it for extending default memory of a 32bit process, but I never used it for reducing the possible memory of a 64bit process since some users may want to use python with hundreds of GigaBytes of memory. And the python executable used for tests is the same one that is delivered to users.

About PSALLOC=early , I confirm that it perfectly fixes the issue. So, we'll use it when testing Python.
Our customers should use it or use ulimit -d .
But using -bmaxdata for building python process in 64bit would reduce the possibilities of the python process.
In the future, we'll probably improve the compatibility with Linux so that this (rare) case no more appear.

BTW, on AIX, we have only 12 test cases failing out of about 32,471 test cases run in 64bit, with probably only 5 remaining serious failures. Both with GCC and XLC. Not bad. Less in 32bit. Now studying these few remaining issues and the still skipped tests.
msg375570 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-17 22:17
> That's probably due to a very small size of the Paging Space of the AIX machine you used for testing.

That is the case, the machine has 160GB of memory and 1GB of paging space. I guess it is configured specifically for not freezing.


> About PSALLOC=early , I confirm that it perfectly fixes the issue.

I'm surprised, because it is unspeakably slow on this machine,
even with the skips in place:

PSALLOC=early time ./python -m test -uall test_decimal

I hit Ctrl-C after 10min, so it takes even longer:

Real   622.11
User   11.63
System 350.12  (!)



The -bmaxdata approach has no speed penalty. Note that you can
use 10 petabytes for the value, it should still prevent this issue.
msg375571 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2020-08-17 22:23
> About PSALLOC=early , I confirm that it perfectly fixes the issue.

> I'm surprised, because it is unspeakably slow on this machine,

These statements are not contradictory.  No one is suggesting that Python always should run with PSALLOC=early.
msg375573 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2020-08-17 22:35
Well, I misunderstood this sentence then, so it's just for testing. :)

> Our customers should use it or use ulimit -d.


One will hit this issue also when following the MAX_PREC section
in the FAQ, but that is a rare case:

https://docs.python.org/3.10/library/decimal.html#decimal-faq
History
Date User Action Args
2022-04-11 14:59:34adminsetgithub: 85712
2020-08-17 22:35:10skrahsetmessages: + msg375573
2020-08-17 22:23:13David.Edelsohnsetmessages: + msg375571
2020-08-17 22:17:05skrahsetmessages: + msg375570
2020-08-17 09:45:10T.Rexsetmessages: + msg375537
2020-08-15 18:46:09skrahsetstatus: open -> closed
assignee: skrah
resolution: fixed
stage: patch review -> resolved
2020-08-15 18:40:17skrahsetmessages: + msg375490
2020-08-15 18:19:22miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request21012
2020-08-15 18:19:16skrahsetmessages: + msg375486
2020-08-15 17:41:58skrahsetkeywords: + patch
stage: patch review
pull_requests: + pull_request21009
2020-08-15 17:05:38skrahsetmessages: + msg375480
2020-08-15 15:06:39skrahsetmessages: + msg375468
2020-08-15 13:49:26David.Edelsohnsetmessages: + msg375464
2020-08-15 12:51:40skrahsetmessages: + msg375462
2020-08-14 12:41:07David.Edelsohnsetmessages: + msg375399
2020-08-14 08:29:03skrahsetmessages: + msg375380
2020-08-14 08:22:51skrahsetmessages: + msg375378
2020-08-14 05:22:11T.Rexsetmessages: + msg375371
2020-08-13 16:53:45David.Edelsohnsetmessages: + msg375317
2020-08-13 16:51:26skrahsetpriority: normal -> low

nosy: + David.Edelsohn
messages: + msg375316

components: + Extension Modules, - C API
type: crash -> behavior
2020-08-13 14:57:19T.Rexsetmessages: + msg375313
2020-08-13 14:16:03skrahsetmessages: + msg375310
2020-08-13 13:57:32skrahsetnosy: + skrah
messages: + msg375309
2020-08-13 13:20:02sanketsetnosy: + sanket
2020-08-13 13:18:57T.Rexsetmessages: + msg375306
2020-08-13 13:12:55T.Rexsetnosy: - rhettinger, facundobatista, mark.dickinson, skrah, pablogsal
messages: + msg375305
2020-08-13 12:43:29pablogsalsetmessages: + msg375304
2020-08-13 12:39:19pablogsalsetnosy: + pablogsal
messages: + msg375303
2020-08-13 12:28:27vstinnersetnosy: + rhettinger, facundobatista, mark.dickinson
2020-08-13 12:28:13vstinnersetnosy: + skrah
2020-08-13 12:15:27T.Rexcreate