This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: pymalloc is not aware of Memory Tagging Extension (MTE) and crashes
Type: Stage:
Components: Library (Lib) Versions: Python 3.9
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, ggardet, nascheme, tim.peters, vstinner
Priority: normal Keywords:

Created on 2021-03-22 13:29 by ggardet, last changed 2022-04-11 14:59 by admin.

Messages (13)
msg389312 - (view) Author: (ggardet) Date: 2021-03-22 13:29
When Memory Tagging Extension (MTE) [0] is enabled on aarch64, python malloc make programs to crash. 
I noticed it while trying to use GDB with MTE enabled in user-space [1], and gdb crashed on start-up. Rebuilding python (3.8) using '--without-pymalloc' option allows to workaround the problem. 
For glibc, you need version 2.33 or later and build glibc with '--enable-memory-tagging' option. I guess that patches similar to glibc's patches are required for pymalloc.

msg389313 - (view) Author: (ggardet) Date: 2021-03-22 13:39
Is there any runtime option (env variable or something else) to use glibc malloc instead of pymalloc? Or the choice made at compile time is immutable?
msg389315 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2021-03-22 13:45
pymalloc is a compile-time option. The configure flag sets or unsets WITH_PYMALLOC. The define is then used by to change the internal allocator.

The flag may also affect the ABI of Python and affect binary wheels. It might break compatibility with manylinux wheels for aarch64.

Victor, please take a look.
msg389316 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-22 13:53
> When Memory Tagging Extension (MTE) [0] is enabled on aarch64, python malloc make programs to crash. 

Would you mind to elaborate what does crash exactly?

Can you please test if fix your issue?
msg389318 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-22 13:54
A workaround is to disable pymalloc at runtime using PYTHONMALLOC=malloc environment variable. But it makes Python 10-20% slower.
msg389686 - (view) Author: (ggardet) Date: 2021-03-29 11:59
With PYTHONMALLOC=malloc, gdb is not crashing anymore. Thanks for the workaround.

pymalloc is not aware of MTE, so a SegFault occurs on any access to the memory since the check fails.
msg389693 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-03-29 12:28
Can you please test if fix your issue?
msg389783 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2021-03-30 02:55
I've merged PR 14474 so you can just test with an up-to-date "main" branch and see if that fixes the problem.  I would expect it should fix the problem since with the radix tree arena tracking, no memory unsanitary behaviour remains.
msg389928 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-03-31 19:11
I'm skeptical ;-) If MTE is actually being used, system software assigns "random" values to 4 of the higher-order bits. When obmalloc punts to the system malloc, presumably those bits will be randomized in the addresses returned by malloc. Then it's just not possible that obmalloc's

    assert(HIGH_BITS(p) == HIGH_BITS(&arena_map_root));

can always succeed - we're insisting there that _all_ the high-order bits are exactly the same as in the `&arena_map_root` file static.  If `p` was actually obtained from the system `malloc()`, it should fail about 15 times out of 16 (and regardless of which of the 16 bit patterns the platform C assigns to &arena_map_root).

But, of course, that failure would only be seen in a debug build.
msg389943 - (view) Author: Neil Schemenauer (nascheme) * (Python committer) Date: 2021-04-01 00:56
> If MTE is actually being used, system software assigns "random" values to 4 of the higher-order bits.

Oh, interesting.

Two ideas about handling that: we could change our assertion check to be different on ARM platforms that we know have a certain size physical address space.  Probably just turn off that high-bits check.

Second idea, we could change the radix tree to not assume high address bits are unused.  That's trickier to do without performance or memory usage degradations.  I have a work-in-progress patch that adds a cache on top of the radix tree lookup.  It looks like that cache can be made to have a pretty high hit rate.  Based on a small amount of testing, the radix tree lookup for address_in_range() only happens about 1% of the time.  If that approach works, we could add another layer to the tree and handle the full 64-bit address space.

Based on my wip testing, my benchmark was showing about equal performance with the cache to without.  So, no benefit to offset the increase in code complexity.  Handling the MTE high bits tricks might enough to justify the cache addition.
msg390135 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-03 19:08
Can't really know without a system to try it on, but my best guess is that these asserts are the only thing that will fail with tagging enabled. The obvious "fix" is indeed just to skip them on a platform with tagging enabled. They're meant as a sanity check that only 48 bits are really used for addresses. Which remains true even when tagging is enabled - the tag bits are part of the _pointer_ on AMD, but not part of the address.

Taking tagging seriously instead would be a significant project, relying on platform-specific instructions. For a start, obmalloc would have to generate a random 4-bit integer for each object it returns, plug that into 4 specific "high order" bits of the pointer it returns, and tell the OS to associate those 4 bits with each 16-byte chunk of the object's space.  mmap()-like calls would also need to be changed, to tell the OS to enable tag checking on the memory slice returned.

While caching may or may not speed things up, I'm not seeing how it _could_ help move to 64-bit addresses.  As is, the tree needs 8 bytes of bottom-node space for each arena in use, and that's independent of how many address bits there are (it only depends on the arena granularity).  I think that could be cut by a factor of 4 by keeping track of arena pool (instead of byte) boundaries in the bottom nodes, meaning that, with the _current_ settings, we'd only need to keep track of the 64=2^6 possible pool addresses in an arena, instead of the 2^20 possible byte addresses.  6 bits fits fine in a signed 8-bit int (but we need a 32-bit int now to hold the 2^20 possible byte addresses in an arena).

So the clearest way to ease the space burden of keeping track of truly expansive address spaces is to boost arena size. And, if the tree bottom changed to keep track of pool (instead of byte) addresses, possibly boost pool size too.
msg390143 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-03 21:43
BTW, your cache WIP

partly moves to tracking pool (instead of byte) addresses, but any such attempt faces a subtlety:  it's not necessarily the case that a pool is entirely "owned" by obmalloc or by the system.  It can be split.

To be concrete, picture a decimal machine with arenas at 10,000 byte boundaries and pools at 100-byte boundaries, with the system malloc returning 20-byte aligned addresses.

If obmalloc gets an arena starting at address 20,020, the pool range(20000, 20100) has its first 20 bytes owned by the system, but its last 80 bytes owned by obmalloc. Pass 20050 to the PR's address_in_range, and it will say it's owned by the system (because its _pool_ address, 20000, is). But it's actually owned by obmalloc.

I'm not sure it matters, but it's easy to get a headache trying to out-think it ;-) In that case, obmalloc simply ignores the partial pool at the start, and the first address it can pass out is 20100. So it would never be valid for free() or realloc() to get 20050 as an input.

On the other end, the arena would end at byte 20020 + 10000 - 1 = 30019. This seems worse! If 30040 were passed in, that's a system address, but its _pool_ address is 30000, which obmalloc does control.

That reminds me now why the current scheme tracks byte addresses instead ;-) It appears it would require more tricks to deal correctly in all cases when system-supplied arenas aren't necessarily aligned to pool addresses (which was never a consideration in the old scheme, since a pool was at largest a system page, and all mmap()-like functions (used to get an arena) in real life return an address at worst page-aligned).

Or we'd need to find ways to force mmap() (across different systems' spellings) to return a pool-aligned address for arenas to begin with.
msg390206 - (view) Author: Tim Peters (tim.peters) * (Python committer) Date: 2021-04-04 18:17
I think it's time to change what address_in_range() tries to answer. It currently gives a precise answer to "is this byte address in a region obmalloc owns?". But that's stronger than what it needs to do its job: the real question is "is this an address that obmalloc could return from its malloc() or realloc() functions?". Because it's only used by free() and realloc(), and they only care where the address they're _passed_ came from.

The difference is when an arena is not pool-aligned. Oddball chunks outside of full pools, at both ends of the arena, are never returned by obmalloc then.

Instead the tree could be changed to record the starting addresses of the _full_ pools obmalloc controls.  Those contain the only addresses obmalloc will ever pass out (from malloc() or realloc()).

Like so, where p is the arena base address. This hasn't been tested, and may contain typos or logical errors. I'm far more interested in the latter for now ;-)

ideal1 = p & ~ARENA_SIZE_MASK # round down to ideal
ideal2 = ideal1 + ARENA_SIZE
offset = p - ideal1
b1 = bottom_node(ideal1)
b2 = bottom_node(ideal2)
if not offset:
    # already ideal
    b1.hi = -1
    assert b2.lo == 0
elif offset & POOL_SIZE_MASK == 0:
    # arena is pool-aligned
    b1.hi = b2.lo = offset
    # Not pool-aligned.
    # obmalloc won't pass out an address higher than in
    # the last full pool.
    # Round offset down to next lower pool address.
    offset &= ~POOL_SIZE_MASK
    b2.lo = offset
    # And won't pass out an address lower than in the
    # first full pool.
    offset += POOL_SIZE # same as rounding original offset up
    # That's almost right for b1.hi, but one exception: if
    # offset is ARENA_SIZE now, the first full pool is at the
    # start of ideal2, and no feasible address is in ideal1.
    assert offset <= ARENA_SIZE
    b1.hi = offset & ARENA_SIZE_MASK 

Note that, except for the oddball -1, everything stored in a bottom node is a pool address then, and so there's no need to store their all-zero lowest POOL_BITS bits. .lo and .hi can be shrunk to a signed 8-bit int with the current settings (20 arena bits and 14 pool bits).

And caching pool addresses wouldn't have any obscure failing end-cases either: address_in_range() could just as well be passed a pool address to begin with. It would only care about pool addresses, not byte addresses.
Date User Action Args
2022-04-11 14:59:43adminsetgithub: 87759
2021-04-04 18:17:33tim.peterssetmessages: + msg390206
2021-04-03 21:43:16tim.peterssetmessages: + msg390143
2021-04-03 19:08:25tim.peterssetmessages: + msg390135
2021-04-01 00:56:53naschemesetmessages: + msg389943
versions: + Python 3.9, - Python 3.8
2021-03-31 19:11:29tim.peterssetnosy: + tim.peters
messages: + msg389928
2021-03-30 02:55:43naschemesetnosy: + nascheme
messages: + msg389783
2021-03-29 12:28:37vstinnersetmessages: + msg389693
2021-03-29 11:59:55ggardetsetmessages: + msg389686
2021-03-22 13:54:34vstinnersetmessages: + msg389318
2021-03-22 13:53:36vstinnersetmessages: + msg389316
2021-03-22 13:45:46christian.heimessetnosy: + vstinner, christian.heimes
messages: + msg389315
2021-03-22 13:39:46ggardetsetmessages: + msg389313
2021-03-22 13:29:47ggardetcreate