classification
Title: free(): invalid pointer in list_ass_item() in Python 3.7.3
Type: crash Stage: resolved
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: Howard_Landman, christian.heimes
Priority: normal Keywords:

Created on 2020-07-18 18:38 by Howard_Landman, last changed 2020-08-04 13:47 by Howard_Landman. This issue is now closed.

Messages (13)
msg373911 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-07-18 18:38
I have a program qtd.py that reliably dies with free(): invalid pointer after about 13 hours of runtime (on a RPi3B+). This is hard to debug because (1) try:except: will not catch SIGABRT
(2) signal.signal(signal.SIGABRT, sigabrt_handler) also fails to catch SIGABRT
(3) python3-dbg does not work with libraries like RPi.GPIO and spidev()
This happens under both Buster and Stretch, so I don't think it's OS-dependent.

I managed to get a core dump and gdb back trace gives:
warning: core file may not match specified executable file.
[New LWP 10277]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/arm-linux-gnueabihf/libthread_db.so.1".
Core was generated by `python3 qtd.py'.
Program terminated with signal SIGABRT, Aborted.
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50	../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x76c5e308 in __GI_abort () at abort.c:100
#2  0x76cae51c in __libc_message (action=action@entry=do_abort, 
    fmt=<optimized out>) at ../sysdeps/posix/libc_fatal.c:181
#3  0x76cb5044 in malloc_printerr (str=<optimized out>) at malloc.c:5341
#4  0x76cb6d50 in _int_free (av=0x76d917d4 <main_arena>, 
    p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>)
    at malloc.c:4165
#5  0x001b3bb4 in list_ass_item (a=<optimized out>, i=<optimized out>, 
    v=<optimized out>, a=<optimized out>, i=<optimized out>, v=<optimized out>)
    at ../Objects/listobject.c:739
Backtrace stopped: Cannot access memory at address 0x17abeff8

So, as far as I can tell, it looks like list_ass_item() is where the error happens. I'm willing to help debug and maybe even fix this, but is there any way to remove any of the roadblocks (1-3) above?

NOTE: qtd.py will exit unless there is a Texas Instruments TDC7201 chip attached to the RPI's SPI pins.
msg374096 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-07-22 16:03
This is not a memory leak problem. "Top" reports VIRT 21768 RES 13516 for the whole run, and Python internal resource reporting says 13564 kb for the whole run. So that's less than 1 kb leaked in 118.6M measurement cycles; most likely zero.
msg374133 - (view) Author: Steve Stagg (stestagg) Date: 2020-07-23 12:34
It's possible that RPi.GPIO (or another C extension) is mis-managing reference counts, resulting in it freeing or trying to free an object at the wrong time.  

If that object is a common value, or the branch that causes the refcount issue is uncommon, it may take many millions of cycles for the problem to occur.
msg374136 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-07-23 14:02
int_free() and small_ints in the traceback likely point to a reference counting issue for small integers in range of -5 to +256. These integers are cached and should never get freed in the middle of a process.
msg374137 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-07-23 14:15
Correction, _int_free() is an internal function of malloc().

small_ints.lto_priv+72 looks like deallocation of an integer object at location +72. This is either int 4 (on 64bit platforms) or int 13 (on 32bit platforms):

>>> NSMALLNEGINTS = 5
>>> (4 + NSMALLNEGINTS) * 8
72
>>> (13 + NSMALLNEGINTS) * 4
72

Could you please report the issue with RPi package owner? This looks like a bug in a 3rd party module.
msg374338 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-07-26 19:14
After a quick glance at the source code for the spidev library, I think it is unlikely but not impossible to be the home for the bug. It does do malloc() and free(), but only for data that is greater than 256 bytes. Short tx and rx data is kept in static local buffers. Also, these calls do not match the partial stack trace I got. There is a small amount of allocating and deallocating Python objects, however, including calls to PyList_New(), PyList_SET_ITEM(), and Py_TYPE(self)->tp_free((PyObject *)self), so it's possible that the bug is buried under one of those.
msg374423 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-07-27 20:39
I'm running under 32-bit Raspbian, so let's assume the magic number is 13. There are only two places in my own code where the number 13 appears:

(1) My result_list is 14 items long, i.e. 0 to 13. Relevant code from qtd.py:
cum_results = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
...
while batches != 0:
    ...
    result_list = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
    for m in range(ITERS):
        ...
        result = tdc.measure(...)
        result_list[result] += 1
        ...
    for i in range(len(result_list)):
        cum_results[i] += result_list[i]

I notice that result_list is getting thrown away each cycle, and thus must be garbage-collected. I could try changing that and see if it has any effect.

(2) in the tdc7201 library, 13 occurs as a possible (error) result code from measure(). However several of the failing runs had zero errors of this class, which means the code with 13 in it never got executed even once. I am not using pin 13 of the RPi's header, so I never send that number to RPi.GPIO. GPIO13 is pin 33 of the header, and I am not using that pin either, so I don't think RPi.GPIO is translating one of my pin number arguments to 13 internally. (But I should check Broadcom Mode numbers.)
msg374704 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-08-03 00:18
Made some progress. By running it under gdb and breaking in malloc_printerr(), I got a better stack trace:

Breakpoint 1, malloc_printerr (str=0x76e028f8 "free(): invalid pointer")
    at malloc.c:5341
5341	malloc.c: No such file or directory.
(gdb) bt
#0  malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341
#1  0x76d44d50 in _int_free (av=0x76e1f7d4 <main_arena>, 
    p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>)
    at malloc.c:4165
#2  0x001b4f40 in list_dealloc (op=0x760ef288) at ../Objects/listobject.c:324
#3  0x001be784 in frame_dealloc (
    f=Frame 0x765fcc30, for file /home/pi/src/QTD/src/tdc7201/tdc7201/__init__.py, line 719, in read_regs24 (i=40, reg=28)) at ../Objects/frameobject.c:470
#4  function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, 
    args=<optimized out>, co=<optimized out>) at ../Objects/call.c:291
#5  _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, 
    nargs=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:408
#6  0x0011f984 in call_function (kwnames=0x0, oparg=<optimized out>, 
    pp_stack=0x7effed40) at ../Python/ceval.c:4554
#7  _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:3110
#8  0x0011d63c in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x760e8960, for file /home/pi/src/QTD/src/tdc7201/tdc7201/__init__.py, line 962, in measure (self=<TDC7201(_spi=<SpiDev at remote 0x7618d2f0>, reg1=[130, 66, 0, 7, 255, 255, 6, 64, 0, 1, 65535, 1600, Breakpoint 1, malloc_printerr (str=0x76e028f8 "free(): invalid pointer")
    at malloc.c:5341
5341	malloc.c: No such file or directory.
(gdb) bt
#0  malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341
#1  0x76d44d50 in _int_free (av=0x76e1f7d4 <main_arena>, 
    p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>)
    at malloc.c:4165
#2  0x001b4f40 in list_dealloc (op=0x760ef288) at ../Objects/listobject.c:324
#3  0x001be784 in frame_dealloc (
    f=Frame 0x765fcc30, for file /home/pi/src/QTD/src/tdc7201/tdc7201/__init__.py, line 719, in read_regs24 (i=40, reg=28)) at ../Objects/frameobject.c:470
#4  function_code_fastcall (globals=<optimized out>, nargs=<optimized out>, 
    args=<optimized out>, co=<optimized out>) at ../Objects/call.c:291
#5  _PyFunction_FastCallKeywords (func=<optimized out>, stack=<optimized out>, 
    nargs=<optimized out>, kwnames=<optimized out>) at ../Objects/call.c:408
#6  0x0011f984 in call_function (kwnames=0x0, oparg=<optimized out>, 
    pp_stack=0x7effed40) at ../Python/ceval.c:4554
#7  _PyEval_EvalFrameDefault (f=<optimized out>, throwflag=<optimized out>)
    at ../Python/ceval.c:3110
#8  0x0011d63c in PyEval_EvalFrameEx (throwflag=0, 
    f=Frame 0x760e8960, for file /home/pi/src/QTD/src/tdc7201/tdc7201/__init__.py, line 962, in measure (self=<TDC7201(_spi=<SpiDev at remote 0x7618d2f0>, reg1=[130, 66, 0, 7, 255, 255, 6, 64, 0, 1, 65535, 1600, 1, None, None, None, 1284, 322, 2371, 0, 0, 0, 0, 0, 0, 0, 0, 2322, 23172, None], chip_select=1, sclk=23, miso=21, mosi=19, cs1=24, cs2=26, enable=12, osc_enable=16, trig1=7, int1=37, trig2=None, int2=None, start=18, stop=22, meas_mode=2, cal_pers=10, cal_count=<float at remote 0x76225e50>, norm_lsb=<float at remote 0x760ee8f0>, tof1=<float at remote 0x760ee900>, tof2=<float at remote 0x760ee920>, tof3=0, tof4=0, tof5=0) at remote 0x76191750>, simulate=True, error_prefix='59835 ', log_file=<_io.TextIOWrapper at remote 0x761df4b0>, cf1=130, timeout=<float at remote 0x76225d00>, n_stop=3, threshold=<float at remote 0x760ee880>, pulse=2))
    at ../Python/ceval.c:3930
1, None, None, None, 1284, 322, 2371, 0, 0, 0, 0, 0, 0, 0, 0, 2322, 23172, None], chip_select=1, sclk=23, miso=21, mosi=19, cs1=24, cs2=26, enable=12, osc_enable=16, trig1=7, int1=37, trig2=None, int2=None, start=18, stop=22, meas_mode=2, cal_pers=10, cal_count=<float at remote 0x76225e50>, norm_lsb=<float at remote 0x760ee8f0>, tof1=<float at remote 0x760ee900>, tof2=<float at remote 0x760ee920>, tof3=0, tof4=0, tof5=0) at remote 0x76191750>, simulate=True, error_prefix='59835 ', log_file=<_io.TextIOWrapper at remote 0x761df4b0>, cf1=130, timeout=<float at remote 0x76225d00>, n_stop=3, threshold=<float at remote 0x760ee880>, pulse=2))
    at ../Python/ceval.c:3930
...

read_regs24 (i=40, reg=28) performs an SPI read of 13 24-bit registers (so 39 bytes); the corresponding Python code is:

    def read_regs24(self):
        """Read all 24-bit chip registers, using auto-increment feature."""
        result24 = self._spi.xfer([self.MINREG24|self._AI,
                                   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                                   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                                   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                                   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
                                   0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00])
        #print("AI read 24-bits =", result24)
        #print("length =", len(result24))
        i = 1
        for reg in range(self.MINREG24, self.MAXREG24+1):
            # Data comes in MSB first.
            self.reg1[reg] = (result24[i] << 16) | (result24[i+1] << 8) | result24[i+2]
            i += 3

The zero padding is necessary to make sure that enough SPI clock cycles are sent for the data bytes to get clocked back in. The _AI flag turns on auto-increment, so each byte is from the next address.

So it looks like the bug is either in the spidev library, or in Python deallocation of this frame. I'll try pulling the sent data list outside the method, since it is always the same; that will avoid the list alloc/dealloc, and may even slightly speed up the program. If that fixes the problem, it's in Python; if it doesn't, it's in spidev. Probably.
msg374746 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-08-03 17:36
Getting closer to isolating this. A small program that does nothing but call read_regs24() over and over dies with the same error after about 107.4M iterations. So it seems to be definitely in that method. But that only takes a few hours, not half a day. Going to try to separate the SPI operation from the remaining Python code next.
msg374756 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-08-03 21:14
It appears to be in the spidev library xfer method, when used for reading. Calling that 107.4M times (using the same code as inside my read_regs24() method) causes the free() error.

Breakpoint 1, malloc_printerr (str=0x76e028f8 "free(): invalid pointer")
    at malloc.c:5341
5341	malloc.c: No such file or directory.
(gdb) bt
#0  malloc_printerr (str=0x76e028f8 "free(): invalid pointer") at malloc.c:5341
#1  0x76d44d50 in _int_free (av=0x76e1f7d4 <main_arena>, 
    p=0x43417c <small_ints.lto_priv+72>, have_lock=<optimized out>) at malloc.c:4165
#2  0x001b4f40 in list_dealloc (op=0x766d0f58) at ../Objects/listobject.c:324
#3  0x7669b660 in ?? ()
   from /usr/lib/python3/dist-packages/spidev.cpython-37m-arm-linux-gnueabihf.so
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Should I be worried about the "corrupt stack" message?
msg374774 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-08-04 00:07
OK, this has been filed against the spidev library:
https://github.com/doceme/py-spidev/issues/107

Do you want it closed, or left open until that gets resolved?
msg374797 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2020-08-04 09:38
Yeah, my bet is on SpiDev_xfer(), https://github.com/doceme/py-spidev/blob/master/spidev_module.c#L458
msg374807 - (view) Author: Howard A. Landman (Howard_Landman) Date: 2020-08-04 13:47
As far as we can tell, this is a known Py_DECREF problem with spidev==3.4. Testing on spidev==3.5 has not triggered the bug so far, so it appears to be already fixed.

Under 3.4, changing the list to a tuple did not affect the behavior.
History
Date User Action Args
2020-08-04 13:47:46Howard_Landmansetmessages: + msg374807
2020-08-04 09:38:39christian.heimessetnosy: + christian.heimes
messages: + msg374797
2020-08-04 04:20:35ned.deilysetstatus: open -> closed
stage: resolved
2020-08-04 00:07:27Howard_Landmansetresolution: third party
messages: + msg374774
2020-08-03 21:14:23Howard_Landmansetmessages: + msg374756
2020-08-03 17:36:52Howard_Landmansetmessages: + msg374746
2020-08-03 00:18:34Howard_Landmansetmessages: + msg374704
2020-07-27 20:39:52Howard_Landmansetmessages: + msg374423
2020-07-26 19:14:58Howard_Landmansetnosy: - christian.heimes, stestagg
messages: + msg374338
2020-07-23 14:15:20christian.heimessetmessages: + msg374137
2020-07-23 14:02:32christian.heimessetnosy: + christian.heimes
messages: + msg374136
2020-07-23 12:34:32stestaggsetnosy: + stestagg
messages: + msg374133
2020-07-22 16:03:56Howard_Landmansetmessages: + msg374096
2020-07-18 19:04:44Howard_Landmansethgrepos: - hgrepo389
2020-07-18 18:38:21Howard_Landmancreate