Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

android: test_functools hangs on armv7 #71126

Closed
xdegaye mannequin opened this issue May 3, 2016 · 22 comments
Closed

android: test_functools hangs on armv7 #71126

xdegaye mannequin opened this issue May 3, 2016 · 22 comments
Assignees
Labels
3.7 (EOL) end of life build The build process and cross-build tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@xdegaye
Copy link
Mannequin

xdegaye mannequin commented May 3, 2016

BPO 26939
Nosy @rhettinger, @ncoghlan, @pitrou, @vstinner, @xdegaye, @moreati, @yan12125
PRs
  • [Do Not Merge] Convert Misc/NEWS so that it is managed by towncrier #552
  • Files
  • test_output.txt: gdb back trace
  • python_tests_on_arm.txt
  • atomic_explicit.patch
  • build-failure.txt
  • atomic_explicit2.patch
  • pycond_timedwait.patch
  • test_lru_cache_threaded-x86-printf.txt
  • test_lru_cache_threaded-armv7-printf.txt
  • pycond_timedwait_2.patch
  • setswitchinterval.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/xdegaye'
    closed_at = <Date 2016-12-08.11:02:40.986>
    created_at = <Date 2016-05-03.15:37:46.487>
    labels = ['3.7', 'type-bug', 'tests', 'build']
    title = 'android: test_functools hangs on armv7'
    updated_at = <Date 2017-03-31.16:36:11.709>
    user = 'https://github.com/xdegaye'

    bugs.python.org fields:

    activity = <Date 2017-03-31.16:36:11.709>
    actor = 'dstufft'
    assignee = 'xdegaye'
    closed = True
    closed_date = <Date 2016-12-08.11:02:40.986>
    closer = 'xdegaye'
    components = ['Tests', 'Cross-Build']
    creation = <Date 2016-05-03.15:37:46.487>
    creator = 'xdegaye'
    dependencies = []
    files = ['42701', '42750', '42753', '42755', '42756', '42874', '43191', '43192', '43193', '45249']
    hgrepos = []
    issue_num = 26939
    keywords = ['patch']
    message_count = 22.0
    messages = ['264742', '264756', '264922', '264957', '264963', '264965', '264974', '264976', '264981', '264988', '264995', '265003', '265020', '265027', '265701', '265704', '265707', '265721', '267250', '267410', '279588', '282703']
    nosy_count = 9.0
    nosy_names = ['rhettinger', 'ncoghlan', 'pitrou', 'vstinner', 'neologix', 'xdegaye', 'python-dev', 'Alex.Willmer', 'yan12125']
    pr_nums = ['552']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue26939'
    versions = ['Python 3.6', 'Python 3.7']

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 3, 2016

    test_functools hangs on an android emulator running an armv7 system image (but not on x86) at API level 21. The test suite hangs at

    test_lru_cache_threaded (test.test_functools.TestLRUC) ...

    A cause of this problem may be related to the fact that the android arm emulator runs very, very slowly.

    @xdegaye xdegaye mannequin added stdlib Python modules in the Lib dir build The build process and cross-build type-bug An unexpected behavior, bug, or error labels May 3, 2016
    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 3, 2016

    The attached test_output.txt file is the corresponding gdb backtrace[1].

    Interestingly, all the tests (issues bpo-26938, bpo-26939, bpo-26940 and bpo-26941) that hang on arm use sys.setswitchinterval() aggressively.

    [1] The backtrace is interleaved with:

    • Python Exception <type 'exceptions.UnicodeDecodeError'> 'utf8' codec can't decode byte 0xb8 in posit ion 0: invalid start byte:
      that seem to come from libpython, the python that is embeded in gdb to pretty-print gdb output
    • ---Type <return> to continue, or q <return> to quit---
      sorry, had to resort to copy/paste, gdb 'set logging' does not seem to work

    @pitrou
    Copy link
    Member

    pitrou commented May 5, 2016

    Can you post the backtraces of all threads (at least the innermost frames)?
    Also, can you reproduce on an actual armv7 machine? Just wondering if there may be something peculiar in the emulator.

    @yan12125
    Copy link
    Mannequin

    yan12125 mannequin commented May 6, 2016

    FWIW, test_functools passes on ASUS Zenfone 2 Laser ZE500KL with stock ROM (rooted) and my patchset. [1] The CPU is ARM64 and the Python is built as 32-bit ARM binaries. test_threading (bpo-26941) and test_importlib (bpo-26940) pass, too. test_concurrent_futures (bpo-26938) fails with 27 errors, all with "OSError: [Errno 38] Function not implemented", which seems the same as bpo-26924.

    The attached file contains logs of the four tests mentioned here.

    [1] https://github.com/yan12125/python3-android

    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2016

    I suspect this bug is caused by memory ordering differences between x86 and ARM (see https://en.wikipedia.org/wiki/Memory_ordering#Runtime_memory_ordering ), which is why it may not reproduce easily (may depend on the particular CPU implementation and/or various timing subtleties in the tests themselves).

    Xavier / Chi, can you tell if Android has stdatomic.h? See Include/pyatomic.h for the difference it makes.

    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2016

    Xavier, as a strating point, can you try the following patch? (atomic_explicit.patch)
    If it solves the issue, can you also measure the performance drop compared to unpatched (on a non-debug build)? python -m test.pystone should be sufficient.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 6, 2016

    Android has stdatomic.h and HAVE_STD_ATOMIC is defined.

    The build fails with the patch, see the errors in the attached file. This is the native compilation of python, the patches in my build system are also applied to the native build so that they can be checked. I can prevent that if needed.

    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2016

    Oops, sorry, hadn't seen the patch didn't compile. Can you tried with this new patch? (atomic_explicit2.patch)

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 6, 2016

    Still hangs at test_lru_cache_threaded with this last patch :(

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 6, 2016

    Can you post the backtraces of all threads (at least the innermost frames)?

    There are two threads, gdb does not print the backtrace of the other thread. My current gdb setup with a connection to a gdb-server misses some stuff (access to the loader or some libraries I guess) and needs to be completed with the same setup as the one used by the ndk-gdb script (released by google for android applications).

    Also, can you reproduce on an actual armv7 machine? Just wondering if there may be something peculiar in the emulator.

    I don't have an armv7 device that can be used for testing.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 6, 2016

    The strace of test_lru_cache_threaded:

    ...
    gettimeofday({1462553238, 580860}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 581276772}) = 0
    gettimeofday({1462553238, 581672}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 582078767}) = 0
    gettimeofday({1462553238, 583956}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 585865959}) = 0
    gettimeofday({1462553238, 587597}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 589381724}) = 0
    gettimeofday({1462553238, 589973}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 591858839}) = 0
    gettimeofday({1462553238, 593003}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 593452750}) = 0
    gettimeofday({1462553238, 593886}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 594364345}) = 0
    gettimeofday({1462553238, 594764}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 595227771}) = 0
    gettimeofday({1462553238, 595626}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 596027590}) = 0
    gettimeofday({1462553238, 596448}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 596848736}) = 0
    gettimeofday({1462553238, 597259}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 597665484}) = 0
    gettimeofday({1462553238, 598060}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 598869963}) = 0
    gettimeofday({1462553238, 599547}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, {1462553238, 601053152}) = 0
    gettimeofday({1462553238, 601593}, NULL) = 0
    clock_gettime(CLOCK_REALTIME, ^CProcess 908 detached
    <detached ...>

    @pitrou
    Copy link
    Member

    pitrou commented May 6, 2016

    Still hangs at test_lru_cache_threaded with this last patch :(

    What about the other tests (test_importlib, test_threading)?

    Do you know in which precise test it hangs? (use "-v" to print test names)

    There are two threads, gdb does not print the backtrace of the other thread.

    Even with "thread apply all backtrace" (Google tells me this is the incantation)?

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 6, 2016

    Do you know in which precise test it hangs?

    They are named in the first msg of each issue, the issues are:
    issue bpo-26938: android: test_concurrent_futures hangs on armv7
    issue bpo-26939: android: test_functools hangs on armv7
    issue bpo-26940: android: test_importlib hangs on armv7
    issue bpo-26941: android: test_threading hangs on armv7

    The test_threading and test_importlib issues have also now a gdb backtrace, so we know not only the test name but also the full python frame stack with line numbers thanks to python-gdb.py magic. It seems that test_functools, test_threading and test_importlib all hang within take_gil().

    test_concurrent_futures is different and hangs in a futex().

    With "thread apply all backtrace", nothing changes and gdb still prints:
    Cannot access memory at address 0x0
    #0 0xb6efae6c in ?? ()
    #1 0xb6efd830 in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

    The problem seems to be that gdb does not have enough information to unwind the stack for the other threads and for thread #1 it only works after an interrupted continue command, and not on gdb startup.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 6, 2016

    Forgot to say that the backtraces for test_threading and test_importlib were obtained with the patch applied while they were hanging.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 16, 2016

    Got now a full fledged gdb. Thread #1 loops infinitely in the "while (_Py_atomic_load_relaxed(&gil_locked))" loop in take_gil(). Thread #2 is stuck in the MUTEX_LOCK(gil_mutex) statement in drop_gil() as can be seen by setting a breakpoint at the following line and seeing that this breakpoint is never hit.

    Setting gil_interval to 10 (from the initial value of 1) from gdb, allows the test to pass.

    With this patch of PyCOND_TIMEDWAIT():

      diff -r eee959fee5f5 Python/condvar.h
      --- a/Python/condvar.h	Sat May 07 21:13:50 2016 +0300
      +++ b/Python/condvar.h	Mon May 16 15:52:49 2016 +0200
      @@ -99,6 +99,7 @@
           PyCOND_ADD_MICROSECONDS(deadline, us);
           ts.tv_sec = deadline.tv_sec;
           ts.tv_nsec = deadline.tv_usec * 1000;
      +    PyCOND_GETTIMEOFDAY(&deadline);

    and a breakpoint set on pthread_cond_timedwait(). When this breakpoint is hit, there is 20 usec elapsed time instead of the value of 1 (as requested by 'us') between ts and deadline as shown here:

    (gdb) p ts
    $1 = {tv_sec = 1463407197, tv_nsec = 468546000}
    (gdb) p deadline
    $2 = {tv_sec = 1463407197, tv_usec = 468566}

    And when iterating the loop, the measured elapsed times are: 20, 12, 12, 11, 11, 11.

    Clearly, the ts date is in the past when pthread_cond_timedwait() is called, and the function returns immediately without releasing the mutex. This is the expected behavior[1].

    [1] http://linux.die.net/man/3/pthread_cond_wait

    @vstinner
    Copy link
    Member

    I don't know if it can fix the issue, but you may see my issue bpo-23428: "Use the monotonic clock for thread conditions on POSIX platforms".

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 16, 2016

    On a heavily loaded system, there is no guarantee that the date obtained from a call to gettimeofday() in PyCOND_TIMEDWAIT(), is not in the past when pthread_cond_timedwait() is called, as there may be a context switch in between. When this happens, a Python thread cannot take the gil even if it is available.

    This tentative patch monitors the load of the system and dynamically increases the value used for the switch interval when needed. With this patch on the android-21-armv7 emulator:

    python -m test -m test_lru_cache_threaded -v test_functools
    Ok
    python -m test -m test_deadlock -v test_importlib
    Ok
    python -m test -m test_is_alive_after_fork -v test_threading
    Ok

    The measured elapsed time is not correct in this patch when it is above 1 second.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented May 16, 2016

    I don't know if it can fix the issue, but you may see my issue bpo-23428: "Use the monotonic clock for thread conditions on POSIX platforms".

    pthread_condattr_setclock() is declared in the Android API 21 headers.

    When a monotonic clock is set for pthread_cond_timedwait(), one could measure the monotonic time elapsed to execute the instructions from _PyTime_monotonic() to the setting of the ts fields in PyCOND_TIMEDWAIT(), i.e the time to execute those statements taken from your patch:

    #ifdef MONOTONIC
        _PyTime_monotonic(&deadline);
    #else
        _PyTime_gettimeofday(&deadline);
    #endif
    /* TODO: add overflow and truncation checks */
    assert(us <= LONG_MAX);
    deadline.tv_usec += (long)us;
    deadline.tv_sec += deadline.tv_usec / 1000000;
    deadline.tv_usec %= 1000000;
    

    and prevent gil_interval to be set below that measured value. This would solve the problem for this soooo slow emulator. _PyTime_monotonic() does not exist, is this _PyTime_GetMonotonicClock() ? So I did not try to test it, not knowing the status of your patch.

    IMHO PyCOND_TIMEDWAIT() should use a monotonic clock when available.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jun 4, 2016

    pycond_timedwait_2.patch prints the delta time in microseconds when deadline is in the past. Running the patch with the following command:
    python -m test -m test_lru_cache_threaded -v test_functools
    gives these results:

    a) natively:
    deadline is never in the past and the test run prints only a bunch of 'Ok's.
    b) on the x86 emulator:
    Although the test does not hang when python is not patched, the output of the test run, test_lru_cache_threaded-x86-printf.txt, shows that deadline is almost always in the past.
    c) on the armv7 emulator:
    The output of the test run, test_lru_cache_threaded-armv7-printf.txt (this is a truncated version, the original file is half a Mbytes long), shows that deadline is always in the past. With the patch, the test does not hang.

    It is possible that this problem could be a quirk in the implementation of the qemu emulator, given the results of b), and the fact that the python test suite runs at about the same speed when run natively and on the x86 emulator.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Jun 5, 2016

    The problem can be fixed by changing the switch interval to a higher value when the test is run on an Android emulator. bpo-26855 can provide a way to know when this is the case through the value of the ro.kernel.qemu property.

    @xdegaye
    Copy link
    Mannequin Author

    xdegaye mannequin commented Oct 28, 2016

    Patch attached.

    @xdegaye xdegaye mannequin added tests Tests in the Lib/test dir 3.7 (EOL) end of life and removed stdlib Python modules in the Lib dir labels Oct 28, 2016
    @xdegaye xdegaye mannequin self-assigned this Oct 28, 2016
    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Dec 8, 2016

    New changeset fd1718badb67 by Xavier de Gaye in branch '3.6':
    Issue bpo-26939: Add the support.setswitchinterval() function to fix
    https://hg.python.org/cpython/rev/fd1718badb67

    New changeset c5d7e46926ac by Xavier de Gaye in branch 'default':
    Issue bpo-26939: Merge 3.6.
    https://hg.python.org/cpython/rev/c5d7e46926ac

    @xdegaye xdegaye mannequin closed this as completed Dec 8, 2016
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life build The build process and cross-build tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants