classification
Title: android: test_functools hangs on armv7
Type: behavior Stage: resolved
Components: Cross-Build, Tests Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: xdegaye Nosy List: Alex.Willmer, ncoghlan, neologix, pitrou, python-dev, rhettinger, vstinner, xdegaye, yan12125
Priority: normal Keywords: patch

Created on 2016-05-03 15:37 by xdegaye, last changed 2017-03-31 16:36 by dstufft. This issue is now closed.

Files
File name Uploaded Description Edit
test_output.txt xdegaye, 2016-05-03 19:07 gdb back trace
python_tests_on_arm.txt yan12125, 2016-05-06 10:20
atomic_explicit.patch pitrou, 2016-05-06 11:30 review
build-failure.txt xdegaye, 2016-05-06 14:53
atomic_explicit2.patch pitrou, 2016-05-06 14:58 review
pycond_timedwait.patch xdegaye, 2016-05-16 16:09 review
test_lru_cache_threaded-x86-printf.txt xdegaye, 2016-06-04 09:16
test_lru_cache_threaded-armv7-printf.txt xdegaye, 2016-06-04 09:16
pycond_timedwait_2.patch xdegaye, 2016-06-04 09:20 review
setswitchinterval.patch xdegaye, 2016-10-28 07:15 review
Pull Requests
URL Status Linked Edit
PR 552 closed dstufft, 2017-03-31 16:36
Messages (22)
msg264742 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-03 15:37
test_functools hangs on an android emulator running an armv7 system image (but not on x86) at API level 21. The test suite hangs at

test_lru_cache_threaded (test.test_functools.TestLRUC) ...

A cause of this problem may be related to the fact that the android arm emulator runs very, very slowly.
msg264756 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-03 19:07
The attached test_output.txt file is the corresponding gdb backtrace[1].

Interestingly, all the tests (issues #26938, #26939, #26940 and #26941) that hang on arm use sys.setswitchinterval() aggressively.


[1] The backtrace is interleaved with:
  * Python Exception <type 'exceptions.UnicodeDecodeError'> 'utf8' codec can't decode byte 0xb8 in posit ion 0: invalid start byte:
    that seem to come from libpython, the python that is embeded in gdb to pretty-print gdb output
  * ---Type <return> to continue, or q <return> to quit---
    sorry, had to resort to copy/paste, gdb 'set logging' does not seem to work
msg264922 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-05-05 21:28
Can you post the backtraces of all threads (at least the innermost frames)?
Also, can you reproduce on an actual armv7 machine? Just wondering if there may be something peculiar in the emulator.
msg264957 - (view) Author: Chih-Hsuan Yen (yan12125) * Date: 2016-05-06 10:20
FWIW, test_functools passes on ASUS Zenfone 2 Laser ZE500KL with stock ROM (rooted) and my patchset. [1] The CPU is ARM64 and the Python is built as 32-bit ARM binaries. test_threading (#26941) and test_importlib (#26940) pass, too. test_concurrent_futures (#26938) fails with 27 errors, all with "OSError: [Errno 38] Function not implemented", which seems the same as #26924.

The attached file contains logs of the four tests mentioned here.

[1] https://github.com/yan12125/python3-android
msg264963 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-05-06 11:11
I suspect this bug is caused by memory ordering differences between x86 and ARM (see https://en.wikipedia.org/wiki/Memory_ordering#Runtime_memory_ordering ), which is why it may not reproduce easily (may depend on the particular CPU implementation and/or various timing subtleties in the tests themselves).

Xavier / Chi, can you tell if Android has stdatomic.h? See Include/pyatomic.h for the difference it makes.
msg264965 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-05-06 11:30
Xavier, as a strating point, can you try the following patch? (atomic_explicit.patch)
If it solves the issue, can you also measure the performance drop compared to unpatched (on a non-debug build)? `python -m test.pystone` should be sufficient.
msg264974 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-06 14:53
Android has stdatomic.h and HAVE_STD_ATOMIC is defined.

The build fails with the patch, see the errors in the attached file. This is the native compilation of python, the patches in my build system are also applied to the native build so that they can be checked. I can prevent that if needed.
msg264976 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-05-06 14:58
Oops, sorry, hadn't seen the patch didn't compile. Can you tried with this new patch? (atomic_explicit2.patch)
msg264981 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-06 15:28
Still hangs at test_lru_cache_threaded with this last patch :(
msg264988 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-06 15:55
> Can you post the backtraces of all threads (at least the innermost frames)?

There are two threads, gdb does not print the backtrace of the other thread. My current gdb setup with a connection to a gdb-server misses some stuff (access to the loader or some libraries I guess) and needs to be completed with the same setup as the one used by the ndk-gdb script (released by google for android applications).

> Also, can you reproduce on an actual armv7 machine? Just wondering if there may be something peculiar in the emulator.

I don't have an armv7 device that can be used for testing.
msg264995 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-06 16:55
The strace of test_lru_cache_threaded:

...
gettimeofday({1462553238, 580860}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 581276772}) = 0
gettimeofday({1462553238, 581672}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 582078767}) = 0
gettimeofday({1462553238, 583956}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 585865959}) = 0
gettimeofday({1462553238, 587597}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 589381724}) = 0
gettimeofday({1462553238, 589973}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 591858839}) = 0
gettimeofday({1462553238, 593003}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 593452750}) = 0
gettimeofday({1462553238, 593886}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 594364345}) = 0
gettimeofday({1462553238, 594764}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 595227771}) = 0
gettimeofday({1462553238, 595626}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 596027590}) = 0
gettimeofday({1462553238, 596448}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 596848736}) = 0
gettimeofday({1462553238, 597259}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 597665484}) = 0
gettimeofday({1462553238, 598060}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 598869963}) = 0
gettimeofday({1462553238, 599547}, NULL) = 0
clock_gettime(CLOCK_REALTIME, {1462553238, 601053152}) = 0
gettimeofday({1462553238, 601593}, NULL) = 0
clock_gettime(CLOCK_REALTIME, ^CProcess 908 detached
 <detached ...>
msg265003 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2016-05-06 17:40
> Still hangs at test_lru_cache_threaded with this last patch :(

What about the other tests (test_importlib, test_threading)?

Do you know in which precise test it hangs? (use "-v" to print test names)

> There are two threads, gdb does not print the backtrace of the other thread.

Even with "thread apply all backtrace" (Google tells me this is the incantation)?
msg265020 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-06 20:11
> Do you know in which precise test it hangs?

They are named in the first msg of each issue, the issues are:
    issue #26938: android: test_concurrent_futures hangs on armv7
    issue #26939: android: test_functools hangs on armv7
    issue #26940: android: test_importlib hangs on armv7
    issue #26941: android: test_threading hangs on armv7

The test_threading and test_importlib issues have also now a gdb backtrace, so we know not only the test name but also the full python frame stack with line numbers thanks to python-gdb.py magic. It seems that test_functools, test_threading and test_importlib all hang within take_gil().

test_concurrent_futures is different and hangs in a futex().

With "thread apply all backtrace",  nothing changes and gdb still prints:
    Cannot access memory at address 0x0
    #0  0xb6efae6c in ?? ()
    #1  0xb6efd830 in ?? ()
    Backtrace stopped: previous frame identical to this frame (corrupt stack?)

The problem seems to be that gdb does not have enough information to unwind the stack for the other threads and for thread #1 it only works after an interrupted continue command, and not on gdb startup.
msg265027 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-06 20:45
Forgot to say that the backtraces for test_threading and test_importlib were obtained with the patch applied while they were hanging.
msg265701 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-16 15:52
Got now a full fledged gdb. Thread #1 loops infinitely in the "while (_Py_atomic_load_relaxed(&gil_locked))" loop in take_gil(). Thread #2 is stuck in the MUTEX_LOCK(gil_mutex) statement in drop_gil() as can be seen by setting a breakpoint at the following line and seeing that this breakpoint is never hit.

Setting gil_interval to 10 (from the initial value of 1) from gdb, allows the test to pass.

With this patch of PyCOND_TIMEDWAIT():

  diff -r eee959fee5f5 Python/condvar.h
  --- a/Python/condvar.h	Sat May 07 21:13:50 2016 +0300
  +++ b/Python/condvar.h	Mon May 16 15:52:49 2016 +0200
  @@ -99,6 +99,7 @@
       PyCOND_ADD_MICROSECONDS(deadline, us);
       ts.tv_sec = deadline.tv_sec;
       ts.tv_nsec = deadline.tv_usec * 1000;
  +    PyCOND_GETTIMEOFDAY(&deadline);

and a breakpoint set on pthread_cond_timedwait(). When this breakpoint is hit, there is 20 usec  elapsed time instead of the value of 1 (as requested by 'us') between ts and deadline as shown here:

  (gdb) p ts
  $1 = {tv_sec = 1463407197, tv_nsec = 468546000}
  (gdb) p deadline
  $2 = {tv_sec = 1463407197, tv_usec = 468566}

And when iterating the loop, the measured elapsed times are: 20, 12, 12, 11, 11, 11.

Clearly, the ts date is in the past when pthread_cond_timedwait() is called, and the function returns immediately without releasing the mutex. This is the expected behavior[1].

[1] http://linux.die.net/man/3/pthread_cond_wait
msg265704 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-05-16 16:02
I don't know if it can fix the issue, but you may see my issue #23428: "Use the monotonic clock for thread conditions on POSIX platforms".
msg265707 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-16 16:09
On a heavily loaded system, there is no guarantee that the date obtained from a call to gettimeofday() in PyCOND_TIMEDWAIT(), is not in the past when pthread_cond_timedwait() is called, as there may be a context switch in between. When this happens, a Python thread cannot take the gil even if it is available.

This tentative patch monitors the load of the system and dynamically increases the value used for the switch interval when needed. With this patch on the android-21-armv7 emulator:

python -m test -m test_lru_cache_threaded -v test_functools
Ok
python -m test -m test_deadlock -v test_importlib
Ok
python -m test -m test_is_alive_after_fork -v test_threading
Ok

The measured elapsed time is not correct in this patch when it is above 1 second.
msg265721 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-05-16 19:07
> I don't know if it can fix the issue, but you may see my issue #23428: "Use the monotonic clock for thread conditions on POSIX platforms".

pthread_condattr_setclock() is declared in the Android API 21 headers.

When a monotonic clock is set for pthread_cond_timedwait(), one could measure the monotonic time elapsed to execute the instructions from _PyTime_monotonic() to the setting of the ts fields in PyCOND_TIMEDWAIT(), i.e the time to execute those statements taken from your patch:

#ifdef MONOTONIC
    _PyTime_monotonic(&deadline);
#else
    _PyTime_gettimeofday(&deadline);
#endif

    /* TODO: add overflow and truncation checks */
    assert(us <= LONG_MAX);
    deadline.tv_usec += (long)us;
    deadline.tv_sec += deadline.tv_usec / 1000000;
    deadline.tv_usec %= 1000000;


and prevent gil_interval to be set below that measured value. This would solve the problem for this soooo slow emulator. _PyTime_monotonic() does not exist, is this _PyTime_GetMonotonicClock() ? So I did not try to test it, not knowing the status of your patch.

IMHO PyCOND_TIMEDWAIT() should use a monotonic clock when available.
msg267250 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-06-04 09:20
pycond_timedwait_2.patch prints the delta time in microseconds when deadline is in the past.  Running the patch with the following command:
  python -m test -m test_lru_cache_threaded -v test_functools
gives these results:

a) natively:
  deadline is never in the past and the test run prints only a bunch of 'Ok's.
b) on the x86 emulator:
  Although the test does not hang when python is not patched, the output of the test run, test_lru_cache_threaded-x86-printf.txt, shows that deadline is almost always in the past.
c) on the armv7 emulator:
  The output of the test run, test_lru_cache_threaded-armv7-printf.txt (this is a truncated version, the original file is half a Mbytes long), shows that deadline is always in the past. With the patch, the test does not hang.

It is possible that this problem could be a quirk in the implementation of the qemu emulator, given the results of b), and the fact that the python test suite runs at about the same speed when run natively and on the x86 emulator.
msg267410 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-06-05 11:56
The problem can be fixed by changing the switch interval to a higher value when the test is run on an Android emulator. Issue 26855 can provide a way to know when this is the case through the value of the ro.kernel.qemu property.
msg279588 - (view) Author: Xavier de Gaye (xdegaye) * (Python triager) Date: 2016-10-28 07:15
Patch attached.
msg282703 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-12-08 10:10
New changeset fd1718badb67 by Xavier de Gaye in branch '3.6':
Issue #26939: Add the support.setswitchinterval() function to fix
https://hg.python.org/cpython/rev/fd1718badb67

New changeset c5d7e46926ac by Xavier de Gaye in branch 'default':
Issue #26939: Merge 3.6.
https://hg.python.org/cpython/rev/c5d7e46926ac
History
Date User Action Args
2017-03-31 16:36:11dstufftsetpull_requests: + pull_request871
2016-12-08 11:02:40xdegayesetstatus: open -> closed
resolution: fixed
stage: commit review -> resolved
2016-12-08 10:10:36python-devsetnosy: + python-dev
messages: + msg282703
2016-11-17 16:28:25xdegayesetstage: patch review -> commit review
2016-10-28 07:29:02xdegayelinkissue26941 dependencies
2016-10-28 07:25:09xdegayelinkissue26940 dependencies
2016-10-28 07:15:16xdegayesetfiles: + setswitchinterval.patch

dependencies: - android: add platform.android_ver()
assignee: xdegaye
components: + Tests, - Library (Lib)
versions: + Python 3.7
messages: + msg279588
stage: patch review
2016-06-05 11:56:32xdegayesetdependencies: + android: add platform.android_ver()
messages: + msg267410
2016-06-04 09:20:20xdegayesetfiles: + pycond_timedwait_2.patch

messages: + msg267250
2016-06-04 09:16:56xdegayesetfiles: + test_lru_cache_threaded-armv7-printf.txt
2016-06-04 09:16:42xdegayesetfiles: + test_lru_cache_threaded-x86-printf.txt
2016-05-21 07:06:39xdegayelinkissue26865 dependencies
2016-05-16 19:07:19xdegayesetmessages: + msg265721
2016-05-16 16:09:21xdegayesetfiles: + pycond_timedwait.patch

messages: + msg265707
2016-05-16 16:02:31vstinnersetnosy: + vstinner
messages: + msg265704
2016-05-16 15:52:27xdegayesetmessages: + msg265701
2016-05-06 20:45:56xdegayesetmessages: + msg265027
2016-05-06 20:11:19xdegayesetmessages: + msg265020
2016-05-06 17:40:20pitrousetmessages: + msg265003
2016-05-06 16:55:47xdegayesetmessages: + msg264995
2016-05-06 15:55:47xdegayesetmessages: + msg264988
2016-05-06 15:28:15xdegayesetmessages: + msg264981
2016-05-06 14:58:38pitrousetfiles: + atomic_explicit2.patch

messages: + msg264976
2016-05-06 14:53:04xdegayesetfiles: + build-failure.txt

messages: + msg264974
2016-05-06 11:30:24pitrousetfiles: + atomic_explicit.patch
keywords: + patch
messages: + msg264965
2016-05-06 11:12:00pitrousetnosy: + neologix
2016-05-06 11:11:51pitrousetmessages: + msg264963
2016-05-06 10:20:55yan12125setfiles: + python_tests_on_arm.txt
nosy: + yan12125
messages: + msg264957

2016-05-05 21:28:32pitrousetnosy: + pitrou
messages: + msg264922
2016-05-03 19:07:46xdegayesetfiles: + test_output.txt

messages: + msg264756
2016-05-03 15:37:46xdegayecreate