classification
Title: test_posix: TestPosixSpawn fails on PPC64 Fedora 3.x
Type: behavior Stage: resolved
Components: Tests Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: alex, cstratak, martin.panter, pablogsal, serhiy.storchaka, vstinner, xiang.zhang
Priority: normal Keywords: patch

Created on 2018-05-24 02:18 by vstinner, last changed 2018-06-19 09:10 by vstinner. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 7663 closed pablogsal, 2018-06-12 12:44
PR 7685 merged pablogsal, 2018-06-13 18:18
Messages (31)
msg317509 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-05-24 02:18
http://buildbot.python.org/all/#/builders/53/builds/1062

(...)
test_close_file (test.test_posix.TestPosixSpawn) ... ok
test_dup2 (test.test_posix.TestPosixSpawn) ... ok
test_empty_file_actions (test.test_posix.TestPosixSpawn) ... ok
test_multiple_file_actions (test.test_posix.TestPosixSpawn) ... FAIL
test_no_such_executable (test.test_posix.TestPosixSpawn) ... ok
test_open_file (test.test_posix.TestPosixSpawn) ... ERROR
test_returns_pid (test.test_posix.TestPosixSpawn) ... ok
Warning -- files was modified by test_posix
  Before: []
  After:  ['\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb'] 
test test_posix failed
test_specify_environment (test.test_posix.TestPosixSpawn) ... ok

======================================================================
ERROR: test_open_file (test.test_posix.TestPosixSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64/build/Lib/test/test_posix.py", line 1542, in test_open_file
    with open(outfile) as f:
FileNotFoundError: [Errno 2] No such file or directory: '@test_42482_tmp'

======================================================================
FAIL: test_multiple_file_actions (test.test_posix.TestPosixSpawn)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/shager/cpython-buildarea/3.x.edelsohn-fedora-ppc64/build/Lib/test/test_posix.py", line 1488, in test_multiple_file_actions
    self.assertEqual(os.waitpid(pid, 0), (pid, 0))
AssertionError: Tuples differ: (31319, 32512) != (31319, 0)

First differing element 1:
32512
0

- (31319, 32512)
?         ^^^^^

+ (31319, 0)
?         ^


----------------------------------------------------------------------
Ran 101 tests in 2.175s

FAILED (failures=1, errors=1, skipped=9)
1 test failed again:
    test_posix
msg317524 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-24 05:12
What is \udcdb\udcdb\udcdb...? Looks like a corrupted memory.
msg317527 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-24 05:35
Seems there are 2 or 3 different bugs here. Race condition, using freed memory, and maybe something other. I take this.
msg317797 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-05-27 17:09
@serhiy.storchaka Is the race condition/freed memory reusage in the test suite or in the posix module?
msg317827 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-05-28 07:27
The race condition is in tests (we can open a file before it be written in a child). The reed memory reusage is in the posix module. Didn't look at it close, but seems the content of a temporary Python object (the result of encoding/decoding) is passed to C library function, and this object is freed before the content be used.
msg319385 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-12 14:03
I'm not sure that the attached PR fixes all race conditions.
msg319386 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-12 14:05
The part that is related to \udcdb\udcdb\udcdb... is a problem in posix_spawn() implementation, not in tests. It can't be fixed by changing tests.
msg319387 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-12 14:10
In PR 7663 I am only fixing the race condition.
msg319388 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-12 14:12
A separate PR is needed for the apparent reusing of freed memory in the implementation of posix_spawn. I am currently looking into it.
msg319389 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-12 14:12
For reference: posix_spawn() was implemented in issue20104.
msg319390 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-12 14:14
Should we merge PR6693 and PR6725 before fixing the memory problem?
msg319394 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-12 15:53
Serhiy, I am checking and the only temporary Python object (the result of encoding/decoding) that is passed to C library function that I can find is in:


                if (!PyArg_ParseTuple(file_action, "OiO&ik"
                         ";A open file_action tuple must have 5 elements",
                         &tag_obj, &fd, PyUnicode_FSConverter, &path,
                         &oflag, &mode))
                 {
                     goto fail;
                 }
                 errno = posix_spawn_file_actions_addopen(file_actionsp,
                         fd, PyBytes_AS_STRING(path), oflag, (mode_t)mode);
                  Py_DECREF(path);  /* addopen copied it. */


And according to the manpage of posix_spawn_file_actions_addopen: "The string described by path shall be copied by the posix_spawn_file_actions_addopen() function.".
The object created by `PyArg_ParseTuple` (path) is only freed after calling `posix_spawn_file_actions_addopen` and in that function the contents are copied inside 
`file_actionsp`. So it should not be a problem.

In case there is still a problem, could you elaborate more about what object is being freed and how is being passed to a C library function?
msg319427 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-06-13 05:56
I encounter the same failure when running test_posix suite with a debug build Python. 

[root@SZX1000328166 cpython]# ./python -m test test_posix
Run tests sequentially
0:00:00 load avg: 0.91 [1/1] test_posix
Warning -- files was modified by test_posix
  Before: []
  After:  ['\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcdb\udcf0\udcd3B\udcb4\udc8b\x7f']
test test_posix failed -- multiple errors occurred; run in verbose mode for details
test_posix failed

== Tests result: FAILURE ==

1 test failed:
    test_posix

Total duration: 459 ms
Tests result: FAILURE

Believe it or not, when I comment out the `Py_DECREF(path)`, the failure goes away.
msg319429 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-06-13 06:14
I checked glibc changelog and the copy behaviour is not implemented until 2014, glic 2.20. There is a CVE-2014-4043. My environment has glic 2.17 which seems still suffer the problem.
msg319431 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-13 06:19
I'm puzzled. It is not just documented that posix_spawn_file_actions_addopen() should make a copy, but the glibc implementation actually makes it: see for example https://github.com/bminor/glibc/blob/09533208febe923479261a27b7691abef297d604/posix/spawn_faction_addopen.c . I don't see a way of corrupting this memory.

Can non-glibc implementation be used here?
msg319432 - (view) Author: Xiang Zhang (xiang.zhang) * (Python committer) Date: 2018-06-13 06:29
Serhiy, you need to check an early version, for example 2.17: https://github.com/bminor/glibc/blob/glibc-2.17/posix/spawn_faction_addopen.c
msg319434 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-13 06:51
Thank you Xiang! Thus this is a bug in old glibc. It was fixed in glibc 2.20: https://sourceware.org/bugzilla/show_bug.cgi?id=17048 .

What should we do with this? Add a workaround for the bug (this will complicate the code)? Or skip the test on glibc <2.20.
msg319454 - (view) Author: Charalampos Stratakis (cstratak) * Date: 2018-06-13 12:13
The latest stable Fedora's have glibc >= 2.26

Maybe the buildbot needs to be updated?
msg319457 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-13 12:51
If we want to support older versions of glibc a temporary (and somewhat inelegant) workaround is storing the temporaries in a list that the caller passes and destroy the list after calling `posix_spawn`.
msg319458 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-13 12:54
> If we want to support older versions of glibc a temporary (and somewhat inelegant) workaround is storing the temporaries in a list that the caller passes and destroy the list after calling `posix_spawn`.

Creating a copy seems to be a reasonable tradeoff.

Another option is to detect the bug at runtime and raise an exception if the bug is detected. But I dislike this option: I hate when a function is available in the os module but calling it raises a NotImplementedError :-( Moreover, posix_spawn() seems to be a lot of benefit, so I would like to be able to use it on "old" glibc versions.
msg319463 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-13 13:28
I have updated the PR with a workaround.
msg319481 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-13 18:25
Serhiy commented in the PR:

>__GLIBC__ and __GLIBC_MINOR__ give you the version of glibc used at >compile time. But can not the different version be dynamically linked at >run time?

Should we use the list approach always to avoid problems with this?
msg319485 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-13 20:02
Yes, as I wrote on GitHub.
msg319594 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-15 08:23
As I wrote on GitHub, I think that it is better to use the list approach always for two reasons:

1) Additional #if's and duplications clutter the code.

2) It is hard to test the alternate branch if the other branch is used on most developer computers.

It is better to have a single branch and add a comment that explains a workaround. It can be removed after platforms with buggy glibc versions will grow out of use.
msg319598 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-15 09:02
"It is better to have a single branch and add a comment that explains
a workaround. It can be removed after platforms with buggy glibc
versions will grow out of use."

Recently, I proposed to drop support for old Linux kernels, but some
people still use them and the code to support these kernels don't need
a lot of maintenance. So we keep the code :-)
https://mail.python.org/pipermail/python-dev/2018-January/151821.html
msg319605 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-15 11:07
I have updated PR7685 to use the temporary list as per Serhiy comments.
msg319606 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2018-06-15 11:18
As I commented on GitHub, it seems that the race in the test (problem covered in PR7663) disappears once the memory problem is fixed (PR33630). As is a race condition I am not sure that this affirmation is true, but running 3 test suites in parallel that executes test_posix 200 times does not raise any problem on the buildbot after applying the patch in PR33630
msg319736 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2018-06-16 13:04
I doubt it is a race condition. “Waitpid” only returns after the child has terminated, and the file should be created as the child starts, before it terminates. Surely it is just the buggy glibc opening the DBDBDB file, when the test case expects it to create outfile.
msg319936 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-19 08:20
New changeset cb970730e3ca2522e9b1700dcaf0a06b7e898db6 by Serhiy Storchaka (Pablo Galindo) in branch 'master':
bpo-33630: Fix using of freed memory in old versions of glicb for posix_spawn(). (GH-7685)
https://github.com/python/cpython/commit/cb970730e3ca2522e9b1700dcaf0a06b7e898db6
msg319937 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2018-06-19 08:24
Thank you Pablo! Seems failures in other tests that look like a race condition actually are caused by the glibc bug. I think that after passing tests on builtbots this issue can be closed.
msg319946 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-19 09:10
PPC64 Fedora 3.x, PPC64LE Fedora 3.x, s390x Debian 3.x and s390x RHEL 3.x buildbots are back to green thanks to this fix.

Oh wow, so it was a glibc bug: that was unexpected :-( Hopefully, the workaround is tiny and cheap!

> I think that after passing tests on builtbots this issue can be closed.

Right. I closed the PR 7663 and I close this issue.

Thank you all for investigating this issue, thanks Serhiy and Xiang for the review, and thanks Pablo for the fix!
History
Date User Action Args
2018-06-19 09:10:16vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg319946

stage: patch review -> resolved
2018-06-19 08:24:10serhiy.storchakasetmessages: + msg319937
2018-06-19 08:20:00serhiy.storchakasetmessages: + msg319936
2018-06-16 13:04:37martin.pantersetnosy: + martin.panter
messages: + msg319736
2018-06-15 11:18:07pablogsalsetmessages: + msg319606
2018-06-15 11:07:27pablogsalsetmessages: + msg319605
2018-06-15 09:02:07vstinnersetmessages: + msg319598
2018-06-15 08:23:25serhiy.storchakasetmessages: + msg319594
2018-06-13 20:02:10serhiy.storchakasetmessages: + msg319485
2018-06-13 18:25:18pablogsalsetmessages: + msg319481
2018-06-13 18:18:35pablogsalsetpull_requests: + pull_request7298
2018-06-13 13:28:05pablogsalsetmessages: + msg319463
2018-06-13 12:54:41vstinnersetmessages: + msg319458
2018-06-13 12:51:18pablogsalsetmessages: + msg319457
2018-06-13 12:13:13cstrataksetnosy: + cstratak
messages: + msg319454
2018-06-13 06:51:52serhiy.storchakasetnosy: + alex
messages: + msg319434
2018-06-13 06:29:52xiang.zhangsetmessages: + msg319432
2018-06-13 06:19:37serhiy.storchakasetmessages: + msg319431
2018-06-13 06:14:50xiang.zhangsetmessages: + msg319429
2018-06-13 05:56:57xiang.zhangsetnosy: + xiang.zhang
messages: + msg319427
2018-06-12 15:53:28pablogsalsetmessages: + msg319394
2018-06-12 14:14:27pablogsalsetmessages: + msg319390
2018-06-12 14:12:56serhiy.storchakasetmessages: + msg319389
2018-06-12 14:12:49pablogsalsetmessages: + msg319388
2018-06-12 14:10:25pablogsalsetmessages: + msg319387
2018-06-12 14:05:49serhiy.storchakasetmessages: + msg319386
2018-06-12 14:03:53vstinnersetmessages: + msg319385
2018-06-12 12:44:46pablogsalsetpull_requests: + pull_request7278
2018-05-28 07:27:50serhiy.storchakasetmessages: + msg317827
2018-05-27 17:09:54pablogsalsetnosy: + pablogsal
messages: + msg317797
2018-05-24 14:39:10vstinnersetpull_requests: - pull_request6734
2018-05-24 12:48:53twisteroid ambassadorsetkeywords: + patch
stage: patch review
pull_requests: + pull_request6734
2018-05-24 05:35:19serhiy.storchakasetassignee: serhiy.storchaka
type: behavior
messages: + msg317527
2018-05-24 05:12:11serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg317524
2018-05-24 02:18:53vstinnercreate