classification
Title: Inconsistent returncode/exitcode for terminated child processes on Windows
Type: behavior Stage: needs patch
Components: Library (Lib), Windows Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Akos Kiss, davin, eryksun, paul.moore, pitrou, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2017-10-24 15:46 by Akos Kiss, last changed 2017-11-05 17:44 by paul.moore.

Messages (9)
msg304920 - (view) Author: Akos Kiss (Akos Kiss) Date: 2017-10-24 15:46
I've been working with various approaches for running and terminating subprocesses on Windows and I've obtained surprisingly different results if I used different modules and ways of termination. Here is the script I wrote, it uses the `subprocess` and the `multiprocessing` modules for starting new subprocesses, and process termination is performed either by the modules' own `terminate` functions or by `os.kill`.

```py
import multiprocessing
import os
import signal
import subprocess
import sys
import time

def kill_with_os_kill(proc):
    print('kill with os.kill(pid,SIGTERM)')
    os.kill(proc.pid, signal.SIGTERM)

def kill_with_terminate(proc):
    print('kill child with proc.terminate()')
    proc.terminate()

def run_and_kill_subprocess(killfn, procarg):
    print('run subprocess child with %s' % procarg)
    with subprocess.Popen([sys.executable, __file__, procarg]) as proc:
        time.sleep(1)
        killfn(proc)
        proc.wait()
    print('child terminated with %s' % proc.returncode)

def run_and_kill_multiprocessing(killfn, procarg):
    print('run multiprocessing child with %s' % procarg)
    proc = multiprocessing.Process(target=childmain, args=(procarg,))
    proc.start()
    time.sleep(1)
    killfn(proc)
    proc.join()
    print('child terminated with %s' % proc.exitcode)

def childmain(arg):
    print('child process started with %s' % arg)
    while True:
        pass

if __name__ == '__main__':
    if len(sys.argv) < 2:
        print('parent process started')
        run_and_kill_subprocess(kill_with_os_kill, 'subprocess-oskill')
        run_and_kill_subprocess(kill_with_terminate, 'subprocess-terminate')
        run_and_kill_multiprocessing(kill_with_os_kill, 'multiprocessing-oskill')
        run_and_kill_multiprocessing(kill_with_terminate, 'multiprocessing-terminate')
    else:
        childmain(sys.argv[1])
```

On macOS, everything works as expected (and I think that Linux will behave alike):

```
$ python3 killtest.py 
parent process started
run subprocess child with subprocess-oskill
child process started with subprocess-oskill
kill with os.kill(pid,SIGTERM)
child terminated with -15
run subprocess child with subprocess-terminate
child process started with subprocess-terminate
kill child with proc.terminate()
child terminated with -15
run multiprocessing child with multiprocessing-oskill
child process started with multiprocessing-oskill
kill with os.kill(pid,SIGTERM)
child terminated with -15
run multiprocessing child with multiprocessing-terminate
child process started with multiprocessing-terminate
kill child with proc.terminate()
child terminated with -15
```

But on Windows, I got:

```
>py -3 killtest.py
parent process started
run subprocess child with subprocess-oskill
child process started with subprocess-oskill
kill with os.kill(pid,SIGTERM)
child terminated with 15
run subprocess child with subprocess-terminate
child process started with subprocess-terminate
kill child with proc.terminate()
child terminated with 1
run multiprocessing child with multiprocessing-oskill
child process started with multiprocessing-oskill
kill with os.kill(pid,SIGTERM)
child terminated with 15
run multiprocessing child with multiprocessing-terminate
child process started with multiprocessing-terminate
kill child with proc.terminate()
child terminated with -15
```

Notes:
- On Windows with `os.kill(pid, sig)`, "sig will cause the process to be unconditionally killed by the TerminateProcess API, and the exit code will be set to sig." I.e., it is not possible to detect on Windows whether a process was terminated by a signal or it exited properly, because `kill` does not actually raise a signal and no Windows API allows to differentiate between proper or forced termination.
- The `multiprocessing` module has a workaround for this by terminating the process with a designated exit code (`TERMINATE = 0x10000`) and checking for that value afterwards, rewriting it to `-SIGTERM` if found. The related documentation is a bit misleading, as `exitcode` is meant to have "negative value -N [which] indicates that the child was terminated by signal N" -- however, if the process was indeed killed with `SIGTERM` (and not via `terminate`), then `exitcode` will be `SIGTERM` and not `-SIGTERM` (see above). (The documentation of `terminate` does not clarify the situation much by stating that "on Windows TerminateProcess() is used", since it does not mention the special exit code -- and well, it's not even a signal after all, so it's not obvious whether negative or positive exit code is to be expected.)
- The `subprocess` module choses the quite arbitrary exit code of 1 and documents that "negative value -N indicates that the child was terminated by signal N" is POSIX only, not mentioning anything about what to expect on Windows.

Long story short: on Windows, the observable exit code of a forcibly terminated child process is quite inconsistent even across standard modules and termination methods, unlike on other (Linux/macOS) platforms. I think that having results consistent with `os.kill(,SIGTERM)` would be desirable even if that means non-negative values.

I'm willing to post a PR if the issue is deemed to be valid.
msg305020 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-10-26 03:24
Setting the exit code to the negative of a C signal value isn't generally meaningful in Windows. It seems multiprocessing doesn't have a significant use for this, other than getting a formatted exit code in the repr via its _exitcode_to_name dict. For example:

    p = multiprocessing.Process(target=time.sleep, args=(30,))
    p.start()
    p.terminate()

    >>> p
    <Process(Process-1, stopped[SIGTERM])>

This may mislead people into thinking incorrectly that Windows implements POSIX signals. Python uses the C runtime's emulation of the basic set of required signals. SIGSEGV, SIGFPE, and SIGILL are based on exceptions. SIGINT and SIGBREAK are based on console control events. SIGABRT and SIGTERM are for use with C `raise`. Additionally it implements os.kill via TerminateProcess and GenerateConsoleCntrlEvent. (The latter takes process group IDs, so it should have been used to implement os.killpg instead. Its use in os.kill is wrong and confusing.)

The normal exit code for a forced shutdown is 1, which you can confirm via Task Manager or `taskkill /F`. subprocess is correct here. I think multiprocessing should follow suit.
msg305044 - (view) Author: Akos Kiss (Akos Kiss) Date: 2017-10-26 10:14
`taskkill /F` sets exit code to 1, indeed. (Confirmed by experiment. Cannot find this behaviour documented, though.)

On the other hand, MS Docs state (https://docs.microsoft.com/en-us/cpp/c-runtime-library/reference/signal#remarks) that termination by a signal "terminates the calling program with exit code 3". (So, there may be other "valid" exit codes, too.)
msg305047 - (view) Author: Akos Kiss (Akos Kiss) Date: 2017-10-26 12:59
A follow-up: in addition to `taskkill`, I've taken a look at another "official" way for killing processes, the `Stop-Process` PowerShell cmdlet (https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.management/stop-process?view=powershell-5.1). Yet again, documentation is scarce on what the exit code of the terminated process will be. But PowerShell and .NET code base is open sourced, so I've dug a bit deeper and found that `Stop-Process` is based on `System.Diagnostics.Process.Kill()` (https://github.com/PowerShell/PowerShell/blob/master/src/Microsoft.PowerShell.Commands.Management/commands/management/Process.cs#L1240), while `Process.Kill()` uses the `TerminateProcess` Win32 API (https://github.com/dotnet/corefx/blob/master/src/System.Diagnostics.Process/src/System/Diagnostics/Process.Windows.cs#L93). Interestingly, `TerminateProcess` is called with -1 (this was surprising, to me at least, as exit code is unsigned on Windows AFAIK).

Therefore, I've added two new "kill" implementations to my original code experiment (wont repeat the whole code here, just the additions):

```py
def kill_with_taskkill(proc):
    print('kill child with taskkill /F')
    subprocess.run(['taskkill', '/F', '/pid', '%s' % proc.pid], check=True)

def kill_with_stopprocess(proc):
    print('kill child with powershell stop-process')
    subprocess.run(['powershell', 'stop-process', '%s' % proc.pid], check=True)
```

And I got:

```
run subprocess child with subprocess-taskkill
child process started with subprocess-taskkill
kill child with taskkill /F
SUCCESS: The process with PID 4024 has been terminated.
child terminated with 1
run subprocess child with subprocess-stopprocess
child process started with subprocess-stopprocess
kill child with powershell stop-process
child terminated with 4294967295

run multiprocessing child with multiprocessing-taskkill
child process started with multiprocessing-taskkill
kill child with taskkill /F
SUCCESS: The process with PID 5988 has been terminated.
child terminated with 1
run multiprocessing child with multiprocessing-stopprocess
child process started with multiprocessing-stopprocess
kill child with powershell stop-process
child terminated with 4294967295
```

My takeaways from the above are that
1) Windows is not consistent across itself,
2) 1 is not the only "valid" "terminated forcibly" exit code, and
3) negative exit code does not work, even if MS itself tries to use it.

BTW, I really think that killing a process with a code of 1 is questionable, as quite some apps return 1 themselves just to signal error (but proper termination). This makes it hard to tell applications' own error signaling and forced kills apart. But that's a personal opinion.
msg305099 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-10-27 04:18
A C/C++ program returns EXIT_FAILURE for a generic failure. Microsoft defines this macro value as 1. Most tools that a user might use to forcibly terminate a process don't allow specifying the reason; they just use the generic value of 1. This includes Task Manager, taskkill.exe /f, the WDK's kill.exe -f, and Sysinternals pskill.exe and Process Explorer. subprocess and multiprocessing should also use 1 to be consistent.

The system itself doesn't distinguish a forced termination from a normal exit. Ultimately every thread and process gets terminated by the system calls NtTerminateThread and NtTerminateProcess (or the equivalent Process Manager private functions PspTerminateThreadByPointer, PspTerminateProcess, etc). Windows API TerminateThread and TerminateProcess are light wrappers around the corresponding system calls.

ExitThread and ExitProcess (actually implemented as RtlExitUserThread and RtlExitUserProcess in ntdll.dll) are within-process calls that integrate with the loader's LdrShutdownThread and LdrShutdownProcess routines. This allows the loader to call the entry points for loaded DLLs with DLL_THREAD_DETACH or DLL_PROCESS_DETACH, respectively. ExitThread also handles deallocating the thread's stack. Beyond that, the bulk of the work is handled by NtTerminateThread and NtTerminateProcess. For ExitProcess, NtTerminateProcess is actually called twice -- the first time it's called with a NULL process handle to kill the other threads in the current process. After LdrShutdownProcess returns, NtTerminateProcess is called again to truly terminate the process.

> PowerShell and .NET ... `System.Diagnostics.Process.Kill()` ... 
> `TerminateProcess` is called with -1

.NET is in its own (cross-platform) managed-code universe. I don't know why the developers decided to make Kill() use -1 (0xFFFFFFFF) as the exit code. I can guess that they negated the conventional EXIT_FAILURE value to indicate a signal-like kill. I think it's an odd decision, and I'm not inclined to favor it over behaviors that predate the existence of .NET. 

Making the ExitCode property a signed integer in .NET is easy to understand, and not a cause for concern since it's only a matter of interpretation. Note that the return value from wmain() or wWinMain() is a signed integer. Also, the two fundamental status result types in Windows -- NTSTATUS [1] and HRESULT [2] -- are 32-bit signed integers (warnings and errors are negative). Internally, the NT Process object's EPROCESS structure defines ExitStatus as an NTSTATUS value. You can see in a kernel debugger that it's a 32-bit signed integer (Int4B):

    lkd> dt nt!_eprocess ExitStatus
       +0x624 ExitStatus : Int4B

Python also wants the exit code to be a signed value. If we try to exit with an unsigned value that exceeds 0x7FFF_FFFF, it instead uses a default code of -1 (0xFFFF_FFFF). For example:

    >>> hex(subprocess.call('python -c "raise SystemExit(0x8000_0000)"'))
    '0xffffffff'

Using the corresponding signed integer works fine:

    >>> 0x8000_0000 - 2**32
    -2147483648
    >>> hex(subprocess.call('python -c "raise SystemExit(-2_147_483_648)"'))
    '0x80000000'

[1]: https://msdn.microsoft.com/en-us/library/cc231200
[2]: https://msdn.microsoft.com/en-us/library/cc231198


> termination by a signal "terminates the calling program with 
> exit code 3"

MS C raise() defaults to calling exit(3). I don't know why it uses the value 3; it's a legacy value from the MS-DOS era. Python doesn't directly expose C raise(), so this exit code only occurs in rare circumstances.

Note that SIGINT and SIGBREAK are based on console control events, and in this case the default behavior (i.e. SIG_DFL) is not to call exit(3) but rather to continue to the next registered console control handler. This is normally the Windows default handler (i.e. kernelbase!DefaultHandler), which calls ExitProcess with STATUS_CONTROL_C_EXIT. When closing the console itself (i.e. CTRL_CLOSE_EVENT), if a control handler in a console client returns TRUE, the default handler doesn't get called, but (starting with NT 6.0) the process still has to be terminated. In this case the session server, csrss.exe, calls NtTerminateProcess with STATUS_CONTROL_C_EXIT.

The exit code also isn't normally 3 for SIGABRT when abort() (i.e. os.abort in Python) gets called. In a release build, abort() defaults to using the __fastfail intrinsic (i.e. INT 0x29 on x64 systems) with the code FAST_FAIL_FATAL_APP_EXIT. This terminates the process with a STATUS_STACK_BUFFER_OVERRUN exception. By design, a __fastfail exception cannot be handled. An attached debugger only sees it as a second-chance exception. (Ideally they should have split this functionality into multiple status codes, since a __fastfail isn't necessarily due to a stack buffer overrun.) The error-reporting dialog may change the exit status to 255 in this case, but you can suppress this dialog via SetErrorMode(SEM_NOGPFAULTERRORBOX) or by using a Job object that's flagged to suppress it. You can also override the CRT's default abort() behavior to skip __fastfail. Either set a SIGABRT handler that exits the process. Or call _set_abort_behavior to unset _CALL_REPORTFAULT, in which case the exit code will be 3.
msg305103 - (view) Author: Akos Kiss (Akos Kiss) Date: 2017-10-27 05:57
And I thought that my analysis was thorough... Exit code 1 is the way to go, I agree now.
msg305138 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-10-27 21:28
If a multiprocessing Process gets terminated by any means other than its terminate() method, it won't get this special TERMINATE (0x10000) exit code that allows the object to pretend the exit status is POSIX -SIGTERM. In general, the exit code will be 1. IMO, Process.terminate should be consistent with typical exit code of 1 and thus consistent with Popen.terminate. However, I'm adding Davin and Antoine to the nosy list in case they disagree -- before you go to the trouble of creating a PR.
msg305598 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2017-11-05 15:08
I would like to know what our resident Windows users think about this (Paul, Steve, Zach).

Reading the above arguments, I'd be inclined to settle on 15 (that is, the non-negative "signal" number).  While it is not consistent with what "taskkill" or other APIs do, it makes it clear that the process was terminated in a certain way.  Certainly, there is a slight chance that 15 is a legitimate error code returned by the process, but that is far less likely than returning 1 as a legitimate error code, which I presume is extremely common.

In any case, this can't go in a bugfix release, so marking as 3.7-only.
msg305604 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2017-11-05 17:44
I'm not actually sure what the proposal here is. Are we suggesting that all Python's means of terminating a process should use the same exit code?

Note that doing so would be a backward compatibility break, as os.kill() is documented as having the behaviour seen here (it's just that SIGTERM isn't a particularly meaningful value to use on Windows). subprocess terminate() doesn't document the exit code sent on Windows, and maybe should - but 1 seems a reasonable value (it's the C EXIT_FAILURE code after all). I don't fully understand the issue multiprocessing is trying to solve, but it seems to be around signals, which are very different between Windows and Unix anyway.

So, in summary - I'd need to see a specific proposal, but my instinct is that this is only an issue if you're trying to cover over the differences between Unix and Windows, and this isn't a case where I think that's advisable (the current situation is "good enough" if you don't care, and if you do, you have the means to do it right, you just need to cater for the platform differences yourself, in a way that suits your application.).
History
Date User Action Args
2017-11-05 17:44:11paul.mooresetmessages: + msg305604
2017-11-05 15:08:20pitrousetmessages: + msg305598
versions: - Python 3.6, Python 3.8
2017-10-27 21:28:24eryksunsetnosy: + pitrou, davin

messages: + msg305138
versions: + Python 3.8
2017-10-27 05:57:45Akos Kisssetmessages: + msg305103
2017-10-27 04:18:19eryksunsetmessages: + msg305099
2017-10-26 12:59:59Akos Kisssetmessages: + msg305047
2017-10-26 10:14:01Akos Kisssetmessages: + msg305044
2017-10-26 03:26:27eryksunsetstage: needs patch
versions: + Python 3.6, Python 3.7
2017-10-26 03:24:56eryksunsetnosy: + eryksun
messages: + msg305020
2017-10-24 15:46:53Akos Kisscreate