Title: [subprocess] Better Unicode support for shell=True on Windows
Type: enhancement Stage: resolved
Components: Library (Lib), Windows Versions: Python 3.8, Python 3.7, Python 3.6
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: Yoni Rozenshein, eryksun, giampaolo.rodola, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2018-06-06 10:01 by Yoni Rozenshein, last changed 2021-03-16 03:46 by eryksun. This issue is now closed.

Messages (5)
msg318807 - (view) Author: Yoni Rozenshein (Yoni Rozenshein) Date: 2018-06-06 10:01
In subprocess, the implementation of shell=True on Windows is to launch a subprocess with using {comspec} /c "{args}" (normally comspec=cmd.exe).

By default, the output of cmd is encoded with the "active" codepage. In Python 3.6, you can decode this using encoding='oem'.

However, this actually loses information. For example, try creating a file with a filename in a language that is not your active codepage, and then doing subprocess.check_output('dir', shell=True). In the output, the filename is replaced with question marks (not by Python, by cmd!).

To get the correct output, cmd has a "/u" switch (this switch has probably existed forever - at least since Windows NT 4.0, by my internet search). The output can then be decoded using encoding='utf-16-le', like any native Windows string.

Currently, Popen constructs the command line in this hardcoded format: {comspec} /c "{args}", so you can't get the /u in there with the shell=True shortcut, and have to write your own wrapping code.

I suggest adding an feature to Popen where /u may be inserted before the /c within the shell=True shortcut. I've thought of several ways to implement this:

1. A new argument to Popen, which indicates that we want Unicode shell output; if True, add the /u. Note that we already have a couple of Windows-only arguments to Popen, so this would not be a precedent.

2. If the encoding argument is 'utf-16-le' or one of its aliases, then add the /u.

3. If the encoding argument is not None, then add the /u.
msg318868 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-06-07 00:51
> To get the correct output, cmd has a "/u" switch (this switch has 
> probably existed forever - at least since Windows NT 4.0, by my 
> internet search). The output can then be decoded using 
> encoding='utf-16-le', like any native Windows string.

However, the /u switch doesn't affect how CMD reads from stdin when it's a disk file or pipe. For example, `set /p` will stop at the first NUL byte. In general this is mismatched with subprocess, which provides a single `encoding` value for all 3 standard I/O streams. For example:

    >>> r ='cmd /d /v /u /c "set /p spam= & echo !spam!"',
    ...     capture_output=True, input='spam', encoding='oem')
    >>> r.stdout

With UTF-16 input, CMD only reads up to "s" instead of reading the entire "s\x00p\x00a\x00m\x00" string that was written to stdin:

    >>> r ='cmd /d /v /u /c "set /p spam= & echo !spam!"',
    ...     capture_output=True, input='spam', encoding='utf-16le')
    >>> r.stdout

> 1. A new argument to Popen

This may lead to confusion and false bug reports by people who expect the setting to also affect external programs run via the shell (e.g. tasklist.exe). It's also not consistent with how CMD reads from stdin, as shown above. 

I can see the use of adding a cross-platform get_shell_path() function that returns the fully-qualified path to the shell that's used by shell=True. This way programs don't have to figure it out on their own if they need custom shell options. 

Common CMD shell options in Windows include /d (skip AutoRun commands), /v (enable delayed expansion of environment variables via "!"), /e (enable command extensions), /k (remain running after the command), and /u. I'd prefer subprocess to use /d by default. It's strange that the CRT's system() command doesn't use it.

Currently the shell path can be "/bin/sh" or "/system/bin/sh" in POSIX and os.environ.get("COMSPEC", "cmd.exe") in Windows. I'd prefer that Windows instead used:

    shell_path = os.path.abspath(os.environ.get('ComSpec',
                    os.path.join(_winapi.GetSystemDirectory(), 'cmd.exe')))

i.e. never use an unqualified, relative path such as "cmd.exe". 

Instead of the single-use GetSystemDirectory function, it could instead use _winapi.SHGetKnownFolderPath(_winapi.FOLDERID_System), or _winapi.SHGetKnownFolderPath('{1AC14E77-02E7-4E5D-B744-2EB1AE5198B7}') if the GUID constants aren't added.
msg318870 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-06-07 00:58
> By default, the output of cmd is encoded with the "active" 
> codepage. In Python 3.6, you can decode this using 
> encoding='oem'.

FYI, the actual encoding is not necessarily "oem".

The console codepage may have been changed from the initial value by a SetConsoleCP call in the current process or another process (e.g., For example, a batch script can switch to codepage 65001 to allow CMD to read a UTF-8 encoded batch file; or read UTF-8 from an external command in a `for /f` loop; or write UTF-8 to a disk file or pipe. 

(Only switch to codepage 65001 temporarily. Using UTF-8 for legacy console I/O is buggy. CMD, PowerShell, and Python 3.6+ aren't affected since they use the wide-character API for console I/O. But a legacy console application that uses the codepage implicitly with ReadFile and WriteFile for byte-based I/O may get invalid results such as reading a non-ASCII character as NUL, or the entire read failing, or writing garbage to the console after output that contains non-ASCII characters.)

To accommodate applications that use the current console codepage for standard I/O, Python could add two encodings that correspond to the current value of GetConsoleCP and GetConsoleOutputCP (e.g. named "conin" and "conout"). 

Additionally, we can't assume the console codepage is initially OEM. It depends on settings in the registry or the shell shortcut for the application that allocated the console. In particular, if a new console window is allocated by a process (either explicitly via AllocConsole or implicitly for a console app that either hasn't inherited a console or was created with the CREATE_NEW_CONSOLE or CREATE_NO_WINDOW creation flag), then the console loads custom settings from either the registry key "HKCU\Console\<window title>" or the shell shortcut (LNK file) that started the application. 

If the console uses the window-title registry key, it looks for a "CodePage" DWORD value. The key name is the normalized window title, which comes from the WindowTitle field of the process parameters. This can be set explicitly using the STARTUPINFO lpTitle field that's passed to CreateProcess. Otherwise the system uses the executable path as the default window title. The console normalizes the title string to create a valid key name by replacing backslash with underscore, and it also substitutes "%SystemRoot%" for the Windows directory, e.g. the default configuration key for CMD is "HKCU\Console\%SystemRoot%_system32_cmd.exe". 

The codepage can also be set in a shell shortcut (LNK file) [1]. When an application is started from a shell shortcut, the shell sets the STARTUPINFO flag STARTF_TITLEISLINKNAME and the lpTitle string to the fully-qualified path of the LNK file. In this case, the console reads the LNK file to load its settings, rather than using the window-title subkey in the registry. But the "HKCU\Console" root key is still used for the default settings.

Finally, if CMD is run without a console (i.e. using the DETACHED_PROCESS creation flag), the default codepage is ANSI, not OEM. This isn't hard-coded in CMD. It happens that GetConsoleCP returns 0 (i.e. CP_ACP) in this case.

msg319810 - (view) Author: Yoni Rozenshein (Yoni Rozenshein) Date: 2018-06-17 10:09
After reading your messages and especially after reading I admit I have been convinced this is much more complicated than I thought, and maybe more of a Windows bug than a Python bug :)
msg388805 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-16 03:46
The complexity of mixing standard I/O from the shell and external programs is a limitation of the Windows command line. Each program could choose to use the system (or process) ANSI or OEM code page, the console session's input or output code page, UTF-8, or UTF-16. There's no uniform way to enforce one, consistent choice. So I'm closing this issue as a third party limitation that cannot be addressed in general. The problem has to be handled on a case by case basis.
Date User Action Args
2021-03-16 03:46:03eryksunsetstatus: open -> closed
resolution: third party
messages: + msg388805

stage: resolved
2018-06-17 10:09:33Yoni Rozensheinsetmessages: + msg319810
2018-06-07 00:58:29eryksunsetmessages: + msg318870
2018-06-07 00:51:50eryksunsetnosy: + eryksun
messages: + msg318868
2018-06-06 18:23:53serhiy.storchakasetnosy: + paul.moore, tim.golden, vstinner, giampaolo.rodola, zach.ware, steve.dower
components: + Windows
2018-06-06 10:01:53Yoni Rozensheincreate