classification
Title: subprocess uses wrong encoding on Windows
Type: behavior Stage: resolved
Components: IO, Library (Lib), Unicode, Windows Versions: Python 3.6, Python 3.5
process
Status: closed Resolution: duplicate
Dependencies: Superseder: subprocess seems to use local encoding and give no choice
View: 6135
Assigned To: steve.dower Nosy List: davispuh, eryksun, ezio.melotti, haypo, martin.panter, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: patch

Created on 2016-06-02 00:52 by davispuh, last changed 2016-09-05 23:24 by steve.dower. This issue is now closed.

Files
File name Uploaded Description Edit
subprocess_fix_encoding_v2_a.patch davispuh, 2016-06-02 17:08 Patch A to fix console's encoding for subprocess review
subprocess_fix_encoding_v2_b.patch davispuh, 2016-06-02 17:08 Patch B to fix console's encoding for subprocess review
subprocess_fix_encoding_v3.patch davispuh, 2016-06-04 01:53 Patch v3 to fix console's encoding for subprocess review
subprocess_fix_encoding_v4fixed.patch davispuh, 2016-06-09 01:36 Patch to fix console's encoding for subprocess review
Messages (27)
msg266852 - (view) Author: Dāvis (davispuh) * Date: 2016-06-02 00:52
subprocess uses wrong encoding on Windows.


On Windows 10 with Python 3.5.1
from Command Prompt (cmd.exe)
> chcp 65001
> python -c "import subprocess; subprocess.getstatusoutput('ā')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "P:\Python35\lib\subprocess.py", line 808, in getstatusoutput
    data = check_output(cmd, shell=True, universal_newlines=True, stderr=STDOUT)
  File "P:\Python35\lib\subprocess.py", line 629, in check_output
    **kwargs).stdout
  File "P:\Python35\lib\subprocess.py", line 698, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "P:\Python35\lib\subprocess.py", line 1055, in communicate
    stdout = self.stdout.read()
  File "P:\Python35\lib\encodings\cp1257.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>


from PowerShell
> [Console]::OutputEncoding = [System.Text.Encoding]::UTF8
> python -c "import subprocess; subprocess.getstatusoutput('ā')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "P:\Python35\lib\subprocess.py", line 808, in getstatusoutput
    data = check_output(cmd, shell=True, universal_newlines=True, stderr=STDOUT)
  File "P:\Python35\lib\subprocess.py", line 629, in check_output
    **kwargs).stdout
  File "P:\Python35\lib\subprocess.py", line 698, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "P:\Python35\lib\subprocess.py", line 1055, in communicate
    stdout = self.stdout.read()
  File "P:\Python35\lib\encodings\cp1257.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2: character maps to <undefined>



As you can see even if consoles encoding is UTF-8 it still uses Windows ANSI codepage 1257
this happens because io.TextIOWrapper is used with default encoding which is locale.getpreferredencoding(False)
but that's wrong because that's not console's encoding.
I've attached a patch which fixes this by using correct console encoding with sys.stdout.encoding

Only note that there's different bug that when python is executed inside PowerShell's group expression then sys.stdout.encoding will be wrong

> [Console]::OutputEncoding.EncodingName
Unicode (UTF-8)
> ([Console]::OutputEncoding.EncodingName)
Unicode (UTF-8)
> python -c "import sys; print(sys.stdout.encoding)"
cp65001
> (python -c "import sys; print(sys.stdout.encoding)")
cp1257

it still should be cp65001 and that's why in this case subprocess will still fail even with my patch, but this some different bug.
msg266853 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-06-02 00:57
What is the ā command? Where does it come from? What is its output? Do you know the encoding of its output? It remembers me the "set" issue in distutils: issue #27048.
msg266854 - (view) Author: Dāvis (davispuh) * Date: 2016-06-02 01:03
there's no such "ā" command, it's just used to get non-ASCII output

cmd will return:
'ā' is not recognized as an internal or external command,
operable program or batch file.


and this will be encoded in consoles encoding (UTF8 in my example or whatever chcp is set to) which Python will fail to read as it will use locale.getpreferredencoding(False) instead of sys.stdout.encoding


see attached patch, it fixes this problem, you can try reproduce yourself.
msg266860 - (view) Author: Dāvis (davispuh) * Date: 2016-06-02 01:42
I looked at #27048 and indeed it's affected by this bug, it happens to me too (I've non-ASCII symbols in %PATH%) and this my patch fixes that.


on my system without patch

> python -c "from distutils import _msvccompiler; _msvccompiler._get_vc_env('')"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "P:\Python35\lib\distutils\_msvccompiler.py", line 92, in _get_vc_env
    universal_newlines=True,
  File "P:\Python35\lib\subprocess.py", line 629, in check_output
    **kwargs).stdout
  File "P:\Python35\lib\subprocess.py", line 698, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "P:\Python35\lib\subprocess.py", line 1055, in communicate
    stdout = self.stdout.read()
  File "P:\Python35\lib\encodings\cp1257.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x83 in position 50: character maps to <undefined>

with my patch

> python -c "from distutils import _msvccompiler; _msvccompiler._get_vc_env('')"
>
no error
msg266864 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-02 02:38
I don’t know much about the conventions for stdout etc encoding on Windows. But in general, the patch does not seem robust. Does it work if sys.stdout is a pipe or file (not a console)? I doubt it will work when sys.stdout has been replaced by e.g. StringIO, and sys.stdout.encoding is None. Maybe you could use sys.__stdout__. But then, what happens when you run Python without any stdout at all, say in a GUI like Idle?

On Linux, the patch may have no effect in common cases. But again, it will break if sys.stdout has been replaced, or is set to None.

Looking at _Py_device_encoding() in Python/fileutils.c, perhaps you need a Windows-specific interface to GetConsoleCP() and GetConsoleOutputCP() that subprocess can use.
msg266872 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-02 04:08
Even sys.__stdout__ can be missing. In this context, falling back on the default encoding is probably fine, but for 3.6 I'd like to make everything default to UTF-8 on Windows, and force the console mode on startup (restore on finalize) - apart from the input() implementation it's fairly straightforward.

But don't hold up an immediate fix on that change, just be aware that I'm tinkering with a good long-term fix.
msg266873 - (view) Author: Eryk Sun (eryksun) * Date: 2016-06-02 04:12
There is no right encoding as far as I can see. 

If it's attached to a console (i.e. conhost.exe), then cmd.exe uses the console's output codepage when writing to a pipe or file, which is the scenario that your patch attempts to address. But if you pass creationflags=CREATE_NO_WINDOW, then the new console (created without a window) uses the OEM codepage, CP_OEMCP. And if you pass creationflags=DETACHED_PROCESS (i.e. no console), cmd uses the ANSI codepage, CP_ACP. There's also a "/u" option to force cmd to use the native Unicode encoding on Windows, UTF-16LE.

Note that the above only considers cmd.exe. Its child processes can write output using any encoding. You may end up with several different encodings present in the same stream. Many, if not most, programs don't use the console's current codepage when writing to a pipe or file. Commonly they default to OEM, ANSI, UTF-8, or UTF-16LE. For example, Windows Python uses ANSI for standard I/O that's not a console, unless you set PYTHONIOENCODING. 

Even if a called program cares about the console output codepage, your patch doesn't implement this robustly. It uses sys.stdout and sys.stderr, but those can be reassigned. Even sys.__stdout__ and sys.__stderr__ may be irrelevant. Python could be run via pythonw.exe for which the latter are None (unless it's started with non-NULL standard handles). Or python.exe could be run with standard I/O redirected to pipes or files, defaulting to ANSI. Also, the current program or called program could change the console encoding via chcp.com, which is just an indirect way of calling the WinAPI functions SetConsoleCP and SetConsoleOutputCP. 

There's no common default encoding for standard I/O on Windows, especially not a common UTF encoding, so universal_newlines=True, getoutput, and getstatusoutput may be of limited use. Preferably a calling program can set an option like cmd's "/u" or Python's PYTHONIOENCODING to force using a Unicode encoding, and then manually decode the output by wrapping stdout/stderr in instances of io.TextIOWrapper. It would help if subprocess.Popen had parameters for encoding and errors.
msg266875 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-02 05:25
I think Issue 6135 has a bit of discussion on adding encoding and error parameters to subprocess.Popen etc.
msg266890 - (view) Author: Dāvis (davispuh) * Date: 2016-06-02 17:08
There is right encoding, it's encoding that's actually used. Here we're inside subprocess.Popen which does the actual winapi.CreateProcess call and thus we can check for any creationflags and adjust encoding logic accordingly. I would say almost all Windows console programs does use console's encoding for input/output because otherwise user wouldn't be able to read it. And programs which use different encoding it would be caller's responsibly to set used encoding because only it could know which encoding to use for that program.

So I think basically Popen should accept encoding parameter which would be then passed to TextIOWrapper. Preferably way to set different encoding for stdin/out/err

and then if there's no encoding specified, we use our logic to determine default encoding which would be by using _Py_device_encoding(fd) and this would be right for almost all if not all cases. And if some program changes console's encoding after we got consoles encoding, we could get encoding again after program's execution and then use this new set console's encoding.


Anyway while looking more into this I found why we get wrong encoding.

looking at subprocess.check_output can see

return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
           **kwargs).stdout

that stdout is set to PIPE

and then in subprocess.__init__

if c2pread != -1:
    self.stdout = io.open(c2pread, 'rb', bufsize)
    if universal_newlines:
        self.stdout = io.TextIOWrapper(self.stdout)


there c2pread will be fd for pipe (3)

when looking inside _io_TextIOWrapper___init___impl

fileno = _PyObject_CallMethodId(buffer, &PyId_fileno, NULL);
[...]
int fd = _PyLong_AsInt(fileno);
[...]
self->encoding = _Py_device_encoding(fd);
[...]


we'll set encoding with _Py_device_encoding(3);
but there

    if (fd == 0)
        cp = GetConsoleCP();
    else if (fd == 1 || fd == 2)
        cp = GetConsoleOutputCP();
    else
        cp = 0;


so encoding would be correct for stdin/stdout/stderr but not for pipe and that's why this issue.

I see 2 ways to fix this and I've added patches for both options.
msg266977 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-06-02 22:46
Patch B changes _Py_device_encoding() to accept a file descriptor of 3, which seems wrong to me.

Patch A is like the earlier patch, but calls os.device_encoding(1) instead of relying on sys.stdout, etc. I think this will still fail when the Python parent’s stdout was never open (then fd 1 will be invalid, or used as something else).
msg267024 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-03 02:31
> There is right encoding, it's encoding that's actually used.

This is true, but it puts the decision entirely in the hands of the developer(s) of the two processes involved.

All IPC on Windows uses bytes, and encodings _always_ need to be negotiated by the processes involved. You can't reliably assume or infer anything. The closest you get is to assume that both processes are using the same MSVCRT version and have not changed the defaults (except Python changes the defaults, from text to binary, so that assumption is easily broken).

Using "cmd /u" is one way to negotiate with that process for the shell=True case, but all others basically just require an explicit encoding parameter so that it can be specified. IMHO, if we make Python default to UTF-8 and subprocess use utf_8:errors (mojibake is not acceptable by default) and "cmd /u", we cover enough common cases to minimise the need to explicitly specify. (A close second best is to default to the console CP if available and default locale otherwise.)
msg267025 - (view) Author: Dāvis (davispuh) * Date: 2016-06-03 02:32
if there's no console then os.device_encoding won't fail, it will just return None which means that ANSI codepage will be used like it currently is and so here it doesn't change anything, current behavior stays.
Also like I showed TextIOWrapper already calls device_encoding even if there's no console. And device_encoding doesn't actually use that fd it just checks if it's valid fd and then calls GetConsoleCP/GetConsoleOutputCP to get encoding.
msg267091 - (view) Author: Eryk Sun (eryksun) * Date: 2016-06-03 11:28
> I would say almost all Windows console programs does use 
> console's encoding for input/output because otherwise 
> user wouldn't be able to read it.

While some programs do use the console codepage, even when writing to a pipe or disk file -- such as more.com, reg.exe and tasklist.exe -- it's no where near "all Windows console programs". As a counterexample, here's a list of Microsoft utilities that always use the OEM codepage (CP_OEMCP) when writing to a pipe or disk file:

    attrib.exe
    cacls.exe
    doskey.exe (e.g /history)
    fc.exe
    findstr.exe (calls SetFileApisToOEM)
    hostname.exe
    icacls.exe
    net.exe
    qprocess.exe (also to console)
    quser.exe (also to console)
    sc.exe
    tree.com

To further ensure that we're on the same page, the following demonstrates what happens for creation flags DETACHED_PROCESS, CREATE_NEW_CONSOLE, and CREATE_NO_WINDOW in Windows 10:

    from subprocess import *

    DETACHED_PROCESS = 0x00000008
    CREATE_NEW_CONSOLE = 0x00000010
    CREATE_NO_WINDOW = 0x08000000

    cmd = ('python -c "import ctypes;'
           "kernel32 = ctypes.WinDLL('kernel32');"
           'print(kernel32.GetConsoleCP())"')

    >>> call('chcp.com 65001')
    Active code page: 65001
    0
    >>> check_output(cmd, creationflags=0)
    b'65001\r\n'
    >>> check_output(cmd, creationflags=DETACHED_PROCESS)
    b'0\r\n'
    >>> check_output(cmd, creationflags=CREATE_NEW_CONSOLE)
    b'437\r\n'
    >>> check_output(cmd, creationflags=CREATE_NO_WINDOW)
    b'437\r\n'

The test was run with a U.S. locale, so the OEM and ANSI codepages are 437 and 1252. With DETACHED_PROCESS there's no console, so GetConsoleCP() returns 0. That's the value of CP_ACP, so ANSI is the natural default for a detached process. CREATE_NEW_CONSOLE and CREATE_NO_WINDOW cause Windows to load a new instance of conhost.exe, which is initially set to the OEM codepage.
msg267221 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-06-04 01:49
> so ANSI is the natural default for a detached process

To clarify - ANSI is the natural default *for programs that don't support Unicode*.

Unfortunately, since "Unicode" on Windows is an incompatible data type (wchar_t rather than char), targeting Unicode rather than a code page requires completely different API calls. This would make Python's implementation much more complicated, as well as breaking some scripts and existing packages. Forcing the use of UTF-8 as the code page is the easiest way for us to support it.

I think Eryk clearly proved that we can't reliably assume or infer the right encoding for a subprocess. (When you use the ANSI APIs to print to the console, the console converts to Unicode for rendering. If you use the Unicode APIs there is no conversion, and so any codepage can be used internally without affecting what is displayed to the user.)

In short: the best available fix is to expose encoding arguments in subprocess and to fix any calls within the stdlib that need to specify them. (When we decide to separate Python's API from the C Runtime API we can break file descriptors which will let us use Unicode APIs throughout, but that's a little way off.)
msg267222 - (view) Author: Dāvis (davispuh) * Date: 2016-06-04 01:53
> qprocess.exe (also to console)
> quser.exe (also to console)

these are broken (http://i.imgur.com/0zIhHrv.png)

    >chcp 1257
    >quser
     USERNAME              SESSIONNAME
     dƒvis                 console

    > chcp 775
    > quser
     USERNAME              SESSIONNAME
     dāvis                 console


we've to decide which codepage to use as default and it should cover most cases not some minority of programs so I would say using console's code page when it's available makes the most sense and when isn't then fallback to ANSI codepage

Now for these special cases where our guess is wrong only user can know which encoding would be right and so he must specify that.


I also checked that cmd /u flag is totally useless because it applies only to cmd itself not to any other programs and so to use it would need to check if returned output is actual UTF-16 or some other encoding which might even pass as valid UTF-16

for example:
    cmd /u /c "echo ā"
will return
ā in UTF-16

but
    cmd /u /c "sc query"

result will be encoded in OEM codepage (775 for me) and no sign of UTF-16


I looked if there's some function to get used encoding for child process but there isn't, I would have expected something like GetConsoleOutputCP(hThread)
So the only way to get it, is by calling GetConsoleOutputCP inside child process with CreateRemoteThread and it's not really pretty and quite hacky, but it does work, I tested.

anyway even with that would need to change something about TextIOWrapper because we're creating it before process is even started and encoding isn't changeable later.




I updated patch which fixes issues with creationflags and also added option to change encoding based on subprocess3.patch (from #6135)

so now with my patch it really works for most cases.

    >python -c "import subprocess; subprocess.getstatusoutput('ā')"

works correctly for me with correct encoding when console's code page is set to any of 775 (OEM), 1257 (ANSI) and 65001 (UTF-8)

it also works correctly with any of DETACHED_PROCESS, CREATE_NEW_CONSOLE, CREATE_NO_WINDOW

    >python -c "import subprocess; subprocess.getstatusoutput('ā', creationflags=0x00000008)"


this also works correctly with console's encodings: 775, 1257, 65001

    >python -c "from distutils import _msvccompiler; _msvccompiler._get_vc_env('')"



and finally 

   > chcp 1257
   > python -c "import subprocess; print(subprocess.check_output('quser', encoding='cp775'))"
    USERNAME              SESSIONNAME
    dāvis                 console

also works correctly with any of console's encoding even if it didn't showed correct encoding inside cmd itself.
msg267246 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-06-04 08:29
"To clarify - ANSI is the natural default *for programs that don't support
Unicode*."

Exactly. For this reason, I am stronly opposed to chnage the default
encoding.

I'm ok to add helper functions or new flags.

It looks like it took more than five years to support well Unicode in
Python 3 on UNIX, and that we now have to enhance "Unicode support" (use
the right encoding, use a real Unicode API) on Windows :-)
msg267258 - (view) Author: Dāvis (davispuh) * Date: 2016-06-04 15:53
it makes no sense to not use better encoding default in some cases like my patch does. Most programs use console's encoding not ANSI codepage and thus by limiting default only to ANSI codepage we force basically everyone to always specify encoding. This is current behavior that ANSI codepage is used and that's why this issue and also #27048. if we keep it this way then specifying encoding will be required for like 99% of cases which means it's useless default. Actually I don't even know any Windows program which does input/output (not talking about files because that's different) in ANSI codepage because it would be broken when displayed in console as that by default uses OEM codepage. Anyway my patch doesn't really change default it just uses console encoding in most cases and then fallbacks to same current default ANSI codepage.
msg267260 - (view) Author: Eryk Sun (eryksun) * Date: 2016-06-04 16:07
>> so ANSI is the natural default for a detached process
>
> To clarify - ANSI is the natural default *for programs that 
> don't support Unicode*.

By natural, I meant in the context of using GetConsoleOutputCP(), since WideCharToMultiByte(0, ...) encodes text as ANSI. Clearly UTF-16LE is preferred for IPC on Windows. It's the native Unicode format down to the lowest levels of the kernel. But we're talking about old-school IPC using standard I/O pipelines, for which I think UTF-8 is a better fit.

> Forcing the use of UTF-8 as the code page is the easiest way 
> for us to support it.

The console's behavior for codepage 65001 is too buggy. The show stopper is that it limits input to ASCII. The console allocates a temporary buffer for the encoded text that's sized assuming 1 ANSI/OEM byte per UTF-16 code. So if you enter non-ASCII characters, WideCharToMultiByte fails in conhost.exe. But the console returns that the operation has successfully read 0 bytes. Python's REPL and input() see this as EOF.

For example:

    import sys, ctypes, msvcrt
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    conin = open(r'\\.\CONIN$', 'r+')
    h = msvcrt.get_osfhandle(conin.fileno())
    buf = (ctypes.c_char * 15)()
    n = (ctypes.c_ulong * 1)()

    >>> sys.stdin.encoding
    'cp65001'

ReadFile test in Windows 10:

    >>> kernel32.ReadFile(h, buf, 15, n, None)
    Test!
    1
    >>> n[0], buf[:]
    (7, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')

    >>> kernel32.ReadFile(h, buf, 15, n, None)
    ¡Prueba!
    1
    >>> n[0], buf[:]
    (0, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')

The second call obviously fails, even thought it returns 1. The input contains non-ASCII "¡", which in UTF-8 requires 2 bytes, b'\xc2\xa1'. This causes the failure in conhost.exe that I described above.

ReadConsoleA has the same problem:

    >>> kernel32.ReadConsoleA(h, buf, 15, n, None)
    Hello World!
    1
    >>> n[0], buf[:]
    (14, b'Hello World!\r\n\x00')

    >>> kernel32.ReadConsoleA(h, buf, 15, n, None)
    ¡Hola Mundo!
    1
    >>> n[0], buf[:]
    (0, b'Hello World!\r\n\x00')

UTF-8 output is also buggy prior to Windows 8. The problem is that WriteFile returns the number of UTF-16 codes written instead of the number of bytes. For non-ASCII characters in the BMP, 1 UTF-16 code is 2 or 3 UTF-8 bytes. So it looks like a partial write. A buffered writer will loop multiple times to write what appears to be the remaining bytes, in a trail of junk lines in proportion to the number of non-ASCII characters written.

Python could work around this by decoding the buffer to get the corresponding number of UTF-16 codes written in the console, but child processes may also be subject to this bug. The only general solution on Windows 7 is to use something like ANSICON, which uses DLL injection to hook and wrap WriteFile and WriteConsoleA.

There's also a UTF-8 related bug in ulib.dll. This bug affects programs that do console codepage conversions, such as more.com. This in turn affects Python's interactive help(). I looked at this in issue 19914. The ulib bug is fixed in Windows 10. I don't know whether it's fixed in Windows 8, but it's there in Windows 7 (supported until 2020).

> This would make Python's implementation much more 
> complicated, as well as breaking some scripts and 
> existing packages.

Unless you're talking about major breakage, I think switching to the wide-character API is worth it, as the only viable path to supporting Unicode in the console. The implementation probably should transcode between UTF-16LE and UTF-8, so pure Python never sees UTF-16 byte strings. sys.std*.encoding would be 'utf-8'. os.read and os.write would be implemented as _Py_read and _Py_write (already exists). For console handles these could delegate to _Py_console_read and _Py_console_write, to convert between UTF-8 and UTF-16LE and call ReadConsoleW and WriteConsoleW.
msg267270 - (view) Author: Eryk Sun (eryksun) * Date: 2016-06-04 16:45
Another set of counterexamples are the utilities in the GnuWin32 collection, which use ANSI in a pipe:

    >>> call('chcp.com')
    Active code page: 437
    0
    >>> '¡'.encode('1252')
    b'\xa1'
    >>> '\xa1'.encode('437')
    b'\xad'

    >>> os.listdir('.')
    ['¡']
    >>> check_output('ls')
    b'\xa1\r\n'
    >>> check_output('echo.exe ¡')
    b'\xa1\r\n'

Writing ANSI to a pipe or disk file is not as uncommon as you seem to think. Microsoft has never dictated a standard. It doesn't even follow a standard for this within its own command-line utilities. IMO, it makes more sense for programs to use UTF-8, or even UTF-16. Codepages are a legacy that we need to move beyond. Internally the console uses UTF-16LE. 

Note that patch 3 requires setting `encoding` for even python.exe as a child process, because sys.std* default to ANSI when isatty(fd) isn't true. (The CRT's isatty is true for any character-mode file, such as NUL or a console. Checking specifically for a console handle requires GetConsoleMode. To check for a pipe or disk file, call GetFileType to check for FILE_TYPE_PIPE or FILE_TYPE_DISK.)

> I also checked that cmd /u flag is totally useless because it applies
> only to cmd itself not to any other programs

Anything else would be magic. Once a child process inherits its standard handles from cmd.exe [1], it can write whatever bytes it wants to them. In issue 27048 I proposed using the "/u" switch for shell=True only to facilitate getting results back from cmd's internal commands such as `set`. But it doesn't change anything if you're using the shell to run other programs.

[1]: Unlike Python's Popen, cmd doesn't use STARTUPINFO for this. It
     temporarily modifies its own standard handles, which works even
     when it falls back on ShellExecuteEx to run files that are 
     neither PE excecutables nor .BAT/.CMD files.

> I looked if there's some function to get used encoding for 
> child process but there isn't, I would have expected something 
> like GetConsoleOutputCP(hThread). So the only way to get it, 
> is by calling GetConsoleOutputCP inside child process with
> CreateRemoteThread and it's not really pretty and quite hacky, 
> but it does work, I tested.

That's not the only way. You can also start a detached Python process (via pythonw.exe or DETACHED_PROCESS) to run a script that calls AttachConsole and returns the result of calling GetConsoleOutputCP:

    from subprocess import *

    DETACHED_PROCESS   = 0x00000008
    CREATE_NEW_CONSOLE = 0x00000010

    cmd = ('python -c "import ctypes;'
           "kernel32 = ctypes.WinDLL('kernel32');"
           'kernel32.AttachConsole(%d);'
           'print(kernel32.GetConsoleOutputCP())"')

    call('chcp.com 1252')
    p = Popen('python', creationflags=CREATE_NEW_CONSOLE)
    cp = int(check_output(cmd % p.pid, creationflags=DETACHED_PROCESS))

    >>> cp
    437

> anyway even with that would need to change something about 
> TextIOWrapper because we're creating it before process is even 
> started and encoding isn't changeable later.

In this case one can detach() the buffer to wrap it in a new TextIOWrapper.

>    > python -c "import subprocess; 
> print(subprocess.check_output('quser', encoding='cp775'))"
>     USERNAME              SESSIONNAME
>     dāvis                 console
>
> also works correctly with any of console's encoding even if it 
> didn't showed correct encoding inside cmd itself.

A minor point of clarification: quser.exe doesn't run "inside" cmd.exe; it runs attached to conhost.exe. The cmd shell is just the parent process.
msg267310 - (view) Author: Dāvis (davispuh) * Date: 2016-06-04 19:58
Of course I agree that proper solution is to use Unicode/Wide API but that's much more work to implement and I rather have now this half-fix which works for most of cases than nothing till whenever Unicode support is implemented which might be nowhere soon.


> IMO, it makes more sense for programs to use UTF-8, or even UTF-16. Codepages are a legacy that we need to move beyond. Internally the console uses UTF-16LE. 

yes that's true, but we can't do anything about current existing programs and so if we default to UTF-8 it will be even worse than defaulting to ANSI because there aren't many programs on Windows which would use UTF-8, in fact it's quite rare because there's not even good UTF-8 support for console itself like you mentioned. Also here I'm talking only about ANSI WinAPI programs with console/pipe encoding and not internal or file encoding which here we don't really care about.


> Note that patch 3 requires setting `encoding` for even python.exe as a child process, because sys.std* default to ANSI when isatty(fd) isn't true.

I think Python is a bit broken here and IMO it should also use console's encoding not ANSI when outputting to console pipe and use ANSI if it really is a file.


on Windows 10 with Python 3.5.1

    >chcp
    Active code page: 775
    >python -c "print('ā')"
    ā

    >python -c "print('ā')" | echo
    ECHO is on.
    Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp1257'>
    OSError: [Errno 22] Invalid argument

    >chcp 1257
    Active code page: 1257
    >python -c "print('ā')" | echo
    ECHO is on.
    Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp1257'>
    OSError: [Errno 22] Invalid argument


in PowerShell

    >[Console]::OutputEncoding.CodePage
    775
    >python -c "print('ā')" | Out-String
    Ō
    >[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
    >python -c "print('ā')" | Out-String
    �
    >[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(1257)
    >python -c "print('ā')" | Out-String
    ā


> I proposed using the "/u" switch for shell=True only to facilitate getting results back from cmd's internal commands such as `set`. But it doesn't change anything if you're using the shell to run other programs.

but you can only do that if you know that command you execute is cmd's command but if it's user passed command then there isn't really reliable way to detect if it will execute inside cmd or not, for example "cmd /u /c chcp.exe" will return result in UTF-16 because such program doesn't exist and cmd's error message will be outputted. Also if user have set.exe in %System32% then "cmd /u /c set" won't be in UTF-16 because it will execute that program.



>> by calling GetConsoleOutputCP inside child process with CreateRemoteThread

> That's not the only way. You can also start a detached Python process (via pythonw.exe or DETACHED_PROCESS) to run a script that calls AttachConsole and returns the result of calling GetConsoleOutputCP:

while useful to know it's still messy because I think you would need to prevent your target process from exiting before you've called AttachConsole and also most likely you want to get GetConsoleOutputCP before program's exit and not at start (say with CREATE_SUSPENDED) as it might have changed it somewhere in middle of program's execution. so looks like this route isn't worth going for.
msg267424 - (view) Author: Eryk Sun (eryksun) * Date: 2016-06-05 16:22
> so if we default to UTF-8 it will be even worse than 
> defaulting to ANSI because there aren't many programs 
> on Windows which would use UTF-8

I didn't say subprocess should default to UTF-8. What I wish is for Python to default to using UTF-8 for its own pipe and disk file I/O. The old behavior could be selected by setting some hypothetical environment variable, such as PYTHONIOUSELOCALE.

If subprocess defaults to the console's current codepage (when available), it would be nice to have a way to conveniently select the OEM or ANSI codepage. The codecs module could define string constants based on GetOEMCP() and GetACP(), such as codecs.CP_OEMCP (e.g. 'cp437') and codecs.CP_ACP (e.g. 'cp1252'). subprocess could import these constants on Windows.

> for example "cmd /u /c chcp.exe" will return result 
> in UTF-16 because such program doesn't exist and cmd's 
> error message will be outputted.

Yes, that's a problem. The cases where we want UTF-16 from the shell should be handled specially instead of enabled generally.

> Also if user have set.exe in %System32% then 
> "cmd /u /c set" won't be in UTF-16 because it will 
> execute that program.

This one's not actually a problem because cmd defaults to parsing `set` as its internal command. An external `set` requires quotes (e.g. `"set"` to look for set, set.com, set.exe, ...).

> you want to get GetConsoleOutputCP before program's
> exit and not at start (say with CREATE_SUSPENDED) 
> as it might have changed it somewhere in middle of
> program's execution.

You'd have to pass both the process ID and the thread ID to have the monitor call OpenProcess to get a waitable handle and OpenThread to call ResumeThread. It can be done, but it's not something I'd consider doing in practice. It's too fragile and not worth the trouble for something that's rarely required.
msg267946 - (view) Author: Dāvis (davispuh) * Date: 2016-06-09 01:07
> Note that patch 3 requires setting `encoding` for even python.exe as a child process, because sys.std* default to ANSI when isatty(fd) isn't true.

I've updated my patch so that Python outputs in consoles encoding for pipes too.

So now in PowerShell

>[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
>python -c "print('ā')" | Out-String
ā
> python -c "import subprocess; print(subprocess.getoutput('python -c ""print(\'ā\')""'))"
ā
>[Console]::OutputEncoding = [System.Text.Encoding]::GetEncoding(775)
>python -c "print('ā')" | Out-String
ā
> python -c "import subprocess; print(subprocess.getoutput('python -c ""print(\'ā\')""'))"
ā


> What I wish is for Python to default to using UTF-8 for its own pipe and disk file I/O. The old behavior could be selected by setting some hypothetical environment variable, such as PYTHONIOUSELOCALE.

I actually don't really see need for this, specifying PYTHONIOENCODING="UTF-8" it will be used for pipes.


> If subprocess defaults to the console's current codepage (when available), it would be nice to have a way to conveniently select the OEM or ANSI codepage. The codecs module could define string constants based on GetOEMCP() and GetACP(), such as codecs.CP_OEMCP (e.g. 'cp437') and codecs.CP_ACP (e.g. 'cp1252'). subprocess could import these constants on Windows.

also updated in my patch and implemented something like this but IMO easier, basically "ansi" and "oem" is a valid encoding on Windows and can be used anywhere where encoding can be specified as a parameter. Look at patch to see how it's implemented.


Ok, so now does my patch look acceptable? What would be issues with it? IMO it greatly improves current situation (fixes #27048 and solves #6135) and I don't see any issues with it.

Things that are changed:
* "ansi" and "oem" are valid encodings on Windows
* console's code page is used for console and pipe (if there's no console then ANSI is used like now)
* subprocess uses "ansi" for DETACHED_PROCESS and "oem" for CREATE_NEW_CONSOLE, CREATE_NO_WINDOW
* encoding and errors parameters can be specified for Popen
* custom parameters (including encoding and errors) can be specified for subprocess.getstatusoutput and getoutput

Also if it's needed I see how easily can add support for separate encodings and errors for stdin/out/err
for example with

    if (type(encoding) is str):
        encoding_stdin = encoding_stdout = encoding_stderr = encoding
    elif (type(encoding) is tuple):
        encoding_stdin, encoding_stdout, encoding_stderr = encoding
    else:
        encoding_stdin = encoding_stdout = encoding_stderr = None

then could use 
    subprocess.check_output('', encoding='oem')
and
    subprocess.check_output('', encoding=('oem', 'ansi', 'ansi'))



Known issues (present in both cases with and without my patch):
* when using cmd.exe and python is outputting to pipe then for some unknown reason error happens

with cmd.exe
> python -c "print('\n')" | echo
ECHO is on.
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='cp775'>
OSError: [Errno 22] Invalid argument

It doesn't matter which code page for console is set and what is being outputted.
It happens for both released 3.5.1 and repo default branch but it doesn't happen when PowerShell is used.

I looked into it but didn't found why it happens, only that
    n = write(fd, buf, (int)count);
in _Py_write_impl (fileutils.c) returns -1 and errno is EINVAL
I verified that all parameters are correct fd, buf (it isn't NULL) and count (parameters are same as when running without pipe)
so I've no idea what causes it.


* Python corrupts characters when reading from stdin

with PowerShell
> Out-String -InputObject "ā" | python -c "import sys; print(sys.stdin.encoding,sys.stdin.read())"
cp1257 ?

It happens for both released 3.5.1 and repo default branch.
With my patch used encoding will be based on console's code page, but it doesn't matter because seems it gets corrupted even before it gets used. I tested it when using console encodings: oem, ansi and utf-8 and also these same with PYTHONIOENCODING too and in all cases it was corrupted, replaced with "?".

I didn't looked further into this.
msg274323 - (view) Author: Dāvis (davispuh) * Date: 2016-09-03 19:49
ping? Could someone review my patch?
msg274325 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2016-09-03 20:25
You should take a look at the recent PEP 529 "Change Windows filesystem encoding to UTF-8":
https://www.python.org/dev/peps/pep-0529/
msg274328 - (view) Author: Dāvis (davispuh) * Date: 2016-09-03 21:21
That is a great PIP, but it will take a lot of time to be implemented and it doesn't really solve this issue.

This is different issue than filename/path encoding. Here we need to decode binary output from other applications and that for a lot of applications will be console's code page but it could be also any other. This isn't issue about Unicode paths because application which is located at ASCII path, when we run it as a subprocess can return text output in console's code page, OEM, ANSI or some other encoding.

My proposed subprocess_fix_encoding_v4fixed.patch fixes this for majority of cases and for other cases encoding can be specified.
msg274331 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-09-04 02:00
I'll take a look during the week. I like parts of the patch but not all of it, but while we're inevitably discussing my PEPs it's sure to come up.
msg274466 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-09-05 23:24
Chatting about this with Victor we've decided to close this as a duplicate of issue6135 and continue the discussion there, and also focus mainly on exposing the parameter rather than trying to guess the correct encoding. I'll post more details on issue6135.
History
Date User Action Args
2016-09-05 23:24:48steve.dowersetstatus: open -> closed
superseder: subprocess seems to use local encoding and give no choice
messages: + msg274466

resolution: duplicate
stage: patch review -> resolved
2016-09-04 02:00:54steve.dowersetassignee: steve.dower
messages: + msg274331
2016-09-03 21:21:57davispuhsetmessages: + msg274328
2016-09-03 20:25:34hayposetmessages: + msg274325
2016-09-03 19:49:09davispuhsetmessages: + msg274323
2016-06-09 01:36:56davispuhsetfiles: + subprocess_fix_encoding_v4fixed.patch
2016-06-09 01:36:37davispuhsetfiles: - subprocess_fix_encoding_v4.patch
2016-06-09 01:07:04davispuhsetfiles: + subprocess_fix_encoding_v4.patch

messages: + msg267946
2016-06-05 16:22:30eryksunsetmessages: + msg267424
2016-06-04 19:58:46davispuhsetmessages: + msg267310
2016-06-04 16:45:38eryksunsetmessages: + msg267270
2016-06-04 16:07:16eryksunsetmessages: + msg267260
2016-06-04 15:53:24davispuhsetmessages: + msg267258
2016-06-04 08:29:01hayposetmessages: + msg267246
2016-06-04 01:53:29davispuhsetfiles: + subprocess_fix_encoding_v3.patch

messages: + msg267222
2016-06-04 01:49:20steve.dowersetmessages: + msg267221
2016-06-03 11:28:43eryksunsetmessages: + msg267091
2016-06-03 11:28:23eryksunsetmessages: - msg267090
2016-06-03 11:16:25eryksunsetmessages: + msg267090
2016-06-03 02:32:18davispuhsetmessages: + msg267025
2016-06-03 02:31:39steve.dowersetmessages: + msg267024
2016-06-02 22:46:17martin.pantersetmessages: + msg266977
2016-06-02 17:08:25davispuhsetfiles: + subprocess_fix_encoding_v2_b.patch
2016-06-02 17:08:06davispuhsetfiles: + subprocess_fix_encoding_v2_a.patch

messages: + msg266890
2016-06-02 15:31:08davispuhsetfiles: - subprocess_fix_encoding.patch
2016-06-02 05:25:33martin.pantersetmessages: + msg266875
2016-06-02 04:12:30eryksunsetnosy: + eryksun
messages: + msg266873
2016-06-02 04:08:34steve.dowersetmessages: + msg266872
2016-06-02 02:38:26martin.pantersetnosy: + martin.panter

messages: + msg266864
stage: patch review
2016-06-02 01:42:03davispuhsetmessages: + msg266860
2016-06-02 01:03:12davispuhsetmessages: + msg266854
2016-06-02 00:57:02hayposetmessages: + msg266853
2016-06-02 00:52:46davispuhcreate