classification
Title: Python uses ANSI CP for stdio on Windows console instead of using console or OEM CP
Type: behavior Stage:
Components: Windows Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, izbyshev, methane, paul.moore, steve.dower, tim.golden, u36959, vstinner, zach.ware
Priority: normal Keywords:

Created on 2020-12-21 18:59 by u36959, last changed 2020-12-23 02:50 by eryksun.

Messages (7)
msg383550 - (view) Author: Alexandre (u36959) Date: 2020-12-21 18:59
Hello, 

first of all, I hope this was not already discussed (I searched the bugs but it might have been discussed elsewhere) and it's really a bug.

I've been struggling to understand today why a simple file redirection couldn't work properly today (encoding issues) and I think I finally understand the whole thing.

There's an IO codepage set on Windows consoles (`chcp` for cmd, `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ; chcp will not work on Powershell while it displays it set the CP), 850 for my locale.
When there's no redirection / piping, PyWindowsConsoleIO take cares of the encoding (utf-8 is seems), but when there's redirection or piping, encoding falls back to ANSI CP (from config_get_locale_encoding).

This behavior seems to be incorrect / breaking things, an example:
* testcp.py (file encoded as utf-8)
```
#!/usr/bin/env python3
# -*- coding: utf-8

print('é')
```

* using cmd:
```
# Test condition
L:\Cop>chcp
Page de codes active : 850

# We're fine here
L:\Cop>py testcp.py
é
L:\Cop>py -c "import sys; print(sys.stdout.encoding)"
utf-8

# Now with piping
L:\Cop>py -c "import sys; print(sys.stdout.encoding)" | more
cp1252

L:\Cop>py testcp.py | more
Ú
L:\Cop>py testcp.py > lol && type lol
Ú

# If we adjust cmd CP, it's fine too:
L:\Cop>chcp 1252
Page de codes active : 1252
L:\Cop>py testcp.py | more
é
```

* with pwsh:
```
PS L:\Cop> ([Console]::InputEncoding, [Console]::OutputEncoding) | select CodePage

CodePage
--------
     850
     850

# Fine without redirection
PS L:\Cop> py .\testcp.py
é

# Here, write-host expect cp850
PS L:\Cop> py .\testcp.py | write-output
Ú
# Same with Out-file (used by ">")
PS L:\Cop> py .\testcp.py > lol; Get-Content lol
Ú

# 
PS L:\Cop> py .\testcp.py | more
Ú
```

By reading some sources today to solve my issue, I found many solutions:
* in PS `[Console]::OutputEncoding = [Text.Utf8Encoding]::new($false); $env:PYTHONIOENCODING="utf8"` or `[Console]::OutputEncoding = [Text.Encoding]::GetEncoding(1252)`
* in CMD `chcp 65001 && set PYTHONIOENCODING=utf8` (but this seems to break more) or `chcp 1252`

But reading (and trusting) https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os (https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants), I understand Python should be using reading the current CP (from GetConsoleOutputCP, like https://github.com/python/cpython/blob/3.9/Python/fileutils.c:) or using the default OEM CP, and not assuming ANSI CP for stdio : 
> * the OEM code page for use by legacy console applications,
> * the ANSI code page for use by legacy GUI applications.

The init path I could trace : 
> https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c
> init_sys_streams
>> create_stdio (https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c#L1774)
>>> open.raw : https://github.com/python/cpython/blob/3.9/Modules/_io/_iomodule.c#L374
>>>> https://github.com/python/cpython/blob/3.9/Modules/_io/winconsoleio.c
>> fallback to ini_sys_stream encoding
> https://github.com/python/cpython/blob/3.9/Python/initconfig.c
> config_init_stdio_encoding
> config_get_locale_encoding
> GetACP()

Some test with GetConsoleCP:
```
L:\Cop>py -c "import os; print(os.device_encoding(0), os.device_encoding(1))" | more
cp850 None

L:\Cop>type nul | py -c "import os; print(os.device_encoding(0), os.device_encoding(1))"
None cp850

L:\Cop>type nul | py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), ctypes.windll.kernel32.GetConsoleOutputCP())"
850 850

L:\Cop>py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), ctypes.windll.kernel32.GetConsoleOutputCP())" | more
850 850
```

Some links / documentations, if useful:
* https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os
* https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants
* https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp
* https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
* https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp
* https://stackoverflow.com/questions/56944301/why-does-powershell-redirection-change-the-formatting-of-the-text-content
* https://stackoverflow.com/questions/19122755/output-echo-a-variable-to-a-text-file
* https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8
* Maybe related: https://github.com/PowerShell/PowerShell/issues/10907
* https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window (will probably break things :) )
* https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797
* https://stackoverflow.com/questions/25642746/how-do-i-pipe-unicode-into-a-native-application-in-powershell

Please note I took time to write this issue as best as I could, I hope it won't be closed without explaining why the current behavior is normal (not that I suppose this will happen, I just don't know how people react here :) ).

Thanks a lot for Python, I really enjoy using it, 
Best, 
Alexandre
msg383566 - (view) Author: Alexey Izbyshev (izbyshev) * (Python triager) Date: 2020-12-22 02:02
> I've been struggling to understand today why a simple file redirection couldn't work properly today (encoding issues)

The core issue is that "working properly" is not defined in general when we're talking about piping/redirection, as opposed to the console. Different programs that consume Python's output (or produce its input) can have different expectations wrt. data encoding, and there is no way for Python to know it in advance. In your examples, you use programs like "more" and "type" to print the Python's output back to the console, so in this case using the OEM code page would produce the result that you expect. But, for example, in case Python's output was to be consumed by a C program that uses simple `fopen()/wscanf()/wprintf()` to work with text files, the ANSI code page would be appropriate because that's what the Microsoft C runtime library defaults to for wide character operations.

Python has traditionally used the ANSI code page as the default IO encoding for non-console cases (note that Python makes no distinction between non-console `sys.std*` and the builtin `open()` wrt. encoding), and this behavior can't be changed. You can use `PYTHONIOENCODING` or enable the UTF-8 mode[1] to change the default encoding.

Note that in your example you could simply use `PYTHONIOENCODING=cp850`, which would remove the need to use `chcp`.

[1] https://docs.python.org/3/using/cmdline.html#envvar-PYTHONUTF8
msg383588 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-12-22 13:45
> I understand Python should be using reading the current CP (from 
> GetConsoleOutputCP
> or using the default OEM CP, and not assuming ANSI CP for stdio

A while ago I analyzed text encodings used by many of the legacy CLI programs in Windows. Some programs hard code using either the ANSI or OEM code page, and others use either the console's current input code page or its current output code page. In light of the inconsistencies, I think defaulting to ANSI for non-console standard I/O is fine.

> There's an IO codepage set on Windows consoles (`chcp` for cmd, 
> `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ;

The CMD shell is a Unicode (UTF-16) application, i.e. it calls wide-character system and console I/O functions such as ReadConsoleW() and WriteConsoleW(). It still uses the console output code page, but as a kind of locale encoding. For example, CMD uses the *output* code page when reading a batch file as well as when reading output from an external command in a `FOR /F` loop. If Python were only concerned with satisfying a `FOR /F` loop in CMD, then it would be reasonable to make stdout default to the console output code page. But "more.com" and "find.exe" are commonly used as well, and they decode piped input using the console *input* code page. Other commands such as "findstr.exe" use OEM.

PowerShell adds a spin to this problem. In CMD, piping bytes between two processes doesn't actively involve the shell. It just creates an anonymous pipe, with each process connected to either end. In contrast, PowerShell injects itself as a middle man. For example, piping between "python.exe" and "more.com" is implemented as a pipe from "python.exe" to PowerShell and a separate pipe from PowerShell to "more.com". In between, PowerShell decodes the output from "python.exe" using its current output encoding and then re-encodes it using its current input encoding before writing to "more.com".

> # If we adjust cmd CP, it's fine too:
> L:\Cop>chcp 1252
> Page de codes active : 1252
> L:\Cop>py testcp.py | more
> é

In this case, the ANSI code-page encoded output from Python is written to a pipe that's read directly by "more.com". In turn, "more.com" decodes the input bytes using the console input code page before writing UTF-16 text to the console via WriteConsoleW(). 

To make Python use the console input code page for standard I/O, query the code page via "chcp.com", and set PYTHONIOENCODING. For example:

    C:\>chcp
    Active code page: 437
    C:\>set PYTHONIOENCODING=cp437
    C:\>py -c "print('é')" | more
    é

It would be convenient to support encodings that are based on the current console code pages, maybe named "conin" and "conout", based on GetConsoleCP() and GetConsoleOutputCP(). For example:

    C:\>set PYTHONIOENCODING=conin

They could default to the process active code page from GetACP() when there's no console. "ansi" and "oem" are already supported, so all four of the common encoding abstractions would be supported.

> when there's redirection or piping, encoding falls back to ANSI CP 
> (from config_get_locale_encoding).

The default encoding for files is locale.getpreferredencoding(), unless UTF-8 mode is enabled. In Windows, this is the process active code page, as returned by WinAPI GetACP(). By default, this is the system ANSI code page.

Standard I/O isn't excepted from this, unless either PYTHONIOENCODING is set or it's a console device file. The default, non-legacy behavior for console files is to use UTF-8 at the buffer and raw I/O level. Internally, Python uses the wide-character console I/O functions ReadConsoleW() and WriteConsoleW(), with UTF-16 encoded text.

Windows 10 allows setting the system ANSI code page to UTF-8. It also allows an application to override its active code page to UTF-8, but that's not easy to change. It requires adding an "activeCodePage" setting to the manifest that's embedded in the executable, which can be done using the manifest tool, "mt.exe".
msg383623 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-12-22 23:32
I think using Console codepage for stdio is better. But I am afraid about breaking existing code.
How about treating only UTF-8 and leave legacy environment as-is?

* When GetConsoleCP() returns CP_UTF8, use UTF-8 for stdin. Otherwise, use ANSI.
* When GetConsoleOutputCP() returns CP_UTF8, use UTF-8 for stdout. Otherwise, use ANSI.

This will work nice with PowerShell or cmd with `chcp 65001` in most simple cases.
msg383625 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-12-23 00:50
> How about treating only UTF-8 and leave legacy environment as-is?
> * When GetConsoleCP() returns CP_UTF8, use UTF-8 for stdin. 
> Otherwise, use ANSI.

Okay, and also when GetConsoleCP() fails because there's no console (e.g. python.exe w/ DETACHED_PROCESS creation flag, or pythonw.exe). 

However, using UTF-8 for the input code page is currently broken in many cases, so it should not be promoted as a recommended solution until Microsoft fixes their broken code (which should have been fixed 20 years ago; it's ridiculous). Legacy console applications rely on ReadFile and ReadConsoleA. Setting the input code page to UTF-8 is limited to reading 7-bit ASCII (ordinals 0-127). Other characters get converted to null bytes. For example:

    >>> kernel32.SetConsoleCP(65001)
    1
    >>> os.read(0, 10)
    ab¡¢£¤cd
    b'ab\x00\x00\x00\x00cd\r\n'
msg383626 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-12-23 01:22
> Okay, and also when GetConsoleCP() fails because there's no console (e.g. python.exe w/ DETACHED_PROCESS creation flag, or pythonw.exe). 

When there is no console, stdio should use the default textio encoding that is ANSI for now.

> However, using UTF-8 for the input code page is currently broken in many cases, so it should not be promoted as a recommended solution until Microsoft fixes their broken code (which should have been fixed 20 years ago; it's ridiculous). Legacy console applications rely on ReadFile and ReadConsoleA. Setting the input code page to UTF-8 is limited to reading 7-bit ASCII (ordinals 0-127). Other characters get converted to null bytes.

Regardless when we promote it, people use `chcp 65001` in cmd and `[Console]::OutputEncoding = [Text.Encoding]::UTF8` in Power Shell.
In such situation, UTF-8 is the best encoding for pipes and redirected files.
msg383630 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-12-23 02:50
> When there is no console, stdio should use the default textio 
> encoding that is ANSI for now.

stdin, stdout, and stderr are special and can be special cased because they're used implicitly for IPC. They've always been acknowledged as special by the existence of PYTHONIOENCODING. I think if Python is going to change its policy for standard I/O, along the lines of what I think you've been arguing in favor of for months now, it should commit to (almost) consistently using the console input and output code pages for the standard I/O encoding in Windows, with UTF-8 as the default when there is no console session, and with the exception that UTF-8 is used for console files. To get legacy behavior, set PYTHONLEGACYWINDOWSSTDIO, which will use the console code pages for console standard I/O and otherwise use the process active code page for standard I/O.

The default encoding for open() would still be the process active code page from GetACP(), and the recommendation should be for scripts to use an explicit `encoding`.
History
Date User Action Args
2020-12-23 02:50:45eryksunsetmessages: + msg383630
2020-12-23 01:22:55methanesetmessages: + msg383626
2020-12-23 00:50:37eryksunsetmessages: + msg383625
2020-12-22 23:32:20methanesetnosy: + methane
messages: + msg383623
2020-12-22 13:45:18eryksunsetnosy: + eryksun
messages: + msg383588
2020-12-22 02:02:17izbyshevsetnosy: + izbyshev, vstinner

messages: + msg383566
versions: + Python 3.8, Python 3.9, Python 3.10
2020-12-21 18:59:31u36959create