Author eryksun
Recipients eryksun, izbyshev, paul.moore, steve.dower, tim.golden, u36959, vstinner, zach.ware
Date 2020-12-22.13:45:17
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1608644718.35.0.827277601212.issue42707@roundup.psfhosted.org>
In-reply-to
Content
> I understand Python should be using reading the current CP (from 
> GetConsoleOutputCP
> or using the default OEM CP, and not assuming ANSI CP for stdio

A while ago I analyzed text encodings used by many of the legacy CLI programs in Windows. Some programs hard code using either the ANSI or OEM code page, and others use either the console's current input code page or its current output code page. In light of the inconsistencies, I think defaulting to ANSI for non-console standard I/O is fine.

> There's an IO codepage set on Windows consoles (`chcp` for cmd, 
> `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ;

The CMD shell is a Unicode (UTF-16) application, i.e. it calls wide-character system and console I/O functions such as ReadConsoleW() and WriteConsoleW(). It still uses the console output code page, but as a kind of locale encoding. For example, CMD uses the *output* code page when reading a batch file as well as when reading output from an external command in a `FOR /F` loop. If Python were only concerned with satisfying a `FOR /F` loop in CMD, then it would be reasonable to make stdout default to the console output code page. But "more.com" and "find.exe" are commonly used as well, and they decode piped input using the console *input* code page. Other commands such as "findstr.exe" use OEM.

PowerShell adds a spin to this problem. In CMD, piping bytes between two processes doesn't actively involve the shell. It just creates an anonymous pipe, with each process connected to either end. In contrast, PowerShell injects itself as a middle man. For example, piping between "python.exe" and "more.com" is implemented as a pipe from "python.exe" to PowerShell and a separate pipe from PowerShell to "more.com". In between, PowerShell decodes the output from "python.exe" using its current output encoding and then re-encodes it using its current input encoding before writing to "more.com".

> # If we adjust cmd CP, it's fine too:
> L:\Cop>chcp 1252
> Page de codes active : 1252
> L:\Cop>py testcp.py | more
> é

In this case, the ANSI code-page encoded output from Python is written to a pipe that's read directly by "more.com". In turn, "more.com" decodes the input bytes using the console input code page before writing UTF-16 text to the console via WriteConsoleW(). 

To make Python use the console input code page for standard I/O, query the code page via "chcp.com", and set PYTHONIOENCODING. For example:

    C:\>chcp
    Active code page: 437
    C:\>set PYTHONIOENCODING=cp437
    C:\>py -c "print('é')" | more
    é

It would be convenient to support encodings that are based on the current console code pages, maybe named "conin" and "conout", based on GetConsoleCP() and GetConsoleOutputCP(). For example:

    C:\>set PYTHONIOENCODING=conin

They could default to the process active code page from GetACP() when there's no console. "ansi" and "oem" are already supported, so all four of the common encoding abstractions would be supported.

> when there's redirection or piping, encoding falls back to ANSI CP 
> (from config_get_locale_encoding).

The default encoding for files is locale.getpreferredencoding(), unless UTF-8 mode is enabled. In Windows, this is the process active code page, as returned by WinAPI GetACP(). By default, this is the system ANSI code page.

Standard I/O isn't excepted from this, unless either PYTHONIOENCODING is set or it's a console device file. The default, non-legacy behavior for console files is to use UTF-8 at the buffer and raw I/O level. Internally, Python uses the wide-character console I/O functions ReadConsoleW() and WriteConsoleW(), with UTF-16 encoded text.

Windows 10 allows setting the system ANSI code page to UTF-8. It also allows an application to override its active code page to UTF-8, but that's not easy to change. It requires adding an "activeCodePage" setting to the manifest that's embedded in the executable, which can be done using the manifest tool, "mt.exe".
History
Date User Action Args
2020-12-22 13:45:18eryksunsetrecipients: + eryksun, paul.moore, vstinner, tim.golden, zach.ware, steve.dower, izbyshev, u36959
2020-12-22 13:45:18eryksunsetmessageid: <1608644718.35.0.827277601212.issue42707@roundup.psfhosted.org>
2020-12-22 13:45:18eryksunlinkissue42707 messages
2020-12-22 13:45:17eryksuncreate