Author u36959
Recipients paul.moore, steve.dower, tim.golden, u36959, zach.ware
Date 2020-12-21.18:59:30
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1608577171.81.0.55180678487.issue42707@roundup.psfhosted.org>
In-reply-to
Content
Hello, 

first of all, I hope this was not already discussed (I searched the bugs but it might have been discussed elsewhere) and it's really a bug.

I've been struggling to understand today why a simple file redirection couldn't work properly today (encoding issues) and I think I finally understand the whole thing.

There's an IO codepage set on Windows consoles (`chcp` for cmd, `[Console]::InputEncoding; [Console]::OutputEncoding` for PowerShell ; chcp will not work on Powershell while it displays it set the CP), 850 for my locale.
When there's no redirection / piping, PyWindowsConsoleIO take cares of the encoding (utf-8 is seems), but when there's redirection or piping, encoding falls back to ANSI CP (from config_get_locale_encoding).

This behavior seems to be incorrect / breaking things, an example:
* testcp.py (file encoded as utf-8)
```
#!/usr/bin/env python3
# -*- coding: utf-8

print('é')
```

* using cmd:
```
# Test condition
L:\Cop>chcp
Page de codes active : 850

# We're fine here
L:\Cop>py testcp.py
é
L:\Cop>py -c "import sys; print(sys.stdout.encoding)"
utf-8

# Now with piping
L:\Cop>py -c "import sys; print(sys.stdout.encoding)" | more
cp1252

L:\Cop>py testcp.py | more
Ú
L:\Cop>py testcp.py > lol && type lol
Ú

# If we adjust cmd CP, it's fine too:
L:\Cop>chcp 1252
Page de codes active : 1252
L:\Cop>py testcp.py | more
é
```

* with pwsh:
```
PS L:\Cop> ([Console]::InputEncoding, [Console]::OutputEncoding) | select CodePage

CodePage
--------
     850
     850

# Fine without redirection
PS L:\Cop> py .\testcp.py
é

# Here, write-host expect cp850
PS L:\Cop> py .\testcp.py | write-output
Ú
# Same with Out-file (used by ">")
PS L:\Cop> py .\testcp.py > lol; Get-Content lol
Ú

# 
PS L:\Cop> py .\testcp.py | more
Ú
```

By reading some sources today to solve my issue, I found many solutions:
* in PS `[Console]::OutputEncoding = [Text.Utf8Encoding]::new($false); $env:PYTHONIOENCODING="utf8"` or `[Console]::OutputEncoding = [Text.Encoding]::GetEncoding(1252)`
* in CMD `chcp 65001 && set PYTHONIOENCODING=utf8` (but this seems to break more) or `chcp 1252`

But reading (and trusting) https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os (https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants), I understand Python should be using reading the current CP (from GetConsoleOutputCP, like https://github.com/python/cpython/blob/3.9/Python/fileutils.c:) or using the default OEM CP, and not assuming ANSI CP for stdio : 
> * the OEM code page for use by legacy console applications,
> * the ANSI code page for use by legacy GUI applications.

The init path I could trace : 
> https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c
> init_sys_streams
>> create_stdio (https://github.com/python/cpython/blob/3.9/Python/pylifecycle.c#L1774)
>>> open.raw : https://github.com/python/cpython/blob/3.9/Modules/_io/_iomodule.c#L374
>>>> https://github.com/python/cpython/blob/3.9/Modules/_io/winconsoleio.c
>> fallback to ini_sys_stream encoding
> https://github.com/python/cpython/blob/3.9/Python/initconfig.c
> config_init_stdio_encoding
> config_get_locale_encoding
> GetACP()

Some test with GetConsoleCP:
```
L:\Cop>py -c "import os; print(os.device_encoding(0), os.device_encoding(1))" | more
cp850 None

L:\Cop>type nul | py -c "import os; print(os.device_encoding(0), os.device_encoding(1))"
None cp850

L:\Cop>type nul | py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), ctypes.windll.kernel32.GetConsoleOutputCP())"
850 850

L:\Cop>py -c "import ctypes; print(ctypes.windll.kernel32.GetConsoleCP(), ctypes.windll.kernel32.GetConsoleOutputCP())" | more
850 850
```

Some links / documentations, if useful:
* https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os
* https://docs.microsoft.com/en-us/windows/win32/intl/locale-idefault-constants
* https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp
* https://docs.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
* https://docs.microsoft.com/en-us/windows/console/getconsoleoutputcp
* https://stackoverflow.com/questions/56944301/why-does-powershell-redirection-change-the-formatting-of-the-text-content
* https://stackoverflow.com/questions/19122755/output-echo-a-variable-to-a-text-file
* https://stackoverflow.com/questions/40098771/changing-powershells-default-output-encoding-to-utf-8
* Maybe related: https://github.com/PowerShell/PowerShell/issues/10907
* https://stackoverflow.com/questions/57131654/using-utf-8-encoding-chcp-65001-in-command-prompt-windows-powershell-window (will probably break things :) )
* https://stackoverflow.com/questions/49476326/displaying-unicode-in-powershell/49481797#49481797
* https://stackoverflow.com/questions/25642746/how-do-i-pipe-unicode-into-a-native-application-in-powershell

Please note I took time to write this issue as best as I could, I hope it won't be closed without explaining why the current behavior is normal (not that I suppose this will happen, I just don't know how people react here :) ).

Thanks a lot for Python, I really enjoy using it, 
Best, 
Alexandre
History
Date User Action Args
2020-12-21 18:59:31u36959setrecipients: + u36959, paul.moore, tim.golden, zach.ware, steve.dower
2020-12-21 18:59:31u36959setmessageid: <1608577171.81.0.55180678487.issue42707@roundup.psfhosted.org>
2020-12-21 18:59:31u36959linkissue42707 messages
2020-12-21 18:59:30u36959create