This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Force console stdout to use UTF8 on Windows
Type: Stage: resolved
Components: Windows Versions: Python 3.5
process
Status: closed Resolution: duplicate
Dependencies: Superseder: windows console doesn't print or input Unicode
View: 1602
Assigned To: paul.moore Nosy List: martin.panter, paul.moore, piotr.dobrogost, r.david.murray, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2015-04-09 17:22 by paul.moore, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (8)
msg240354 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2015-04-09 17:22
Console code page issues are a consistent source of problems on Windows. It would be nice, given that the Windows console has Unicode support, if Python could write the full range of Unicode to the console by default.

The MSVC runtime appears to have a flag that can be set via _setmode(), _O_U8TEXT, which "enables Unicode mode" (see https://msdn.microsoft.com/en-us/library/tw4k6df8%28v=vs.100%29.aspx?f=255&MSPPError=-2147217396, in particular the second example). It seems as if Python could set U8TEXT mode on sys.stdout on startup (assuming it's a console) and set the encoding to UTF8, and then Unicode output would "just work".

I don't have code that implements this yet, but if I can get my head round the IO stack and the Python startup code, I'll give it a go.

Steve - any comments on whether this might work? I've never seen any real-world code using U8TEXT which makes me wonder if it's reliable (doing msvcrt.setmode(sys.stdout.fileno(), 0x40000) in Python 3.4 causes Python to crash, which is worrying, but it works in 3.5...).
msg240375 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-04-09 20:10
There are a lot of issues in this tracker (for some definition of a lot) that indicate that the console does *not* support unicode.  So if you are writing utf-8 I wouldn't expect this to work.  (If it were an API taking unicode directly, that might be a different story).  But the amount I know about windows is pretty small, so I sure hope you are right
msg240378 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-09 20:21
> There are a lot of issues in this tracker (for some definition of a lot) that indicate that the console does *not* support unicode.

The main issue is the issue #1602.
msg240386 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2015-04-09 20:44
Generally, my understanding is that the console does pretty badly at supporting Unicode via the normal WriteFile APIs and the "code page" support (mainly because support for the UTF8 code page is rubbish). But the WriteConsole API does, I believe, have pretty solid Unicode support (it's what Powershell uses, for example). Typically, attempts to support Unicode for Python console output (e.g., win_unicode_console on PyPI) deal with this by making a file-like object that calls WriteConsole under the hood, and replaces sys.stdout with this. The problem with this approach is that it isn't a "normal" text stream object (there's no underlying raw bytes buffer), so the result isn't seamless (although win_unicode_console is pretty good).

What I noticed is that the C runtime supports an _O_U8TEXT mode for console file descriptors, at the (bytes) write() level. So that could be seamlessly integrated into the bytes IO layer of the Python IO stack.

As far as I can tell from the description, the way it works is to treat a block of bytes written via write() as a UTF8 string, encode it to Unicode and write it to the console via WriteConsole(). (I haven't checked the CRT source, but that seems like the most likely implementation).

Code speaks louder than words, obviously, and I do intend to produce a trial implementation. But that'll take a bit of time because I need to understand how the IO stack hangs together first.

An alternative approach would be a RawIOBase implementation that wrote bytes to the console by (re-)decoding them from UTF8 and using WriteConsole, then wrapping that in the usual buffered IO and text IO layers (with the text IO layer using UTF8 encoding). That may well be implementable in pure Python, and make a good prototype implementation. Hmm...
msg240388 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2015-04-09 20:51
Doh. That latter approach (a RawIOBase implementation) is *precisely* what win_unicode_console does for stdout (using utf16le rather than utf8 as that's the native Windows encoding used by WriteConsole). So (a) yes it would work, and (b) it has already demonstrated in the wild that the approach is viable.

(Actually, a C implementation of this approach might be a better way of implementing this anyway, rather than relying on a relatively obscure C runtime feature).
msg240694 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2015-04-13 18:17
My proof-of-concept attempt to use _O_U8TEXT resulted in some very bizarre behaviour - odd buffering of the interactive interpreter output and what appear to be Chinese characters being displayed for normal (ASCII) interactions.

I suspect there is some oddity around how _O_U8TEXT works. The approach looks too fragile to pursue. I'll look further into the RawIOBase option.
msg241465 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-04-18 23:03
If sys.stdout is modified, it must be carefully tested in various scenario:

- Windows console, default config
- Windows console, TrueType font
- PowerShell => see #21927, it looks like PowerShell has its own set of Unicode issues
- Redirect output into a file
- etc.

Very good articles by Michael S. Kaplan on Windows stdout/console:
- "Conventional wisdom is retarded, aka What the @#%&* is _O_U16TEXT?"
  http://www.siao2.com/2008/03/18/8306597.aspx
- "Myth busting in the console"
  http://www.siao2.com/2010/10/07/10072032.aspx
- "Cunningly conquering communicated console caveats. Comprende, mon Capitán?"
  http://www.siao2.com/2010/05/07/10008232.aspx

See also fwide() function.

Good luck...
msg290507 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-03-26 06:19
This seems to be discussing the same sort of stuff that ended up with the Issue 1602 implementation.
History
Date User Action Args
2022-04-11 14:58:15adminsetgithub: 68089
2017-03-26 06:19:27martin.pantersetstatus: open -> closed

superseder: windows console doesn't print or input Unicode

nosy: + martin.panter
messages: + msg290507
resolution: duplicate
stage: resolved
2015-04-18 23:03:01vstinnersetmessages: + msg241465
2015-04-13 18:17:49paul.mooresetmessages: + msg240694
2015-04-10 08:33:51piotr.dobrogostsetnosy: + piotr.dobrogost
2015-04-09 20:51:15paul.mooresetmessages: + msg240388
2015-04-09 20:44:32paul.mooresetmessages: + msg240386
2015-04-09 20:21:54vstinnersetnosy: + vstinner
messages: + msg240378
2015-04-09 20:10:51r.david.murraysetnosy: + r.david.murray
messages: + msg240375
2015-04-09 17:22:37paul.moorecreate