windows console doesn't print or input Unicode #45943

mark-summerfield · 2007-12-12T09:56:31Z

BPO	1602
Nosy	@malemburg, @mhammond, @terryjreedy, @pfmoore, @amauryfa, @ncoghlan, @pitrou, @giampaolo, @tjguk, @mark-summerfield, @ned-deily, @ezio-melotti, @florentx, @4kir4, @lilydjwg, @berkerpeksag, @vadmium, @eryksun, @zooba, @davispuh
Superseder	bpo-28217: Add interactive console tests
Files	sys_write_stdout.patch unicode2.py doc-patch.diff: Proposed changes to user-visible documentation unicode3.py win_console.patch test_win_console.py streams.py wincontest.py: Example io.TextIOWrapper sublcass using WideCharToMultiByte winconsoleio.diff 1602_2.patch 1602_3.patch 1602_4.patch 1602_5.patch 1602_6.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/zooba'
closed_at = <Date 2016-09-09.16:42:53.594>
created_at = <Date 2007-12-12.09:56:30.846>
labels = ['type-bug', 'expert-unicode', 'OS-windows']
title = "windows console doesn't print or input Unicode"
updated_at = <Date 2016-10-22.10:46:13.515>
user = 'https://github.com/mark-summerfield'

bugs.python.org fields:

activity = <Date 2016-10-22.10:46:13.515>
actor = 'THRlWiTi'
assignee = 'steve.dower'
closed = True
closed_date = <Date 2016-09-09.16:42:53.594>
closer = 'steve.dower'
components = ['Unicode', 'Windows']
creation = <Date 2007-12-12.09:56:30.846>
creator = 'mark'
dependencies = []
files = ['19493', '20320', '20363', '23461', '23470', '23471', '36120', '40990', '44094', '44290', '44379', '44409', '44449', '44452']
hgrepos = []
issue_num = 1602
keywords = ['patch']
message_count = 148.0
messages = ['58487', '58621', '58651', '87086', '88059', '88077', '92854', '94445', '94480', '94483', '94496', '108173', '108228', '116801', '120414', '120415', '120416', '120700', '125823', '125824', '125826', '125833', '125852', '125877', '125889', '125890', '125898', '125899', '125938', '125942', '125947', '125956', '126286', '126288', '126303', '126304', '126308', '126319', '127782', '131657', '131854', '132060', '132061', '132062', '132064', '132065', '132067', '132184', '132191', '132208', '132266', '132268', '145898', '145899', '145963', '145964', '146471', '148990', '157569', '160812', '160813', '160897', '161151', '161153', '161308', '161651', '164572', '164578', '164580', '164618', '164619', '170899', '170915', '170999', '185135', '197700', '197751', '197752', '197773', '221175', '221178', '223403', '223404', '223507', '223509', '223945', '223946', '223947', '223948', '223949', '223951', '223952', '224019', '224086', '224095', '224596', '224605', '224690', '227329', '227330', '227332', '227333', '227337', '227338', '227347', '227354', '227373', '227374', '227441', '227450', '228191', '228208', '228210', '233347', '233350', '233916', '233937', '234019', '234020', '234096', '234371', '242884', '254405', '254407', '272596', '272605', '272645', '272662', '272675', '272716', '272718', '272720', '273999', '274449', '274673', '274884', '274906', '274912', '274939', '275003', '275004', '275005', '275157', '275362', '275510', '277047', '277048', '277050']
nosy_count = 38.0
nosy_names = ['lemburg', 'mhammond', 'terry.reedy', 'paul.moore', 'tzot', 'amaury.forgeotdarc', 'ncoghlan', 'pitrou', 'giampaolo.rodola', 'tim.golden', 'mark', 'ned.deily', 'christoph', 'ezio.melotti', 'v+python', 'hippietrail', 'flox', 'THRlWiTi', 'davidsarah', 'santoso.wijaya', 'akira', 'David.Sankel', 'python-dev', 'smerlin', 'lilydjwg', 'berker.peksag', 'martin.panter', 'piotr.dobrogost', 'eryksun', 'Drekin', 'steve.dower', 'wiz21', 'stijn', 'Jonitis', 'gurnec', 'escapewindow', 'dead1ne', 'davispuh']
pr_nums = []
priority = 'high'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = '28217'
type = 'behavior'
url = 'https://bugs.python.org/issue1602'
versions = ['Python 3.6']

mark-summerfield · 2007-12-12T09:56:30Z

I am not sure if this is a Python bug or simply a limitation of cmd.exe.

I am using Windows XP Home.
I run cmd.exe with the /u option and I have set my console font to
"Lucida Console" (the only TrueType font offered), and I run chcp 65001
to set the utf8 code page.
When I run the following program:

for x in range(32, 2000):
    print("{0:5X} {0:c}".format(x))

one blank line is output.

But if I do chcp 1252 the program prints up to 7F before hitting a
unicode encoding error.

This is different behaviour from Python 2.5.1 which (with a suitably
modified print line) after chcp 65001 prints up to 7F and then fails
with "IOError: [Errno 0] Error".

mark-summerfield · 2007-12-14T11:31:28Z

I've looked into this a bit more, and from what I can see, code page
65001 just doesn't work---so it is a Windows problem not a Python problem.
A possible solution might be to read/write UTF16 which "managed" Windows
applications can do.

tiran · 2007-12-15T02:08:14Z

We are aware of multiple Windows related problems. We are planing to
rewrite parts of the Windows specific API to use the widechar variants.
Maybe that will help.

pitrou · 2009-05-03T23:57:03Z

Yes, it is a Windows problem. There simply doesn't seem to be a true
Unicode codepage for command-line apps. Recommend closing.

tzot · 2009-05-19T00:08:57Z

Just in case it helps, this behaviour is on Win XP Pro, Python 2.5.1:

First, I added an alias for 'cp65001' to 'utf_8' in
Lib/encodings/aliases.py .

Then, I opened a command prompt with a bitmap font.

c:\windows\system32>python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print u"\N{EM DASH}"
â€”

I switched the font to Lucida Console, and retried (without exiting the
python interpreter, although the behaviour is the same when exiting and
entering again: )

>>> print u"\N{EM DASH}"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied

Then I tried (by pressing Alt+0233 for é, which is invalid in my normal
cp1253 codepage):

>> print u"née"

and the interpreter exits without any information. So it does for:

>> a=u"née"

Then I created a UTF-8 text file named 'test65001.py':

# -*- coding: utf_8 -*-
a=u"néeα"
print a

and tried to run it directly from the command line:

c:\windows\system32>python d:\src\PYTHON\test65001.py
néeαTraceback (most recent call last):
  File "d:\src\PYTHON\test65001.py", line 4, in <module>
    print a
IOError: [Errno 2] No such file or directory

You see? It printed all the characters before failing.

Also the following works:

c:\windows\system32>echo heéε
heéε

and

c:\windows\system32>echo heéε >D:\src\PYTHON\dummy.txt

creates successfully a UTF-8 file (without any UTF-8 BOM marks at the
beginning).

So it's possible that it is a python bug, or at least something can be
done about it.

amauryfa · 2009-05-19T09:46:12Z

an immediate thing to do is to declare cp65001 as an encoding:

Index: Lib/encodings/aliases.py
===================================================================

--- Lib/encodings/aliases.py    (revision 72757)
+++ Lib/encodings/aliases.py    (working copy)
@@ -511,6 +511,7 @@
     'utf8'               : 'utf_8',
     'utf8_ucs2'          : 'utf_8',
     'utf8_ucs4'          : 'utf_8',
+    'cp65001'            : 'utf_8',

 ## uu_codec codec
 #'uu'                 : 'uu_codec',

This is not enough unfortunately, because the win32 API function
WriteFile() returns the number of characters written, not the number of
(utf8) bytes:

>>> print("\u0124\u0102" + 'abc')
ĤĂabc
c
[44420 refs]
>>>

Additionally, there is a bug in the ReadFile, which returns an empty
string (and no error) when a non-ascii character is entered, which is
the behavior of an EOF condition...

Maybe the solution is to use the win32 console API directly...

tzot · 2009-09-19T00:38:48Z

Another note:
if one creates a dummy Stream object (having a softspace attribute and a
write method that writes using os.write, as in
http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/1432462#1432462
) to replace sys.stdout and sys.stderr, then writes occur correctly,
without issues. Pre-requisites:
chcp 65001, Lucida Console font and cp65001 as an alias for UTF-8 in
encodings/aliases.py
This is Python 2.5.4 on Windows.

vpython · 2009-10-25T00:06:49Z

With Python 3.1.1, the following batch file seems to be necessary to use
UTF-8 successfully from an XP console:

set PYTHONIOENCODING=UTF-8
cmd /u /k chcp 65001
set PYTHONIOENCODING=
exit

the cmd line seems to be necessary because of Windows having
compatibility issues, but it seems that Python should notice the cp65001
and not need the PYTHONIOENCODING stuff.

mark-summerfield · 2009-10-26T09:07:28Z

Glenn Linderman's fix pretty well works for me on XP Home. I can print
every Unicode character up to and including U+D7FF (although most just
come out as rectangles, at least I don't get encoding errors).

It fails at U+D800 with message:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 17: surrogates not allowed

I also tried U+D801 and got the same error.

Nonetheless, this is *much* better than before.

malemburg · 2009-10-26T09:19:55Z

Mark Summerfield wrote:

Mark Summerfield <mark@qtrac.eu> added the comment:

Glenn Linderman's fix pretty well works for me on XP Home. I can print
every Unicode character up to and including U+D7FF (although most just
come out as rectangles, at least I don't get encoding errors).

It fails at U+D800 with message:

UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in
position 17: surrogates not allowed

I also tried U+D801 and got the same error.

That's normal and expected: D800 is the start of the surrogate
ranges which are only allows in pairs in UTF-8.

vpython · 2009-10-26T17:06:21Z

The choice of the Lucida Consola or the Consolas font cures most of the
rectangle problems. Those are just a limitation of the selected font
for the console window.

christoph · 2010-06-19T12:04:59Z

Will this bug be tackled or Python2.7?

And is there a way to get hold of the access denied error?

Here are my steps to reproduce:

I started the console with "cmd /u /k chcp 65001"

Aktive Codepage: 65001.

C:\Dokumente und Einstellungen\root>set PYTHONIOENCODING=UTF-8

C:\Dokumente und Einstellungen\root>d:

D:\>cd Python31

D:\Python31>python
Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print("\u573a")
场
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied
>>>
_______________________________________________________________________

I see a rectangle on screen but obviously c&p works.

vstinner · 2010-06-20T09:00:57Z

Maybe the solution is to use the win32 console API directly...

Yes, it is the best solution because it avoids the horrible mbcs encoding.

About cp65001: it is not *exactly* the same encoding than utf-8 and so it cannot be used as an alias to utf-8: see issue bpo-6058.

BreamoreBoy · 2010-09-18T15:39:36Z

@Brian/Tim what's your take on this?

vstinner · 2010-11-04T15:09:59Z

I wrote a small function to call WriteConsoleOutputA() and WriteConsoleOutputW() in Python to do some tests. It works correclty, except if I change the code page using chcp command. It looks like the problem is that the chcp command changes the console code page and the ANSI code page, but it should only changes the ANSI code page (and not the console code page).

chcp command
============

The chcp command changes the console code page, but in practice, the console still expects the OEM code page (eg. cp850 on my french setup). Example:

C:\...> python.exe -c "import sys; print(sys.stdout.encoding")
cp850
C:\...> chcp 65001
C:\...> python.exe
Fatal Python error: Py_Initialize: can't initialize sys standard streams
LookupError: unknown encoding: cp65001
C:\...> SET PYTHONIOENCODING=utf-8
C:\...> python.exe
>>> import sys
>>> sys.stdout.write("\xe9\n")
Ã©
2
>>> sys.stdout.buffer.write("\xe9\n".encode("utf8"))
Ã©
3
>>> sys.stdout.buffer.write("\xe9\n".encode("cp850"))
é
2

os.device_encoding(1) uses GetConsoleOutputCP() which gives 65001. It should maybe use GetOEMCP() instead? Or chcp command should be fixed?

Set the console code page looks to be a bad idea, because if I type "é" using my keyboard, a random character (eg. U+0002) is displayed instead...

WriteConsoleOutputA() and WriteConsoleOutputW()
===============================================

Without touching the code page
------------------------------

If the character can be rendered by the current font (eg. U+00E9): WriteConsoleOutputA() and WriteConsoleOutputW() work correctly.

If the character cannot be rendered by the current font, but there is a replacment character (eg. U+0141 replaced by U+0041): WriteConsoleOutputA() cannot be used (U+0141 cannot be encoded to the code page), WriteConsoleOutputW() writes U+0141 but the console contains U+0041 (I checked using ReadConsoleOutputW()) and U+0041 is displayed. It works like the mbcs encoding, the behaviour looks correct.

If the character cannot be rendered by the current font, but there is a replacment character (eg. U+042D): WriteConsoleOutputA() cannot be used (U+042D cannot be encoded to the code page), WriteConsoleOutputW() writes U+042D but U+003d (?) is displayed instead. The behaviour looks correct.

chcp 65001
----------

Using "chcp 65001" command (+ "set PYTHONIOENCODING=utf-8" to avoid the fatal error), it becomes worse: the result depends on the font...

Using raster font:

(ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+00e9 (é), whereas the console output code page is cp65001 (I checked using GetConsoleOutputCP())
(ANSI) write "\xe9".encode("utf-8") using WriteConsoleOutputA() displays Ã© (mojibake!)
(UNICODE) write "\xe9" using WriteConsoleOutputW() displays... a random character (U+0002, U+0008, U+0069, U+00b0, ...)

Using Lucida (TrueType font):

(ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+0000 !?
(UNICODE) write "\xe9" using WriteConsoleOutputW() works correctly (display U+00e9), even with "\u0141", it works correctly (display U+0141)

vstinner · 2010-11-04T15:15:00Z

sys_write_stdtout.patch: Create sys.write_stdout() test function to call WriteConsoleOutputA() or WriteConsoleOutputW() depending on the input types (bytes or str).

tzot · 2010-11-04T15:22:03Z

http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx

If you want any kind of Unicode output in the console, the font must be an “official” MS console TTF (“official” as defined by the Windows version); I believe only Lucida Console and Consolas are the ones with all MS private settings turned on inside the font file.

vstinner · 2010-11-08T01:26:27Z

I don't understand exactly the goal of this issue. Different people described various bugs of the Windows console, but I don't see any problem with Python here. It looks like it's just not possible to display correctly unicode with the Windows console (the whole unicode charset, not the current code page subset).

65001 code page: it's not the same encoding than utf-8 and so it cannot be set as an alias to utf-8 (see bpo-6058) => nothing to do, or maybe document that PYTHONIOENCODING=utf-8 workaround... But if you do that, you may get strange errors when writing to stdout or stderr like "IOError: [Errno 13] Permission denied" or "IOError: [Errno 2] No such file or directory" ...
chcp command sets the console encoding, which is stupid because the console still expects text encoded to the previous code page => Windows (chcp command) bug, chcp command should not be used (it doesn't solve any problem, it just makes the situation worse)
use the console API instead of read()/write() to fix this issue: it doesn't work, the console is completly buggy (msg120414) => Windows (console) bug
use "Lucida Console" font avoids some issue => I don't think that the Python interpreter should configure the console (using SetCurrentConsoleFontEx?), it's not the role of Python

To me, there is nothing to do, and so I close the bug.

If you would like to fix a particular Python bug, open a new specific issue. If you consider that I'm wrong, Python should fix this issue and you know how, please reopen it.

davidsarah · 2011-01-09T05:32:01Z

It is certainly possible to write Unicode to the console successfully using WriteConsoleW. This works regardless of the console code page, including 65001. The code <a href="http://tahoe-lafs.org/trac/tahoe-lafs/browser/src/allmydata/windows/fixups.py"\>here\</a> does so (it's for Python 2.x, but you'd be calling WriteConsoleW from C anyway).

WriteConsoleW has one bug that I know of, which is that it <a href="http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1232"\>fails when writing more than 26608 characters at once</a>. That's easy to work around by limiting the amount of data passed in a single call.

Fonts are not Python's problem, but encoding is. It doesn't make sense to fail to output the right characters just because some users might not have selected fonts that can display those characters. This bug should be reopened.

(For completeness, it is possible to display Unicode on the console using fonts other than Lucida Console and Consolas, but it <a href="http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/3259271#3259271"\>requires a registry hack</a>.)

vpython · 2011-01-09T06:52:49Z

Interesting!

I was able to tweak David-Sarah's code to work with Python 3.x, mostly doing things that 2to3 would probably do: changing unicode() to str(), dropping u from u'...', etc.

I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol. But I'll attach David-Sarah's code + tweaks + a test case showing output of the Cyrillic alphabet to a console with code page 437 (at least, on my Win7-64 box, that is what it is).

Nice work, David-Sarah. I'm quite sure this is not in a form usable inside Python 3, but it shows exactly what could be done inside Python 3 to make things work... and gives us a workaround if Python 3 is not fixed.

davidsarah · 2011-01-09T07:28:46Z

Glenn Linderman wrote:

I skipped the unmangling of command-line arguments, because it produced an error I didn't understand, about needing a buffer protocol.

If I understand correctly, that part isn't needed on Python 3 because bpo-2128 is already fixed there.

vstinner · 2011-01-09T09:03:08Z

It is certainly possible to write Unicode to the console
successfully using WriteConsoleW

Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font?

See msg120414 for my tests with WriteConsoleOutputW.

davidsarah · 2011-01-09T19:23:56Z

haypo wrote:

davidsarah wrote:
> It is certainly possible to write Unicode to the console
> successfully using WriteConsoleW

Did you tried with characters not encodable to the code page and with character that cannot be rendeded by the font?

Yes, characters not encodable to the code page do work (as confirmed by Glenn Linderman, since code page 437 does not include Cyrillic).

Characters that cannot be rendered by the font print as missing-glyph boxes, as expected. They don't cause any other problem, and they can be cut-and-pasted to other Unicode-aware applications, showing up as the original characters.

See msg120414 for my tests with WriteConsoleOutputW

Even if it handled encoding correctly, WriteConsoleOutputW (http://msdn.microsoft.com/en-us/library/ms687404%28v=vs.85%29.aspx) would not be the right API to use in any case, because it prints to a rectangle of characters without scrolling. WriteConsoleW does scroll in the same way that printing to a console output stream normally would. (Redirection to a non-console stream can be detected and handled differently, as the code in unicode2.py does.)

Drekin · 2016-08-13T18:40:42Z

Hello Steve, that's great you are working on this!

I've ran through your patch and I have the following remarks:

• Since wide chars have two bytes, there may be problem when someone wants to read or write odd number of bytes. If the number is > 1, it's ok since the code may read or write less bytes, but when the number is exactly 1, the code should maybe raise some exception.

• WriteConsoleW always fails with ERROR_NOT_ENOUGH_MEMORY (8) if we try to write more than a certain number of bytes. For me, the number is something like 41000. Unfortunately, it depends on actual heap usage of the console process. I do len = min(len, 32767) in write. The the value chosen comes from bpo-11395 .

• If someone types something like ^Zfoo, the standard sys.stdin returns '' -- it ignores everything after EOF if it is the first byte read. I reproduce the bahaviour in win_unicode_console to be compatible.

• There may be some issue when someone hits Ctrl-C on input. It seems that in that case, ReadConsoleW fails with ERROR_OPERATION_ABORTED (995) and some signal is asynchronously fired. It may happen that the corresponding KeyboardInterrupt exception occurs later that it should. In my Python/ctypes situation I do an ugly hack – I detect ERROR_OPERATION_ABORTED and in that case I sleep for 0.1 seconds to wait for the exception. I understand that the situation may me different in C.

vadmium · 2016-08-14T05:26:53Z

For compatibility, I think it may be good to add custom implementations of the buffer attribute and detach() method to stdin/out. They should be able to at least read and write ASCII bytes. It might be easiest to keep them as the current BufferedReader/Writer objects. Probably also make stdin/out.fileno() defer to the buffer attribute.

With the current patch that only allows reading and writing in UTF-16 pairs, I forsee a few problems:

I assume stdin.buffer.raw.readline() will try to read one byte at a time, and will therefore always indicate EOF.
Incompatibility with using stdin/out.buffer for ASCII character input and output. I suggest testing the patch with “python -m base64”, a use case mentioned earlier in this thread.

Drekin · 2016-08-14T10:31:17Z

There is also the following consequence of (not) having the standard filenos: input() either considers the streams interactive or not. To consider them interactive, standard filenos and isatty are needed on sys.stdin and sys.stdout.

If the streams are considered interactive, input() goes via readlinehook machinery, otherwise it just writes and reads an ordinary file.

The latter means we don't have to touch readline machinery now, the downside is that custom rlcompleters like pyreadline won't work on input().

zooba · 2016-08-14T15:25:08Z

The current patch actually only affects the raw IO, so the concern would be one of the wrappers trying to work in bytes when it should be dealing in characters. This should be no different from reading a UTF16 file, so either both work or both are broken.

The readline API is most annoying because it assumes strlen is valid for any encoded text (and at so many places it's near unfixable), but there's another issue for this part.

Also, I don't have answers for most of the questions in the review on the patch because I copied all of those bits from fileio.c. Can certainly clean parts of them up for the console API, but I count compatibility with the FileIO class a useful goal where possible.

zooba · 2016-08-15T04:46:35Z

I'm fairly happy with where my current patch is at (not posted right now - too many different machines involved) and only one test is failing - test_cgi.

The problem seems to be that urllib.parse.unquote() takes an encoding parameter to decode utf-8 encoded bytes with, and cgi infers this parameter from sys.stdin. I don't have the slightest idea why unquote/unquote_to_bytes unconditionally encodes with utf-8 and then allows decoding with an arbitrary encoding, but I guess it works okay for ASCII-compatible encodings?

Unfortunately, utf-16-le is not ASCII compatible, and so this doesn't work. I'm not familiar enough with cgi or urllib.parse to know what to fix - any suggestions?

zooba · 2016-08-15T04:52:28Z

For more info here, cgi.parse has code like this:

def parse(fp, ...):
    if fp is None:
        fp = sys.stdin

    encoding = getattr(fp, 'encoding', 'latin-1')

# later on...

return urllib.parse.parse_qs(a_str, encoding=encoding, ...)

As an easy hack, I added this after assigning encoding:

    if len(' '.encode(encoding, errors='replace')) > 1:
        encoding = 'latin-1'

I have no idea if this is a good idea or not. The current behaviour of mojibake in the parsed result is certainly worse, since the choice of utf-16-le is entirely contained within the parse() function.

vadmium · 2016-08-15T06:32:00Z

I think this CGI thing is a separate bug, just exacerbated by the stdin.encoding problem. :) The urllib.parse.parse_qs() function takes an encoding parameter to figure out what to do with percent-encoded values: "%A9" → b"\xA9".decode(...). This is different lower-level encoding: b"%A9".decode("ascii").

Maybe the best solution is just to remove the encoding argument, and let it revert to UTF-8, as it did before r87998. Or maybe it really should use the locale encoding. (Is that ASCII-compatible on Windows?) It really depends on where the query string was generated (in a browser, pre-computed URL, etc).

zooba · 2016-08-31T04:28:28Z

New patch attached (1602_2.patch - hopefully the review will work this time too).

I discovered while researching for the PEP that a decent amount of code expects to be able to write ASCII to sys.stdout.buffer (or sys.stdout.buffer.raw). As my first patch required utf-16-le at this point, it was going to cause havoc.

Rather than break that compatibility, I decided that exposing utf-8 and doing the reencoding at the latest possible stage was better. This is also more consistent with how other encoding issues are likely to be resolved, and shouldn't be any less performant, given that previously we were decoding to utf-16 anyway.

The downsides of this is that read(n) now can only read up to n/4 characters, and write(n) has a much more complicated time dealing with large buffers (as we need to cap the number of utf-16-le bytes but return the number of utf-8 bytes - it's not a direct relationship, so there's more work and a little bit of guessing in some cases).

On the upside, the readline handling is simpler as utf-8 is compatible with the existing interface and now sys.stdin.encoding is accurate. I've rolled that fix into this patch (just the myreadline.c change) as they really ought to go in together.

zooba · 2016-09-05T22:24:11Z

Updated patch. This implements everything we've been discussing on python-dev

zooba · 2016-09-06T23:50:32Z

Latest patch is attached.

PEP acceptance is sounding likely, so feel free to critically review.

zooba · 2016-09-07T20:49:15Z

Updated patch based on some suggestions from Eryk.

The PEP has been accepted, so now I just need to land it in the next two days.

Currently "normal" usage here is fine, and some edge cases match the Python 3.5 behaviour. I'm going to go through now and bulk out the tests to try and catch more problems, but modulo that I hope the implementation is nearly ready.

zooba · 2016-09-07T23:08:17Z

I can't actually come up with many useful tests for this... so far I can validate that we can open the console IO object from 0, 1, 2, "CON", "CONIN$" and "CONOUT$", get fileno(), check readable()/writable() and close (multiple times without crashing).

Anything else requires a real console with a real person with a real keyboard.

But I fixed a couple of issues in fd handling as a result of the tests, so it's not a complete waste.

berkerpeksag · 2016-09-07T23:35:06Z

I left some minor comments for Doc/whatsnew/3.6.rst on Rietveld.

In Lib/test/test_winconsoleio.py:

self.assert_() (deprecated) can be replaced by self.assertTrue()
We can add

  if __name__ == '__main__':
      unittest.main()

zooba · 2016-09-08T01:03:35Z

Thanks! I've made the changes you suggested.

vadmium · 2016-09-08T12:10:06Z

+++ b/Lib/test/test_winconsoleio.py
+to real people with real keyborads.
Should be keyboards
There are still assert_() calls in this file (1602_6.patch). Did you miss them?

+++ b/Lib/io.py
+from _io import WindowsConsoleIO
+all.append('WindowsConsoleIO')
I think you should either document this class, or remove it from __all__ to clarify it is just an implementation detail.

+++ b/Modules/_io/winconsoleio.c
+_io_WindowsConsoleIO___init___impl
+ PyObject *decodedname = Py_None;
+ Py_INCREF(decodedname);
+ int d = PyUnicode_FSDecoder(nameobj, (void*)&decodedname);
Won’t this leak a reference to Py_None?
(Also, I think needless casting like in the last line can mask mistakes that the compiler would otherwise pick up. Imagine if you got the parameters around the wrong way.)

+read_console_w(HANDLE handle, DWORD maxlen, DWORD *readlen) {
+ /* If we didn't read a full buffer that time, don't try
+ again or we will block a second time. */
I’m not familiar with the Windows APIs involved, but this doesn’t seem robust. What if there were exactly one full buffer waiting, would the next call block without returning anything?

vadmium · 2016-09-08T12:15:14Z

Ah sorry I see Berker’s assert_() comment was _after_ you posted 1602_6.patch, so ignore that bit :)

vadmium · 2016-09-08T12:18:05Z

Also as I understand it, the open() function can return this new class, so the documentation at <https://docs.python.org/3.6/library/functions.html#open\> needs updating.

python-dev · 2016-09-08T21:15:29Z

New changeset 6142d2d3c471 by Steve Dower in branch 'default':
Issue bpo-1602: Windows console doesn't input or print Unicode (PEP-528)
https://hg.python.org/cpython/rev/6142d2d3c471

eryksun · 2016-09-09T18:00:28Z

Martin, the console should be in line-input mode, in which case ReadConsole will block if there isn't at least one line in the input buffer. It reads up to the lesser of a complete line or the number of UTF-16 codes requested. If the previous call read the entire request size but didn't stop on '\n', then we know the next call shouldn't block because the input buffer has at least one '\n' in it.

I can validate that we can open the console IO object from
0, 1, 2, "CON", "CONIN$" and "CONOUT$", get fileno(), check
readable()/writable() and close (multiple times without
crashing).

I like the idea to have fileno() lazily get a file descriptor on demand, but _open_osfhandle is a low I/O function that uses _open flags -- not 'rb' (int 0x7262) or 'wb' (int 0x7762). ;-)

You can use _O_RDONLY | _O_BINARY or _O_WRONLY | _O_BINARY. But really those values would be ignored anyway. It's not actually opening the file, so it only cares about a few flags. Specifically, in lowio\osfinfo.cpp I see that it looks for _O_APPEND, _O_TEXT, and _O_NOINHERIT.

On line 329, the following assignment

    if (self->writable)
        access |= GENERIC_WRITE;

should be access = GENERIC_WRITE. Requesting both read and write access is an invalid parameter when opening "CON", as can be seen here:

    >>> f = open('CON', 'wb', buffering=0)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OSError: [WinError 87] The parameter is incorrect: 'CON'

CONOUT$ works, of course:

    >>> f = open('CONOUT$', 'wb', buffering=0)
    >>> f
    <_io._WindowsConsoleIO mode='wb' closefd=True>

Lastly, for a readall that starts with ^Z, you're still breaking out of the loop before incrementing len, which is thus 0 when subsequently checked. It ends up calling WideCharToMultiByte with len == 0, which fails.

    >>> sys.stdin.buffer.raw.read()
    ^Z
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    OSError: [WinError 87] The parameter is incorrect

I can't actually come up with many useful tests for this...

ctypes can be used to write to the input buffer and read from a screen buffer. For the latter it helps to first create and activate a scratch screen buffer, initialized to NULs to make it easy to read back everything that was written up to the current cursor position. I have existing ctypes code for this, written to solve the problem of a subprocess that stubbornly writes directly to the console instead of writing to stdout/stderr pipes.

vadmium · 2016-09-10T00:30:01Z

Okay so regarding blocking reads with a full buffer, what you are saying is the second check to break the read loop should be sufficient:

+/* If the buffer ended with a newline, break out */
+if (buf[*readlen - 1] == '\n')
+ break;

davispuh · 2016-09-20T17:11:07Z

Steve Dower (steve.dower)

[...]
Anything else requires a real console with a real person with a real keyboard.

FYI, not really, it is possible to fully automatically test console's output/input using WinAPI functions like WriteConsoleInput, GetConsoleScreenBufferInfo, ReadConsoleOutputCharacter

very recently I wrote such test, you can look at it as example http://review.source.kitware.com/gitweb?p=KWSys.git;a=blob;f=testConsoleBuf.cxx;hb=HEAD

it tests all 3 cases when output is actual console, redirected pipe and file.

zooba · 2016-09-20T17:41:29Z

Oh nice, I like that. We should definitely add some tests using that (though it seems like quite a big task... maybe I'll open a new issue for it).

zooba · 2016-09-20T17:45:07Z

Created bpo-28217 for adding these tests.

The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts. However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore. * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4]. * `fix_default_encoding()`[5] python3 defaults to utf8. * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7]. * `fix_win_console()`[8] Fixed[9]. TODO: <Get performance changes in windows>. [1] https://codereview.chromium.org/6721029 [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3 [4] python/cpython#57425 (comment) [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3 [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3 [7] python/cpython#46381 (comment) [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3 [9] python/cpython#45943 (comment) Bug: 1501984 Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638

The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts. However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore. * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4]. * `fix_default_encoding()`[5] python3 defaults to utf8. * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7]. * `fix_win_console()`[8] Fixed[9]. [1] https://codereview.chromium.org/6721029 [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3 [4] python/cpython#57425 (comment) [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3 [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3 [7] python/cpython#46381 (comment) [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3 [9] python/cpython#45943 (comment) Bug: 1501984 Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638

The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts. However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore. * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4]. * `fix_default_encoding()`[5] python3 defaults to utf8. * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7]. * `fix_win_console()`[8] Fixed[9]. Benchmarking on windows: * Baseline (http://gpaste/6701096112750592): [1] https://codereview.chromium.org/6721029 [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3 [4] python/cpython#57425 (comment) [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3 [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3 [7] python/cpython#46381 (comment) [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3 [9] python/cpython#45943 (comment) Bug: 1501984 Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638

The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts. However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore. * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4]. * `fix_default_encoding()`[5] python3 defaults to utf8. * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7]. * `fix_win_console()`[8] Fixed[9]. Benchmarking on windows: * Baseline (http://gpaste/6701096112750592): ~1min 41sec. [1] https://codereview.chromium.org/6721029 [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3 [4] python/cpython#57425 (comment) [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3 [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3 [7] python/cpython#46381 (comment) [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3 [9] python/cpython#45943 (comment) Bug: 1501984 Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638

mark-summerfield mannequin added OS-windows topic-unicode type-bug An unexpected behavior, bug, or error labels Dec 12, 2007

vstinner added OS-windows and removed OS-windows labels May 3, 2009

vstinner closed this as completed Nov 8, 2010

vstinner added the invalid label Nov 8, 2010

zooba closed this as completed Sep 9, 2016

ezio-melotti transferred this issue from another repository Apr 10, 2022

vszakats mentioned this issue Nov 2, 2022

broken UTF-8 encoded content terminal output on Windows curl/curl#9841

Closed

windows console doesn't print or input Unicode #45943

windows console doesn't print or input Unicode #45943

Comments

mark-summerfield mannequin commented Dec 12, 2007

mark-summerfield mannequin commented Dec 12, 2007

mark-summerfield mannequin commented Dec 14, 2007

tiran commented Dec 15, 2007

pitrou commented May 3, 2009

tzot mannequin commented May 19, 2009

amauryfa commented May 19, 2009

tzot mannequin commented Sep 19, 2009

vpython mannequin commented Oct 25, 2009

mark-summerfield mannequin commented Oct 26, 2009

malemburg commented Oct 26, 2009

vpython mannequin commented Oct 26, 2009

christoph mannequin commented Jun 19, 2010

vstinner commented Jun 20, 2010

BreamoreBoy mannequin commented Sep 18, 2010

vstinner commented Nov 4, 2010

vstinner commented Nov 4, 2010

tzot mannequin commented Nov 4, 2010

vstinner commented Nov 8, 2010

davidsarah mannequin commented Jan 9, 2011

vpython mannequin commented Jan 9, 2011

davidsarah mannequin commented Jan 9, 2011

vstinner commented Jan 9, 2011

davidsarah mannequin commented Jan 9, 2011

Drekin mannequin commented Aug 13, 2016

vadmium commented Aug 14, 2016

Drekin mannequin commented Aug 14, 2016

zooba commented Aug 14, 2016

zooba commented Aug 15, 2016

zooba commented Aug 15, 2016

vadmium commented Aug 15, 2016

zooba commented Aug 31, 2016

zooba commented Sep 5, 2016

zooba commented Sep 6, 2016

zooba commented Sep 7, 2016

zooba commented Sep 7, 2016

berkerpeksag commented Sep 7, 2016

zooba commented Sep 8, 2016

vadmium commented Sep 8, 2016

vadmium commented Sep 8, 2016

vadmium commented Sep 8, 2016

python-dev mannequin commented Sep 8, 2016

eryksun commented Sep 9, 2016

vadmium commented Sep 10, 2016

davispuh mannequin commented Sep 20, 2016

zooba commented Sep 20, 2016

zooba commented Sep 20, 2016