This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB
Type: behavior Stage: needs patch
Components: IO, Windows Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: anhans, eryksun, gregory.p.smith, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2019-08-16 00:36 by anhans, last changed 2022-04-11 14:59 by admin.

Messages (5)
msg349837 - (view) Author: ANdy (anhans) Date: 2019-08-16 00:36
# To reproduce:
# Put this text in a file `a.py` and run `py a.py`.
# Or just run: py -c "print(('é' * 40 + '\n') * 473)"
# Scroll up for a while. One of the lines will be:
# éééééééééééééééééééé��ééééééééééééééééééé
# (You can spot this because it's slightly longer than the other lines.)
# The error is consistently on line 237, column 21 (1-indexed).

# The error reproduces on Windows but not Linux. Tested in both powershell and CMD.
# (Failed to reproduce on either a real Linux machine or on Ubuntu with WSL.)
# On Windows, the error reproduces every time consistently.

# There is no error if N = 472 or 474.
N = 473
# There is no error if W = 39 or 41.
# (I tested with console windows of varying sizes, all well over 40 characters.)
W = 40
# There is no error if ch = "e" with no accent.
# There is still an error for other unicode characters like "Ö" or "ü".
ch = "é"
# There is no error without newlines.
s = (ch * W + "\n") * N
# Assert the string itself is correct.
assert all(c in (ch, "\n") for c in s)
print(s)

# There is no error if we use N separate print statements
# instead of printing a single string with N newlines.

# Similar scripts written in Groovy, JS and Ruby have no error.
# Groovy: System.out.println(("é" * 40 + "\n") * 473)
# JS: console.log(("é".repeat(40) + "\n").repeat(473))
# Ruby: puts(("é" * 40 + "\n") * 473)
msg349844 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-08-16 04:05
To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough. 

    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

With `('é' * 40 + '\n') * 473`, encoded as UTF-8, we have 473 82-byte lines (note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is too much for a single write, so it splits it in two.

    >>> 38786 // 2
    19393
    >>> 19393 // 82
    236
    >>> 19393 % 82
    41

This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD.

To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries.
msg349872 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-08-16 15:39
I'd rather keep encoding incrementally, and reduce the length of each attempt until the last UTF-8 character does not have its top bit set (i.e. is the final character in a multi-byte sequence).

Otherwise the people who like to print >2GB worth of data to the console will complain about the memory error :)
msg389191 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2021-03-20 22:26
Steve's approach makes sense and should be robust.

side note: do we need to care about Windows 7 anymore in 3.10 given that microsoft no longer supports it?
msg389199 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-03-20 23:35
> side note: do we need to care about Windows 7 anymore in 
> 3.10 given that microsoft no longer supports it?

If the fix comes in time for Python 3.8, then it needs to support Windows 7. For Python 3.9+, the 32 KiB limit can be removed. 

The console documentation still includes the misleading disclaimer about "available heap". This refers to a relatively small block of shared memory (64 KiB IIRC) that's overlayed by a heap, not the default process heap. Shared memory is used by system LPC ports to efficiently pass large messages between a system server (e.g. csrss.exe, conhost.exe) and a client process. The console API used to use an LPC port, but in Windows 8.1+ it uses a driver instead, so none of the "available heap" warnings apply anymore. Microsoft should clarify the docs to stress that the warning is for Windows 7 and earlier.
History
Date User Action Args
2022-04-11 14:59:19adminsetgithub: 82052
2021-03-20 23:35:16eryksunsetmessages: + msg389199
2021-03-20 22:26:35gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg389191
2021-03-20 14:33:48vstinnersetnosy: - vstinner
2021-03-20 07:49:05eryksunsetstage: needs patch
versions: + Python 3.8, Python 3.9, Python 3.10, - Python 3.7
2019-08-21 11:09:59vstinnersettitle: 40 * 473 grid of "é" has a single wrong character on Windows -> Windows: WindowsConsoleIO produces mojibake for strings longer than 32 KiB
2019-08-21 11:09:28vstinnersetnosy: + vstinner
2019-08-16 15:39:55steve.dowersetmessages: + msg349872
2019-08-16 04:05:43eryksunsetnosy: + eryksun
messages: + msg349844
components: + IO
2019-08-16 00:36:04anhanscreate