Message 349844 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	anhans, eryksun, paul.moore, steve.dower, tim.golden, zach.ware
Date	2019-08-16.04:05:42
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565928343.08.0.0123288794152.issue37871@roundup.psfhosted.org>
In-reply-to

Content
To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough. wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0); while (wlen > 32766 / sizeof(wchar_t)) { len /= 2; wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0); } With `('é' * 40 + '\n') * 473`, encoded as UTF-8, we have 473 82-byte lines (note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is too much for a single write, so it splits it in two. >>> 38786 // 2 19393 >>> 19393 // 82 236 >>> 19393 % 82 41 This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD. To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries.

To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in Modules/_io/winconsoleio.c is forced to write to the console in chunks that do not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2 until the decoded buffer size is small enough. 

    wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    while (wlen > 32766 / sizeof(wchar_t)) {
        len /= 2;
        wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
    }

With `('é' * 40 + '\n') * 473`, encoded as UTF-8, we have 473 82-byte lines (note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is too much for a single write, so it splits it in two.

    >>> 38786 // 2
    19393
    >>> 19393 // 82
    236
    >>> 19393 % 82
    41

This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one partial character sequjence, b'\xc3'. When this buffer is passed to MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets decoded as the replacement character U+FFFD. For the next write, the remaining b'\xa9' byte also gets decoded as U+FFFD.

To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer in one pass, and slice that up into writes that are less than 32 KiB. Or it could ensure that its UTF-8 slices are always at character boundaries.

History
Date	User	Action	Args
2019-08-16 04:05:43	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower, anhans
2019-08-16 04:05:43	eryksun	set	messageid: <1565928343.08.0.0123288794152.issue37871@roundup.psfhosted.org>
2019-08-16 04:05:43	eryksun	link	issue37871 messages
2019-08-16 04:05:42	eryksun	create