Message 267260 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	davispuh, eryksun, ezio.melotti, martin.panter, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Date	2016-06-04.16:07:15
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1465056436.38.0.189015989468.issue27179@psf.upfronthosting.co.za>
In-reply-to

Content
>> so ANSI is the natural default for a detached process > > To clarify - ANSI is the natural default for programs that > don't support Unicode. By natural, I meant in the context of using GetConsoleOutputCP(), since WideCharToMultiByte(0, ...) encodes text as ANSI. Clearly UTF-16LE is preferred for IPC on Windows. It's the native Unicode format down to the lowest levels of the kernel. But we're talking about old-school IPC using standard I/O pipelines, for which I think UTF-8 is a better fit. > Forcing the use of UTF-8 as the code page is the easiest way > for us to support it. The console's behavior for codepage 65001 is too buggy. The show stopper is that it limits input to ASCII. The console allocates a temporary buffer for the encoded text that's sized assuming 1 ANSI/OEM byte per UTF-16 code. So if you enter non-ASCII characters, WideCharToMultiByte fails in conhost.exe. But the console returns that the operation has successfully read 0 bytes. Python's REPL and input() see this as EOF. For example: import sys, ctypes, msvcrt kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) conin = open(r'\\.\CONIN$', 'r+') h = msvcrt.get_osfhandle(conin.fileno()) buf = (ctypes.c_char * 15)() n = (ctypes.c_ulong * 1)() >>> sys.stdin.encoding 'cp65001' ReadFile test in Windows 10: >>> kernel32.ReadFile(h, buf, 15, n, None) Test! 1 >>> n[0], buf[:] (7, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00') >>> kernel32.ReadFile(h, buf, 15, n, None) ¡Prueba! 1 >>> n[0], buf[:] (0, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00') The second call obviously fails, even thought it returns 1. The input contains non-ASCII "¡", which in UTF-8 requires 2 bytes, b'\xc2\xa1'. This causes the failure in conhost.exe that I described above. ReadConsoleA has the same problem: >>> kernel32.ReadConsoleA(h, buf, 15, n, None) Hello World! 1 >>> n[0], buf[:] (14, b'Hello World!\r\n\x00') >>> kernel32.ReadConsoleA(h, buf, 15, n, None) ¡Hola Mundo! 1 >>> n[0], buf[:] (0, b'Hello World!\r\n\x00') UTF-8 output is also buggy prior to Windows 8. The problem is that WriteFile returns the number of UTF-16 codes written instead of the number of bytes. For non-ASCII characters in the BMP, 1 UTF-16 code is 2 or 3 UTF-8 bytes. So it looks like a partial write. A buffered writer will loop multiple times to write what appears to be the remaining bytes, in a trail of junk lines in proportion to the number of non-ASCII characters written. Python could work around this by decoding the buffer to get the corresponding number of UTF-16 codes written in the console, but child processes may also be subject to this bug. The only general solution on Windows 7 is to use something like ANSICON, which uses DLL injection to hook and wrap WriteFile and WriteConsoleA. There's also a UTF-8 related bug in ulib.dll. This bug affects programs that do console codepage conversions, such as more.com. This in turn affects Python's interactive help(). I looked at this in issue 19914. The ulib bug is fixed in Windows 10. I don't know whether it's fixed in Windows 8, but it's there in Windows 7 (supported until 2020). > This would make Python's implementation much more > complicated, as well as breaking some scripts and > existing packages. Unless you're talking about major breakage, I think switching to the wide-character API is worth it, as the only viable path to supporting Unicode in the console. The implementation probably should transcode between UTF-16LE and UTF-8, so pure Python never sees UTF-16 byte strings. sys.std*.encoding would be 'utf-8'. os.read and os.write would be implemented as _Py_read and _Py_write (already exists). For console handles these could delegate to _Py_console_read and _Py_console_write, to convert between UTF-8 and UTF-16LE and call ReadConsoleW and WriteConsoleW.

>> so ANSI is the natural default for a detached process
>
> To clarify - ANSI is the natural default *for programs that 
> don't support Unicode*.

By natural, I meant in the context of using GetConsoleOutputCP(), since WideCharToMultiByte(0, ...) encodes text as ANSI. Clearly UTF-16LE is preferred for IPC on Windows. It's the native Unicode format down to the lowest levels of the kernel. But we're talking about old-school IPC using standard I/O pipelines, for which I think UTF-8 is a better fit.

> Forcing the use of UTF-8 as the code page is the easiest way 
> for us to support it.

The console's behavior for codepage 65001 is too buggy. The show stopper is that it limits input to ASCII. The console allocates a temporary buffer for the encoded text that's sized assuming 1 ANSI/OEM byte per UTF-16 code. So if you enter non-ASCII characters, WideCharToMultiByte fails in conhost.exe. But the console returns that the operation has successfully read 0 bytes. Python's REPL and input() see this as EOF.

For example:

    import sys, ctypes, msvcrt
    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    conin = open(r'\\.\CONIN$', 'r+')
    h = msvcrt.get_osfhandle(conin.fileno())
    buf = (ctypes.c_char * 15)()
    n = (ctypes.c_ulong * 1)()

    >>> sys.stdin.encoding
    'cp65001'

ReadFile test in Windows 10:

    >>> kernel32.ReadFile(h, buf, 15, n, None)
    Test!
    1
    >>> n[0], buf[:]
    (7, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')

    >>> kernel32.ReadFile(h, buf, 15, n, None)
    ¡Prueba!
    1
    >>> n[0], buf[:]
    (0, b'Test!\r\n\x00\x00\x00\x00\x00\x00\x00\x00')

The second call obviously fails, even thought it returns 1. The input contains non-ASCII "¡", which in UTF-8 requires 2 bytes, b'\xc2\xa1'. This causes the failure in conhost.exe that I described above.

ReadConsoleA has the same problem:

    >>> kernel32.ReadConsoleA(h, buf, 15, n, None)
    Hello World!
    1
    >>> n[0], buf[:]
    (14, b'Hello World!\r\n\x00')

    >>> kernel32.ReadConsoleA(h, buf, 15, n, None)
    ¡Hola Mundo!
    1
    >>> n[0], buf[:]
    (0, b'Hello World!\r\n\x00')

UTF-8 output is also buggy prior to Windows 8. The problem is that WriteFile returns the number of UTF-16 codes written instead of the number of bytes. For non-ASCII characters in the BMP, 1 UTF-16 code is 2 or 3 UTF-8 bytes. So it looks like a partial write. A buffered writer will loop multiple times to write what appears to be the remaining bytes, in a trail of junk lines in proportion to the number of non-ASCII characters written.

Python could work around this by decoding the buffer to get the corresponding number of UTF-16 codes written in the console, but child processes may also be subject to this bug. The only general solution on Windows 7 is to use something like ANSICON, which uses DLL injection to hook and wrap WriteFile and WriteConsoleA.

There's also a UTF-8 related bug in ulib.dll. This bug affects programs that do console codepage conversions, such as more.com. This in turn affects Python's interactive help(). I looked at this in issue 19914. The ulib bug is fixed in Windows 10. I don't know whether it's fixed in Windows 8, but it's there in Windows 7 (supported until 2020).

> This would make Python's implementation much more 
> complicated, as well as breaking some scripts and 
> existing packages.

Unless you're talking about major breakage, I think switching to the wide-character API is worth it, as the only viable path to supporting Unicode in the console. The implementation probably should transcode between UTF-16LE and UTF-8, so pure Python never sees UTF-16 byte strings. sys.std*.encoding would be 'utf-8'. os.read and os.write would be implemented as _Py_read and _Py_write (already exists). For console handles these could delegate to _Py_console_read and _Py_console_write, to convert between UTF-8 and UTF-16LE and call ReadConsoleW and WriteConsoleW.

History
Date	User	Action	Args
2016-06-04 16:07:16	eryksun	set	recipients: + eryksun, paul.moore, vstinner, tim.golden, ezio.melotti, martin.panter, zach.ware, steve.dower, davispuh
2016-06-04 16:07:16	eryksun	set	messageid: <1465056436.38.0.189015989468.issue27179@psf.upfronthosting.co.za>
2016-06-04 16:07:16	eryksun	link	issue27179 messages
2016-06-04 16:07:15	eryksun	create