Message 260173 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	Egor Tensin, eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Date	2016-02-12.10:42:02
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1455273723.78.0.312972739479.issue26345@psf.upfronthosting.co.za>
In-reply-to

Content
This a third-party problem due to bugs in the console's support for codepage 65001. For the general problem of Unicode in the console, see issue 1602. The best way to resolve this problem is by using the wide-character APIs, WriteConsoleW and ReadConsoleW. I suggest that you try the win_unicode_console package. > But if I try to print something a little less common > (GREEK CAPITAL LETTER ALPHA), something weird happens: > > >python -c "print(chr(0x391))" > Α > > > > In versions of Windows that use the legacy console, WriteFile to a console screen mistakenly returns the number of UTF-16 codes written instead of the number of bytes written. For example, '\u0391\r\n' gets encoded as a four-byte buffer, b'\xce\x91\r\n'. Here's the result of writing this buffer to the legacy console, using codepage 65001: >>> sys.stdout.buffer.raw.write(b'\xce\x91\r\n') Α 3 Four bytes were written, but the console returns that it wrote three UTF-16 codes. Python's BufferedWriter (i.e. sys.stdout.buffer) sees this as an incomplete write. So it writes the last byte again. That's why you see an extra newline. The problem can be far worse if the UTF-8 buffer contains many non-ASCII characters, especially if it includes codes greater than U+07FF that get encoded as three bytes. This particular problem is fixed in the new version of the console that comes with Windows 10. For the legacy console, you can work around the problem by hooking WriteConsoleA and WriteFile via DLL injection. For example, ANSICON and ConEmu do this. That said, there's a far worse problem with using codepage 65001 in the console, which still exists in Windows 10. Due to this bug Python's interactive REPL will quit whenever you try to enter non-ASCII characters, and built-in input() will raise EOFError. For example: >>> input() Ü Traceback (most recent call last): File "<stdin>", line 1, in <module> EOFError To read the console's wide-character (UTF-16) input buffer via ReadFile, it has to first get encoded to the current codepage. The console does the conversion via WideCharToMultiByte with a buffer size that assumes each UTF-16 value will be encoded as a single byte. But that's wrong for UTF-8, in which one UTF-16 code can map to as many as three bytes. So WideCharToMultiByte fails, but does the console try to increase the buffer size? No. Does it fail the call? No. It actually returns back that it 'successfully' read 0 bytes. To the REPL and built-in input() that signals EOF (end of file). If you only need to input text in your system locale, you can try to have the best of both worlds. Use chcp.com to set the command prompt to the codepage you need for input. Then in your Python script (e.g. in sitecustomize.py) you can use ctypes to change just the output codepage and rebind sys.stdout. For example: >>> import os, sys, ctypes >>> ctypes.WinDLL('kernel32').SetConsoleOutputCP(65001) 1 >>> sys.stdout = open(os.dup(sys.__stdout__.fileno()), 'w', encoding='cp65001') >>> sys.stdin.encoding 'cp1252' >>> input() Ü 'Ü' >>> print('\u0391') Α Another minor bug is that the console doesn't keep an overlapping window in case a UTF-8 sequence gets split across multiple writes (typically due to buffering). For example: >>> exec(r''' ... sys.stdout.buffer.raw.write(b'\xce') ... sys.stdout.buffer.raw.write(b'\x91') ... ''') ��>>> Since UTF-8 uses up to four bytes per code, the console would have to keep a three-byte buffer to handle the case of a split write. > Look, guys, I know what a mess Unicode handling on Windows is, > and I'm not even sure it's Python's fault Unicode handling is only a mess in the Windows API if you think Unicode is synonymous with UTF-8. Windows NT is Unicode down to the lowest levels of the kernel, but it's UTF-16 using 16-bit wide characters. Part of the problem is that the C and POSIX APIs that are preferred by cross-platform applications are byte oriented (e.g. null-terminated char strings), so Unicode support becomes synonymous with UTF-8. On Windows this leaves you stuck using the ANSI codepage, which unfortunately cannot be set to codepage 65001. Microsoft would have to rewrite a lot of code to support UTF-8 in the ANSI API, and they have no incentive to pay for that given that they're heavily invested in UTF-16.

This a third-party problem due to bugs in the console's support for codepage 65001. For the general problem of Unicode in the console, see issue 1602. The best way to resolve this problem is by using the wide-character APIs, WriteConsoleW and ReadConsoleW. I suggest that you try the win_unicode_console package.

> But if I try to print something a little less common
> (GREEK CAPITAL LETTER ALPHA), something weird happens:
>
>    >python -c "print(chr(0x391))"
>    Α
>
>
>    >

In versions of Windows that use the legacy console, WriteFile to a console screen mistakenly returns the number of UTF-16 codes written instead of the number of bytes written. 

For example, '\u0391\r\n' gets encoded as a four-byte buffer, b'\xce\x91\r\n'. Here's the result of writing this buffer to the legacy console, using codepage 65001:

    >>> sys.stdout.buffer.raw.write(b'\xce\x91\r\n')
    Α
    3

Four bytes were written, but the console returns that it wrote three UTF-16 codes. Python's BufferedWriter (i.e. sys.stdout.buffer) sees this as an incomplete write. So it writes the last byte again. That's why you see an extra newline. The problem can be far worse if the UTF-8 buffer contains many non-ASCII characters, especially if it includes codes greater than U+07FF that get encoded as three bytes. 

This particular problem is fixed in the new version of the console that comes with Windows 10. For the legacy console, you can work around the problem by hooking WriteConsoleA and WriteFile via DLL injection. For example, ANSICON and ConEmu do this.

That said, there's a far worse problem with using codepage 65001 in the console, which still exists in Windows 10. Due to this bug Python's interactive REPL will quit whenever you try to enter non-ASCII characters, and built-in input() will raise EOFError. For example:

    >>> input()
    Ü
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    EOFError

To read the console's wide-character (UTF-16) input buffer via ReadFile, it has to first get encoded to the current codepage. The console does the conversion via WideCharToMultiByte with a buffer size that assumes each UTF-16 value will be encoded as a single byte. But that's wrong for UTF-8, in which one UTF-16 code can map to as many as three bytes. So WideCharToMultiByte fails, but does the console try to increase the buffer size? No. Does it fail the call? No. It actually returns back that it 'successfully' read 0 bytes. To the REPL and built-in input() that signals EOF (end of file).

If you only need to input text in your system locale, you can try to have the best of both worlds. Use chcp.com to set the command prompt to the codepage you need for input. Then in your Python script (e.g. in sitecustomize.py) you can use ctypes to change just the output codepage and rebind sys.stdout. For example:

    >>> import os, sys, ctypes
    >>> ctypes.WinDLL('kernel32').SetConsoleOutputCP(65001)
    1
    >>> sys.stdout = open(os.dup(sys.__stdout__.fileno()), 'w', encoding='cp65001')

    >>> sys.stdin.encoding
    'cp1252'
    >>> input()
    Ü
    'Ü'
    >>> print('\u0391')
    Α

Another minor bug is that the console doesn't keep an overlapping window in case a UTF-8 sequence gets split across multiple writes (typically due to buffering). For example:

    >>> exec(r'''
    ... sys.stdout.buffer.raw.write(b'\xce')
    ... sys.stdout.buffer.raw.write(b'\x91')
    ... ''')
    ��>>>

Since UTF-8 uses up to four bytes per code, the console would have to keep a three-byte buffer to handle the case of a split write.

> Look, guys, I know what a mess Unicode handling on Windows is,
> and I'm not even sure it's Python's fault 

Unicode handling is only a mess in the Windows API if you think Unicode is synonymous with UTF-8. Windows NT is Unicode down to the lowest levels of the kernel, but it's UTF-16 using 16-bit wide characters. Part of the problem is that the C and POSIX APIs that are preferred by cross-platform applications are byte oriented (e.g. null-terminated char strings), so Unicode support becomes synonymous with UTF-8. On Windows this leaves you stuck using the ANSI codepage, which unfortunately cannot be set to codepage 65001. Microsoft would have to rewrite a lot of code to support UTF-8 in the ANSI API, and they have no incentive to pay for that given that they're heavily invested in UTF-16.

History
Date	User	Action	Args
2016-02-12 10:42:03	eryksun	set	recipients: + eryksun, paul.moore, vstinner, tim.golden, ezio.melotti, zach.ware, steve.dower, Egor Tensin
2016-02-12 10:42:03	eryksun	set	messageid: <1455273723.78.0.312972739479.issue26345@psf.upfronthosting.co.za>
2016-02-12 10:42:03	eryksun	link	issue26345 messages
2016-02-12 10:42:02	eryksun	create