classification
Title: Extra newline appended to UTF-8 strings on Windows
Type: behavior Stage: resolved
Components: Unicode, Windows Versions: Python 3.5
process
Status: closed Resolution: third party
Dependencies: Superseder:
Assigned To: Nosy List: Egor Tensin, eryksun, ezio.melotti, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2016-02-12 01:06 by Egor Tensin, last changed 2016-02-12 10:42 by eryksun. This issue is now closed.

Messages (3)
msg260153 - (view) Author: Egor Tensin (Egor Tensin) Date: 2016-02-12 01:06
I've come across an issue of Python 3.5.1 appending an extra newline when print()ing non-ASCII strings on Windows.

This only happens when the active "code page" is set UTF-8 in cmd.exe:

    >chcp
    Active code page: 65001

Now, if I try to print an ASCII character (e.g. LATIN CAPITAL LETTER A), everything works fine:

    >python -c "print(chr(0x41))"
    A

    >

But if I try to print something a little less common (GREEK CAPITAL LETTER ALPHA), something weird happens:

    >python -c "print(chr(0x391))"
    Α


    >

For another example, let's try to print CYRILLIC CAPITAL LETTER A:

    >python -c "print(chr(0x410))"
    А


    >

This only happens if the current code page is UTF-8 though.
If I change it to something that can represent those characters, everything seems to be working fine.
For example, the Greek letter:

    >chcp 1252
    Active code page: 1253

    >python -c "print(chr(0x391))"
    Α

    >

And the Cyrillic letter:

    >chcp 1251
    Active code page: 1251

    >python -c "print(chr(0x410))"
    А

    >

This also happens if one tries to print a string with a funny character somewhere in it. Sometimes it's even worse:

    >python -c "print('Привет!')"
    Привет!
    �т!


    >

Look, guys, I know what a mess Unicode handling on Windows is, and I'm not even sure it's Python's fault, I just wanted to make sure I'm not delusional and not making stuff up.
Can somebody at least confirm this? Thank you.

I'm using x86-64 version of Python 3.5.1 on Windows 8.1.
msg260170 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2016-02-12 10:27
I guess that it's yet another example of the bug #1602: "windows console doesn't print or input Unicode".

Don't use the Windows console, but use a better console which has a better Unicode support. For example, you can play with IDLE :-) (Maybe PowerShell or ConEmu ?)
https://conemu.github.io/
msg260173 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-02-12 10:42
This a third-party problem due to bugs in the console's support for codepage 65001. For the general problem of Unicode in the console, see issue 1602. The best way to resolve this problem is by using the wide-character APIs, WriteConsoleW and ReadConsoleW. I suggest that you try the win_unicode_console package.

> But if I try to print something a little less common
> (GREEK CAPITAL LETTER ALPHA), something weird happens:
>
>    >python -c "print(chr(0x391))"
>    Α
>
>
>    >

In versions of Windows that use the legacy console, WriteFile to a console screen mistakenly returns the number of UTF-16 codes written instead of the number of bytes written. 

For example, '\u0391\r\n' gets encoded as a four-byte buffer, b'\xce\x91\r\n'. Here's the result of writing this buffer to the legacy console, using codepage 65001:

    >>> sys.stdout.buffer.raw.write(b'\xce\x91\r\n')
    Α
    3

Four bytes were written, but the console returns that it wrote three UTF-16 codes. Python's BufferedWriter (i.e. sys.stdout.buffer) sees this as an incomplete write. So it writes the last byte again. That's why you see an extra newline. The problem can be far worse if the UTF-8 buffer contains many non-ASCII characters, especially if it includes codes greater than U+07FF that get encoded as three bytes. 

This particular problem is fixed in the new version of the console that comes with Windows 10. For the legacy console, you can work around the problem by hooking WriteConsoleA and WriteFile via DLL injection. For example, ANSICON and ConEmu do this.

That said, there's a far worse problem with using codepage 65001 in the console, which still exists in Windows 10. Due to this bug Python's interactive REPL will quit whenever you try to enter non-ASCII characters, and built-in input() will raise EOFError. For example:

    >>> input()
    Ü
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    EOFError

To read the console's wide-character (UTF-16) input buffer via ReadFile, it has to first get encoded to the current codepage. The console does the conversion via WideCharToMultiByte with a buffer size that assumes each UTF-16 value will be encoded as a single byte. But that's wrong for UTF-8, in which one UTF-16 code can map to as many as three bytes. So WideCharToMultiByte fails, but does the console try to increase the buffer size? No. Does it fail the call? No. It actually returns back that it 'successfully' read 0 bytes. To the REPL and built-in input() that signals EOF (end of file).

If you only need to input text in your system locale, you can try to have the best of both worlds. Use chcp.com to set the command prompt to the codepage you need for input. Then in your Python script (e.g. in sitecustomize.py) you can use ctypes to change just the output codepage and rebind sys.stdout. For example:

    >>> import os, sys, ctypes
    >>> ctypes.WinDLL('kernel32').SetConsoleOutputCP(65001)
    1
    >>> sys.stdout = open(os.dup(sys.__stdout__.fileno()), 'w', encoding='cp65001')

    >>> sys.stdin.encoding
    'cp1252'
    >>> input()
    Ü
    'Ü'
    >>> print('\u0391')
    Α

Another minor bug is that the console doesn't keep an overlapping window in case a UTF-8 sequence gets split across multiple writes (typically due to buffering). For example:

    >>> exec(r'''
    ... sys.stdout.buffer.raw.write(b'\xce')
    ... sys.stdout.buffer.raw.write(b'\x91')
    ... ''')
    ��>>>

Since UTF-8 uses up to four bytes per code, the console would have to keep a three-byte buffer to handle the case of a split write.

> Look, guys, I know what a mess Unicode handling on Windows is,
> and I'm not even sure it's Python's fault 

Unicode handling is only a mess in the Windows API if you think Unicode is synonymous with UTF-8. Windows NT is Unicode down to the lowest levels of the kernel, but it's UTF-16 using 16-bit wide characters. Part of the problem is that the C and POSIX APIs that are preferred by cross-platform applications are byte oriented (e.g. null-terminated char strings), so Unicode support becomes synonymous with UTF-8. On Windows this leaves you stuck using the ANSI codepage, which unfortunately cannot be set to codepage 65001. Microsoft would have to rewrite a lot of code to support UTF-8 in the ANSI API, and they have no incentive to pay for that given that they're heavily invested in UTF-16.
History
Date User Action Args
2016-02-12 10:42:03eryksunsetstatus: open -> closed

nosy: + eryksun
messages: + msg260173

resolution: third party
stage: resolved
2016-02-12 10:27:39vstinnersetmessages: + msg260170
2016-02-12 01:06:05Egor Tensincreate