New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode character ends interactive session #67612
Comments
Inputing some Unicode characters (like 'łąśćńó...') causes interactive session to abort. When console session is set to use UTF-8 code page (65001) after diacritic character appears in string the session abruptly ends. Looking into debug output it looks like some cleanup is performed but there are no error messages indicating what caused problem. Problem spotted on Windows 10 (technical preview) but I may try to replicate it on some released operating system. --- C:\>python -i
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'ł'
'ł'
>>> exit() C:\>chcp 65001 C:\>python -i
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 6 2014, 22:15:05) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'ł' C:\ |
This issue looks to be a duplicate of the issue bpo-1602: windows console doesn't print or input Unicode. It's a limitation of Windows, not of Python itself. Python supports any Unicode character if the output is written in a file (encoded in UTF-8). Workaround: use IDLE or another Python "REPL" (interactive interpreter) which has a better Unicode support. |
This isn't a Python bug. The Windows console doesn't properly support UTF-8. See bpo-1602 and Drekin's win-unicode-console, an alternative REPL based on the wide-character (UCS-2) console API. FWIW, I attached a debugger to conhost.exe under Windows 7 to inspect what's happening here. In the client, the CRT's read() function calls WinAPI ReadFile. For a console handle this calls either ReadConsoleA or (in Windows 8+) NtReadFile. Either way, most of the action happens in the server process, conhost.exe. The server's input buffer is Unicode, which gets encoded to CP 65001 (UTF-8) by calling WideCharToMultibyte. However the server incorrectly assumes the current codepage is a Windows ANSI codepage with a one-to-one mapping, i.e. that each 16-bit wchar_t maps to an 8-bit char in the current codepage. Since 'ł' gets UTF-8 encoded as the two-byte string b'\xc5\x82', the allocated buffer is too small by a byte. The server doesn't recover from this failure by allocating a larger buffer. It just reports back to the client process that it read 0 bytes. The CRT in turn sets the end-of-file (EOF) flag on the stdin FILE stream, which causes Python to exit 'normally'. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: