This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Unicode character ends interactive session
Type: crash Stage: resolved
Components: Unicode, Windows Versions: Python 3.4
process
Status: closed Resolution: duplicate
Dependencies: Superseder: windows console doesn't print or input Unicode
View: 1602
Assigned To: Nosy List: AGrzes, eryksun, ezio.melotti, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2015-02-09 20:37 by AGrzes, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
-v.txt AGrzes, 2015-02-09 20:37
Messages (3)
msg235629 - (view) Author: Grzegorz Abramczyk (AGrzes) Date: 2015-02-09 20:37
Inputing some Unicode characters (like 'łąśćńó...') causes interactive session to abort.

When console session is set to use UTF-8 code page (65001) after diacritic character appears in string the session abruptly ends. Looking into debug output it looks like some cleanup is performed but there are no error messages indicating what caused problem.

Problem spotted on Windows 10 (technical preview) but I may try to replicate it on some released operating system.

---
C:\>chcp 1250
Active code page: 1250

C:\>python -i
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:15:05) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'ł'
'ł'
>>> exit()

C:\>chcp 65001
Active code page: 65001

C:\>python -i
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  6 2014, 22:15:05) [MSC v.1600 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> 'ł'


C:\
msg235644 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-02-09 23:05
This issue looks to be a duplicate of the issue #1602: windows console doesn't print or input Unicode. It's a limitation of Windows, not of Python itself. Python supports any Unicode character if the output is written in a file (encoded in UTF-8).

Workaround: use IDLE or another Python "REPL" (interactive interpreter) which has a better Unicode support.
msg235655 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-02-10 01:51
This isn't a Python bug. The Windows console doesn't properly support UTF-8. See issue 1602 and Drekin's win-unicode-console, an alternative REPL based on the wide-character (UCS-2) console API.

FWIW, I attached a debugger to conhost.exe under Windows 7 to inspect what's happening here. In the client, the CRT's read() function calls WinAPI ReadFile. For a console handle this calls either ReadConsoleA or (in Windows 8+) NtReadFile. Either way, most of the action happens in the server process, conhost.exe. 

The server's input buffer is Unicode, which gets encoded to CP 65001 (UTF-8) by calling WideCharToMultibyte. However the server incorrectly assumes the current codepage is a Windows ANSI codepage with a one-to-one mapping, i.e. that each 16-bit wchar_t maps to an 8-bit char in the current codepage. Since 'ł' gets UTF-8 encoded as the two-byte string b'\xc5\x82', the allocated buffer is too small by a byte. The server doesn't recover from this failure by allocating a larger buffer. It just reports back to the client process that it read 0 bytes. The CRT in turn sets the end-of-file (EOF) flag on the stdin FILE stream, which causes Python to exit 'normally'.
History
Date User Action Args
2022-04-11 14:58:12adminsetgithub: 67612
2015-02-13 21:42:39terry.reedysetstatus: open -> closed
superseder: windows console doesn't print or input Unicode
resolution: duplicate
stage: resolved
2015-02-10 01:51:45eryksunsetnosy: + eryksun
messages: + msg235655
2015-02-09 23:05:33vstinnersetmessages: + msg235644
2015-02-09 20:37:19AGrzescreate