classification
Title: Unicode encoding failure
Type: behavior Stage: resolved
Components: Unicode, Windows Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: Superseder: windows console doesn't print or input Unicode
View: 1602
Assigned To: Nosy List: Robert Baker, eryksun, ezio.melotti, martin.panter, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2017-03-26 02:26 by Robert Baker, last changed 2017-03-26 13:40 by eryksun. This issue is now closed.

Messages (5)
msg290503 - (view) Author: Robert Baker (Robert Baker) Date: 2017-03-26 02:26
Using Python 2.7 (not IDLE) on Windows 10.

I have tried to use a Python 2.7 program to print the name of Czech composer Antonín Dvořák. I remembered to add the "u" before the string, but regardless of whether I encode the caron-r as a literal character (pasted from Windows Character Map) or as \u0159, it gives the error that character 0159 is undefined. This is incorrect; that character has been defined as "lower case r with caron above" for several years now. (The interpreter has no problem with the ANSI characters in the string.)
msg290506 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-03-26 06:06
I presume you are trying to print to the normal Windows console. I understand the console was not well supported until Python 3.6 (see Issue 1602). Have you tried that version?

I’ll leave this open for someone more experienced to confirm, but I suspect what you want may not be possible with 2.7.
msg290513 - (view) Author: Paul Moore (paul.moore) * (Python committer) Date: 2017-03-26 07:41
Also, you need to:

1. Ensure you are using characters that are available in the encoding that sys.stdout uses - in Python prior to 3.6, this would be your Windows *console* code page, and in 3.6+ would be UTF-8.
2. Declare the encoding of your source code if you are not using the default (which is ASCII in Python 2, and UTF-8 in Python 3).

Specifically, if you write your source in UTF-8, or use an encoding declaration or \u escapes, and you use Python 3.6, this problem will likely have gone away.
msg290517 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-03-26 08:24
For Python 2, there is https://pypi.python.org/pypi/win_unicode_console
msg290528 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2017-03-26 13:40
I'm closing this issue since Python's encodings in this case -- 852 (OEM) and 1250 (ANSI) -- both correctly map U+0159:

    >>> u'\u0159'.encode('852')
    '\xfd'
    >>> u'\u0159'.encode('1250')
    '\xf8'

You must be using an encoding that doesn't map U+0159. If you're using the console's default codepage (i.e. you haven't run chcp.com, mode.com, or called SetConsoleOutputCP), then Python started with stdout.encoding set to your locale's OEM codepage encoding. For example, if you're using a U.S. locale, it's cp437, and if you're using a Western Europe locale, it's cp850. Neither of these includes U+0159.

We're presented with this codepage hell because the WriteFile and WriteConsoleA functions write a stream of bytes to the console, and it needs to be told how to decode these bytes to get Unicode text. It would be nice if the console's UTF-8 implementation (codepage 65001) wasn't buggy, but Microsoft has never cared enough to fix it (at least not completely; it's still broken for input in Windows 10). 

That leaves the wide-character UTF-16 function, WriteConsoleW, as the best alternative. Using this function requires bypassing Python's normal standard I/O implementation. This has been implemented as of 3.6. But for older versions you'll need to install and enable win_unicode_console.
History
Date User Action Args
2017-03-26 13:40:02eryksunsetstatus: open -> closed

nosy: + eryksun
messages: + msg290528

stage: resolved
2017-03-26 08:24:40vstinnersetmessages: + msg290517
2017-03-26 07:41:15paul.mooresetmessages: + msg290513
2017-03-26 06:06:32martin.pantersetnosy: + ezio.melotti, paul.moore, tim.golden, vstinner, martin.panter, zach.ware, steve.dower
messages: + msg290506

superseder: windows console doesn't print or input Unicode
components: + Unicode, Windows
resolution: out of date
2017-03-26 02:26:08Robert Bakercreate