classification
Title: input() with Unicode prompt produces mojibake on Windows
Type: behavior Stage: resolved
Components: Unicode, Windows Versions: Python 3.7, Python 3.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: steve.dower Nosy List: Drekin, eryksun, ezio.melotti, ned.deily, paul.moore, python-dev, steve.dower, terry.reedy, tim.golden, vstinner, zach.ware
Priority: normal Keywords: 3.5regression, patch

Created on 2016-10-01 15:20 by Drekin, last changed 2017-03-31 16:36 by dstufft. This issue is now closed.

Files
File name Uploaded Description Edit
issue_28333_01.patch eryksun, 2016-10-08 01:01 review
Pull Requests
URL Status Linked Edit
PR 552 closed dstufft, 2017-03-31 16:36
Messages (13)
msg277821 - (view) Author: Adam Bartoš (Drekin) * Date: 2016-10-01 15:20
In my setting (Python 3.6b1 on Windows), trying to prompt a non-ASCII character via input() results in mojibake. This is related to the recent fix of #1602 and so is Windows-specific.

>>> input("α")
╬▒

The result corresponds to print("α".encode("utf-8").decode("cp852")). That cp852 the default terminal encoding in my locale.
msg278274 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-10-07 21:51
Same output with cp437.
msg278275 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-10-07 21:52
This is a regression from 3.5.2, where input("α") displays "α".
msg278277 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-10-07 22:08
This may force issue17620 into 3.6 - we really ought to be getting and using sys.stdin and sys.stderr in PyOS_StdioReadline() rather than going directly to the raw streams.

The problem here is that we're still using fprintf to output the prompt, even though we know (assume) the input is utf-8.

I haven't looked closely at how safely we can use Python objects from this code, except to see that it's not obviously safe, but we should really figure out how to deal in Python str rather than C char* for the default readline implementation (and then only fall back on the GNU protocol when someone asks for it).

The faster fix here would be to decode the prompt from utf-8 to utf-16-le in PyOS_StdioReadline and then write it using a wide-char output function.
msg278281 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-10-07 22:32
When I pointed this issue out in code reviews, I assumed you would add the relatively simple fix to decode the prompt and call WriteConsoleW. The long-term fix in issue 17620 has to be worked out with cross-platform support, and ISTM that it can wait for 3.7.

Off topic: I just noticed that you're not calling PyOS_InputHook in the new PyOS_StdioReadline code. Tkinter registers this function pointer to call its EventHook. Do you want a separate issue for this, or is there a reason its was omitted?
msg278284 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-10-08 01:01
I'm sure Steve already has this covered, but FWIW here's a patch to call WriteConsoleW. Here's the result with the patch applied:

    >>> sys.ps1 = '»»» '
    »»» input("αβψδ: ")
    αβψδ: spam
    'spam'

and with interactive stdin and stdout/stderr redirected to a file:

    >set PYTHONIOENCODING=utf-8
    >amd64\python_d.exe >out.txt 2>&1
    input("αβψδ: ")
    spam
    ^Z

    >chcp 65001
    Active code page: 65001

    >type out.txt
    Python 3.6.0b1+ (default, Oct  7 2016, 23:47:58)
    [MSC v.1900 64 bit (AMD64)] on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> αβψδ: 'spam'
    >>>

If it can't write the prompt for some reason (e.g. out of memory, decoding fails, WriteConsole fails), it doesn't fall back on fprintf to write the prompt. Should it? 

This should also get a test that calls ReadConsoleOutputCharacter to verify that the correct prompt is written.
msg278317 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-08 19:19
New changeset faf5493e6f61 by Steve Dower in branch '3.6':
Issue #28333: Enables Unicode for ps1/ps2 and input() prompts. (Patch by Eryk Sun)
https://hg.python.org/cpython/rev/faf5493e6f61

New changeset cb62e921bd06 by Steve Dower in branch 'default':
Issue #28333: Enables Unicode for ps1/ps2 and input() prompts. (Patch by Eryk Sun)
https://hg.python.org/cpython/rev/cb62e921bd06
msg278318 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-08 19:21
New changeset 63ceadf8410f by Steve Dower in branch '3.6':
Issue #28333: Remove unnecessary increment.
https://hg.python.org/cpython/rev/63ceadf8410f

New changeset d76c8f9ea787 by Steve Dower in branch 'default':
Issue #28333: Remove unnecessary increment.
https://hg.python.org/cpython/rev/d76c8f9ea787
msg278319 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-10-08 19:23
I made some minor tweaks to the patch (no need for strlen() - passing -1 works equivalently), but otherwise it's exactly what I would have done so I committed it.

We currently have no tests to check which characters are written to a console output buffer. Issue28217 was tracking those, but considering how little code we have on top of output I don't think it's worth blocking anything on automating those tests.
msg278624 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-10-13 23:04
MultibyteToWideChar includes the trailing NUL when it gets the string length, so the WriteConsoleW call needs to use (wlen - 1).
msg279427 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-10-25 17:50
Not sure how I missed it originally, but that extra 1 char is actually very important:

Python 3.6.0b2 (v3.6.0b2:b9fadc7d1c3f, Oct 10 2016, 20:36:51) [MSC v.1900 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>  import sys
>>>  sys.ps1='> '
>  sys

The extra space is because of that. Really ought to fix this before the next beta.
msg279435 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-10-25 18:08
I forgot to include the link to the python-list thread where this came up:

https://mail.python.org/pipermail/python-list/2016-October/715428.html
msg279445 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2016-10-25 18:52
New changeset 6b46c3deea2c by Steve Dower in branch '3.6':
Issue #28333: Fixes off-by-one error that was adding an extra space.
https://hg.python.org/cpython/rev/6b46c3deea2c

New changeset 44d15ba67d2e by Steve Dower in branch 'default':
Issue #28333: Fixes off-by-one error that was adding an extra space.
https://hg.python.org/cpython/rev/44d15ba67d2e
History
Date User Action Args
2017-03-31 16:36:19dstufftsetpull_requests: + pull_request928
2016-10-25 18:52:52steve.dowersetstatus: open -> closed
resolution: fixed
stage: needs patch -> resolved
2016-10-25 18:52:32python-devsetmessages: + msg279445
2016-10-25 18:08:15eryksunsetmessages: + msg279435
2016-10-25 17:50:08steve.dowersetassignee: steve.dower

messages: + msg279427
nosy: + ned.deily
2016-10-13 23:04:07eryksunsetmessages: + msg278624
2016-10-08 19:23:23steve.dowersetmessages: + msg278319
2016-10-08 19:21:35python-devsetmessages: + msg278318
2016-10-08 19:19:01python-devsetnosy: + python-dev
messages: + msg278317
2016-10-08 01:01:17eryksunsetfiles: + issue_28333_01.patch
keywords: + patch
messages: + msg278284
2016-10-07 22:32:54eryksunsetnosy: + eryksun
messages: + msg278281
2016-10-07 22:08:01steve.dowersetkeywords: + 3.5regression

stage: needs patch
messages: + msg278277
versions: + Python 3.7
2016-10-07 21:52:48terry.reedysetmessages: + msg278275
2016-10-07 21:51:43terry.reedysetnosy: + terry.reedy
messages: + msg278274
2016-10-01 15:20:25Drekincreate