classification
Title: Incorrectly displayed non ascii characters in prompt using "input()" - Python 3.0a2
Type: Stage:
Components: Unicode, Windows Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, benjamin.peterson, christian.heimes, gvanrossum, loewis, vbr
Priority: release blocker Keywords: patch

Created on 2007-12-22 21:09 by vbr, last changed 2008-09-21 22:11 by amaury.forgeotdarc. This issue is now closed.

Files
File name Uploaded Description Edit
inputprompt.patch amaury.forgeotdarc, 2008-09-20 20:20
Messages (16)
msg58965 - (view) Author: Vlastimil Brom (vbr) Date: 2007-12-22 21:09
While testing the 3.0a2 build (on Win XPh SP2, Czech), I found a 
possible bug in the input() function; 
if the prompt text contains non-ascii characters (even those present in 
the default charset of the system locale - Czech in this case) the 
prompt is displayed incorrectly; however, the inserted value is treated 
as expected.

The print() function deals with these characters correctly. 
This bug occurs in the system console (cmd.exe) only, using idle 
everything works ok.


============ a minimal snapshot of the session follows ==========

Python 3.0a2 (r30a2:59397:59399, Dec  6 2007, 22:34:52) [MSC v.1500 32 
bit (Inte
l)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> input("ěšč: ")
─Ť┼í─Ź: 7
'7'
>>> print("ěšč: ")
ěšč:
>>>

==================================


Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit 
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> input(u"ěšč: ")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-
2: ordin
al not in range(128)
>>> print u"ěšč: "
ěšč:
>>>
msg58969 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-12-23 10:11
Would you like to work on a patch?
msg59039 - (view) Author: Vlastimil Brom (vbr) Date: 2007-12-29 19:53
First sorry about a delayed response, but moreover, I fear, preparing a 
patch would be far beyond my programming competence; sorry about that.
msg59125 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-03 06:14
I think I understand what's going on.  The trail leads from the last "if
(tty) {" block in builtin_input() to PyOS_Readline() which in turn ends
up calling PyOS_StdioReadline() (because that's the most likely
initialization of PyOS_ReadlineFunctionPointer).  And this, finally,
uses fprintf() to stderr to print the prompt.  That apparently doesn't
use the same encoding, or perhaps by now the string has been encoded as
UTF-8.

This is clearly a problem.  But what to do about it...
msg59140 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-03 18:19
Windows needs its own PyOS_StdioReadline() function in order to support
wide chars. We can either use the low level functions _putwch() and
_getwche(). Or we could probably use the more higher functions
_cwprintf_s() (secure console wide char print format, oh I love MS'
naming schema) and _cgetws_s().
msg59141 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-01-03 18:40
Cool.

I suspect Unix will also require a customized version to be used in case
GNU readline isn't present.

And I wouldn't be surprised if GNU readline itself doesn't handle UTF-8
properly either!
msg59142 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-01-03 18:51
Guido van Rossum wrote:
> I suspect Unix will also require a customized version to be used in case
> GNU readline isn't present.
> 
> And I wouldn't be surprised if GNU readline itself doesn't handle UTF-8
> properly either!

GNU readline can handle UTF-8 chars fine on my system:

äßé: ä
ä

My locales are set to de_DE.UTF-8

Christian
msg59144 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-01-03 19:18
If possible, I would like to see the C library phased out of Python on
Windows, for file I/O. In this case, it would mean that ReadConsoleW is
used directly for character input. Notice that _cgetws does not take a
file handle as a parameter, but implicitly uses _coninpfh.

As a consequence, PyOS_StdioReadline probably should change its
parameter from FILE* to "file handle", and consequently rename it to,
say, PyOS_Readline.
msg59685 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-01-11 00:19
Isn't it enough to encode the prompt with the console encoding, instead
of letting the default utf-8 conversion? This patch corrects the issue
on Windows:

Index: ../Python/bltinmodule.c
===================================================================
--- ../Python/bltinmodule.c     (revision 59843)
+++ ../Python/bltinmodule.c     (working copy)
@@ -1358,12 +1358,19 @@
                else
                        Py_DECREF(tmp);
                if (promptarg != NULL) {
-                       po = PyObject_Str(promptarg);
+                       PyObject *stringpo = PyObject_Str(promptarg);
+                       if (stringpo == NULL) {
+                               Py_DECREF(stdin_encoding);
+                               return NULL;
+                       }
+                       po = PyUnicode_AsEncodedString(stringpo,
+                               PyUnicode_AsString(stdin_encoding), NULL);
+                       Py_DECREF(stringpo);
                        if (po == NULL) {
                                Py_DECREF(stdin_encoding);
                                return NULL;
                        }
-                       prompt = PyUnicode_AsString(po);
+                       prompt = PyString_AsString(po);
                        if (prompt == NULL) {
                                Py_DECREF(stdin_encoding);
                                Py_DECREF(po);
msg59695 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-01-11 08:36
> Isn't it enough to encode the prompt with the console encoding, instead
> of letting the default utf-8 conversion? This patch corrects the issue
> on Windows:

Sounds right. Technically, you should be using the stdout encoding, but
I don't think it should ever differ from the stdin_encoding.
msg73458 - (view) Author: Vlastimil Brom (vbr) Date: 2008-09-20 07:38
While I am not sure about the status of this somewhat older issue, I 
just wanted to mention, that the behaviour remains the same in Python 
3.0rc1 (XPh SP3, Czech)

Python 3.0rc1 (r30rc1:66507, Sep 18 2008, 14:47:08) [MSC v.1500 32 bit 
(Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> input("ěšč: ")
─Ť┼í─Ź: řžý
'řžý'
>>> print("ěšč: ")
ěšč:
>>>

Is the patch above supposed to have been committed, or are there yet 
another difficulties?
(Not that it is a huge problem (for me), as applications dealing with 
non ascii text probably would use a gui, rather than relying on a 
console, but it's a kind of surprising.)
msg73462 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-20 10:46
Amaury, what further review of the patch do you desire? I had already
commented that I consider the patch correct, except that it might use
stdout_encoding instead.

Also, I wouldn't consider this a release blocker. It is somewhat
annoying that input produces moji-bake in certain cases (i.e. non-ASCII
characters in the prompt, and a non-UTF-8 terminal), but if the patch
wouldn't make it into 3.0, we can still fix it in 3.0.1.
msg73464 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-09-20 15:04
Given MvL's review, assuming it fixes the Czech problem, I'm all for
applying it.
msg73471 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-20 20:20
Here is a new version of the patch: the PyString* functions were renamed
to PyBytes*, and it now uses stdout_encoding.

About the "release blocker" status: I agree it is not so important, I
just wanted to express my "it's been here for long, it's almost ready,
it would be a pity not to have it in the final 3.0" feelings.
msg73527 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-09-21 20:32
I'm ok with this patch.
msg73536 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-21 22:11
Committed r66545.
History
Date User Action Args
2008-09-21 22:11:31amaury.forgeotdarcsetstatus: open -> closed
resolution: fixed
messages: + msg73536
2008-09-21 20:32:00benjamin.petersonsetkeywords: - needs review
nosy: + benjamin.peterson
messages: + msg73527
2008-09-20 20:20:22amaury.forgeotdarcsetfiles: + inputprompt.patch
keywords: + patch
messages: + msg73471
2008-09-20 15:04:33gvanrossumsetmessages: + msg73464
2008-09-20 10:46:40loewissetmessages: + msg73462
2008-09-20 08:59:17amaury.forgeotdarcsetpriority: normal -> release blocker
keywords: + needs review
2008-09-20 07:38:10vbrsetmessages: + msg73458
2008-01-11 08:36:32loewissetmessages: + msg59695
2008-01-11 00:19:28amaury.forgeotdarcsetmessages: + msg59685
2008-01-06 22:29:44adminsetkeywords: - py3k
versions: Python 3.0
2008-01-03 19:18:31loewissetmessages: + msg59144
2008-01-03 18:51:15christian.heimessetmessages: + msg59142
2008-01-03 18:40:34gvanrossumsetmessages: + msg59141
2008-01-03 18:19:44christian.heimessetmessages: + msg59140
2008-01-03 06:15:00gvanrossumsetpriority: normal
nosy: + gvanrossum, christian.heimes
messages: + msg59125
keywords: + py3k
2007-12-30 22:24:01amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
2007-12-29 19:53:16vbrsetmessages: + msg59039
2007-12-23 10:11:16loewissetnosy: + loewis
messages: + msg58969
2007-12-22 21:09:32vbrcreate