classification
Title: windows console doesn't print utf8 (Py30a2)
Type: behavior Stage: test needed
Components: Unicode, Windows Versions: Python 3.1, Python 3.0
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, christian.heimes, ezio.melotti, haypo, mark, pitrou, tzot (7)
Priority: low Keywords:

Created on 2007-12-12 09:56 by mark, last changed 2009-05-19 09:46 by amaury.forgeotdarc.

Messages (6)
msg58487 - (view) Author: Mark Summerfield (mark) Date: 2007-12-12 09:56
I am not sure if this is a Python bug or simply a limitation of cmd.exe.

I am using Windows XP Home.
I run cmd.exe with the /u option and I have set my console font to
"Lucida Console" (the only TrueType font offered), and I run chcp 65001
to set the utf8 code page.
When I run the following program:

for x in range(32, 2000):
    print("{0:5X} {0:c}".format(x))

one blank line is output.

But if I do chcp 1252 the program prints up to 7F before hitting a
unicode encoding error.

This is different behaviour from Python 2.5.1 which (with a suitably
modified print line) after chcp 65001 prints up to 7F and then fails
with "IOError: [Errno 0] Error".
msg58621 - (view) Author: Mark Summerfield (mark) Date: 2007-12-14 11:31
I've looked into this a bit more, and from what I can see, code page
65001 just doesn't work---so it is a Windows problem not a Python problem.
A possible solution might be to read/write UTF16 which "managed" Windows
applications can do.
msg58651 - (view) Author: Christian Heimes (christian.heimes) Date: 2007-12-15 02:08
We are aware of multiple Windows related problems. We are planing to
rewrite parts of the Windows specific API to use the widechar variants.
Maybe that will help.
msg87086 - (view) Author: Antoine Pitrou (pitrou) Date: 2009-05-03 23:57
Yes, it is a Windows problem. There simply doesn't seem to be a true
Unicode codepage for command-line apps. Recommend closing.
msg88059 - (view) Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) Date: 2009-05-19 00:08
Just in case it helps, this behaviour is on Win XP Pro, Python 2.5.1:

First, I added an alias for 'cp65001' to 'utf_8' in
Lib/encodings/aliases.py .

Then, I opened a command prompt with a bitmap font.

c:\windows\system32>python
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> print u"\N{EM DASH}"
—

I switched the font to Lucida Console, and retried (without exiting the
python interpreter, although the behaviour is the same when exiting and
entering again: )

>>> print u"\N{EM DASH}"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 13] Permission denied

Then I tried (by pressing Alt+0233 for é, which is invalid in my normal
cp1253 codepage):

>>> print u"née"

and the interpreter exits without any information. So it does for:

>>> a=u"née"

Then I created a UTF-8 text file named 'test65001.py':

# -*- coding: utf_8 -*-
a=u"néeα"
print a

and tried to run it directly from the command line:

c:\windows\system32>python d:\src\PYTHON\test65001.py
néeαTraceback (most recent call last):
  File "d:\src\PYTHON\test65001.py", line 4, in <module>
    print a
IOError: [Errno 2] No such file or directory

You see? It printed all the characters before failing.

Also the following works:

c:\windows\system32>echo heéε
heéε

and

c:\windows\system32>echo heéε >D:\src\PYTHON\dummy.txt

creates successfully a UTF-8 file (without any UTF-8 BOM marks at the
beginning).

So it's possible that it is a python bug, or at least something can be
done about it.
msg88077 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) Date: 2009-05-19 09:46
an immediate thing to do is to declare cp65001 as an encoding:

Index: Lib/encodings/aliases.py
===================================================================
--- Lib/encodings/aliases.py    (revision 72757)
+++ Lib/encodings/aliases.py    (working copy)
@@ -511,6 +511,7 @@
     'utf8'               : 'utf_8',
     'utf8_ucs2'          : 'utf_8',
     'utf8_ucs4'          : 'utf_8',
+    'cp65001'            : 'utf_8',

     ## uu_codec codec
     #'uu'                 : 'uu_codec',

This is not enough unfortunately, because the win32 API function
WriteFile() returns the number of characters written, not the number of
(utf8) bytes:

>>> print("\u0124\u0102" + 'abc')
ĤĂabc
c
[44420 refs]
>>>

Additionally, there is a bug in the ReadFile, which returns an empty
string (and no error) when a non-ascii character is entered, which is
the behavior of an EOF condition...

Maybe the solution is to use the win32 console API directly...
History
Date User Action Args
2009-05-19 09:46:13amaury.forgeotdarcsetmessages: + msg88077
2009-05-19 07:54:22pitrousetnosy: + amaury.forgeotdarc
2009-05-19 00:09:03tzotsetnosy: + tzot
messages: + msg88059
2009-05-03 23:57:04pitrousetnosy: + pitrou
messages: + msg87086
2009-05-03 23:51:10hayposetnosy: haypo, christian.heimes, mark, ezio.melotti
components: + Windows
2009-05-03 23:50:37hayposetnosy: haypo, christian.heimes, mark, ezio.melotti
components: - Windows
2009-04-27 23:38:12ajaksu2setnosy: + haypo, ezio.melotti

versions: + Python 3.1
stage: test needed
2008-01-06 22:29:44adminsetkeywords: - py3k
versions: Python 3.0
2007-12-15 02:08:14christian.heimessetpriority: low
keywords: + py3k
messages: + msg58651
nosy: + christian.heimes
2007-12-14 11:31:28marksetmessages: + msg58621
2007-12-12 09:56:30markcreate