Issue1602
Created on 2007-12-12 09:56 by mark, last changed 2009-10-26 17:06 by v+python.
| Messages (11) | |||
|---|---|---|---|
| msg58487 - (view) | Author: Mark Summerfield (mark) | Date: 2007-12-12 09:56 | |
I am not sure if this is a Python bug or simply a limitation of cmd.exe.
I am using Windows XP Home.
I run cmd.exe with the /u option and I have set my console font to
"Lucida Console" (the only TrueType font offered), and I run chcp 65001
to set the utf8 code page.
When I run the following program:
for x in range(32, 2000):
print("{0:5X} {0:c}".format(x))
one blank line is output.
But if I do chcp 1252 the program prints up to 7F before hitting a
unicode encoding error.
This is different behaviour from Python 2.5.1 which (with a suitably
modified print line) after chcp 65001 prints up to 7F and then fails
with "IOError: [Errno 0] Error".
|
|||
| msg58621 - (view) | Author: Mark Summerfield (mark) | Date: 2007-12-14 11:31 | |
I've looked into this a bit more, and from what I can see, code page 65001 just doesn't work---so it is a Windows problem not a Python problem. A possible solution might be to read/write UTF16 which "managed" Windows applications can do. |
|||
| msg58651 - (view) | Author: Christian Heimes (christian.heimes) | Date: 2007-12-15 02:08 | |
We are aware of multiple Windows related problems. We are planing to rewrite parts of the Windows specific API to use the widechar variants. Maybe that will help. |
|||
| msg87086 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2009-05-03 23:57 | |
Yes, it is a Windows problem. There simply doesn't seem to be a true Unicode codepage for command-line apps. Recommend closing. |
|||
| msg88059 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) | Date: 2009-05-19 00:08 | |
Just in case it helps, this behaviour is on Win XP Pro, Python 2.5.1: First, I added an alias for 'cp65001' to 'utf_8' in Lib/encodings/aliases.py . Then, I opened a command prompt with a bitmap font. c:\windows\system32>python Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> print u"\N{EM DASH}" — I switched the font to Lucida Console, and retried (without exiting the python interpreter, although the behaviour is the same when exiting and entering again: ) >>> print u"\N{EM DASH}" Traceback (most recent call last): File "<stdin>", line 1, in <module> IOError: [Errno 13] Permission denied Then I tried (by pressing Alt+0233 for é, which is invalid in my normal cp1253 codepage): >>> print u"née" and the interpreter exits without any information. So it does for: >>> a=u"née" Then I created a UTF-8 text file named 'test65001.py': # -*- coding: utf_8 -*- a=u"néeα" print a and tried to run it directly from the command line: c:\windows\system32>python d:\src\PYTHON\test65001.py néeαTraceback (most recent call last): File "d:\src\PYTHON\test65001.py", line 4, in <module> print a IOError: [Errno 2] No such file or directory You see? It printed all the characters before failing. Also the following works: c:\windows\system32>echo heéε heéε and c:\windows\system32>echo heéε >D:\src\PYTHON\dummy.txt creates successfully a UTF-8 file (without any UTF-8 BOM marks at the beginning). So it's possible that it is a python bug, or at least something can be done about it. |
|||
| msg88077 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) | Date: 2009-05-19 09:46 | |
an immediate thing to do is to declare cp65001 as an encoding: Index: Lib/encodings/aliases.py =================================================================== --- Lib/encodings/aliases.py (revision 72757) +++ Lib/encodings/aliases.py (working copy) @@ -511,6 +511,7 @@ 'utf8' : 'utf_8', 'utf8_ucs2' : 'utf_8', 'utf8_ucs4' : 'utf_8', + 'cp65001' : 'utf_8', ## uu_codec codec #'uu' : 'uu_codec', This is not enough unfortunately, because the win32 API function WriteFile() returns the number of characters written, not the number of (utf8) bytes: >>> print("\u0124\u0102" + 'abc') ĤĂabc c [44420 refs] >>> Additionally, there is a bug in the ReadFile, which returns an empty string (and no error) when a non-ascii character is entered, which is the behavior of an EOF condition... Maybe the solution is to use the win32 console API directly... |
|||
| msg92854 - (view) | Author: Χρήστος Γεωργίου (Christos Georgiou) (tzot) | Date: 2009-09-19 00:38 | |
Another note: if one creates a dummy Stream object (having a softspace attribute and a write method that writes using os.write, as in http://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/1432462#1432462 ) to replace sys.stdout and sys.stderr, then writes occur correctly, without issues. Pre-requisites: chcp 65001, Lucida Console font and cp65001 as an alias for UTF-8 in encodings/aliases.py This is Python 2.5.4 on Windows. |
|||
| msg94445 - (view) | Author: Glenn Linderman (v+python) | Date: 2009-10-25 00:06 | |
With Python 3.1.1, the following batch file seems to be necessary to use UTF-8 successfully from an XP console: set PYTHONIOENCODING=UTF-8 cmd /u /k chcp 65001 set PYTHONIOENCODING= exit the cmd line seems to be necessary because of Windows having compatibility issues, but it seems that Python should notice the cp65001 and not need the PYTHONIOENCODING stuff. |
|||
| msg94480 - (view) | Author: Mark Summerfield (mark) | Date: 2009-10-26 09:07 | |
Glenn Linderman's fix pretty well works for me on XP Home. I can print every Unicode character up to and including U+D7FF (although most just come out as rectangles, at least I don't get encoding errors). It fails at U+D800 with message: UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 17: surrogates not allowed I also tried U+D801 and got the same error. Nonetheless, this is *much* better than before. |
|||
| msg94483 - (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2009-10-26 09:19 | |
Mark Summerfield wrote: > > Mark Summerfield <mark@qtrac.eu> added the comment: > > Glenn Linderman's fix pretty well works for me on XP Home. I can print > every Unicode character up to and including U+D7FF (although most just > come out as rectangles, at least I don't get encoding errors). > > It fails at U+D800 with message: > > UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in > position 17: surrogates not allowed > > I also tried U+D801 and got the same error. That's normal and expected: D800 is the start of the surrogate ranges which are only allows in pairs in UTF-8. |
|||
| msg94496 - (view) | Author: Glenn Linderman (v+python) | Date: 2009-10-26 17:06 | |
The choice of the Lucida Consola or the Consolas font cures most of the rectangle problems. Those are just a limitation of the selected font for the console window. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2009-10-26 17:06:21 | v+python | set | messages: + msg94496 |
| 2009-10-26 09:19:55 | lemburg | set | nosy:
+ lemburg messages: + msg94483 |
| 2009-10-26 09:07:28 | mark | set | messages: + msg94480 |
| 2009-10-25 00:06:49 | v+python | set | nosy:
+ v+python messages: + msg94445 |
| 2009-09-19 00:38:48 | tzot | set | messages: + msg92854 |
| 2009-05-19 09:46:13 | amaury.forgeotdarc | set | messages: + msg88077 |
| 2009-05-19 07:54:22 | pitrou | set | nosy:
+ amaury.forgeotdarc |
| 2009-05-19 00:09:03 | tzot | set | nosy:
+ tzot messages: + msg88059 |
| 2009-05-03 23:57:04 | pitrou | set | nosy:
+ pitrou messages: + msg87086 |
| 2009-05-03 23:51:10 | haypo | set | nosy:
haypo, christian.heimes, mark, ezio.melotti components: + Windows |
| 2009-05-03 23:50:37 | haypo | set | nosy:
haypo, christian.heimes, mark, ezio.melotti components: - Windows |
| 2009-04-27 23:38:12 | ajaksu2 | set | nosy:
+ haypo, ezio.melotti versions: + Python 3.1 stage: test needed |
| 2008-01-06 22:29:44 | admin | set | keywords:
- py3k versions: Python 3.0 |
| 2007-12-15 02:08:14 | christian.heimes | set | priority: low keywords: + py3k messages: + msg58651 nosy: + christian.heimes |
| 2007-12-14 11:31:28 | mark | set | messages: + msg58621 |
| 2007-12-12 09:56:30 | mark | create | |