Message 120414 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	BreamoreBoy, David.Sankel, amaury.forgeotdarc, brian.curtin, christian.heimes, christoph, ezio.melotti, lemburg, mark, pitrou, ssbarnea, tim.golden, tzot, v+python, vstinner
Date	2010-11-04.15:09:58
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<1288883402.85.0.253841988342.issue1602@psf.upfronthosting.co.za>
In-reply-to

Content
I wrote a small function to call WriteConsoleOutputA() and WriteConsoleOutputW() in Python to do some tests. It works correclty, except if I change the code page using chcp command. It looks like the problem is that the chcp command changes the console code page and the ANSI code page, but it should only changes the ANSI code page (and not the console code page). chcp command ============ The chcp command changes the console code page, but in practice, the console still expects the OEM code page (eg. cp850 on my french setup). Example: C:\...> python.exe -c "import sys; print(sys.stdout.encoding") cp850 C:\...> chcp 65001 C:\...> python.exe Fatal Python error: Py_Initialize: can't initialize sys standard streams LookupError: unknown encoding: cp65001 C:\...> SET PYTHONIOENCODING=utf-8 C:\...> python.exe >>> import sys >>> sys.stdout.write("\xe9\n") Ã© 2 >>> sys.stdout.buffer.write("\xe9\n".encode("utf8")) Ã© 3 >>> sys.stdout.buffer.write("\xe9\n".encode("cp850")) é 2 os.device_encoding(1) uses GetConsoleOutputCP() which gives 65001. It should maybe use GetOEMCP() instead? Or chcp command should be fixed? Set the console code page looks to be a bad idea, because if I type "é" using my keyboard, a random character (eg. U+0002) is displayed instead... WriteConsoleOutputA() and WriteConsoleOutputW() =============================================== Without touching the code page ------------------------------ If the character can be rendered by the current font (eg. U+00E9): WriteConsoleOutputA() and WriteConsoleOutputW() work correctly. If the character cannot be rendered by the current font, but there is a replacment character (eg. U+0141 replaced by U+0041): WriteConsoleOutputA() cannot be used (U+0141 cannot be encoded to the code page), WriteConsoleOutputW() writes U+0141 but the console contains U+0041 (I checked using ReadConsoleOutputW()) and U+0041 is displayed. It works like the mbcs encoding, the behaviour looks correct. If the character cannot be rendered by the current font, but there is a replacment character (eg. U+042D): WriteConsoleOutputA() cannot be used (U+042D cannot be encoded to the code page), WriteConsoleOutputW() writes U+042D but U+003d (?) is displayed instead. The behaviour looks correct. chcp 65001 ---------- Using "chcp 65001" command (+ "set PYTHONIOENCODING=utf-8" to avoid the fatal error), it becomes worse: the result depends on the font... Using raster font: - (ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+00e9 (é), whereas the console output code page is cp65001 (I checked using GetConsoleOutputCP()) - (ANSI) write "\xe9".encode("utf-8") using WriteConsoleOutputA() displays Ã© (mojibake!) - (UNICODE) write "\xe9" using WriteConsoleOutputW() displays... a random character (U+0002, U+0008, U+0069, U+00b0, ...) Using Lucida (TrueType font): - (ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+0000 !? - (UNICODE) write "\xe9" using WriteConsoleOutputW() works correctly (display U+00e9), even with "\u0141", it works correctly (display U+0141)

I wrote a small function to call WriteConsoleOutputA() and  WriteConsoleOutputW() in Python to do some tests. It works correclty, except if I change the code page using chcp command. It looks like the problem is that the chcp command changes the console code page and the ANSI code page, but it should only changes the ANSI code page (and not the console code page).


chcp command
============

The chcp command changes the console code page, but in practice, the console still expects the OEM code page (eg. cp850 on my french setup). Example:

C:\...> python.exe -c "import sys; print(sys.stdout.encoding")
cp850
C:\...> chcp 65001
C:\...> python.exe
Fatal Python error: Py_Initialize: can't initialize sys standard streams
LookupError: unknown encoding: cp65001
C:\...> SET PYTHONIOENCODING=utf-8
C:\...> python.exe
>>> import sys
>>> sys.stdout.write("\xe9\n")
Ã©
2
>>> sys.stdout.buffer.write("\xe9\n".encode("utf8"))
Ã©
3
>>> sys.stdout.buffer.write("\xe9\n".encode("cp850"))
é
2

os.device_encoding(1) uses GetConsoleOutputCP() which gives 65001. It should maybe use GetOEMCP() instead? Or chcp command should be fixed?

Set the console code page looks to be a bad idea, because if I type "é" using my keyboard, a random character (eg. U+0002) is displayed instead...


WriteConsoleOutputA() and WriteConsoleOutputW()
===============================================

Without touching the code page
------------------------------

If the character can be rendered by the current font (eg. U+00E9): WriteConsoleOutputA() and WriteConsoleOutputW() work correctly.

If the character cannot be rendered by the current font, but there is a replacment character (eg. U+0141 replaced by U+0041): WriteConsoleOutputA() cannot be used (U+0141 cannot be encoded to the code page), WriteConsoleOutputW() writes U+0141 but the console contains U+0041 (I checked using ReadConsoleOutputW()) and U+0041 is displayed. It works like the mbcs encoding, the behaviour looks correct.

If the character cannot be rendered by the current font, but there is a replacment character (eg. U+042D): WriteConsoleOutputA() cannot be used (U+042D cannot be encoded to the code page), WriteConsoleOutputW() writes U+042D but U+003d (?) is displayed instead. The behaviour looks correct.

chcp 65001
----------

Using "chcp 65001" command (+ "set PYTHONIOENCODING=utf-8" to avoid the fatal error), it becomes worse: the result depends on the font...

Using raster font:
 - (ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+00e9 (é), whereas the console output code page is cp65001 (I checked using GetConsoleOutputCP())
 - (ANSI) write "\xe9".encode("utf-8") using WriteConsoleOutputA() displays Ã© (mojibake!)
 - (UNICODE) write "\xe9" using WriteConsoleOutputW() displays... a random character (U+0002, U+0008, U+0069, U+00b0, ...)

Using Lucida (TrueType font): 
 - (ANSI) write "\xe9".encode("cp850") using WriteConsoleOutputA() displays U+0000 !?
 - (UNICODE) write "\xe9" using WriteConsoleOutputW() works correctly (display U+00e9), even with "\u0141", it works correctly (display U+0141)

History
Date	User	Action	Args
2010-11-04 15:10:03	vstinner	set	recipients: + vstinner, lemburg, tzot, amaury.forgeotdarc, pitrou, christian.heimes, tim.golden, mark, christoph, ezio.melotti, v+python, ssbarnea, brian.curtin, BreamoreBoy, David.Sankel
2010-11-04 15:10:02	vstinner	set	messageid: <1288883402.85.0.253841988342.issue1602@psf.upfronthosting.co.za>
2010-11-04 15:09:59	vstinner	link	issue1602 messages
2010-11-04 15:09:58	vstinner	create