Author haypo
Recipients David.Sankel, amaury.forgeotdarc, brian.curtin, christian.heimes, christoph, davidsarah, ezio.melotti, haypo, lemburg, mark, pitrou, sorin, terry.reedy, tim.golden, tzot, v+python
Date 2011-01-14.23:31:44
SpamBayes Score 0.0
Marked as misclassified No
Message-id <1295047909.67.0.279399668122.issue1602@psf.upfronthosting.co.za>
In-reply-to
Content
Here are some results of my test of unicode2.py. I'm testing py3k on Windows XP, OEM: cp850, ANSI: cp1252.

Raster fonts
------------

With a fresh console, unicode2.py displays "?????????????????". input() accepts characters encodable to the OEM code page.

If I set the code page to 65001 (chcp program+set PYTHONIOENCODING=utf-8; or SetConsoleCP() + SetConsoleOutputCP()), it displays weird characters. input() accepts ASCII characters, but non-ASCII characters (encodable to the console and OEM code pages) display weird characters (smileys! control characters?).

Lucida console
--------------

With my system code page (OEM: cp850), characters not encodable to the code pages are displayed correctly. I can type some non-ASCII characters (encodable to the code page). If I copy/paste characters non encodable to the code page, there are replaced by similar glyph (eg. Ł => L) or ? (€ => ?).

If I set the code page to 65001, all characters are still correctly displayed. But I cannot type non-ASCII characters anymore: input() fails with EOFError (I suppose that Python gets control characters).

Redirect output to a pipe
-------------------------

I patched unicode2.py to use sys.stdout.buffer instead of sys.stdout for UnicodeOutput stream. I also patched UnicodeOutput to replace \n by \r\n. 

It works correctly with any character. No UTF-8 BOM is written. But "Here 1" is written at the end. I suppose that sys.stdout should be flushed before the creation of UnicodeOutput.

But it always use UTF-8. I don't know if UTF-8 is well supported by any application on Windows.

Without unicode2.py, only characters encodable to OEM code page are supported, and \n is used as end of line string.

Let's try to summarize
----------------------

Tests:
 d1) Display characters encodable to the console code page
 t1) Type characters encodable to the console code page
 d2) Display characters not encodable to any code page
 t2) Type characters not encodable to any code page

I'm using Windows with OEM=cp850 and ANSI=cp1252. For test (t2), I copy €-Ł and paste it to the console (right click on the window title > Edit > Paste).

Raster fonts, console=cp850:

d1) ok
t1) ok
d2) FAIL: €-Ł is displayed ?-L
t2) FAIL: €-Ł is read as ?-L

Raster fonts, console=cp65001:

d1) FAIL: é is displayed as 2 strange glyphs
t1) FAIL: EOFError
d2) FAIL: only display unreadable glyphs
t2) FAIL: EOFError

Lucida console, console=cp850:

d1) ok
t1) ok
d2) ok
t2) FAIL: €-Ł is read as ?-L

Lucida console, console=cp65001:

d1) ok
t1) FAIL: EOFError
d2) ok
t2) FAIL: EOFError

So, setting the console code page to 65001 doesn't solve any issue, but it breaks the input (input with the keyboard or pasting text).

With Raster fonts or Lucida console, it's possible to display characters encodable to the code page. But it is not new, it's already possible with Python 3. But for characters not encodable to the code page, it works with unicode2.py and Lucida console, with is something new :-)

For the input, I suppose that we need also to use a Windows console function, to support unencodable characters.
History
Date User Action Args
2011-01-14 23:31:49hayposetrecipients: + haypo, lemburg, terry.reedy, tzot, amaury.forgeotdarc, pitrou, christian.heimes, tim.golden, mark, christoph, ezio.melotti, v+python, sorin, brian.curtin, davidsarah, David.Sankel
2011-01-14 23:31:49hayposetmessageid: <1295047909.67.0.279399668122.issue1602@psf.upfronthosting.co.za>
2011-01-14 23:31:45haypolinkissue1602 messages
2011-01-14 23:31:44haypocreate