Author ezio.melotti
Recipients Rhamphoryncus, amaury.forgeotdarc, ezio.melotti, lemburg, loewis, vstinner
Date 2010-07-09.09:49:04
SpamBayes Score 4.41301e-11
Marked as misclassified No
Message-id <1278668949.82.0.0997720973536.issue9198@psf.upfronthosting.co.za>
In-reply-to
Content
Here is a patch to "fix" sys_displayhook (note: the patch is just a proof of concept -- it seems to work fine but I still have to clean it up, add comments, rename and reorganize some vars and add tests).
This is an example output while using iso-8859-1 as IO encoding:

wolf@linuxvm:~/dev/py3k$ PYTHONIOENCODING=iso-8859-1 ./python
Python 3.2a0 (py3k:82643:82644M, Jul  9 2010, 11:39:25)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.stdout.encoding, sys.stdin.encoding
('iso-8859-1', 'iso-8859-1')
>>> 'ascii string'
'ascii string'  # works fine
>>> 'some accented chars: öäå'
'some accented chars: öäå'  # works fine - these chars are encodable
>>> 'a snowman: \u2603'
'a snowman: \u2603'  # non-encodable - the char is escaped instead of raising an error
>>> 'snowman: \u2603, and accented öäå'
'snowman: \u2603, and accented öäå' # only non-encodable chars are escaped
>>> # the behavior of print is still the same:
>>> print('some accented chars: öäå') 
some accented chars: öäå
>>> print('a snowman: \u2603')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 11: ordinal not in range(256)

-------------------------------------

While testing the patch with PYTHONIOENCODING=iso-8859-1 I also found this weird issue that however is *not* related to the patch, since I managed to reproduce on a clean py3k using PYTHONIOENCODING=iso-8859-1:
>>> 'òàùèì  óáúéí  öäüëï'
'ò�\xa0ùèì  óáúé�\xad  öäüëï'
>>> 'òàùèì  óáúéí  öäüëï'.encode('iso-8859-1')
b'\xc3\xb2\xc3\xa0\xc3\xb9\xc3\xa8\xc3\xac  \xc3\xb3\xc3\xa1\xc3\xba\xc3\xa9\xc3\xad  \xc3\xb6\xc3\xa4\xc3\xbc\xc3\xab\xc3\xaf'
>>> 'òàùèì'.encode('utf-8')
b'\xc3\x83\xc2\xb2\xc3\x83\xc2\xa0\xc3\x83\xc2\xb9\xc3\x83\xc2\xa8\xc3\x83\xc2\xac'

I think there might be some conflict between the IO encoding that I specified and the one that my terminal actually uses, but I couldn't figure out what's going on exactly (it also weird that only 'à' and 'í' are not displayed correctly). Unless this behavior is expected I'll open another issue about it.
History
Date User Action Args
2010-07-09 09:49:10ezio.melottisetrecipients: + ezio.melotti, lemburg, loewis, amaury.forgeotdarc, Rhamphoryncus, vstinner
2010-07-09 09:49:09ezio.melottisetmessageid: <1278668949.82.0.0997720973536.issue9198@psf.upfronthosting.co.za>
2010-07-09 09:49:08ezio.melottilinkissue9198 messages
2010-07-09 09:49:06ezio.melotticreate