classification
Title: Should repr() print unicode characters outside the BMP?
Type: behavior Stage:
Components: Unicode Versions: Python 3.2
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: ezio.melotti Nosy List: Rhamphoryncus, amaury.forgeotdarc, belopolsky, eric.araujo, ezio.melotti, lemburg, loewis, vstinner
Priority: normal Keywords: patch

Created on 2010-07-08 08:53 by amaury.forgeotdarc, last changed 2010-12-02 01:51 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
issue9198.diff ezio.melotti, 2010-07-09 09:49 Incomplete (but working) patch to "fix" displayhook. review
Messages (15)
msg109520 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-08 08:53
On wide unicode builds, '\U00010000'.isprintable() returns True, and repr() returns the character unmodified.
Is it a good behavior, given that very few fonts have can display this character?

Marc-Andre Lemburg wrote:
> The "printable" property is a Python invention, not a Unicode property,
> so we do have some freedom is deciding what is printable and what
> is not.

The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.
  * Cc (Other, Control)
  * Cf (Other, Format)
  * Cs (Other, Surrogate)
  * Co (Other, Private Use)
  * Cn (Other, Not Assigned)
  * Zl Separator, Line ('\u2028', LINE SEPARATOR)
  * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
  * Zs (Separator, Space) other than ASCII space('\x20').
"""

We could also arbitrarily exclude all the non-BMP chars.
msg109528 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-08 09:34
[Adding some bits from the discussion on #5127 for better context]

"""
Ezio Melotti wrote:
> >
> > Ezio Melotti <ezio.melotti@gmail.com> added the comment:
> >
> > [This should probably be discussed on python-dev or in another issue, so feel free to move the
conversation there.]
> >
> > The current implementation considers printable """all the characters except those characters
defined in the Unicode character database as following categories are considered printable.
> >   * Cc (Other, Control)
> >   * Cf (Other, Format)
> >   * Cs (Other, Surrogate)
> >   * Co (Other, Private Use)
> >   * Cn (Other, Not Assigned)
> >   * Zl Separator, Line ('\u2028', LINE SEPARATOR)
> >   * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
> >   * Zs (Separator, Space) other than ASCII space('\x20')."""
> >
> > We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the
availability of the fonts IMHO.

Without fonts, you can't print the code points, even if the Unicode
database defines the code point as not having one of the above
classes. And that's probably also the reason why the Unicode
database doesn't define a printable property :-)

I also find the use of Zl, Zp and Zs in the definition somewhat
arbitrary: whitespace is certainly printable. This also doesn't
match the isprint() C lib API:

http://www.cplusplus.com/reference/clibrary/cctype/isprint/

"A printable character is any character that is not a control character."
"""

There are two aspects:

 * What to call a printable code point ?

   I'd suggest to follow the C lib approach: all non-control
   characters.

 * Which criteria to use for Unicode repr() ?

   Given the original intent of the extension to allow printable
   code points to pass through unescaped, it may be better to
   define "printable" based on the sys.stdout/sys.stderr encoding:

   A code points may pass through unescaped, if it is
   printable per the above definition, and does not cause problems
   with the sys.stdout/sys.stderr encoding.

   Since we can't apply this check based on a per character basis,
   I think we should only allow non-ASCII code points to pass through
   if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.
msg109531 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-08 10:24
Regarding the fonts, I think that who actually uses or needs to use characters outside the BMP might have (now or in a few months/years) a font able to display them.
I also tried to print the printable chars from U+FFFF to U+1FFFF on my linux terminal and about half of them were rendered correctly (the default font is DejaVu Sans Mono).

The question is then if we do more harm hiding these chars behind escape sequence to the people who use them or hiding the escape sequence behind boxes for the people who don't and don't have the right font.


Regarding the categories that should be considered printable, I agree that the Zx categories could be considered printable, so the non printable chars could be limited to the Cx categories.

> Since we can't apply this check based on a per character basis,
> I think we should only allow non-ASCII code points to pass through
> if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

If I understood correctly, you are suggesting to look at the sys.stdout/sys.stderr encoding and:
 * if it's a UTF-* encoding: allow all the non-ASCII (printable) codepoints (because they are the only encodings that can represent all the Unicode characters);
 * if it's not a UTF-* encoding: allow only ASCII (printable) codepoints.

This would however introduce a regression. For example on Windows (where the encoding is usually not a UTF-* one) I would expect accented characters (at least the ones in the codepage I'm using -- and usually it matches the native language of the user) to be displayed correctly.
A more accurate approach would be to actually try to encode the string and escape only the chars that can't be encoded (and also the one that are not printable of course), but this can't be done in repr() because repr() returns a Unicode string (in #5110 I did it in sys.displayhook), and encode the string there would mean doing it twice.

Also note that I might want to use repr() to get a representation of the object without necessarily send it through sys.stdout. For example I could write it on a file or send it via mail (Roundup reports errors via mail showing a repr of the variables) and in both the cases I might use/want UTF-8 even if sys.stdout is ASCII.
msg109533 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-08 11:02
> A more accurate approach would be to actually try to encode the string
> and escape only the chars that can't be encoded

This is already the case with sys.stderr, it uses the "backslashreplace" error handler. Do you suggest the same for sys.stdout?
msg109534 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-08 11:03
Yes, repr() should not depend on the user's terminal.
msg109535 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-08 11:10
> This is already the case with sys.stderr, it uses the "backslashreplace"
> error handler. Do you suggest the same for sys.stdout?

See http://bugs.python.org/issue5110#msg84965
msg109536 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-08 11:11
The chapter "Rationale" in PEP3138 explains why sys.stdout uses "strict" encoding, when sys.stderr uses "backslashreplace".

It would be possible to use "backslashreplace" for stdout as well for interactive sessions, but the PEP also rejected this because it '''may add confusion of the kind "it works in interactive mode but not when redirecting to a file".'''
msg109550 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-08 15:35
Yes, but as I said in the message I linked, that's *not* what I want to do.
I want to change only the behavior of the interactive interpreter and only when the string sent to stdout is not encodable (so only when the encoding is not UTF-*).

This can be done changing sys.displayhook, but I haven't figured out yet how to do it. The default displayhook (Python/sysmodule.c:71) calls PyFile_WriteObject (Objects/fileobject.c:139) passing the object as is and the stdout. PyFile_WriteObject then does the equivalent of sys.stdout.write(repr(obj)).
This is all done passing around unicode strings. Ideally we should try to encode the repr of the objects in displayhook using sys.stdout.encoding and 'backslashreplace', but then:
  1) we would have to decode the resulting byte string again before passing it to PyFile_WriteObject;
  2) we would have to find a way to write to sys.stdout a bytestring but I don't think that's possible (keep in mind that sys.stdout could also be some other object).

OTOH even if the intermediate step of encoding/decoding looks redundant it shouldn't affect performances too much, because it's not that common to print lot of text in the interactive interpreter and even in those cases probably performances are not so important. It would anyway be better to find another way to do it.
msg109555 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-08 17:17
I think if you change it stop considering non-BMP characters as printable, somebody will complain. If you change it in any other way, somebody will complain. Somebody will always complain - so you might as well leave things the way they are. Or you change it in a way that you think most users would consider useful - but don't expect consensus.
msg109617 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-08 22:04
amaury> Should repr() print unicode characters outside the BMP?

Yes. I don't understand why characters outside the BMP will be considered differently than other characters. Is it a workaround for bogus operating systems? My Linux terminal (Konsole on KDE) is able to display Ugaritic characters (range starting at U+10383).

amaury> it may be better to define "printable" based on the
amaury> sys.stdout/sys.stderr encoding

You can not do that: stdout and stderr encoding might be different, stdout and stderr errors are different, and repr() output is not always written to stdout or stderr. If you write repr() output to a file, you have to know the encoding of the file. How can I get the encoding? If sys.stdout is replaced by a io.StringIO() object (eg. doctests), you don't have any "encoding" (StringIO only manipulate unicode objects, no bytes objects).

ezio> I want to change only the behavior of the interactive interpreter

This idea was rejected by the PEP.

I agree with "may add confusion of the kind "it works in interactive mode but not when redirecting to a file".

I already noticed such problem: the interactive interpreter adds '' to sys.path, and so import behaves differently in the interpreter than a script. It's annoying because it took me hours to understand why it was different.

ezio> and only when the string sent to stdout is not encodable

Which means setting sys.stdout.errors to something else than strict. I prefer to detect unicode problems earlier.

Eg. if you set errors to 'replace', write will never fail. If the output is used as input for another program (UNIX pipe), you will send "?" to the reader process. I'm not sure that it is the expected behaviour. And if stdout is not a TTY, stdout uses ASCII encoding (which will raise an unicode errors at the first non ASCII character!).
msg109620 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-08 22:15
I suggest to:
  1) keep the current behavior for non-BMP chars (i.e. print them normally);
  2) change isprintable to consider the Zx categories printable (this will affect repr() too);
  3) change displayhook (*NOT* sys.stdout.encoding) to use backslashreplace when the string contains chars that are not encodable with sys.stdout.encoding *.

* note that this will affect only the objects that are converted with repr() in the interpreter e.g. ">>> x = 'foo'; x" and *NOT* ">>> x = 'foo'; print(x)". Since the first behavior exists *only* in the interactive interpreter it won't be inconsistent with normal programs ("x = 'foo'; x" in a program doesn't display anything).
msg109702 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-07-09 09:49
Here is a patch to "fix" sys_displayhook (note: the patch is just a proof of concept -- it seems to work fine but I still have to clean it up, add comments, rename and reorganize some vars and add tests).
This is an example output while using iso-8859-1 as IO encoding:

wolf@linuxvm:~/dev/py3k$ PYTHONIOENCODING=iso-8859-1 ./python
Python 3.2a0 (py3k:82643:82644M, Jul  9 2010, 11:39:25)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.stdout.encoding, sys.stdin.encoding
('iso-8859-1', 'iso-8859-1')
>>> 'ascii string'
'ascii string'  # works fine
>>> 'some accented chars: öäå'
'some accented chars: öäå'  # works fine - these chars are encodable
>>> 'a snowman: \u2603'
'a snowman: \u2603'  # non-encodable - the char is escaped instead of raising an error
>>> 'snowman: \u2603, and accented öäå'
'snowman: \u2603, and accented öäå' # only non-encodable chars are escaped
>>> # the behavior of print is still the same:
>>> print('some accented chars: öäå') 
some accented chars: öäå
>>> print('a snowman: \u2603')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 11: ordinal not in range(256)

-------------------------------------

While testing the patch with PYTHONIOENCODING=iso-8859-1 I also found this weird issue that however is *not* related to the patch, since I managed to reproduce on a clean py3k using PYTHONIOENCODING=iso-8859-1:
>>> 'òàùèì  óáúéí  öäüëï'
'ò�\xa0ùèì  óáúé�\xad  öäüëï'
>>> 'òàùèì  óáúéí  öäüëï'.encode('iso-8859-1')
b'\xc3\xb2\xc3\xa0\xc3\xb9\xc3\xa8\xc3\xac  \xc3\xb3\xc3\xa1\xc3\xba\xc3\xa9\xc3\xad  \xc3\xb6\xc3\xa4\xc3\xbc\xc3\xab\xc3\xaf'
>>> 'òàùèì'.encode('utf-8')
b'\xc3\x83\xc2\xb2\xc3\x83\xc2\xa0\xc3\x83\xc2\xb9\xc3\x83\xc2\xa8\xc3\x83\xc2\xac'

I think there might be some conflict between the IO encoding that I specified and the one that my terminal actually uses, but I couldn't figure out what's going on exactly (it also weird that only 'à' and 'í' are not displayed correctly). Unless this behavior is expected I'll open another issue about it.
msg113149 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-08-07 03:41
Assigning to myself so that I'll remember to finish and commit the patch.
msg113734 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-08-13 01:29
About issue9198.diff:
 - exit directly if !PyErr_ExceptionMatches(PyExc_UnicodeEncodeError) to avoid an useless level of indentation
 - why do you clear the exception before calling PyObject_Repr()? if you cannot execute code while an exception is active, you should maybe save/restore the original exception?
 - the code is long: it can maybe be moved to a subfunction
 - PyObject_CallMethod(buffer, "write", "(O)", encoded) to call write method
 - you can maybe just use strict error handler when you decode the encoded variable: it doesn't work anyway


>>> b"a\xff".decode("ascii", "backslashreplace")
...
TypeError: don't know how to handle UnicodeDecodeError in error callback
>>> b"a\xff".decode("utf-8", "backslashreplace")
...
TypeError: don't know how to handle UnicodeDecodeError in error callback

Note: some encodings don't support backslashreplace, especially mbcs encoding. But on Windows, sys.stdout.encoding is not mbcs but cpXXXX (eg. cp850).
msg123033 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-12-02 01:49
I created a new issue for Ezio's proposition to patch sys.displayhook: issue #10601.

To answer the initial question, "Should repr() print unicode characters outside the BMP?", Marc-Andre, Ezio and me agree that we should keep the current behavior for non-BMP chars (i.e. print them normally) and so I close this issue.
History
Date User Action Args
2010-12-02 01:51:06vstinnersetstatus: open -> closed
resolution: not a bug
2010-12-02 01:49:08vstinnersetmessages: + msg123033
2010-11-20 00:22:04ezio.melottisetnosy: + belopolsky
2010-08-13 01:29:54vstinnersetmessages: + msg113734
2010-08-07 11:43:36eric.araujosetnosy: + eric.araujo
2010-08-07 03:41:51ezio.melottisetassignee: ezio.melotti
messages: + msg113149
2010-07-09 09:49:08ezio.melottisetfiles: + issue9198.diff
keywords: + patch
messages: + msg109702
2010-07-09 03:53:37Rhamphoryncussetnosy: + Rhamphoryncus
2010-07-08 22:15:58ezio.melottisetmessages: + msg109620
2010-07-08 22:04:42vstinnersetmessages: + msg109617
2010-07-08 17:17:05loewissetnosy: + loewis
messages: + msg109555
2010-07-08 15:35:52ezio.melottisetmessages: + msg109550
2010-07-08 13:38:09vstinnersetnosy: + vstinner
2010-07-08 11:11:46amaury.forgeotdarcsetmessages: + msg109536
2010-07-08 11:10:13ezio.melottisetmessages: + msg109535
2010-07-08 11:03:57amaury.forgeotdarcsetmessages: + msg109534
2010-07-08 11:02:22amaury.forgeotdarcsetmessages: + msg109533
2010-07-08 10:24:24ezio.melottisetmessages: + msg109531
2010-07-08 09:34:50lemburgsetmessages: + msg109528
2010-07-08 08:53:04amaury.forgeotdarccreate