Should repr() print unicode characters outside the BMP? #53444

amauryfa · 2010-07-08T08:53:04Z

BPO	9198
Nosy	@malemburg, @loewis, @amauryfa, @abalkin, @vstinner, @ezio-melotti, @merwok
Files	issue9198.diff: Incomplete (but working) patch to "fix" displayhook.

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = 'https://github.com/ezio-melotti'
closed_at = <Date 2010-12-02.01:51:06.081>
created_at = <Date 2010-07-08.08:53:04.028>
labels = ['type-bug', 'invalid', 'expert-unicode']
title = 'Should repr() print unicode characters outside the BMP?'
updated_at = <Date 2010-12-02.01:51:06.080>
user = 'https://github.com/amauryfa'

bugs.python.org fields:

activity = <Date 2010-12-02.01:51:06.080>
actor = 'vstinner'
assignee = 'ezio.melotti'
closed = True
closed_date = <Date 2010-12-02.01:51:06.081>
closer = 'vstinner'
components = ['Unicode']
creation = <Date 2010-07-08.08:53:04.028>
creator = 'amaury.forgeotdarc'
dependencies = []
files = ['17915']
hgrepos = []
issue_num = 9198
keywords = ['patch']
message_count = 15.0
messages = ['109520', '109528', '109531', '109533', '109534', '109535', '109536', '109550', '109555', '109617', '109620', '109702', '113149', '113734', '123033']
nosy_count = 8.0
nosy_names = ['lemburg', 'loewis', 'amaury.forgeotdarc', 'belopolsky', 'Rhamphoryncus', 'vstinner', 'ezio.melotti', 'eric.araujo']
pr_nums = []
priority = 'normal'
resolution = 'not a bug'
stage = None
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue9198'
versions = ['Python 3.2']

amauryfa · 2010-07-08T08:53:04Z

On wide unicode builds, '\U00010000'.isprintable() returns True, and repr() returns the character unmodified.
Is it a good behavior, given that very few fonts have can display this character?

Marc-Andre Lemburg wrote:

The "printable" property is a Python invention, not a Unicode property,
so we do have some freedom is deciding what is printable and what
is not.

The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.

Cc (Other, Control)
Cf (Other, Format)
Cs (Other, Surrogate)
Co (Other, Private Use)
Cn (Other, Not Assigned)
Zl Separator, Line ('\u2028', LINE SEPARATOR)
Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
Zs (Separator, Space) other than ASCII space('\x20').
"""

We could also arbitrarily exclude all the non-BMP chars.

malemburg · 2010-07-08T09:34:50Z

[Adding some bits from the discussion on bpo-5127 for better context]

"""
Ezio Melotti wrote:

>
> Ezio Melotti <ezio.melotti@gmail.com> added the comment:
>
> [This should probably be discussed on python-dev or in another issue, so feel free to move the
conversation there.]
>
> The current implementation considers printable """all the characters except those characters
defined in the Unicode character database as following categories are considered printable.
> * Cc (Other, Control)
> * Cf (Other, Format)
> * Cs (Other, Surrogate)
> * Co (Other, Private Use)
> * Cn (Other, Not Assigned)
> * Zl Separator, Line ('\u2028', LINE SEPARATOR)
> * Zp Separator, Paragraph ('\u2029', PARAGRAPH SEPARATOR)
> * Zs (Separator, Space) other than ASCII space('\x20')."""
>
> We could also arbitrary exclude all the non-BMP chars, but that shouldn't be based on the
availability of the fonts IMHO.

Without fonts, you can't print the code points, even if the Unicode
database defines the code point as not having one of the above
classes. And that's probably also the reason why the Unicode
database doesn't define a printable property :-)

I also find the use of Zl, Zp and Zs in the definition somewhat
arbitrary: whitespace is certainly printable. This also doesn't
match the isprint() C lib API:

http://www.cplusplus.com/reference/clibrary/cctype/isprint/

"A printable character is any character that is not a control character."
"""

There are two aspects:

What to call a printable code point ?

I'd suggest to follow the C lib approach: all non-control
characters.
Which criteria to use for Unicode repr() ?

Given the original intent of the extension to allow printable
code points to pass through unescaped, it may be better to
define "printable" based on the sys.stdout/sys.stderr encoding:

A code points may pass through unescaped, if it is
printable per the above definition, and does not cause problems
with the sys.stdout/sys.stderr encoding.

Since we can't apply this check based on a per character basis,
I think we should only allow non-ASCII code points to pass through
if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

ezio-melotti · 2010-07-08T10:24:23Z

Regarding the fonts, I think that who actually uses or needs to use characters outside the BMP might have (now or in a few months/years) a font able to display them.
I also tried to print the printable chars from U+FFFF to U+1FFFF on my linux terminal and about half of them were rendered correctly (the default font is DejaVu Sans Mono).

The question is then if we do more harm hiding these chars behind escape sequence to the people who use them or hiding the escape sequence behind boxes for the people who don't and don't have the right font.

Regarding the categories that should be considered printable, I agree that the Zx categories could be considered printable, so the non printable chars could be limited to the Cx categories.

Since we can't apply this check based on a per character basis,
I think we should only allow non-ASCII code points to pass through
if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

If I understood correctly, you are suggesting to look at the sys.stdout/sys.stderr encoding and:

if it's a UTF-* encoding: allow all the non-ASCII (printable) codepoints (because they are the only encodings that can represent all the Unicode characters);
if it's not a UTF-* encoding: allow only ASCII (printable) codepoints.

This would however introduce a regression. For example on Windows (where the encoding is usually not a UTF-* one) I would expect accented characters (at least the ones in the codepage I'm using -- and usually it matches the native language of the user) to be displayed correctly.
A more accurate approach would be to actually try to encode the string and escape only the chars that can't be encoded (and also the one that are not printable of course), but this can't be done in repr() because repr() returns a Unicode string (in bpo-5110 I did it in sys.displayhook), and encode the string there would mean doing it twice.

Also note that I might want to use repr() to get a representation of the object without necessarily send it through sys.stdout. For example I could write it on a file or send it via mail (Roundup reports errors via mail showing a repr of the variables) and in both the cases I might use/want UTF-8 even if sys.stdout is ASCII.

amauryfa · 2010-07-08T11:02:22Z

A more accurate approach would be to actually try to encode the string
and escape only the chars that can't be encoded

This is already the case with sys.stderr, it uses the "backslashreplace" error handler. Do you suggest the same for sys.stdout?

amauryfa · 2010-07-08T11:03:58Z

Yes, repr() should not depend on the user's terminal.

ezio-melotti · 2010-07-08T11:10:13Z

This is already the case with sys.stderr, it uses the "backslashreplace"
error handler. Do you suggest the same for sys.stdout?

See http://bugs.python.org/issue5110#msg84965

amauryfa · 2010-07-08T11:11:46Z

The chapter "Rationale" in PEP-3138 explains why sys.stdout uses "strict" encoding, when sys.stderr uses "backslashreplace".

It would be possible to use "backslashreplace" for stdout as well for interactive sessions, but the PEP also rejected this because it '''may add confusion of the kind "it works in interactive mode but not when redirecting to a file".'''

ezio-melotti · 2010-07-08T15:35:51Z

Yes, but as I said in the message I linked, that's *not* what I want to do.
I want to change only the behavior of the interactive interpreter and only when the string sent to stdout is not encodable (so only when the encoding is not UTF-*).

This can be done changing sys.displayhook, but I haven't figured out yet how to do it. The default displayhook (Python/sysmodule.c:71) calls PyFile_WriteObject (Objects/fileobject.c:139) passing the object as is and the stdout. PyFile_WriteObject then does the equivalent of sys.stdout.write(repr(obj)).
This is all done passing around unicode strings. Ideally we should try to encode the repr of the objects in displayhook using sys.stdout.encoding and 'backslashreplace', but then:

we would have to decode the resulting byte string again before passing it to PyFile_WriteObject;
we would have to find a way to write to sys.stdout a bytestring but I don't think that's possible (keep in mind that sys.stdout could also be some other object).

OTOH even if the intermediate step of encoding/decoding looks redundant it shouldn't affect performances too much, because it's not that common to print lot of text in the interactive interpreter and even in those cases probably performances are not so important. It would anyway be better to find another way to do it.

loewis · 2010-07-08T17:17:05Z

I think if you change it stop considering non-BMP characters as printable, somebody will complain. If you change it in any other way, somebody will complain. Somebody will always complain - so you might as well leave things the way they are. Or you change it in a way that you think most users would consider useful - but don't expect consensus.

vstinner · 2010-07-08T22:04:41Z

amaury> Should repr() print unicode characters outside the BMP?

Yes. I don't understand why characters outside the BMP will be considered differently than other characters. Is it a workaround for bogus operating systems? My Linux terminal (Konsole on KDE) is able to display Ugaritic characters (range starting at U+10383).

amaury> it may be better to define "printable" based on the
amaury> sys.stdout/sys.stderr encoding

You can not do that: stdout and stderr encoding might be different, stdout and stderr errors are different, and repr() output is not always written to stdout or stderr. If you write repr() output to a file, you have to know the encoding of the file. How can I get the encoding? If sys.stdout is replaced by a io.StringIO() object (eg. doctests), you don't have any "encoding" (StringIO only manipulate unicode objects, no bytes objects).

ezio> I want to change only the behavior of the interactive interpreter

This idea was rejected by the PEP.

I agree with "may add confusion of the kind "it works in interactive mode but not when redirecting to a file".

I already noticed such problem: the interactive interpreter adds '' to sys.path, and so import behaves differently in the interpreter than a script. It's annoying because it took me hours to understand why it was different.

ezio> and only when the string sent to stdout is not encodable

Which means setting sys.stdout.errors to something else than strict. I prefer to detect unicode problems earlier.

Eg. if you set errors to 'replace', write will never fail. If the output is used as input for another program (UNIX pipe), you will send "?" to the reader process. I'm not sure that it is the expected behaviour. And if stdout is not a TTY, stdout uses ASCII encoding (which will raise an unicode errors at the first non ASCII character!).

ezio-melotti · 2010-07-08T22:15:58Z

I suggest to:

keep the current behavior for non-BMP chars (i.e. print them normally);
change isprintable to consider the Zx categories printable (this will affect repr() too);
change displayhook (NOT sys.stdout.encoding) to use backslashreplace when the string contains chars that are not encodable with sys.stdout.encoding *.

* note that this will affect only the objects that are converted with repr() in the interpreter e.g. ">>> x = 'foo'; x" and *NOT* ">>> x = 'foo'; print(x)". Since the first behavior exists *only* in the interactive interpreter it won't be inconsistent with normal programs ("x = 'foo'; x" in a program doesn't display anything).

ezio-melotti · 2010-07-09T09:49:05Z

Here is a patch to "fix" sys_displayhook (note: the patch is just a proof of concept -- it seems to work fine but I still have to clean it up, add comments, rename and reorganize some vars and add tests).
This is an example output while using iso-8859-1 as IO encoding:

wolf@linuxvm:~/dev/py3k$ PYTHONIOENCODING=iso-8859-1 ./python
Python 3.2a0 (py3k:82643:82644M, Jul  9 2010, 11:39:25)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.stdout.encoding, sys.stdin.encoding
('iso-8859-1', 'iso-8859-1')
>>> 'ascii string'
'ascii string'  # works fine
>>> 'some accented chars: öäå'
'some accented chars: öäå'  # works fine - these chars are encodable
>>> 'a snowman: \u2603'
'a snowman: \u2603'  # non-encodable - the char is escaped instead of raising an error
>>> 'snowman: \u2603, and accented öäå'
'snowman: \u2603, and accented öäå' # only non-encodable chars are escaped
>>> # the behavior of print is still the same:
>>> print('some accented chars: öäå') 
some accented chars: öäå
>>> print('a snowman: \u2603')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 11: ordinal not in range(256)

-------------------------------------

While testing the patch with PYTHONIOENCODING=iso-8859-1 I also found this weird issue that however is *not* related to the patch, since I managed to reproduce on a clean py3k using PYTHONIOENCODING=iso-8859-1:
>>> 'òàùèì  óáúéí  öäüëï'
'ò�\xa0ùèì  óáúé�\xad  öäüëï'
>>> 'òàùèì  óáúéí  öäüëï'.encode('iso-8859-1')
b'\xc3\xb2\xc3\xa0\xc3\xb9\xc3\xa8\xc3\xac  \xc3\xb3\xc3\xa1\xc3\xba\xc3\xa9\xc3\xad  \xc3\xb6\xc3\xa4\xc3\xbc\xc3\xab\xc3\xaf'
>>> 'òàùèì'.encode('utf-8')
b'\xc3\x83\xc2\xb2\xc3\x83\xc2\xa0\xc3\x83\xc2\xb9\xc3\x83\xc2\xa8\xc3\x83\xc2\xac'

I think there might be some conflict between the IO encoding that I specified and the one that my terminal actually uses, but I couldn't figure out what's going on exactly (it also weird that only 'à' and 'í' are not displayed correctly). Unless this behavior is expected I'll open another issue about it.

ezio-melotti · 2010-08-07T03:41:51Z

Assigning to myself so that I'll remember to finish and commit the patch.

vstinner · 2010-08-13T01:29:54Z

About bpo-9198.diff:

exit directly if !PyErr_ExceptionMatches(PyExc_UnicodeEncodeError) to avoid an useless level of indentation
why do you clear the exception before calling PyObject_Repr()? if you cannot execute code while an exception is active, you should maybe save/restore the original exception?
the code is long: it can maybe be moved to a subfunction
PyObject_CallMethod(buffer, "write", "(O)", encoded) to call write method
you can maybe just use strict error handler when you decode the encoded variable: it doesn't work anyway

>>> b"a\xff".decode("ascii", "backslashreplace")
...
TypeError: don't know how to handle UnicodeDecodeError in error callback
>>> b"a\xff".decode("utf-8", "backslashreplace")
...
TypeError: don't know how to handle UnicodeDecodeError in error callback

Note: some encodings don't support backslashreplace, especially mbcs encoding. But on Windows, sys.stdout.encoding is not mbcs but cpXXXX (eg. cp850).

vstinner · 2010-12-02T01:49:08Z

I created a new issue for Ezio's proposition to patch sys.displayhook: issue bpo-10601.

To answer the initial question, "Should repr() print unicode characters outside the BMP?", Marc-Andre, Ezio and me agree that we should keep the current behavior for non-BMP chars (i.e. print them normally) and so I close this issue.

amauryfa added topic-unicode type-bug An unexpected behavior, bug, or error labels Jul 8, 2010

ezio-melotti self-assigned this Aug 7, 2010

vstinner closed this as completed Dec 2, 2010

vstinner added the invalid label Dec 2, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should repr() print unicode characters outside the BMP? #53444

Should repr() print unicode characters outside the BMP? #53444

amauryfa commented Jul 8, 2010

amauryfa commented Jul 8, 2010

malemburg commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

amauryfa commented Jul 8, 2010

amauryfa commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

amauryfa commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

loewis mannequin commented Jul 8, 2010

vstinner commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

ezio-melotti commented Jul 9, 2010

ezio-melotti commented Aug 7, 2010

vstinner commented Aug 13, 2010

vstinner commented Dec 2, 2010

Should repr() print unicode characters outside the BMP? #53444

Should repr() print unicode characters outside the BMP? #53444

Comments

amauryfa commented Jul 8, 2010

amauryfa commented Jul 8, 2010

malemburg commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

amauryfa commented Jul 8, 2010

amauryfa commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

amauryfa commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

loewis mannequin commented Jul 8, 2010

vstinner commented Jul 8, 2010

ezio-melotti commented Jul 8, 2010

ezio-melotti commented Jul 9, 2010

ezio-melotti commented Aug 7, 2010

vstinner commented Aug 13, 2010

vstinner commented Dec 2, 2010