New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should repr() print unicode characters outside the BMP? #53444
Comments
On wide unicode builds, '\U00010000'.isprintable() returns True, and repr() returns the character unmodified. Marc-Andre Lemburg wrote:
The current implementation considers printable """all the characters except those characters defined in the Unicode character database as following categories are considered printable.
We could also arbitrarily exclude all the non-BMP chars. |
[Adding some bits from the discussion on bpo-5127 for better context] """
Without fonts, you can't print the code points, even if the Unicode I also find the use of Zl, Zp and Zs in the definition somewhat http://www.cplusplus.com/reference/clibrary/cctype/isprint/ "A printable character is any character that is not a control character." There are two aspects:
|
Regarding the fonts, I think that who actually uses or needs to use characters outside the BMP might have (now or in a few months/years) a font able to display them. The question is then if we do more harm hiding these chars behind escape sequence to the people who use them or hiding the escape sequence behind boxes for the people who don't and don't have the right font. Regarding the categories that should be considered printable, I agree that the Zx categories could be considered printable, so the non printable chars could be limited to the Cx categories.
If I understood correctly, you are suggesting to look at the sys.stdout/sys.stderr encoding and:
This would however introduce a regression. For example on Windows (where the encoding is usually not a UTF-* one) I would expect accented characters (at least the ones in the codepage I'm using -- and usually it matches the native language of the user) to be displayed correctly. Also note that I might want to use repr() to get a representation of the object without necessarily send it through sys.stdout. For example I could write it on a file or send it via mail (Roundup reports errors via mail showing a repr of the variables) and in both the cases I might use/want UTF-8 even if sys.stdout is ASCII. |
This is already the case with sys.stderr, it uses the "backslashreplace" error handler. Do you suggest the same for sys.stdout? |
Yes, repr() should not depend on the user's terminal. |
|
The chapter "Rationale" in PEP-3138 explains why sys.stdout uses "strict" encoding, when sys.stderr uses "backslashreplace". It would be possible to use "backslashreplace" for stdout as well for interactive sessions, but the PEP also rejected this because it '''may add confusion of the kind "it works in interactive mode but not when redirecting to a file".''' |
Yes, but as I said in the message I linked, that's *not* what I want to do. This can be done changing sys.displayhook, but I haven't figured out yet how to do it. The default displayhook (Python/sysmodule.c:71) calls PyFile_WriteObject (Objects/fileobject.c:139) passing the object as is and the stdout. PyFile_WriteObject then does the equivalent of sys.stdout.write(repr(obj)).
OTOH even if the intermediate step of encoding/decoding looks redundant it shouldn't affect performances too much, because it's not that common to print lot of text in the interactive interpreter and even in those cases probably performances are not so important. It would anyway be better to find another way to do it. |
I think if you change it stop considering non-BMP characters as printable, somebody will complain. If you change it in any other way, somebody will complain. Somebody will always complain - so you might as well leave things the way they are. Or you change it in a way that you think most users would consider useful - but don't expect consensus. |
amaury> Should repr() print unicode characters outside the BMP? Yes. I don't understand why characters outside the BMP will be considered differently than other characters. Is it a workaround for bogus operating systems? My Linux terminal (Konsole on KDE) is able to display Ugaritic characters (range starting at U+10383). amaury> it may be better to define "printable" based on the You can not do that: stdout and stderr encoding might be different, stdout and stderr errors are different, and repr() output is not always written to stdout or stderr. If you write repr() output to a file, you have to know the encoding of the file. How can I get the encoding? If sys.stdout is replaced by a io.StringIO() object (eg. doctests), you don't have any "encoding" (StringIO only manipulate unicode objects, no bytes objects). ezio> I want to change only the behavior of the interactive interpreter This idea was rejected by the PEP. I agree with "may add confusion of the kind "it works in interactive mode but not when redirecting to a file". I already noticed such problem: the interactive interpreter adds '' to sys.path, and so import behaves differently in the interpreter than a script. It's annoying because it took me hours to understand why it was different. ezio> and only when the string sent to stdout is not encodable Which means setting sys.stdout.errors to something else than strict. I prefer to detect unicode problems earlier. Eg. if you set errors to 'replace', write will never fail. If the output is used as input for another program (UNIX pipe), you will send "?" to the reader process. I'm not sure that it is the expected behaviour. And if stdout is not a TTY, stdout uses ASCII encoding (which will raise an unicode errors at the first non ASCII character!). |
I suggest to:
* note that this will affect only the objects that are converted with repr() in the interpreter e.g. ">>> x = 'foo'; x" and *NOT* ">>> x = 'foo'; print(x)". Since the first behavior exists *only* in the interactive interpreter it won't be inconsistent with normal programs ("x = 'foo'; x" in a program doesn't display anything). |
Here is a patch to "fix" sys_displayhook (note: the patch is just a proof of concept -- it seems to work fine but I still have to clean it up, add comments, rename and reorganize some vars and add tests). wolf@linuxvm:~/dev/py3k$ PYTHONIOENCODING=iso-8859-1 ./python
Python 3.2a0 (py3k:82643:82644M, Jul 9 2010, 11:39:25)
[GCC 4.4.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys; sys.stdout.encoding, sys.stdin.encoding
('iso-8859-1', 'iso-8859-1')
>>> 'ascii string'
'ascii string' # works fine
>>> 'some accented chars: öäå'
'some accented chars: öäå' # works fine - these chars are encodable
>>> 'a snowman: \u2603'
'a snowman: \u2603' # non-encodable - the char is escaped instead of raising an error
>>> 'snowman: \u2603, and accented öäå'
'snowman: \u2603, and accented öäå' # only non-encodable chars are escaped
>>> # the behavior of print is still the same:
>>> print('some accented chars: öäå')
some accented chars: öäå
>>> print('a snowman: \u2603')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2603' in position 11: ordinal not in range(256)
-------------------------------------
While testing the patch with PYTHONIOENCODING=iso-8859-1 I also found this weird issue that however is *not* related to the patch, since I managed to reproduce on a clean py3k using PYTHONIOENCODING=iso-8859-1:
>>> 'òàùèì óáúéí öäüëï'
'ò�\xa0ùèì óáúé�\xad öäüëï'
>>> 'òàùèì óáúéí öäüëï'.encode('iso-8859-1')
b'\xc3\xb2\xc3\xa0\xc3\xb9\xc3\xa8\xc3\xac \xc3\xb3\xc3\xa1\xc3\xba\xc3\xa9\xc3\xad \xc3\xb6\xc3\xa4\xc3\xbc\xc3\xab\xc3\xaf'
>>> 'òàùèì'.encode('utf-8')
b'\xc3\x83\xc2\xb2\xc3\x83\xc2\xa0\xc3\x83\xc2\xb9\xc3\x83\xc2\xa8\xc3\x83\xc2\xac' I think there might be some conflict between the IO encoding that I specified and the one that my terminal actually uses, but I couldn't figure out what's going on exactly (it also weird that only 'à' and 'í' are not displayed correctly). Unless this behavior is expected I'll open another issue about it. |
Assigning to myself so that I'll remember to finish and commit the patch. |
About bpo-9198.diff:
>>> b"a\xff".decode("ascii", "backslashreplace")
...
TypeError: don't know how to handle UnicodeDecodeError in error callback
>>> b"a\xff".decode("utf-8", "backslashreplace")
...
TypeError: don't know how to handle UnicodeDecodeError in error callback Note: some encodings don't support backslashreplace, especially mbcs encoding. But on Windows, sys.stdout.encoding is not mbcs but cpXXXX (eg. cp850). |
I created a new issue for Ezio's proposition to patch sys.displayhook: issue bpo-10601. To answer the initial question, "Should repr() print unicode characters outside the BMP?", Marc-Andre, Ezio and me agree that we should keep the current behavior for non-BMP chars (i.e. print them normally) and so I close this issue. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: