Author ezio.melotti
Recipients amaury.forgeotdarc, ezio.melotti, lemburg
Date 2010-07-08.10:24:23
SpamBayes Score 7.98124e-05
Marked as misclassified No
Message-id <1278584665.89.0.440340057357.issue9198@psf.upfronthosting.co.za>
In-reply-to
Content
Regarding the fonts, I think that who actually uses or needs to use characters outside the BMP might have (now or in a few months/years) a font able to display them.
I also tried to print the printable chars from U+FFFF to U+1FFFF on my linux terminal and about half of them were rendered correctly (the default font is DejaVu Sans Mono).

The question is then if we do more harm hiding these chars behind escape sequence to the people who use them or hiding the escape sequence behind boxes for the people who don't and don't have the right font.


Regarding the categories that should be considered printable, I agree that the Zx categories could be considered printable, so the non printable chars could be limited to the Cx categories.

> Since we can't apply this check based on a per character basis,
> I think we should only allow non-ASCII code points to pass through
> if sys.stdout/sys.stderr is set to utf-8, utf-16 or utf-32.

If I understood correctly, you are suggesting to look at the sys.stdout/sys.stderr encoding and:
 * if it's a UTF-* encoding: allow all the non-ASCII (printable) codepoints (because they are the only encodings that can represent all the Unicode characters);
 * if it's not a UTF-* encoding: allow only ASCII (printable) codepoints.

This would however introduce a regression. For example on Windows (where the encoding is usually not a UTF-* one) I would expect accented characters (at least the ones in the codepage I'm using -- and usually it matches the native language of the user) to be displayed correctly.
A more accurate approach would be to actually try to encode the string and escape only the chars that can't be encoded (and also the one that are not printable of course), but this can't be done in repr() because repr() returns a Unicode string (in #5110 I did it in sys.displayhook), and encode the string there would mean doing it twice.

Also note that I might want to use repr() to get a representation of the object without necessarily send it through sys.stdout. For example I could write it on a file or send it via mail (Roundup reports errors via mail showing a repr of the variables) and in both the cases I might use/want UTF-8 even if sys.stdout is ASCII.
History
Date User Action Args
2010-07-08 10:24:25ezio.melottisetrecipients: + ezio.melotti, lemburg, amaury.forgeotdarc
2010-07-08 10:24:25ezio.melottisetmessageid: <1278584665.89.0.440340057357.issue9198@psf.upfronthosting.co.za>
2010-07-08 10:24:24ezio.melottilinkissue9198 messages
2010-07-08 10:24:23ezio.melotticreate