classification
Title: repr() should not escape non-ASCII characters
Type: feature request
Components: None Versions: Python 3.0
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, gvanrossum, ishimoto, lemburg
Priority: Keywords: patch

Created on 2008-04-14 09:54 by ishimoto, last changed 2008-05-08 17:19 by gvanrossum.

Files
File name Uploaded Description Edit Remove
diff.txt ishimoto, 2008-04-14 09:54
diff2.txt ishimoto, 2008-04-15 12:19
diff3.txt ishimoto, 2008-05-04 15:34
Messages
msg65461 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-14 09:54
In py3k, repr() escapes non-ASCII characters in Unicode to \uXXXX as
Python 2. This is unpleasant feature if you are working with non-latin
characters. This issue was once discussed by Hye-Shik Chang[1], but was
rejected. Here's a new challenge for Python 3 to fix issue.

In this patch, repr() converts special ascii characters such as "\t", 
"\r", "\n", but doesn't convert non-ASCII characters to \uXXXX form. 
Non-ASCII characters are converted by TextIOWrapper on printing. I set 
'errors' attribute of sys.stdout and sys.stderr to 'backslashreplace', so
un-printable characters are converted to '\uXXXX' if your console
cannot print such characters.

This patch breaks five regr tests on my environment. 
I'll fix these tests if this patch is acceptable.

[1] http://mail.python.org/pipermail/python-dev/2002-October/029443.html
http://bugs.python.org/issue479898
msg65470 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-04-14 18:12
I think this has potential, but it is too liberal. There are many more
characters that cannot be assumed printable, e.g. many of the Latin-1
characters in the range 0x80 through 0x9F.  Isn't there some Unicode
data table that shows code points that are safely printable?

OTOH there are other potential use cases where it would be nice to see
the \u escapes, e.g. when one is concerned about sequences that print
the same but don't have the same content (e.g. pre-normalization).

The backslashreplace trick is nice, I didn't even know about that. :-)
msg65483 (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) Date: 2008-04-14 21:20
What if we turn on the backslashreplace trick for some operations only?
For example: sys_displayhook and sys_excepthook.
msg65490 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-15 01:40
>  I think this has potential, but it is too liberal. There are many more
>  characters that cannot be assumed printable, e.g. many of the Latin-1
>  characters in the range 0x80 through 0x9F.  Isn't there some Unicode
>  data table that shows code points that are safely printable?

As Michael Urman pointed out, we can use Unicode properties. 
Or we can define a set of non-printable characters (e.g.
sys.nonprintablechars).

>  OTOH there are other potential use cases where it would be nice to see
>  the \u escapes, e.g. when one is concerned about sequences that print
>  the same but don't have the same content (e.g. pre-normalization).

For such cases, print(s.encode("ascii", "backslashreplace")) might work.
msg65491 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-15 01:48
>  What if we turn on the backslashreplace trick for some operations only?
>  For example: sys_displayhook and sys_excepthook.

It would be difficult, since *_repr() API don't know who is the caller.
msg65493 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-04-15 03:10
Atsuo: I missed Michael Urman's comment.  Can you copy it here, or
(better :-) write a patch that uses it?

Amaury: I think it would be okay to use backslashreplace as the default
error handler for sys.stderr.  Probably not for sys.stdout or other
files, since I'm sure many users prefer the errors when their data
cannot be printed rather than silently writing \u escapes that might
cause other code reading their output to choke.  For sys.stderr though I
think not having exceptions raised when attempting to print errors is
very valuable.
msg65494 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-15 03:35
Okay, I'll revise a patch later today.
msg65514 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-15 12:19
I revised a patch against Python 3.0a4.

- As-per suggestion from Michael Urman, unicode_repr() 
  refers unicode database to determine characters to be 
  hex-encoded.
  
- sys.stdout doesn't use 'backslashreplace'.
msg65535 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-16 00:33
I think sys.stdout need to have backslashreplace error handler. 
Without backslashreplace, print(listOfJapaneseString) prints nothing, 
but raises an exception. This is worse than Python2.
msg65536 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-04-16 00:44
I don't think this is a good idea; I've explained why earlier on this issue.
msg65542 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-16 02:37
Sorry, I missed to write "for interactive session".
I agree for sys.stdout and other files should not have default 
backslashescape, but for iteractive session, I think sys.stdout can 
have backslasespape handler to avoid exceptions.
msg65564 (view) Author: Marc-Andre Lemburg (lemburg) Date: 2008-04-16 19:37
While it may be desirable to to have repr(unicode) return a non-ASCII
string, the suggested approach is not suitable to solve the problem.

repr() is usually used in logging and applications/users/tools don't
expect to suddenly find non-ASCII or even mixed encodings in a log file.

If you do want to have this more flexible, then make the encoding used
by unicode_repr() adjustable, turn the existing code into a codec (e.g.
"unicode-repr") and leave it setup as default.

Users who wish to see non-ASCII repr(unicode) data can then adjust the
used encoding to their liking.

This is both more flexible and backwards compatible with 2.x.

Also note that the separation of the Unicode database from the
interpreter core was done to keep the interpreter footprint manageable.
It's not a good idea to just dump the complete table set into
unicodeobject.c via an #include. If you need to reference APIs from
modules in C, the usual approach is to create a PyCObject which is then
exported by the module (see e.g. the datetime module) and imported by
code needing it.

BTW: "printable" is not a defined term in Unicode. What is or is not
printable really depends on the use case, e.g. there are quite a few
code points in Unicode that don't result in any glyph being "printed" to
the screen. A Unicode string could then look as if it had fewer code
points than it actually does - which is not really what you want when
debugging code or sifting through log files.
msg65573 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-17 05:37
>  If you do want to have this more flexible, then make the encoding used
>  by unicode_repr() adjustable, turn the existing code into a codec (e.g.
>  "unicode-repr") and leave it setup as default.

Turning code in unicode_repr() into a codec is good idea. I'll write two
codecs(existing repr and new Unicode friendly codec) and post a revised
patch later.
msg65601 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-04-18 03:35
Is a codec which encode() returns an Unicode allowed in Python3? I
started to think codec is not nessesary, but python function is enough.
msg65606 (view) Author: Marc-Andre Lemburg (lemburg) Date: 2008-04-18 08:46
On 2008-04-18 05:35, atsuo ishimoto wrote:
> atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
> 
> Is a codec which encode() returns an Unicode allowed in Python3?

Sure, why not ?

I think you have to ask another question: Is repr() allowed to
return a string (instead of Unicode) in Py3k ?

If not, then unicode_repr() will have to check the return value of
the codec and convert it back to Unicode as necessary.

> I started to think codec is not nessesary, but python function is enough.

That's what we currently have with unicode_repr(), but it doesn't
solve the problem.
msg66216 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-05-04 15:34
New patch agaist current py3k branch.

All the regr tests faild by my patch is now fixed as far as I 
can run.
I also modified a doctest module a bit, so should be reviewed
by module owners.
msg66298 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-05-05 22:07
On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> On 2008-04-18 05:35, atsuo ishimoto wrote:
>  > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
>  >
>  > Is a codec which encode() returns an Unicode allowed in Python3?
>
>  Sure, why not ?

Actually, it is not. In Py3k, x.encode() always requires x to be a str
(i.e. unicode) instance and return a bytes instance. y.decode()
requires y to be a bytes instance and returns a str (i.e. unicode)
instance.

>  I think you have to ask another question: Is repr() allowed to
>  return a string (instead of Unicode) in Py3k ?

In Py3k, "strings" *are* unicode. The str data type is Unicode.

If you're asking about repr() possibly returning a bytes instance,
definitely not.

>  If not, then unicode_repr() will have to check the return value of
>  the codec and convert it back to Unicode as necessary.

What codec?

>  > I started to think codec is not nessesary, but python function is enough.
>
>  That's what we currently have with unicode_repr(), but it doesn't
>  solve the problem.

I'm lost here.

PS. Atsuo's PEP has now been checked in as PEP 3138. Discussion should
start soon on the python-3000 list.
msg66299 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-05-05 22:17
FWIW, I've uploaded diff3.txt to Rietveld:
http://codereview.appspot.com/767

Code review comments should be reflected here.

I had to skip the change to Modules/unicodename_db.h which were too
large for Rietveld to handle.
msg66302 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-05-06 04:30
I forgot to mention to Modules/unicodename_db.h. 

The current unicodename_db.h looks it was generated
by old Tools/unicode/makeunicodedata.py. This patch
includes newly generated unicodename_db.h, but we 
can exclude the change if not necessary.
msg66303 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-05-06 04:39
No need to change anything, the diff is just too big for the code
review tool (Rietveld), but since it consists only of numbers we don't
need to review it anyway. :)
msg66307 (view) Author: Marc-Andre Lemburg (lemburg) Date: 2008-05-06 08:26
On 2008-05-06 00:07, Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
>> On 2008-04-18 05:35, atsuo ishimoto wrote:
>>  > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
>>  >
>>  > Is a codec which encode() returns an Unicode allowed in Python3?
>>
>>  Sure, why not ?
> 
> Actually, it is not. In Py3k, x.encode() always requires x to be a str
> (i.e. unicode) instance and return a bytes instance. y.decode()
> requires y to be a bytes instance and returns a str (i.e. unicode)
> instance.

So you've limited the codec design to just doing Unicode<->bytes
conversions ?

The original codec design was to have the codec decide which
types to take on input and to generate on output, e.g. to
escape characters in Unicode (converting Unicode to Unicode),
work on compressed 8-bit strings (converting 8-bit strings to
8-bit strings), etc.

>>  I think you have to ask another question: Is repr() allowed to
>>  return a string (instead of Unicode) in Py3k ?
> 
> In Py3k, "strings" *are* unicode. The str data type is Unicode.

With "strings" I always refer to 8-bit strings, ie. 8-bit data that
is encoded in some encoding.

> If you're asking about repr() possibly returning a bytes instance,
> definitely not.
> 
>>  If not, then unicode_repr() will have to check the return value of
>>  the codec and convert it back to Unicode as necessary.
> 
> What codec?

The idea is to have a codec which takes the Unicode object and
converts it to its repr()-value.

Now, since you apparently cannot
go the direct way anymore (ie. have the codec encode Unicode to
Unicode), you'd have to first use a codec which converts the Unicode
object to its repr()-value represented as bytes object and then
convert the bytes object back to Unicode in unicode_repr().

With the original design, this extra step wouldn't have been
necessary.

>>  > I started to think codec is not nessesary, but python function is enough.
>>
>>  That's what we currently have with unicode_repr(), but it doesn't
>>  solve the problem.
> 
> I'm lost here.

See my previous replies on this ticket.

> PS. Atsuo's PEP has now been checked in as PEP 3138. Discussion should
> start soon on the python-3000 list.
msg66310 (view) Author: atsuo ishimoto (ishimoto) Date: 2008-05-06 11:43
>  No need to change anything, the diff is just too big for the code
>  review tool (Rietveld), but since it consists only of numbers we don't
>  need to review it anyway. :)

I wonder why unicodename_db.h have not updated after 
makeunicodedata.py was modified. If new makeunicodedata.py 
breaks something, I should remove the chage to unicodename_db.h 
from this patch (My patch works whether unicodename_db.h is 
updated or not.). I'll post a question to python-3000 list.
msg66320 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-05-06 17:10
On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
>  So you've limited the codec design to just doing Unicode<->bytes
>  conversions ?

Yes. This was quite a conscious decision that was not taken lightly,
with lots of community input, quite a while ago.

>  The original codec design was to have the codec decide which
>  types to take on input and to generate on output, e.g. to
>  escape characters in Unicode (converting Unicode to Unicode),
>  work on compressed 8-bit strings (converting 8-bit strings to
>  8-bit strings), etc.

Unfortunately this design made it hard to reason about the correctness
of code, since (especially in Py3k, where bytes and str are more
different than str and unicode were in 2.x) it's hard to write code
that uses .encode() or .decode() unless it knows which codec is being
used.

IOW, when translated to 3.0, the design violates the general design
principle that the *type* of a function's or method's return value
should not depend on the *value* of one of the arguments.

>  >>  I think you have to ask another question: Is repr() allowed to
>  >>  return a string (instead of Unicode) in Py3k ?
>  >
>  > In Py3k, "strings" *are* unicode. The str data type is Unicode.
>
>  With "strings" I always refer to 8-bit strings, ie. 8-bit data that
>  is encoded in some encoding.

You will have to change this habit or you will thoroughly confuse both
users and developers of 3.0. "String" refers to the built-in "str"
type which in Py3k is PyUnicode. For the PyString type we use the
built-in type "bytes".

>  > If you're asking about repr() possibly returning a bytes instance,
>  > definitely not.
>  >
>  >>  If not, then unicode_repr() will have to check the return value of
>  >>  the codec and convert it back to Unicode as necessary.
>  >
>  > What codec?
>
>  The idea is to have a codec which takes the Unicode object and
>  converts it to its repr()-value.
>
>  Now, since you apparently cannot
>  go the direct way anymore (ie. have the codec encode Unicode to
>  Unicode), you'd have to first use a codec which converts the Unicode
>  object to its repr()-value represented as bytes object and then
>  convert the bytes object back to Unicode in unicode_repr().
>
>  With the original design, this extra step wouldn't have been
>  necessary.

Why does everything have to be a codec?
msg66424 (view) Author: Marc-Andre Lemburg (lemburg) Date: 2008-05-08 17:15
On 2008-05-06 19:10, Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
>>  So you've limited the codec design to just doing Unicode<->bytes
>>  conversions ?
> 
> Yes. This was quite a conscious decision that was not taken lightly,
> with lots of community input, quite a while ago.
> 
>>  The original codec design was to have the codec decide which
>>  types to take on input and to generate on output, e.g. to
>>  escape characters in Unicode (converting Unicode to Unicode),
>>  work on compressed 8-bit strings (converting 8-bit strings to
>>  8-bit strings), etc.
> 
> Unfortunately this design made it hard to reason about the correctness
> of code, since (especially in Py3k, where bytes and str are more
> different than str and unicode were in 2.x) it's hard to write code
> that uses .encode() or .decode() unless it knows which codec is being
> used.
> 
> IOW, when translated to 3.0, the design violates the general design
> principle that the *type* of a function's or method's return value
> should not depend on the *value* of one of the arguments.

I understand where this concept originates and usual apply this
rule to software design as well, however, in the particular case
of codecs, the codec registry and its helper functions are merely
interfaces to code that is defined elsewhere.

In comparison, the approach is very much like getattr() - you know
what the attribute is called, but know nothing about its type
until you receive it from the function.

The reason codecs where designed like this was to be able to
easily stack them. For this to work, only the interfaces need
to be defined, without restricting the codecs too much in terms
of which types may be used.

I'd suggest to lift the type restrictions from the general
codecs.c access APIs (PyCodec_*), since they don't really belong
there and instead only impose the limitation on PyUnicode and
PyString methods .encode() and .decode().

If you then also allow those methods to return *both*
PyUnicode and PyString, you'd still have strong typing
(only 1 of two possible types is allowed) and stacking
streams or having codecs that work on PyUnicode->PyUnicode
or PyString->PyString would still be accessible via
.encode()/.decode().

>>  >>  I think you have to ask another question: Is repr() allowed to
>>  >>  return a string (instead of Unicode) in Py3k ?
>>  >
>>  > In Py3k, "strings" *are* unicode. The str data type is Unicode.
>>
>>  With "strings" I always refer to 8-bit strings, ie. 8-bit data that
>>  is encoded in some encoding.
> 
> You will have to change this habit or you will thoroughly confuse both
> users and developers of 3.0. "String" refers to the built-in "str"
> type which in Py3k is PyUnicode. For the PyString type we use the
> built-in type "bytes".

Well, I'm confused by the P3k use of terms (esp. because the
C type names don't match the Python ones), which is why I'm
talking about 8-bit strings and Unicode.

Perhaps it's better to use PyString and PyUnicode.

>>  > If you're asking about repr() possibly returning a bytes instance,
>>  > definitely not.
>>  >
>>  >>  If not, then unicode_repr() will have to check the return value of
>>  >>  the codec and convert it back to Unicode as necessary.
>>  >
>>  > What codec?
>>
>>  The idea is to have a codec which takes the Unicode object and
>>  converts it to its repr()-value.
>>
>>  Now, since you apparently cannot
>>  go the direct way anymore (ie. have the codec encode Unicode to
>>  Unicode), you'd have to first use a codec which converts the Unicode
>>  object to its repr()-value represented as bytes object and then
>>  convert the bytes object back to Unicode in unicode_repr().
>>
>>  With the original design, this extra step wouldn't have been
>>  necessary.
> 
> Why does everything have to be a codec?

It doesn't. It's just that codecs are so easy to add, change
and adjust that reusing the existing code is more attractive
than reinventing the wheel every time you need to make
a conversion from one text form to another adjustable in
some way.

In the case addresses by this ticket, I see the usefulness
of having native language being written to the console using
native glyphs, but there are so many drawbacks to this (see the
discussion on the ticket and the mailing list), that
I think there needs to be a way to adjust the mechanism
or at least be able to revert to the existing repr() output.

Furthermore, a codec implementation of what Atsuo has in mind
would also be useful in other contexts, e.g. where you want
to write PyUnicode to a stream without introducing line breaks.
msg66425 (view) Author: Guido van Rossum (gvanrossum) Date: 2008-05-08 17:19
I'd be happy to have a separate more relaxed API for stackable codecs,
however, the API should not be overloaded on the .encode() and .decode()
methods on str and bytes objects.
History
Date User Action Args
2008-05-08 17:19:54gvanrossumsetmessages: + msg66425
2008-05-08 17:15:35lemburgsetmessages: + msg66424
2008-05-06 17:10:26gvanrossumsetmessages: + msg66320
2008-05-06 11:43:44ishimotosetmessages: + msg66310
2008-05-06 08:26:35lemburgsetmessages: + msg66307
2008-05-06 04:39:17gvanrossumsetmessages: + msg66303
2008-05-06 04:30:29ishimotosetmessages: + msg66302
2008-05-05 22:17:50gvanrossumsetmessages: + msg66299
2008-05-05 22:07:36gvanrossumsetmessages: + msg66298
2008-05-04 15:35:11ishimotosetfiles: + diff3.txt
messages: + msg66216
2008-04-18 08:46:11lemburgsetmessages: + msg65606
2008-04-18 03:35:41ishimotosetmessages: + msg65601
2008-04-17 05:37:51ishimotosetmessages: + msg65573
2008-04-16 19:37:38lemburgsetnosy: + lemburg
messages: + msg65564
2008-04-16 02:37:15ishimotosetmessages: + msg65542
2008-04-16 00:44:16gvanrossumsetmessages: + msg65536
2008-04-16 00:33:31ishimotosetmessages: + msg65535
2008-04-15 12:19:56ishimotosetfiles: + diff2.txt
messages: + msg65514
2008-04-15 03:35:09ishimotosetmessages: + msg65494
2008-04-15 03:10:13gvanrossumsetmessages: + msg65493
2008-04-15 01:48:46ishimotosetmessages: + msg65491
2008-04-15 01:40:26ishimotosetmessages: + msg65490
2008-04-14 21:20:11amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg65483
2008-04-14 18:12:23gvanrossumsetkeywords: + patch
nosy: + gvanrossum
messages: + msg65470
2008-04-14 09:54:22ishimotocreate