classification
Title: repr() should not escape non-ASCII characters
Type: enhancement Stage:
Components: None Versions: Python 3.0
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, eric.smith, georg.brandl, gvanrossum, ishimoto, lemburg, pitrou
Priority: normal Keywords: patch

Created on 2008-04-14 09:54 by ishimoto, last changed 2008-06-12 02:44 by ishimoto. This issue is now closed.

Files
File name Uploaded Description Edit
diff.txt ishimoto, 2008-04-14 09:54
diff2.txt ishimoto, 2008-04-15 12:19
diff3.txt ishimoto, 2008-05-04 15:34
diff4.txt ishimoto, 2008-05-27 12:55
docdiff1.txt ishimoto, 2008-05-28 07:39
diff5.txt ishimoto, 2008-06-01 12:53
diff6.txt ishimoto, 2008-06-03 10:33
diff7_1.txt ishimoto, 2008-06-03 18:05
diff8.patch ishimoto, 2008-06-04 17:52
Messages (43)
msg65461 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-14 09:54
In py3k, repr() escapes non-ASCII characters in Unicode to \uXXXX as
Python 2. This is unpleasant feature if you are working with non-latin
characters. This issue was once discussed by Hye-Shik Chang[1], but was
rejected. Here's a new challenge for Python 3 to fix issue.

In this patch, repr() converts special ascii characters such as "\t", 
"\r", "\n", but doesn't convert non-ASCII characters to \uXXXX form. 
Non-ASCII characters are converted by TextIOWrapper on printing. I set 
'errors' attribute of sys.stdout and sys.stderr to 'backslashreplace', so
un-printable characters are converted to '\uXXXX' if your console
cannot print such characters.

This patch breaks five regr tests on my environment. 
I'll fix these tests if this patch is acceptable.

[1] http://mail.python.org/pipermail/python-dev/2002-October/029443.html
http://bugs.python.org/issue479898
msg65470 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-14 18:12
I think this has potential, but it is too liberal. There are many more
characters that cannot be assumed printable, e.g. many of the Latin-1
characters in the range 0x80 through 0x9F.  Isn't there some Unicode
data table that shows code points that are safely printable?

OTOH there are other potential use cases where it would be nice to see
the \u escapes, e.g. when one is concerned about sequences that print
the same but don't have the same content (e.g. pre-normalization).

The backslashreplace trick is nice, I didn't even know about that. :-)
msg65483 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-04-14 21:20
What if we turn on the backslashreplace trick for some operations only?
For example: sys_displayhook and sys_excepthook.
msg65490 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-15 01:40
>  I think this has potential, but it is too liberal. There are many more
>  characters that cannot be assumed printable, e.g. many of the Latin-1
>  characters in the range 0x80 through 0x9F.  Isn't there some Unicode
>  data table that shows code points that are safely printable?

As Michael Urman pointed out, we can use Unicode properties. 
Or we can define a set of non-printable characters (e.g.
sys.nonprintablechars).

>  OTOH there are other potential use cases where it would be nice to see
>  the \u escapes, e.g. when one is concerned about sequences that print
>  the same but don't have the same content (e.g. pre-normalization).

For such cases, print(s.encode("ascii", "backslashreplace")) might work.
msg65491 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-15 01:48
>  What if we turn on the backslashreplace trick for some operations only?
>  For example: sys_displayhook and sys_excepthook.

It would be difficult, since *_repr() API don't know who is the caller.
msg65493 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-15 03:10
Atsuo: I missed Michael Urman's comment.  Can you copy it here, or
(better :-) write a patch that uses it?

Amaury: I think it would be okay to use backslashreplace as the default
error handler for sys.stderr.  Probably not for sys.stdout or other
files, since I'm sure many users prefer the errors when their data
cannot be printed rather than silently writing \u escapes that might
cause other code reading their output to choke.  For sys.stderr though I
think not having exceptions raised when attempting to print errors is
very valuable.
msg65494 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-15 03:35
Okay, I'll revise a patch later today.
msg65514 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-15 12:19
I revised a patch against Python 3.0a4.

- As-per suggestion from Michael Urman, unicode_repr() 
  refers unicode database to determine characters to be 
  hex-encoded.
  
- sys.stdout doesn't use 'backslashreplace'.
msg65535 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-16 00:33
I think sys.stdout need to have backslashreplace error handler. 
Without backslashreplace, print(listOfJapaneseString) prints nothing, 
but raises an exception. This is worse than Python2.
msg65536 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-04-16 00:44
I don't think this is a good idea; I've explained why earlier on this issue.
msg65542 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-16 02:37
Sorry, I missed to write "for interactive session".
I agree for sys.stdout and other files should not have default 
backslashescape, but for iteractive session, I think sys.stdout can 
have backslasespape handler to avoid exceptions.
msg65564 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-16 19:37
While it may be desirable to to have repr(unicode) return a non-ASCII
string, the suggested approach is not suitable to solve the problem.

repr() is usually used in logging and applications/users/tools don't
expect to suddenly find non-ASCII or even mixed encodings in a log file.

If you do want to have this more flexible, then make the encoding used
by unicode_repr() adjustable, turn the existing code into a codec (e.g.
"unicode-repr") and leave it setup as default.

Users who wish to see non-ASCII repr(unicode) data can then adjust the
used encoding to their liking.

This is both more flexible and backwards compatible with 2.x.

Also note that the separation of the Unicode database from the
interpreter core was done to keep the interpreter footprint manageable.
It's not a good idea to just dump the complete table set into
unicodeobject.c via an #include. If you need to reference APIs from
modules in C, the usual approach is to create a PyCObject which is then
exported by the module (see e.g. the datetime module) and imported by
code needing it.

BTW: "printable" is not a defined term in Unicode. What is or is not
printable really depends on the use case, e.g. there are quite a few
code points in Unicode that don't result in any glyph being "printed" to
the screen. A Unicode string could then look as if it had fewer code
points than it actually does - which is not really what you want when
debugging code or sifting through log files.
msg65573 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-17 05:37
>  If you do want to have this more flexible, then make the encoding used
>  by unicode_repr() adjustable, turn the existing code into a codec (e.g.
>  "unicode-repr") and leave it setup as default.

Turning code in unicode_repr() into a codec is good idea. I'll write two
codecs(existing repr and new Unicode friendly codec) and post a revised
patch later.
msg65601 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-04-18 03:35
Is a codec which encode() returns an Unicode allowed in Python3? I
started to think codec is not nessesary, but python function is enough.
msg65606 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-04-18 08:46
On 2008-04-18 05:35, atsuo ishimoto wrote:
> atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
> 
> Is a codec which encode() returns an Unicode allowed in Python3?

Sure, why not ?

I think you have to ask another question: Is repr() allowed to
return a string (instead of Unicode) in Py3k ?

If not, then unicode_repr() will have to check the return value of
the codec and convert it back to Unicode as necessary.

> I started to think codec is not nessesary, but python function is enough.

That's what we currently have with unicode_repr(), but it doesn't
solve the problem.
msg66216 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-05-04 15:34
New patch agaist current py3k branch.

All the regr tests faild by my patch is now fixed as far as I 
can run.
I also modified a doctest module a bit, so should be reviewed
by module owners.
msg66298 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-05-05 22:07
On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> On 2008-04-18 05:35, atsuo ishimoto wrote:
>  > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
>  >
>  > Is a codec which encode() returns an Unicode allowed in Python3?
>
>  Sure, why not ?

Actually, it is not. In Py3k, x.encode() always requires x to be a str
(i.e. unicode) instance and return a bytes instance. y.decode()
requires y to be a bytes instance and returns a str (i.e. unicode)
instance.

>  I think you have to ask another question: Is repr() allowed to
>  return a string (instead of Unicode) in Py3k ?

In Py3k, "strings" *are* unicode. The str data type is Unicode.

If you're asking about repr() possibly returning a bytes instance,
definitely not.

>  If not, then unicode_repr() will have to check the return value of
>  the codec and convert it back to Unicode as necessary.

What codec?

>  > I started to think codec is not nessesary, but python function is enough.
>
>  That's what we currently have with unicode_repr(), but it doesn't
>  solve the problem.

I'm lost here.

PS. Atsuo's PEP has now been checked in as PEP 3138. Discussion should
start soon on the python-3000 list.
msg66299 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-05-05 22:17
FWIW, I've uploaded diff3.txt to Rietveld:
http://codereview.appspot.com/767

Code review comments should be reflected here.

I had to skip the change to Modules/unicodename_db.h which were too
large for Rietveld to handle.
msg66302 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-05-06 04:30
I forgot to mention to Modules/unicodename_db.h. 

The current unicodename_db.h looks it was generated
by old Tools/unicode/makeunicodedata.py. This patch
includes newly generated unicodename_db.h, but we 
can exclude the change if not necessary.
msg66303 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-05-06 04:39
No need to change anything, the diff is just too big for the code
review tool (Rietveld), but since it consists only of numbers we don't
need to review it anyway. :)
msg66307 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-06 08:26
On 2008-05-06 00:07, Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
>> On 2008-04-18 05:35, atsuo ishimoto wrote:
>>  > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment:
>>  >
>>  > Is a codec which encode() returns an Unicode allowed in Python3?
>>
>>  Sure, why not ?
> 
> Actually, it is not. In Py3k, x.encode() always requires x to be a str
> (i.e. unicode) instance and return a bytes instance. y.decode()
> requires y to be a bytes instance and returns a str (i.e. unicode)
> instance.

So you've limited the codec design to just doing Unicode<->bytes
conversions ?

The original codec design was to have the codec decide which
types to take on input and to generate on output, e.g. to
escape characters in Unicode (converting Unicode to Unicode),
work on compressed 8-bit strings (converting 8-bit strings to
8-bit strings), etc.

>>  I think you have to ask another question: Is repr() allowed to
>>  return a string (instead of Unicode) in Py3k ?
> 
> In Py3k, "strings" *are* unicode. The str data type is Unicode.

With "strings" I always refer to 8-bit strings, ie. 8-bit data that
is encoded in some encoding.

> If you're asking about repr() possibly returning a bytes instance,
> definitely not.
> 
>>  If not, then unicode_repr() will have to check the return value of
>>  the codec and convert it back to Unicode as necessary.
> 
> What codec?

The idea is to have a codec which takes the Unicode object and
converts it to its repr()-value.

Now, since you apparently cannot
go the direct way anymore (ie. have the codec encode Unicode to
Unicode), you'd have to first use a codec which converts the Unicode
object to its repr()-value represented as bytes object and then
convert the bytes object back to Unicode in unicode_repr().

With the original design, this extra step wouldn't have been
necessary.

>>  > I started to think codec is not nessesary, but python function is enough.
>>
>>  That's what we currently have with unicode_repr(), but it doesn't
>>  solve the problem.
> 
> I'm lost here.

See my previous replies on this ticket.

> PS. Atsuo's PEP has now been checked in as PEP 3138. Discussion should
> start soon on the python-3000 list.
msg66310 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-05-06 11:43
>  No need to change anything, the diff is just too big for the code
>  review tool (Rietveld), but since it consists only of numbers we don't
>  need to review it anyway. :)

I wonder why unicodename_db.h have not updated after 
makeunicodedata.py was modified. If new makeunicodedata.py 
breaks something, I should remove the chage to unicodename_db.h 
from this patch (My patch works whether unicodename_db.h is 
updated or not.). I'll post a question to python-3000 list.
msg66320 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-05-06 17:10
On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
>  So you've limited the codec design to just doing Unicode<->bytes
>  conversions ?

Yes. This was quite a conscious decision that was not taken lightly,
with lots of community input, quite a while ago.

>  The original codec design was to have the codec decide which
>  types to take on input and to generate on output, e.g. to
>  escape characters in Unicode (converting Unicode to Unicode),
>  work on compressed 8-bit strings (converting 8-bit strings to
>  8-bit strings), etc.

Unfortunately this design made it hard to reason about the correctness
of code, since (especially in Py3k, where bytes and str are more
different than str and unicode were in 2.x) it's hard to write code
that uses .encode() or .decode() unless it knows which codec is being
used.

IOW, when translated to 3.0, the design violates the general design
principle that the *type* of a function's or method's return value
should not depend on the *value* of one of the arguments.

>  >>  I think you have to ask another question: Is repr() allowed to
>  >>  return a string (instead of Unicode) in Py3k ?
>  >
>  > In Py3k, "strings" *are* unicode. The str data type is Unicode.
>
>  With "strings" I always refer to 8-bit strings, ie. 8-bit data that
>  is encoded in some encoding.

You will have to change this habit or you will thoroughly confuse both
users and developers of 3.0. "String" refers to the built-in "str"
type which in Py3k is PyUnicode. For the PyString type we use the
built-in type "bytes".

>  > If you're asking about repr() possibly returning a bytes instance,
>  > definitely not.
>  >
>  >>  If not, then unicode_repr() will have to check the return value of
>  >>  the codec and convert it back to Unicode as necessary.
>  >
>  > What codec?
>
>  The idea is to have a codec which takes the Unicode object and
>  converts it to its repr()-value.
>
>  Now, since you apparently cannot
>  go the direct way anymore (ie. have the codec encode Unicode to
>  Unicode), you'd have to first use a codec which converts the Unicode
>  object to its repr()-value represented as bytes object and then
>  convert the bytes object back to Unicode in unicode_repr().
>
>  With the original design, this extra step wouldn't have been
>  necessary.

Why does everything have to be a codec?
msg66424 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-05-08 17:15
On 2008-05-06 19:10, Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
>>  So you've limited the codec design to just doing Unicode<->bytes
>>  conversions ?
> 
> Yes. This was quite a conscious decision that was not taken lightly,
> with lots of community input, quite a while ago.
> 
>>  The original codec design was to have the codec decide which
>>  types to take on input and to generate on output, e.g. to
>>  escape characters in Unicode (converting Unicode to Unicode),
>>  work on compressed 8-bit strings (converting 8-bit strings to
>>  8-bit strings), etc.
> 
> Unfortunately this design made it hard to reason about the correctness
> of code, since (especially in Py3k, where bytes and str are more
> different than str and unicode were in 2.x) it's hard to write code
> that uses .encode() or .decode() unless it knows which codec is being
> used.
> 
> IOW, when translated to 3.0, the design violates the general design
> principle that the *type* of a function's or method's return value
> should not depend on the *value* of one of the arguments.

I understand where this concept originates and usual apply this
rule to software design as well, however, in the particular case
of codecs, the codec registry and its helper functions are merely
interfaces to code that is defined elsewhere.

In comparison, the approach is very much like getattr() - you know
what the attribute is called, but know nothing about its type
until you receive it from the function.

The reason codecs where designed like this was to be able to
easily stack them. For this to work, only the interfaces need
to be defined, without restricting the codecs too much in terms
of which types may be used.

I'd suggest to lift the type restrictions from the general
codecs.c access APIs (PyCodec_*), since they don't really belong
there and instead only impose the limitation on PyUnicode and
PyString methods .encode() and .decode().

If you then also allow those methods to return *both*
PyUnicode and PyString, you'd still have strong typing
(only 1 of two possible types is allowed) and stacking
streams or having codecs that work on PyUnicode->PyUnicode
or PyString->PyString would still be accessible via
.encode()/.decode().

>>  >>  I think you have to ask another question: Is repr() allowed to
>>  >>  return a string (instead of Unicode) in Py3k ?
>>  >
>>  > In Py3k, "strings" *are* unicode. The str data type is Unicode.
>>
>>  With "strings" I always refer to 8-bit strings, ie. 8-bit data that
>>  is encoded in some encoding.
> 
> You will have to change this habit or you will thoroughly confuse both
> users and developers of 3.0. "String" refers to the built-in "str"
> type which in Py3k is PyUnicode. For the PyString type we use the
> built-in type "bytes".

Well, I'm confused by the P3k use of terms (esp. because the
C type names don't match the Python ones), which is why I'm
talking about 8-bit strings and Unicode.

Perhaps it's better to use PyString and PyUnicode.

>>  > If you're asking about repr() possibly returning a bytes instance,
>>  > definitely not.
>>  >
>>  >>  If not, then unicode_repr() will have to check the return value of
>>  >>  the codec and convert it back to Unicode as necessary.
>>  >
>>  > What codec?
>>
>>  The idea is to have a codec which takes the Unicode object and
>>  converts it to its repr()-value.
>>
>>  Now, since you apparently cannot
>>  go the direct way anymore (ie. have the codec encode Unicode to
>>  Unicode), you'd have to first use a codec which converts the Unicode
>>  object to its repr()-value represented as bytes object and then
>>  convert the bytes object back to Unicode in unicode_repr().
>>
>>  With the original design, this extra step wouldn't have been
>>  necessary.
> 
> Why does everything have to be a codec?

It doesn't. It's just that codecs are so easy to add, change
and adjust that reusing the existing code is more attractive
than reinventing the wheel every time you need to make
a conversion from one text form to another adjustable in
some way.

In the case addresses by this ticket, I see the usefulness
of having native language being written to the console using
native glyphs, but there are so many drawbacks to this (see the
discussion on the ticket and the mailing list), that
I think there needs to be a way to adjust the mechanism
or at least be able to revert to the existing repr() output.

Furthermore, a codec implementation of what Atsuo has in mind
would also be useful in other contexts, e.g. where you want
to write PyUnicode to a stream without introducing line breaks.
msg66425 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-05-08 17:19
I'd be happy to have a separate more relaxed API for stackable codecs,
however, the API should not be overloaded on the .encode() and .decode()
methods on str and bytes objects.
msg67409 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-05-27 12:55
I updated a patch as per latest PEP.

- io.TextIOWrapper doesn't provide API to change error handler
  at this time. I should update this patch after the API is
  provided.

- This patch contains a fix for Tools/unicode/makeunicodedata.py 
  in rev 63378.
msg67439 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-05-28 07:39
docdiff1.txt contains a documentation for functions I added.
msg67591 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-01 12:53
diff5.txt contains both code and documentation patch for PEP 3138.

- In this patch, default error-handler of sys.stdout is always 'strict'.
msg67651 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-03 10:13
Review:

* Why is an empty string not printable? In any case, the empty string
should be among the test cases for isprintable().

* Why not use PyUnicode_DecodeASCII instead of
PyUnicode_FromEncodedObject? It should be a bit faster.

* If old-style string formatting gets "%a", .format() must get a "!a"
specifier.

* The ascii() and repr() tests should be expanded so that both test the
same set of objects, and the expected differences. Are there tests for
failing cases?

* This is just "return ascii" (in builtin_ascii):
+	if (ascii == NULL)
+	    return NULL;
+
+	return ascii;

* For PyBool_FromLong(1) and PyBool_FromLong(0) there is Py_RETURN_TRUE
and Py_RETURN_FALSE. (You're not to blame, the rest of unicodeobject.c
seems to use them too, probably a legacy.)

* There appear to be some space indentations in tab-indented files like
bltinmodule.c and vice versa (unicodeobject.c).

* C docs/isprintable() docs: The spec
+   Characters defined in the Unicode character database as "Other"
+   or "Separator" other than ASCII space(0x20) are not considered
+   printable.
is unclear, better say "All character except those ... are considered
printable".

* ascii() docs: 
+   the non-ASCII
+   characters in the string returned by :func:`ascii`() are hex-escaped
+   to generate a same string as :func:`repr` in Python 2.

should be

"the non-ASCII characters in the string returned by :func:`repr` are
backslash-escaped (with ``\x``, ``\u`` or ``\U``) to generate ...".

* makeunicodedata: len(list(n for n in names if n is not None)) could
better be expressed as sum(1 for n in names if n is not None).

Otherwise, the patch is fine IMO. (I'm surprised that only so few tests
needed adaptation, that's a sign that we're not testing Unicode enough.)
msg67653 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-03 10:31
One more thing: with r63891 the encoding and errors arguments for the
creation of sys.stderr were made configurable; you'll have to adapt the
patch so that it defaults to backslashescape but can be overridden by
PYTHONIOENCODING.
msg67654 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-03 10:33
This patch contains following changes.

- Added the new C API PyObject_ASCII() for consistency.
- Added the new string formatting operater for str.format() and
PyUnicode_FromFormat.
msg67655 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-03 11:00
Thank you for your review! 
I filed a new patch just before I see your comments.

On Tue, Jun 3, 2008 at 7:13 PM, Georg Brandl <report@bugs.python.org> wrote:
>
> Georg Brandl <georg@python.org> added the comment:
>
> Review:
>
> * Why is an empty string not printable? In any case, the empty string
> should be among the test cases for isprintable().

Well, my intuition came from str.islower() was wrong. An empty string is
printable, of cource.

> * Why not use PyUnicode_DecodeASCII instead of
> PyUnicode_FromEncodedObject? It should be a bit faster.
>

Okay, thank you.

> * If old-style string formatting gets "%a", .format() must get a "!a"
> specifier.
>
I added the format string in my latest patch.

> * The ascii() and repr() tests should be expanded so that both test the
> same set of objects, and the expected differences. Are there tests for
> failing cases?
>

Okay, thank you.

> * This is just "return ascii" (in builtin_ascii):
> +       if (ascii == NULL)
> +           return NULL;
> +
> +       return ascii;

Fixed in my latest patch.

>
> * For PyBool_FromLong(1) and PyBool_FromLong(0) there is Py_RETURN_TRUE
> and Py_RETURN_FALSE. (You're not to blame, the rest of unicodeobject.c
> seems to use them too, probably a legacy.)

Okay, thank you.

>
> * There appear to be some space indentations in tab-indented files like
> bltinmodule.c and vice versa (unicodeobject.c).
>

I think bltinmodule.c is fixed with latest patch, but I don't know what
is correct indentation for unicodeobject.c. I guess latest patch is
acceptable.

> * C docs/isprintable() docs: The spec
> +   Characters defined in the Unicode character database as "Other"
> +   or "Separator" other than ASCII space(0x20) are not considered
> +   printable.
> is unclear, better say "All character except those ... are considered
> printable".
>
> * ascii() docs:
> +   the non-ASCII
> +   characters in the string returned by :func:`ascii`() are hex-escaped
> +   to generate a same string as :func:`repr` in Python 2.
>
> should be
>
> "the non-ASCII characters in the string returned by :func:`repr` are
> backslash-escaped (with ``\x``, ``\u`` or ``\U``) to generate ...".
>

Okay, thank you.

> * makeunicodedata: len(list(n for n in names if n is not None)) could
> better be expressed as sum(1 for n in names if n is not None).

I don't want to change here, because this is reversion of rev 63378.

> One more thing: with r63891 the encoding and errors arguments for the
> creation of sys.stderr were made configurable; you'll have to adapt the
> patch so that it defaults to backslashescape but can be overridden by
> PYTHONIOENCODING.

I think sys.stderr should be default to 'backslashreplace' always. I'll
post a messege to Py3k-list later.

>
> Otherwise, the patch is fine IMO. (I'm surprised that only so few tests
> needed adaptation, that's a sign that we're not testing Unicode enough.)
>

Thank you very much! I'll file new patch soon.
msg67656 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-03 11:06
BTW, are new C APIs and functions should be ported to Python 2.6 for
compatibility, without modifing repr() itself? If so, I'll prepare a
patch for Python 2.6.
msg67657 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-03 11:10
ascii() should probably be in future_builtins.

Whether the C API stuff and .isprintable() should be backported to 2.6
is something for Guido to decide.
msg67665 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-03 17:50
I updated the patch as per Georg's advice.
msg67667 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-03 18:05
I'm sorry, I missed a file to be uploaded. diff7_1.txt is correct file.
msg67670 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2008-06-03 18:48
> Whether the C API stuff and .isprintable() should be backported to 2.6
> is something for Guido to decide.

No way -- while all of this makes sense in Py3k, where all strings are
Unicode, it would cause no end of problems in 2.6, and it would break
backward compatibility badly.
msg67692 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-04 17:52
stringlib can be compiled for Python 2.6 now, but the '!a' converter is
disabled by #ifdef for now.
msg67702 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-04 21:30
Shall the method be called isprintable() or simply printable()? For the
record, in the io classes, the writable()/readable() convention was chosen.
msg67704 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-04 21:34
I would expect "abc".isprintable() give me a bool and "abc".printable()
to return a printable string, as with "abc".lower() and "abc".islower().
msg67705 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-04 21:36
You are right, I had forgotton about lower()/islower().
msg68008 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-11 18:38
Patch committed to Py3k branch in r64138. Thanks all!
msg68047 - (view) Author: Atsuo Ishimoto (ishimoto) * Date: 2008-06-12 02:44
Great, thank you!
History
Date User Action Args
2008-06-12 02:44:45ishimotosetmessages: + msg68047
2008-06-11 18:38:55georg.brandlsetstatus: open -> closed
resolution: accepted
messages: + msg68008
2008-06-04 21:36:49pitrousetmessages: + msg67705
2008-06-04 21:34:42georg.brandlsetmessages: + msg67704
2008-06-04 21:30:58pitrousetnosy: + pitrou
messages: + msg67702
2008-06-04 17:52:51ishimotosetfiles: + diff8.patch
messages: + msg67692
2008-06-03 19:06:26eric.smithsetnosy: + eric.smith
2008-06-03 18:48:04gvanrossumsetmessages: + msg67670
2008-06-03 18:05:15ishimotosetfiles: + diff7_1.txt
messages: + msg67667
2008-06-03 17:57:35ishimotosetfiles: - diff7.txt
2008-06-03 17:50:20ishimotosetfiles: + diff7.txt
messages: + msg67665
2008-06-03 11:10:19georg.brandlsetmessages: + msg67657
2008-06-03 11:06:49ishimotosetmessages: + msg67656
2008-06-03 11:00:51ishimotosetmessages: + msg67655
2008-06-03 10:33:40ishimotosetfiles: + diff6.txt
messages: + msg67654
2008-06-03 10:31:23georg.brandlsetmessages: + msg67653
2008-06-03 10:13:53georg.brandlsetnosy: + georg.brandl
messages: + msg67651
2008-06-01 12:53:46ishimotosetfiles: + diff5.txt
messages: + msg67591
2008-05-28 07:39:38ishimotosetfiles: + docdiff1.txt
messages: + msg67439
2008-05-27 12:55:58ishimotosetfiles: + diff4.txt
messages: + msg67409
2008-05-08 17:19:54gvanrossumsetmessages: + msg66425
2008-05-08 17:15:35lemburgsetmessages: + msg66424
2008-05-06 17:10:26gvanrossumsetmessages: + msg66320
2008-05-06 11:43:44ishimotosetmessages: + msg66310
2008-05-06 08:26:35lemburgsetmessages: + msg66307
2008-05-06 04:39:17gvanrossumsetmessages: + msg66303
2008-05-06 04:30:29ishimotosetmessages: + msg66302
2008-05-05 22:17:50gvanrossumsetmessages: + msg66299
2008-05-05 22:07:36gvanrossumsetmessages: + msg66298
2008-05-04 15:35:11ishimotosetfiles: + diff3.txt
messages: + msg66216
2008-04-18 08:46:11lemburgsetmessages: + msg65606
2008-04-18 03:35:41ishimotosetmessages: + msg65601
2008-04-17 05:37:51ishimotosetmessages: + msg65573
2008-04-16 19:37:38lemburgsetnosy: + lemburg
messages: + msg65564
2008-04-16 02:37:15ishimotosetmessages: + msg65542
2008-04-16 00:44:16gvanrossumsetmessages: + msg65536
2008-04-16 00:33:31ishimotosetmessages: + msg65535
2008-04-15 12:19:56ishimotosetfiles: + diff2.txt
messages: + msg65514
2008-04-15 03:35:09ishimotosetmessages: + msg65494
2008-04-15 03:10:13gvanrossumsetmessages: + msg65493
2008-04-15 01:48:46ishimotosetmessages: + msg65491
2008-04-15 01:40:26ishimotosetmessages: + msg65490
2008-04-14 21:20:11amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg65483
2008-04-14 18:12:23gvanrossumsetkeywords: + patch
nosy: + gvanrossum
messages: + msg65470
2008-04-14 09:54:22ishimotocreate