Issue2630
Created on 2008-04-14 09:54 by ishimoto, last changed 2008-05-08 17:19 by gvanrossum.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | Remove |
| diff.txt | ishimoto, 2008-04-14 09:54 | |||
| diff2.txt | ishimoto, 2008-04-15 12:19 | |||
| diff3.txt | ishimoto, 2008-05-04 15:34 | |||
| Messages | |||
|---|---|---|---|
| msg65461 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-14 09:54 | |
In py3k, repr() escapes non-ASCII characters in Unicode to \uXXXX as Python 2. This is unpleasant feature if you are working with non-latin characters. This issue was once discussed by Hye-Shik Chang[1], but was rejected. Here's a new challenge for Python 3 to fix issue. In this patch, repr() converts special ascii characters such as "\t", "\r", "\n", but doesn't convert non-ASCII characters to \uXXXX form. Non-ASCII characters are converted by TextIOWrapper on printing. I set 'errors' attribute of sys.stdout and sys.stderr to 'backslashreplace', so un-printable characters are converted to '\uXXXX' if your console cannot print such characters. This patch breaks five regr tests on my environment. I'll fix these tests if this patch is acceptable. [1] http://mail.python.org/pipermail/python-dev/2002-October/029443.html http://bugs.python.org/issue479898 |
|||
| msg65470 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-04-14 18:12 | |
I think this has potential, but it is too liberal. There are many more characters that cannot be assumed printable, e.g. many of the Latin-1 characters in the range 0x80 through 0x9F. Isn't there some Unicode data table that shows code points that are safely printable? OTOH there are other potential use cases where it would be nice to see the \u escapes, e.g. when one is concerned about sequences that print the same but don't have the same content (e.g. pre-normalization). The backslashreplace trick is nice, I didn't even know about that. :-) |
|||
| msg65483 (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) | Date: 2008-04-14 21:20 | |
What if we turn on the backslashreplace trick for some operations only? For example: sys_displayhook and sys_excepthook. |
|||
| msg65490 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-15 01:40 | |
> I think this has potential, but it is too liberal. There are many more
> characters that cannot be assumed printable, e.g. many of the Latin-1
> characters in the range 0x80 through 0x9F. Isn't there some Unicode
> data table that shows code points that are safely printable?
As Michael Urman pointed out, we can use Unicode properties.
Or we can define a set of non-printable characters (e.g.
sys.nonprintablechars).
> OTOH there are other potential use cases where it would be nice to see
> the \u escapes, e.g. when one is concerned about sequences that print
> the same but don't have the same content (e.g. pre-normalization).
For such cases, print(s.encode("ascii", "backslashreplace")) might work.
|
|||
| msg65491 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-15 01:48 | |
> What if we turn on the backslashreplace trick for some operations only? > For example: sys_displayhook and sys_excepthook. It would be difficult, since *_repr() API don't know who is the caller. |
|||
| msg65493 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-04-15 03:10 | |
Atsuo: I missed Michael Urman's comment. Can you copy it here, or (better :-) write a patch that uses it? Amaury: I think it would be okay to use backslashreplace as the default error handler for sys.stderr. Probably not for sys.stdout or other files, since I'm sure many users prefer the errors when their data cannot be printed rather than silently writing \u escapes that might cause other code reading their output to choke. For sys.stderr though I think not having exceptions raised when attempting to print errors is very valuable. |
|||
| msg65494 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-15 03:35 | |
Okay, I'll revise a patch later today. |
|||
| msg65514 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-15 12:19 | |
I revised a patch against Python 3.0a4. - As-per suggestion from Michael Urman, unicode_repr() refers unicode database to determine characters to be hex-encoded. - sys.stdout doesn't use 'backslashreplace'. |
|||
| msg65535 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-16 00:33 | |
I think sys.stdout need to have backslashreplace error handler. Without backslashreplace, print(listOfJapaneseString) prints nothing, but raises an exception. This is worse than Python2. |
|||
| msg65536 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-04-16 00:44 | |
I don't think this is a good idea; I've explained why earlier on this issue. |
|||
| msg65542 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-16 02:37 | |
Sorry, I missed to write "for interactive session". I agree for sys.stdout and other files should not have default backslashescape, but for iteractive session, I think sys.stdout can have backslasespape handler to avoid exceptions. |
|||
| msg65564 (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2008-04-16 19:37 | |
While it may be desirable to to have repr(unicode) return a non-ASCII string, the suggested approach is not suitable to solve the problem. repr() is usually used in logging and applications/users/tools don't expect to suddenly find non-ASCII or even mixed encodings in a log file. If you do want to have this more flexible, then make the encoding used by unicode_repr() adjustable, turn the existing code into a codec (e.g. "unicode-repr") and leave it setup as default. Users who wish to see non-ASCII repr(unicode) data can then adjust the used encoding to their liking. This is both more flexible and backwards compatible with 2.x. Also note that the separation of the Unicode database from the interpreter core was done to keep the interpreter footprint manageable. It's not a good idea to just dump the complete table set into unicodeobject.c via an #include. If you need to reference APIs from modules in C, the usual approach is to create a PyCObject which is then exported by the module (see e.g. the datetime module) and imported by code needing it. BTW: "printable" is not a defined term in Unicode. What is or is not printable really depends on the use case, e.g. there are quite a few code points in Unicode that don't result in any glyph being "printed" to the screen. A Unicode string could then look as if it had fewer code points than it actually does - which is not really what you want when debugging code or sifting through log files. |
|||
| msg65573 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-17 05:37 | |
> If you do want to have this more flexible, then make the encoding used > by unicode_repr() adjustable, turn the existing code into a codec (e.g. > "unicode-repr") and leave it setup as default. Turning code in unicode_repr() into a codec is good idea. I'll write two codecs(existing repr and new Unicode friendly codec) and post a revised patch later. |
|||
| msg65601 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-04-18 03:35 | |
Is a codec which encode() returns an Unicode allowed in Python3? I started to think codec is not nessesary, but python function is enough. |
|||
| msg65606 (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2008-04-18 08:46 | |
On 2008-04-18 05:35, atsuo ishimoto wrote: > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment: > > Is a codec which encode() returns an Unicode allowed in Python3? Sure, why not ? I think you have to ask another question: Is repr() allowed to return a string (instead of Unicode) in Py3k ? If not, then unicode_repr() will have to check the return value of the codec and convert it back to Unicode as necessary. > I started to think codec is not nessesary, but python function is enough. That's what we currently have with unicode_repr(), but it doesn't solve the problem. |
|||
| msg66216 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-05-04 15:34 | |
New patch agaist current py3k branch. All the regr tests faild by my patch is now fixed as far as I can run. I also modified a doctest module a bit, so should be reviewed by module owners. |
|||
| msg66298 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-05-05 22:07 | |
On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > On 2008-04-18 05:35, atsuo ishimoto wrote: > > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment: > > > > Is a codec which encode() returns an Unicode allowed in Python3? > > Sure, why not ? Actually, it is not. In Py3k, x.encode() always requires x to be a str (i.e. unicode) instance and return a bytes instance. y.decode() requires y to be a bytes instance and returns a str (i.e. unicode) instance. > I think you have to ask another question: Is repr() allowed to > return a string (instead of Unicode) in Py3k ? In Py3k, "strings" *are* unicode. The str data type is Unicode. If you're asking about repr() possibly returning a bytes instance, definitely not. > If not, then unicode_repr() will have to check the return value of > the codec and convert it back to Unicode as necessary. What codec? > > I started to think codec is not nessesary, but python function is enough. > > That's what we currently have with unicode_repr(), but it doesn't > solve the problem. I'm lost here. PS. Atsuo's PEP has now been checked in as PEP 3138. Discussion should start soon on the python-3000 list. |
|||
| msg66299 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-05-05 22:17 | |
FWIW, I've uploaded diff3.txt to Rietveld: http://codereview.appspot.com/767 Code review comments should be reflected here. I had to skip the change to Modules/unicodename_db.h which were too large for Rietveld to handle. |
|||
| msg66302 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-05-06 04:30 | |
I forgot to mention to Modules/unicodename_db.h. The current unicodename_db.h looks it was generated by old Tools/unicode/makeunicodedata.py. This patch includes newly generated unicodename_db.h, but we can exclude the change if not necessary. |
|||
| msg66303 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-05-06 04:39 | |
No need to change anything, the diff is just too big for the code review tool (Rietveld), but since it consists only of numbers we don't need to review it anyway. :) |
|||
| msg66307 (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2008-05-06 08:26 | |
On 2008-05-06 00:07, Guido van Rossum wrote: > Guido van Rossum <guido@python.org> added the comment: > > On Fri, Apr 18, 2008 at 1:46 AM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: >> On 2008-04-18 05:35, atsuo ishimoto wrote: >> > atsuo ishimoto <ishimoto@users.sourceforge.net> added the comment: >> > >> > Is a codec which encode() returns an Unicode allowed in Python3? >> >> Sure, why not ? > > Actually, it is not. In Py3k, x.encode() always requires x to be a str > (i.e. unicode) instance and return a bytes instance. y.decode() > requires y to be a bytes instance and returns a str (i.e. unicode) > instance. So you've limited the codec design to just doing Unicode<->bytes conversions ? The original codec design was to have the codec decide which types to take on input and to generate on output, e.g. to escape characters in Unicode (converting Unicode to Unicode), work on compressed 8-bit strings (converting 8-bit strings to 8-bit strings), etc. >> I think you have to ask another question: Is repr() allowed to >> return a string (instead of Unicode) in Py3k ? > > In Py3k, "strings" *are* unicode. The str data type is Unicode. With "strings" I always refer to 8-bit strings, ie. 8-bit data that is encoded in some encoding. > If you're asking about repr() possibly returning a bytes instance, > definitely not. > >> If not, then unicode_repr() will have to check the return value of >> the codec and convert it back to Unicode as necessary. > > What codec? The idea is to have a codec which takes the Unicode object and converts it to its repr()-value. Now, since you apparently cannot go the direct way anymore (ie. have the codec encode Unicode to Unicode), you'd have to first use a codec which converts the Unicode object to its repr()-value represented as bytes object and then convert the bytes object back to Unicode in unicode_repr(). With the original design, this extra step wouldn't have been necessary. >> > I started to think codec is not nessesary, but python function is enough. >> >> That's what we currently have with unicode_repr(), but it doesn't >> solve the problem. > > I'm lost here. See my previous replies on this ticket. > PS. Atsuo's PEP has now been checked in as PEP 3138. Discussion should > start soon on the python-3000 list. |
|||
| msg66310 (view) | Author: atsuo ishimoto (ishimoto) | Date: 2008-05-06 11:43 | |
> No need to change anything, the diff is just too big for the code > review tool (Rietveld), but since it consists only of numbers we don't > need to review it anyway. :) I wonder why unicodename_db.h have not updated after makeunicodedata.py was modified. If new makeunicodedata.py breaks something, I should remove the chage to unicodename_db.h from this patch (My patch works whether unicodename_db.h is updated or not.). I'll post a question to python-3000 list. |
|||
| msg66320 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-05-06 17:10 | |
On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote: > So you've limited the codec design to just doing Unicode<->bytes > conversions ? Yes. This was quite a conscious decision that was not taken lightly, with lots of community input, quite a while ago. > The original codec design was to have the codec decide which > types to take on input and to generate on output, e.g. to > escape characters in Unicode (converting Unicode to Unicode), > work on compressed 8-bit strings (converting 8-bit strings to > 8-bit strings), etc. Unfortunately this design made it hard to reason about the correctness of code, since (especially in Py3k, where bytes and str are more different than str and unicode were in 2.x) it's hard to write code that uses .encode() or .decode() unless it knows which codec is being used. IOW, when translated to 3.0, the design violates the general design principle that the *type* of a function's or method's return value should not depend on the *value* of one of the arguments. > >> I think you have to ask another question: Is repr() allowed to > >> return a string (instead of Unicode) in Py3k ? > > > > In Py3k, "strings" *are* unicode. The str data type is Unicode. > > With "strings" I always refer to 8-bit strings, ie. 8-bit data that > is encoded in some encoding. You will have to change this habit or you will thoroughly confuse both users and developers of 3.0. "String" refers to the built-in "str" type which in Py3k is PyUnicode. For the PyString type we use the built-in type "bytes". > > If you're asking about repr() possibly returning a bytes instance, > > definitely not. > > > >> If not, then unicode_repr() will have to check the return value of > >> the codec and convert it back to Unicode as necessary. > > > > What codec? > > The idea is to have a codec which takes the Unicode object and > converts it to its repr()-value. > > Now, since you apparently cannot > go the direct way anymore (ie. have the codec encode Unicode to > Unicode), you'd have to first use a codec which converts the Unicode > object to its repr()-value represented as bytes object and then > convert the bytes object back to Unicode in unicode_repr(). > > With the original design, this extra step wouldn't have been > necessary. Why does everything have to be a codec? |
|||
| msg66424 (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2008-05-08 17:15 | |
On 2008-05-06 19:10, Guido van Rossum wrote: > Guido van Rossum <guido@python.org> added the comment: > > On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote: >> So you've limited the codec design to just doing Unicode<->bytes >> conversions ? > > Yes. This was quite a conscious decision that was not taken lightly, > with lots of community input, quite a while ago. > >> The original codec design was to have the codec decide which >> types to take on input and to generate on output, e.g. to >> escape characters in Unicode (converting Unicode to Unicode), >> work on compressed 8-bit strings (converting 8-bit strings to >> 8-bit strings), etc. > > Unfortunately this design made it hard to reason about the correctness > of code, since (especially in Py3k, where bytes and str are more > different than str and unicode were in 2.x) it's hard to write code > that uses .encode() or .decode() unless it knows which codec is being > used. > > IOW, when translated to 3.0, the design violates the general design > principle that the *type* of a function's or method's return value > should not depend on the *value* of one of the arguments. I understand where this concept originates and usual apply this rule to software design as well, however, in the particular case of codecs, the codec registry and its helper functions are merely interfaces to code that is defined elsewhere. In comparison, the approach is very much like getattr() - you know what the attribute is called, but know nothing about its type until you receive it from the function. The reason codecs where designed like this was to be able to easily stack them. For this to work, only the interfaces need to be defined, without restricting the codecs too much in terms of which types may be used. I'd suggest to lift the type restrictions from the general codecs.c access APIs (PyCodec_*), since they don't really belong there and instead only impose the limitation on PyUnicode and PyString methods .encode() and .decode(). If you then also allow those methods to return *both* PyUnicode and PyString, you'd still have strong typing (only 1 of two possible types is allowed) and stacking streams or having codecs that work on PyUnicode->PyUnicode or PyString->PyString would still be accessible via .encode()/.decode(). >> >> I think you have to ask another question: Is repr() allowed to >> >> return a string (instead of Unicode) in Py3k ? >> > >> > In Py3k, "strings" *are* unicode. The str data type is Unicode. >> >> With "strings" I always refer to 8-bit strings, ie. 8-bit data that >> is encoded in some encoding. > > You will have to change this habit or you will thoroughly confuse both > users and developers of 3.0. "String" refers to the built-in "str" > type which in Py3k is PyUnicode. For the PyString type we use the > built-in type "bytes". Well, I'm confused by the P3k use of terms (esp. because the C type names don't match the Python ones), which is why I'm talking about 8-bit strings and Unicode. Perhaps it's better to use PyString and PyUnicode. >> > If you're asking about repr() possibly returning a bytes instance, >> > definitely not. >> > >> >> If not, then unicode_repr() will have to check the return value of >> >> the codec and convert it back to Unicode as necessary. >> > >> > What codec? >> >> The idea is to have a codec which takes the Unicode object and >> converts it to its repr()-value. >> >> Now, since you apparently cannot >> go the direct way anymore (ie. have the codec encode Unicode to >> Unicode), you'd have to first use a codec which converts the Unicode >> object to its repr()-value represented as bytes object and then >> convert the bytes object back to Unicode in unicode_repr(). >> >> With the original design, this extra step wouldn't have been >> necessary. > > Why does everything have to be a codec? It doesn't. It's just that codecs are so easy to add, change and adjust that reusing the existing code is more attractive than reinventing the wheel every time you need to make a conversion from one text form to another adjustable in some way. In the case addresses by this ticket, I see the usefulness of having native language being written to the console using native glyphs, but there are so many drawbacks to this (see the discussion on the ticket and the mailing list), that I think there needs to be a way to adjust the mechanism or at least be able to revert to the existing repr() output. Furthermore, a codec implementation of what Atsuo has in mind would also be useful in other contexts, e.g. where you want to write PyUnicode to a stream without introducing line breaks. |
|||
| msg66425 (view) | Author: Guido van Rossum (gvanrossum) | Date: 2008-05-08 17:19 | |
I'd be happy to have a separate more relaxed API for stackable codecs, however, the API should not be overloaded on the .encode() and .decode() methods on str and bytes objects. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2008-05-08 17:19:54 | gvanrossum | set | messages: + msg66425 |
| 2008-05-08 17:15:35 | lemburg | set | messages: + msg66424 |
| 2008-05-06 17:10:26 | gvanrossum | set | messages: + msg66320 |
| 2008-05-06 11:43:44 | ishimoto | set | messages: + msg66310 |
| 2008-05-06 08:26:35 | lemburg | set | messages: + msg66307 |
| 2008-05-06 04:39:17 | gvanrossum | set | messages: + msg66303 |
| 2008-05-06 04:30:29 | ishimoto | set | messages: + msg66302 |
| 2008-05-05 22:17:50 | gvanrossum | set | messages: + msg66299 |
| 2008-05-05 22:07:36 | gvanrossum | set | messages: + msg66298 |
| 2008-05-04 15:35:11 | ishimoto | set | files:
+ diff3.txt messages: + msg66216 |
| 2008-04-18 08:46:11 | lemburg | set | messages: + msg65606 |
| 2008-04-18 03:35:41 | ishimoto | set | messages: + msg65601 |
| 2008-04-17 05:37:51 | ishimoto | set | messages: + msg65573 |
| 2008-04-16 19:37:38 | lemburg | set | nosy:
+ lemburg messages: + msg65564 |
| 2008-04-16 02:37:15 | ishimoto | set | messages: + msg65542 |
| 2008-04-16 00:44:16 | gvanrossum | set | messages: + msg65536 |
| 2008-04-16 00:33:31 | ishimoto | set | messages: + msg65535 |
| 2008-04-15 12:19:56 | ishimoto | set | files:
+ diff2.txt messages: + msg65514 |
| 2008-04-15 03:35:09 | ishimoto | set | messages: + msg65494 |
| 2008-04-15 03:10:13 | gvanrossum | set | messages: + msg65493 |
| 2008-04-15 01:48:46 | ishimoto | set | messages: + msg65491 |
| 2008-04-15 01:40:26 | ishimoto | set | messages: + msg65490 |
| 2008-04-14 21:20:11 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg65483 |
| 2008-04-14 18:12:23 | gvanrossum | set | keywords:
+ patch nosy: + gvanrossum messages: + msg65470 |
| 2008-04-14 09:54:22 | ishimoto | create | |