Message 66320 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gvanrossum
Recipients	amaury.forgeotdarc, gvanrossum, ishimoto, lemburg
Date	2008-05-06.17:10:20
SpamBayes Score	0.0072527025
Marked as misclassified	No
Message-id	<ca471dc20805061010r1f368942y995e0066fda5441a@mail.gmail.com>
In-reply-to	<48201630.9020602@egenix.com>

Content
On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote: > So you've limited the codec design to just doing Unicode<->bytes > conversions ? Yes. This was quite a conscious decision that was not taken lightly, with lots of community input, quite a while ago. > The original codec design was to have the codec decide which > types to take on input and to generate on output, e.g. to > escape characters in Unicode (converting Unicode to Unicode), > work on compressed 8-bit strings (converting 8-bit strings to > 8-bit strings), etc. Unfortunately this design made it hard to reason about the correctness of code, since (especially in Py3k, where bytes and str are more different than str and unicode were in 2.x) it's hard to write code that uses .encode() or .decode() unless it knows which codec is being used. IOW, when translated to 3.0, the design violates the general design principle that the type of a function's or method's return value should not depend on the value of one of the arguments. > >> I think you have to ask another question: Is repr() allowed to > >> return a string (instead of Unicode) in Py3k ? > > > > In Py3k, "strings" are unicode. The str data type is Unicode. > > With "strings" I always refer to 8-bit strings, ie. 8-bit data that > is encoded in some encoding. You will have to change this habit or you will thoroughly confuse both users and developers of 3.0. "String" refers to the built-in "str" type which in Py3k is PyUnicode. For the PyString type we use the built-in type "bytes". > > If you're asking about repr() possibly returning a bytes instance, > > definitely not. > > > >> If not, then unicode_repr() will have to check the return value of > >> the codec and convert it back to Unicode as necessary. > > > > What codec? > > The idea is to have a codec which takes the Unicode object and > converts it to its repr()-value. > > Now, since you apparently cannot > go the direct way anymore (ie. have the codec encode Unicode to > Unicode), you'd have to first use a codec which converts the Unicode > object to its repr()-value represented as bytes object and then > convert the bytes object back to Unicode in unicode_repr(). > > With the original design, this extra step wouldn't have been > necessary. Why does everything have to be a codec?

On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
>  So you've limited the codec design to just doing Unicode<->bytes
>  conversions ?

Yes. This was quite a conscious decision that was not taken lightly,
with lots of community input, quite a while ago.

>  The original codec design was to have the codec decide which
>  types to take on input and to generate on output, e.g. to
>  escape characters in Unicode (converting Unicode to Unicode),
>  work on compressed 8-bit strings (converting 8-bit strings to
>  8-bit strings), etc.

Unfortunately this design made it hard to reason about the correctness
of code, since (especially in Py3k, where bytes and str are more
different than str and unicode were in 2.x) it's hard to write code
that uses .encode() or .decode() unless it knows which codec is being
used.

IOW, when translated to 3.0, the design violates the general design
principle that the *type* of a function's or method's return value
should not depend on the *value* of one of the arguments.

>  >>  I think you have to ask another question: Is repr() allowed to
>  >>  return a string (instead of Unicode) in Py3k ?
>  >
>  > In Py3k, "strings" *are* unicode. The str data type is Unicode.
>
>  With "strings" I always refer to 8-bit strings, ie. 8-bit data that
>  is encoded in some encoding.

You will have to change this habit or you will thoroughly confuse both
users and developers of 3.0. "String" refers to the built-in "str"
type which in Py3k is PyUnicode. For the PyString type we use the
built-in type "bytes".

>  > If you're asking about repr() possibly returning a bytes instance,
>  > definitely not.
>  >
>  >>  If not, then unicode_repr() will have to check the return value of
>  >>  the codec and convert it back to Unicode as necessary.
>  >
>  > What codec?
>
>  The idea is to have a codec which takes the Unicode object and
>  converts it to its repr()-value.
>
>  Now, since you apparently cannot
>  go the direct way anymore (ie. have the codec encode Unicode to
>  Unicode), you'd have to first use a codec which converts the Unicode
>  object to its repr()-value represented as bytes object and then
>  convert the bytes object back to Unicode in unicode_repr().
>
>  With the original design, this extra step wouldn't have been
>  necessary.

Why does everything have to be a codec?

History
Date	User	Action	Args
2008-05-06 17:10:33	gvanrossum	set	spambayes_score: 0.0072527 -> 0.0072527025 recipients: + gvanrossum, lemburg, ishimoto, amaury.forgeotdarc
2008-05-06 17:10:26	gvanrossum	link	issue2630 messages
2008-05-06 17:10:21	gvanrossum	create