Message 66424 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	amaury.forgeotdarc, gvanrossum, ishimoto, lemburg
Date	2008-05-08.17:15:31
SpamBayes Score	5.144434e-06
Marked as misclassified	No
Message-id	<48233530.90600@egenix.com>
In-reply-to	<ca471dc20805061010r1f368942y995e0066fda5441a@mail.gmail.com>

Content
On 2008-05-06 19:10, Guido van Rossum wrote: > Guido van Rossum <guido@python.org> added the comment: > > On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote: >> So you've limited the codec design to just doing Unicode<->bytes >> conversions ? > > Yes. This was quite a conscious decision that was not taken lightly, > with lots of community input, quite a while ago. > >> The original codec design was to have the codec decide which >> types to take on input and to generate on output, e.g. to >> escape characters in Unicode (converting Unicode to Unicode), >> work on compressed 8-bit strings (converting 8-bit strings to >> 8-bit strings), etc. > > Unfortunately this design made it hard to reason about the correctness > of code, since (especially in Py3k, where bytes and str are more > different than str and unicode were in 2.x) it's hard to write code > that uses .encode() or .decode() unless it knows which codec is being > used. > > IOW, when translated to 3.0, the design violates the general design > principle that the type of a function's or method's return value > should not depend on the value of one of the arguments. I understand where this concept originates and usual apply this rule to software design as well, however, in the particular case of codecs, the codec registry and its helper functions are merely interfaces to code that is defined elsewhere. In comparison, the approach is very much like getattr() - you know what the attribute is called, but know nothing about its type until you receive it from the function. The reason codecs where designed like this was to be able to easily stack them. For this to work, only the interfaces need to be defined, without restricting the codecs too much in terms of which types may be used. I'd suggest to lift the type restrictions from the general codecs.c access APIs (PyCodec_), since they don't really belong there and instead only impose the limitation on PyUnicode and PyString methods .encode() and .decode(). If you then also allow those methods to return both* PyUnicode and PyString, you'd still have strong typing (only 1 of two possible types is allowed) and stacking streams or having codecs that work on PyUnicode->PyUnicode or PyString->PyString would still be accessible via .encode()/.decode(). >> >> I think you have to ask another question: Is repr() allowed to >> >> return a string (instead of Unicode) in Py3k ? >> > >> > In Py3k, "strings" are unicode. The str data type is Unicode. >> >> With "strings" I always refer to 8-bit strings, ie. 8-bit data that >> is encoded in some encoding. > > You will have to change this habit or you will thoroughly confuse both > users and developers of 3.0. "String" refers to the built-in "str" > type which in Py3k is PyUnicode. For the PyString type we use the > built-in type "bytes". Well, I'm confused by the P3k use of terms (esp. because the C type names don't match the Python ones), which is why I'm talking about 8-bit strings and Unicode. Perhaps it's better to use PyString and PyUnicode. >> > If you're asking about repr() possibly returning a bytes instance, >> > definitely not. >> > >> >> If not, then unicode_repr() will have to check the return value of >> >> the codec and convert it back to Unicode as necessary. >> > >> > What codec? >> >> The idea is to have a codec which takes the Unicode object and >> converts it to its repr()-value. >> >> Now, since you apparently cannot >> go the direct way anymore (ie. have the codec encode Unicode to >> Unicode), you'd have to first use a codec which converts the Unicode >> object to its repr()-value represented as bytes object and then >> convert the bytes object back to Unicode in unicode_repr(). >> >> With the original design, this extra step wouldn't have been >> necessary. > > Why does everything have to be a codec? It doesn't. It's just that codecs are so easy to add, change and adjust that reusing the existing code is more attractive than reinventing the wheel every time you need to make a conversion from one text form to another adjustable in some way. In the case addresses by this ticket, I see the usefulness of having native language being written to the console using native glyphs, but there are so many drawbacks to this (see the discussion on the ticket and the mailing list), that I think there needs to be a way to adjust the mechanism or at least be able to revert to the existing repr() output. Furthermore, a codec implementation of what Atsuo has in mind would also be useful in other contexts, e.g. where you want to write PyUnicode to a stream without introducing line breaks.

On 2008-05-06 19:10, Guido van Rossum wrote:
> Guido van Rossum <guido@python.org> added the comment:
> 
> On Tue, May 6, 2008 at 1:26 AM, Marc-Andre Lemburg wrote:
>>  So you've limited the codec design to just doing Unicode<->bytes
>>  conversions ?
> 
> Yes. This was quite a conscious decision that was not taken lightly,
> with lots of community input, quite a while ago.
> 
>>  The original codec design was to have the codec decide which
>>  types to take on input and to generate on output, e.g. to
>>  escape characters in Unicode (converting Unicode to Unicode),
>>  work on compressed 8-bit strings (converting 8-bit strings to
>>  8-bit strings), etc.
> 
> Unfortunately this design made it hard to reason about the correctness
> of code, since (especially in Py3k, where bytes and str are more
> different than str and unicode were in 2.x) it's hard to write code
> that uses .encode() or .decode() unless it knows which codec is being
> used.
> 
> IOW, when translated to 3.0, the design violates the general design
> principle that the *type* of a function's or method's return value
> should not depend on the *value* of one of the arguments.

I understand where this concept originates and usual apply this
rule to software design as well, however, in the particular case
of codecs, the codec registry and its helper functions are merely
interfaces to code that is defined elsewhere.

In comparison, the approach is very much like getattr() - you know
what the attribute is called, but know nothing about its type
until you receive it from the function.

The reason codecs where designed like this was to be able to
easily stack them. For this to work, only the interfaces need
to be defined, without restricting the codecs too much in terms
of which types may be used.

I'd suggest to lift the type restrictions from the general
codecs.c access APIs (PyCodec_*), since they don't really belong
there and instead only impose the limitation on PyUnicode and
PyString methods .encode() and .decode().

If you then also allow those methods to return *both*
PyUnicode and PyString, you'd still have strong typing
(only 1 of two possible types is allowed) and stacking
streams or having codecs that work on PyUnicode->PyUnicode
or PyString->PyString would still be accessible via
.encode()/.decode().

>>  >>  I think you have to ask another question: Is repr() allowed to
>>  >>  return a string (instead of Unicode) in Py3k ?
>>  >
>>  > In Py3k, "strings" *are* unicode. The str data type is Unicode.
>>
>>  With "strings" I always refer to 8-bit strings, ie. 8-bit data that
>>  is encoded in some encoding.
> 
> You will have to change this habit or you will thoroughly confuse both
> users and developers of 3.0. "String" refers to the built-in "str"
> type which in Py3k is PyUnicode. For the PyString type we use the
> built-in type "bytes".

Well, I'm confused by the P3k use of terms (esp. because the
C type names don't match the Python ones), which is why I'm
talking about 8-bit strings and Unicode.

Perhaps it's better to use PyString and PyUnicode.

>>  > If you're asking about repr() possibly returning a bytes instance,
>>  > definitely not.
>>  >
>>  >>  If not, then unicode_repr() will have to check the return value of
>>  >>  the codec and convert it back to Unicode as necessary.
>>  >
>>  > What codec?
>>
>>  The idea is to have a codec which takes the Unicode object and
>>  converts it to its repr()-value.
>>
>>  Now, since you apparently cannot
>>  go the direct way anymore (ie. have the codec encode Unicode to
>>  Unicode), you'd have to first use a codec which converts the Unicode
>>  object to its repr()-value represented as bytes object and then
>>  convert the bytes object back to Unicode in unicode_repr().
>>
>>  With the original design, this extra step wouldn't have been
>>  necessary.
> 
> Why does everything have to be a codec?

It doesn't. It's just that codecs are so easy to add, change
and adjust that reusing the existing code is more attractive
than reinventing the wheel every time you need to make
a conversion from one text form to another adjustable in
some way.

In the case addresses by this ticket, I see the usefulness
of having native language being written to the console using
native glyphs, but there are so many drawbacks to this (see the
discussion on the ticket and the mailing list), that
I think there needs to be a way to adjust the mechanism
or at least be able to revert to the existing repr() output.

Furthermore, a codec implementation of what Atsuo has in mind
would also be useful in other contexts, e.g. where you want
to write PyUnicode to a stream without introducing line breaks.

History
Date	User	Action	Args
2008-05-08 17:15:36	lemburg	set	spambayes_score: 5.14443e-06 -> 5.144434e-06 recipients: + lemburg, gvanrossum, ishimoto, amaury.forgeotdarc
2008-05-08 17:15:35	lemburg	link	issue2630 messages
2008-05-08 17:15:31	lemburg	create