Message 214179 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	benjamin.peterson, docs@python, eric.araujo, ezio.melotti, gwideman, lemburg, pitrou, tshepang, vstinner
Date	2014-03-20.07:47:55
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<532A9D27.8090702@egenix.com>
In-reply-to	<1395273041.74.0.181908712413.issue20906@psf.upfronthosting.co.za>

Content
Just to clarify a few things: On 20.03.2014 00:50, Graham Wideman wrote: > > I think part of the ambiguity problem here is that there are two subtly but importantly different ideas here: > > 1. Python string (capable of representing any unicode text) --> some full-fidelity and industry recognized unicode byte stream, like utf-8, or utf-32. I think this is legitimately described as an "encoding" of the unicode string. Right, those are Unicode transformation format (UTF) encodings which are capable of representing all Unicode code points. > versus: > > 2. 1. Python string --> some other code system, such as ASCII, cp1250, etc. The destination code system doesn't necessarily have anything to do with unicode, and whole ranges of unicode's characters either result in an exception, or get translated as escape sequences. Ie: This is more usefully seen as a translation operation, than "merely" encoding. Those are encodings as well. The operation going from Unicode to one of these encodings is called "encode" in Python. The other way around "decode". > In 1, the encoding process results in data that stays within concepts defined within Unicode. In 2, encoding produces data that would be described by some code system outside of Unicode. > > At the moment I think Python muddles these two ideas together, and I'm not sure how to clarify this. An encoding is a mapping of characters to ordinals, nothing more or less. Unicode is such an encoding, but all others are as well. They just happen to have different ranges of ordinals. You are viewing all this from the a Unicode point of view, but please realize that Unicode is rather new in the business and the many other encodings Python supports have been around for decades. >> So it should say "16-bit code points" instead, right? > > I don't think Unicode code points should ever be described as having a particular number of bits. I think this is a core concept: Unicode separates the character <--> code point, and code point <--> bits/bytes mappings. > > At most, one might want to distinguish different ranges of unicode code points. Even if there is a need to distinguish code points <= 65535, I don't think this should be described as "16-bit", as it muddies the distinction between Unicode's two mappings. You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4 needs 21 bits, but is typically stored in 32-bit. Still, you're right: it's better to use the correct terms UCS-2 vs. UCS-4 rather than refer to the number of bits.

Just to clarify a few things:

On 20.03.2014 00:50, Graham Wideman wrote:
> 
> I think part of the ambiguity problem here is that there are two subtly but importantly different ideas here:
> 
> 1. Python string (capable of representing any unicode text) --> some full-fidelity and industry recognized unicode byte stream, like utf-8, or utf-32. I think this is legitimately described as an "encoding" of the unicode string.

Right, those are Unicode transformation format (UTF) encodings which are
capable of representing all Unicode code points.

> versus:
> 
> 2. 1. Python string --> some other code system, such as ASCII, cp1250, etc. The destination code system doesn't necessarily have anything to do with unicode, and whole ranges of unicode's characters either result in an exception, or get translated as escape sequences. Ie: This is more usefully seen as a translation operation, than "merely" encoding.

Those are encodings as well. The operation going from Unicode to one of
these encodings is called "encode" in Python. The other way around
"decode".

> In 1, the encoding process results in data that stays within concepts defined within Unicode. In 2, encoding produces data that would be described by some code system outside of Unicode.
>
> At the moment I think Python muddles these two ideas together, and I'm not sure how to clarify this. 

An encoding is a mapping of characters to ordinals, nothing more or
less. Unicode is such an encoding, but all others are as well. They
just happen to have different ranges of ordinals.

You are viewing all this from the a Unicode point of view, but please
realize that Unicode is rather new in the business and the many
other encodings Python supports have been around for decades.

>> So it should say "16-bit code points" instead, right?
> 
> I don't think Unicode code points should ever be described as having a particular number of bits. I think this is a core concept: Unicode separates the character <--> code point, and code point <--> bits/bytes mappings. 
>
> At most, one might want to distinguish different ranges of unicode code points. Even if there is a need to distinguish code points <= 65535, I don't think this should be described as "16-bit", as it muddies the distinction between Unicode's two mappings.

You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4
needs 21 bits, but is typically stored in 32-bit. Still,
you're right: it's better to use the correct terms UCS-2 vs. UCS-4
rather than refer to the number of bits.

History
Date	User	Action	Args
2014-03-20 07:47:56	lemburg	set	recipients: + lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, eric.araujo, docs@python, tshepang, gwideman
2014-03-20 07:47:56	lemburg	link	issue20906 messages
2014-03-20 07:47:55	lemburg	create