Message 214197 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gwideman
Recipients	benjamin.peterson, docs@python, eric.araujo, ezio.melotti, gwideman, lemburg, pitrou, tshepang, vstinner
Date	2014-03-20.10:49:41
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1395312581.99.0.0905034238681.issue20906@psf.upfronthosting.co.za>
In-reply-to

Content
Marc-Andre: Thanks for commenting: > > 2. 1. Python string --> some other code system, such as > > ASCII, cp1250, etc. The destination code system doesn't > > necessarily have anything to do with unicode, and whole > > ranges of unicode's characters either result in an > > exception, or get translated as escape sequences. > > Ie: This is more usefully seen as a translation > > operation, than "merely" encoding. > Those are encodings as well. The operation going from Unicode to one of > these encodings is called "encode" in Python. Yes I am certainly aware that in Python parlance these are also called "encode" (and achieved with encode()), which, I am arguing, is one reason we have confusion. These are not encoding into a recognized Unicode-defined byte stream, they entail translation and filtering into the allowed character set of a different code system and encoding into that code system's byte representation (encoding). > > In 1, the encoding process results in data that stays within concepts > > defined within Unicode. In 2, encoding produces data that would be > > described by some code system outside of Unicode. > > At the moment I think Python muddles these two ideas together, > > and I'm not sure how to clarify this. > An encoding is a mapping of characters to ordinals, nothing more or less. In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise). > You are viewing all this from the a Unicode point of view, but please > realize that Unicode is rather new in the business and the many > other encodings Python supports have been around for decades. I'm advocating that the concepts be clear enough to understand that Unicode (UTF-whatever) works differently (two mappings) than non-Unicode systems (single mapping), so that users have some hope of understanding what happens in moving from one to the other. > > > So it should say "16-bit code points" instead, right? > > I don't think Unicode code points should ever be described as > > having a particular number of bits. I think this is a > > core concept: Unicode separates the character <--> code point, > > and code point <--> bits/bytes mappings. > You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4 > needs 21 bits, but is typically stored in 32-bit. Still, > you're right: it's better to use the correct terms UCS-2 vs. UCS-4 > rather than refer to the number of bits. I think mixing in UCS just adds confusion here. Unicode consortium has declared "UCS" obsolete, and even wants people to stop using that term: http://www.unicode.org/faq/utf_bom.html "UCS-2 is obsolete terminology... the term should now be avoided." (That's a somewhat silly position -- we must still use the term to talk about legacy stuff. But probably not necessary here.) So my point wasn't about UCS. It was about referring to code points as having a particular bit width. Fundamentally, code points are numbers, without regard to some particular computer number format. It is a separate matter that they can be encoded in 8, 16 or 32 bit encoding schemes (utf-8, 16, 32), and that is independent of the magnitude of the code point number. It _is_ the case that some code points are large enough integers that when encoded they _require_, say, 3 bytes in utf-8, or two 16-bit words in utf-16 and so on. But the number of bits used in the encoding does not necessarily correspond to the number of bits that would be required to represent the integer code point number in plain binary. (Only in UTF-32 is the encoded value simply the binary version of the code point value.)

Marc-Andre:

Thanks for commenting:

> > 2. 1. Python string --> some other code system, such as 
> > ASCII, cp1250, etc. The destination code system doesn't 
> > necessarily have anything to do with unicode, and whole 
> > ranges of unicode's characters either result in an 
> > exception, or get translated as escape sequences. 
> > Ie: This is more usefully seen as a translation 
> > operation, than "merely" encoding.

> Those are encodings as well. The operation going from Unicode to one of
> these encodings is called "encode" in Python.

Yes I am certainly aware that in Python parlance these are also called "encode" (and achieved with encode()), which, I am arguing, is one reason we have confusion. These are not encoding into a recognized Unicode-defined byte stream, they entail translation and filtering into the allowed character set of a different code system and encoding into that code system's byte representation (encoding).

> > In 1, the encoding process results in data that stays within concepts 
> > defined within Unicode. In 2, encoding produces data that would be 
> > described by some code system outside of Unicode.
> > At the moment I think Python muddles these two ideas together, 
> > and I'm not sure how to clarify this. 

> An encoding is a mapping of characters to ordinals, nothing more or less.

In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise).

> You are viewing all this from the a Unicode point of view, but please
> realize that Unicode is rather new in the business and the many
> other encodings Python supports have been around for decades.

I'm advocating that the concepts be clear enough to understand that Unicode (UTF-whatever) works differently (two mappings) than non-Unicode systems (single mapping), so that users have some hope of understanding what happens in moving from one to the other.

> > > So it should say "16-bit code points" instead, right?
 
> > I don't think Unicode code points should ever be described as 
> > having a particular number of bits. I think this is a 
> > core concept: Unicode separates the character <--> code point, 
> > and code point <--> bits/bytes mappings. 

> You have UCS-2 and UCS-4. UCS-2 representable in 16 bits, UCS-4
> needs 21 bits, but is typically stored in 32-bit. Still,
> you're right: it's better to use the correct terms UCS-2 vs. UCS-4
> rather than refer to the number of bits.

I think mixing in UCS just adds confusion here. Unicode consortium has declared "UCS" obsolete, and even wants people to stop using that term:
http://www.unicode.org/faq/utf_bom.html
"UCS-2 is obsolete terminology... the term should now be avoided."
(That's a somewhat silly position -- we must still use the term to talk about legacy stuff. But probably not necessary here.)

So my point wasn't about UCS. It was about referring to code points as having a particular bit width. Fundamentally, code points are numbers, without regard to some particular computer number format. It is a separate matter that they can be encoded in 8, 16 or 32 bit encoding schemes (utf-8, 16, 32), and that is independent of the magnitude of the code point number. 

It _is_ the case that some code points are large enough integers that when encoded they _require_, say, 3 bytes in utf-8, or two 16-bit words in utf-16 and so on. But the number of bits used in the encoding does not necessarily correspond to the number of bits that would be required to represent the integer code point number in plain binary. (Only in UTF-32 is the encoded value simply the binary version of the code point value.)

History
Date	User	Action	Args
2014-03-20 10:49:42	gwideman	set	recipients: + gwideman, lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, eric.araujo, docs@python, tshepang
2014-03-20 10:49:41	gwideman	set	messageid: <1395312581.99.0.0905034238681.issue20906@psf.upfronthosting.co.za>
2014-03-20 10:49:41	gwideman	link	issue20906 messages
2014-03-20 10:49:41	gwideman	create