Message 214205 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	benjamin.peterson, docs@python, eric.araujo, ezio.melotti, gwideman, lemburg, pitrou, tshepang, vstinner
Date	2014-03-20.11:32:51
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<532AD1DD.8090006@egenix.com>
In-reply-to	<1395312581.99.0.0905034238681.issue20906@psf.upfronthosting.co.za>

Content
On 20.03.2014 11:49, Graham Wideman wrote: > >> An encoding is a mapping of characters to ordinals, nothing more or less. > > In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise). I know that Unicode terminology provides all kinds of ways to name things and we can be arbitrarily pedantic about any of them and the fact that the Unicode consortium changes its mind every few years isn't helpful either :-) We could also have called encodings: "character set", "code page", "character encoding", "transformation", etc. In Python keep it simple: you have Unicode (code points) and 8-bit strings or bytes (code units). Whenever you go from Unicode to bytes, you encode Unicode into some encoding. Going back, you decode the encoding back into Unicode. This operation is defined by the codec implementing the encoding and it's not guaranteed to be lossless. See here for how we ended up having Unicode support in Python: http://www.egenix.com/library/presentations/#PythonAndUnicode

On 20.03.2014 11:49, Graham Wideman wrote:
> 
>> An encoding is a mapping of characters to ordinals, nothing more or less.
> 
> In unicode, the mapping from characters to ordinals (code points) is not the encoding. It's the mapping from code points to bytes that's the encoding. While I wish this was a distinction reserved for pedants, unfortunately it's an aspect that's important for users of unicode to understand in order to make sense of how it works, and what the literature and the web says (correct and otherwise).

I know that Unicode terminology provides all kinds of ways to name
things and we can be arbitrarily pedantic about any of them and
the fact that the Unicode consortium changes its mind every few
years isn't helpful either :-)

We could also have called encodings: "character set", "code page",
"character encoding", "transformation", etc.

In Python keep it simple: you have Unicode (code points) and 8-bit strings
or bytes (code units).

Whenever you go from Unicode to bytes, you encode Unicode into some encoding.
Going back, you decode the encoding back into Unicode. This operation is
defined by the codec implementing the encoding and it's *not* guaranteed
to be lossless.

See here for how we ended up having Unicode support in Python:

http://www.egenix.com/library/presentations/#PythonAndUnicode

History
Date	User	Action	Args
2014-03-20 11:32:52	lemburg	set	recipients: + lemburg, pitrou, vstinner, benjamin.peterson, ezio.melotti, eric.araujo, docs@python, tshepang, gwideman
2014-03-20 11:32:52	lemburg	link	issue20906 messages
2014-03-20 11:32:51	lemburg	create