Message 93617 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Rhamphoryncus
Recipients	Rhamphoryncus, amaury.forgeotdarc, bupjae, ezio.melotti, lemburg, vstinner
Date	2009-10-05.18:48:52
SpamBayes Score	1.1070376e-06
Marked as misclassified	No
Message-id	<aac2c7cb0910051148y75a87640ye5c0bd674fb5416d@mail.gmail.com>
In-reply-to	<4ACA3686.4060307@egenix.com>

Content
On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg <report@bugs.python.org> wrote: > All this is just nitpicking, really. UCS2 is a character set, > UTF-16 an encoding. UCS is a character set, for most purposes synonymous with the Unicode character set. UCS-2 and UTF-16 are both encodings of that character set. However, UCS-2 can only represent the BMP, while UTF-16 can represent the full range. > If we were to implement Unicode using UTF-16 as storage format, > we would not be able to store single lone surrogates, since these > are not allowed in UTF-16. Ditto for unassigned ordinals, invalid > code points, etc. No. Internal usage may become temporarily ill-formed, but this is a compromise, and acceptable so long as we never export them to other systems. Not that I wouldn't prefer a system that wouldn't store lone surrogates, but.. pragmatics prevail. > Note that I wrote the PEP and worked on the implementation at a time > when Unicode 2.x was still in use wide-spread use (mostly on Windows) > and 3.0 was just being release: > > http://www.unicode.org/history/publicationdates.html I think you hit the nail on the head there. 10 years ago, unicode meant something different than it does today. That's reflected in PEP 100 and in the code. Now it's time to move on, switch to the modern terminology, modern usage, and modern specs. > But all that is off-topic for this ticket, so please let's just > stop such discussions. It needs to be discussed somewhere. It's a distraction from fixing the bug, but at least it's more private here. Would you prefer email?

On Mon, Oct 5, 2009 at 12:10, Marc-Andre Lemburg <report@bugs.python.org> wrote:
> All this is just nitpicking, really. UCS2 is a character set,
> UTF-16 an encoding.

UCS is a character set, for most purposes synonymous with the Unicode
character set.  UCS-2 and UTF-16 are both encodings of that character
set.  However, UCS-2 can only represent the BMP, while UTF-16 can
represent the full range.

> If we were to implement Unicode using UTF-16 as storage format,
> we would not be able to store single lone surrogates, since these
> are not allowed in UTF-16. Ditto for unassigned ordinals, invalid
> code points, etc.

No.  Internal usage may become temporarily ill-formed, but this is a
compromise, and acceptable so long as we never export them to other
systems.

Not that I wouldn't *prefer* a system that wouldn't store lone
surrogates, but.. pragmatics prevail.

> Note that I wrote the PEP and worked on the implementation at a time
> when Unicode 2.x was still in use wide-spread use (mostly on Windows)
> and 3.0 was just being release:
>
>        http://www.unicode.org/history/publicationdates.html

I think you hit the nail on the head there.  10 years ago, unicode
meant something different than it does today.  That's reflected in PEP
100 and in the code.  Now it's time to move on, switch to the modern
terminology, modern usage, and modern specs.

> But all that is off-topic for this ticket, so please let's just
> stop such discussions.

It needs to be discussed somewhere.  It's a distraction from fixing
the bug, but at least it's more private here.  Would you prefer email?

History
Date	User	Action	Args
2009-10-05 18:48:55	Rhamphoryncus	set	recipients: + Rhamphoryncus, lemburg, amaury.forgeotdarc, vstinner, ezio.melotti, bupjae
2009-10-05 18:48:53	Rhamphoryncus	link	issue5127 messages
2009-10-05 18:48:52	Rhamphoryncus	create