Message 142096 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date	2011-08-15.03:31:12
SpamBayes Score	2.1834896e-05
Marked as misclassified	No
Message-id	<24710.1313379053@chthon>
In-reply-to	<26823.1313374629@chthon>

Content
I wrote: >> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16. > So I'm finding. Perhaps that's why I keep getting confused. I do have a pretty firm > notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results. > Can you think of anywhere that Python acts like UCS-2 and not UTF-16? I'm not sure I > have found one, although the regex thing might count. I just thought of one. The casemapping functions don't work right on Deseret, which is a non-BMP case-changing scripts. That's one I submitted as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP code points into paired UTF-16 surrogates, then the casing functions had jolly well be able to deal with it. If the UTF-8 decoder knows it is only going to UCS-2, then it should have raised on exception on my non-BMP source. Since it went to UTF-16, the rest of the language should have behaved accordingly. Java does to this right, BTW, despite its UTF-16ness. --tom

I wrote:

>> Python's narrow builds are, in a sense, 'between' UCS-2 and UTF-16.

> So I'm finding.  Perhaps that's why I keep getting confused. I do have a pretty firm
> notion of what UCS-2 and UTF-16 are, and so I get sometimes self-contradictory results.
> Can you think of anywhere that Python acts like UCS-2 and not UTF-16?  I'm not sure I
> have found one, although the regex thing might count.

I just thought of one.  The casemapping functions don't work right on
Deseret, which is a non-BMP case-changing scripts.  That's one I submitted
as a bug, because I figure if the the UTF-8 decoder can decode the non-BMP
code points into paired UTF-16 surrogates, then the casing functions had
jolly well be able to deal with it.  If the UTF-8 decoder knows it is only
going to UCS-2, then it should have raised on exception on my non-BMP source.
Since it went to UTF-16, the rest of the language should have behaved accordingly.
Java does to this right, BTW, despite its UTF-16ness.

--tom

History
Date	User	Action	Args
2011-08-15 03:31:13	tchrist	set	recipients: + tchrist, terry.reedy, pitrou, jkloth, ezio.melotti, mrabarnett, Arfrever, r.david.murray
2011-08-15 03:31:13	tchrist	link	issue12729 messages
2011-08-15 03:31:12	tchrist	create