Message 142107 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	Arfrever, ezio.melotti, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date	2011-08-15.09:04:58
SpamBayes Score	7.6794127e-13
Marked as misclassified	No
Message-id	<4E48E136.2090802@egenix.com>
In-reply-to	<1313384214.92.0.150594455382.issue12729@psf.upfronthosting.co.za>

Content
> Keep in mind that we should be able to access and use lone surrogates too, therefore: > s = '\ud800' # should be valid > len(s) # should this raise an error? (or return 0.5 ;)? > s[0] # error here too? > list(s) # here too? > > p = s + '\udc00' > len(p) # 1? > s[0] # '\U00010000' ? > s[1] # IndexError? > list(p + 'a') # ['\ud800\udc00', 'a']? > > We can still decide that strings with lone surrogates work only with a limited number of methods/functions but: > 1) it's not backward compatible; > 2) it's not very consistent > > Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly": >>>> p > '\ud800\udc00' >>>> len(p) > 2 >>>> p.encode('utf-16').decode('utf-16') > '𐀀' >>>> len(_) > 1 Hi Tom, welcome to Python land :-) Here's some more background information on how Python's Unicode implementation works: You need to differentiate between Unicode code points stored in Unicode objects and ones encoded in transfer formats by codecs. We generally do allow lone surrogates, unassigned code points, lone combining code points, etc. in Unicode objects since Python needs to be able to work on all Unicode code points and build strings with them. The transfer format codecs do try to combine surrogates on decoding data on UCS4 builds. On UCS2 builds they create surrogate pairs as necessary. On output, those pairs will again be joined to get round-trip safety. It helps if you think of Python's Unicode objects using UCS2 and UCS4 instead of UTF-16/32. Python does try to make working with UCS2 easy and in many cases behaves as if it were using UTF-16 internally, but there are, of course, limits to this. In practice, you only rarely get to see any of these special cases, since non-BMP code points are usually not found in everyday use. If they do become a problem for you, you have the option of switching to a UCS4 build of Python. You also have to be aware of the fact that Python started Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the terminology of those versions, some of which has changed in more recent versions of Unicode. For more background information, you might want take a look at this talk from 2002: http://www.egenix.com/library/presentations/#PythonAndUnicode Related to the other tickets you opened You'll also find that collation and compression was already on the plate back then, but since no one step forward, it wasn't implemented. Cheers, -- Marc-Andre Lemburg eGenix.com ________________________________________________________________________ 2011-10-04: PyCon DE 2011, Leipzig, Germany 50 days to go ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/

> Keep in mind that we should be able to access and use lone surrogates too, therefore:
> s = '\ud800'  # should be valid
> len(s)  # should this raise an error? (or return 0.5 ;)?
> s[0]  # error here too?
> list(s)  # here too?
> 
> p = s + '\udc00'
> len(p)  # 1?
> s[0]  # '\U00010000' ?
> s[1]  # IndexError?
> list(p + 'a')  # ['\ud800\udc00', 'a']?
> 
> We can still decide that strings with lone surrogates work only with a limited number of methods/functions but:
> 1) it's not backward compatible;
> 2) it's not very consistent
> 
> Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
>>>> p
> '\ud800\udc00'
>>>> len(p)
> 2
>>>> p.encode('utf-16').decode('utf-16')
> '𐀀'
>>>> len(_)
> 1

Hi Tom,

welcome to Python land :-) Here's some more background information
on how Python's Unicode implementation works:

You need to differentiate between Unicode code points stored in
Unicode objects and ones encoded in transfer formats by codecs.

We generally do allow lone surrogates, unassigned code
points, lone combining code points, etc. in Unicode objects
since Python needs to be able to work on all Unicode code points
and build strings with them.

The transfer format codecs do try to combine surrogates
on decoding data on UCS4 builds. On UCS2 builds they create
surrogate pairs as necessary. On output, those pairs will again
be joined to get round-trip safety.

It helps if you think of Python's Unicode objects using UCS2
and UCS4 instead of UTF-16/32. Python does try to make working
with UCS2 easy and in many cases behaves as if it were using
UTF-16 internally, but there are, of course, limits to this. In
practice, you only rarely get to see any of these special cases,
since non-BMP code points are usually not found in everyday
use. If they do become a problem for you, you have the option
of switching to a UCS4 build of Python.

You also have to be aware of the fact that Python started
Unicode in 1999/2000 with Unicode 2.0/3.0, so it uses the
terminology of those versions, some of which has changed in
more recent versions of Unicode.

For more background information, you might want take a look
at this talk from 2002:

http://www.egenix.com/library/presentations/#PythonAndUnicode

Related to the other tickets you opened You'll also find that
collation and compression was already on the plate back then,
but since no one step forward, it wasn't implemented.

Cheers,
-- 
Marc-Andre Lemburg
eGenix.com

________________________________________________________________________
2011-10-04: PyCon DE 2011, Leipzig, Germany                50 days to go

::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               http://www.egenix.com/company/contact/

History
Date	User	Action	Args
2011-08-15 09:04:59	lemburg	set	recipients: + lemburg, terry.reedy, pitrou, jkloth, ezio.melotti, mrabarnett, Arfrever, r.david.murray, tchrist
2011-08-15 09:04:58	lemburg	link	issue12729 messages
2011-08-15 09:04:58	lemburg	create