Message142098
Keep in mind that we should be able to access and use lone surrogates too, therefore:
s = '\ud800' # should be valid
len(s) # should this raise an error? (or return 0.5 ;)?
s[0] # error here too?
list(s) # here too?
p = s + '\udc00'
len(p) # 1?
s[0] # '\U00010000' ?
s[1] # IndexError?
list(p + 'a') # ['\ud800\udc00', 'a']?
We can still decide that strings with lone surrogates work only with a limited number of methods/functions but:
1) it's not backward compatible;
2) it's not very consistent
Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
>>> p
'\ud800\udc00'
>>> len(p)
2
>>> p.encode('utf-16').decode('utf-16')
'𐀀'
>>> len(_)
1 |
|
Date |
User |
Action |
Args |
2011-08-15 04:56:55 | ezio.melotti | set | recipients:
+ ezio.melotti, terry.reedy, pitrou, jkloth, mrabarnett, Arfrever, r.david.murray, tchrist |
2011-08-15 04:56:54 | ezio.melotti | set | messageid: <1313384214.92.0.150594455382.issue12729@psf.upfronthosting.co.za> |
2011-08-15 04:56:54 | ezio.melotti | link | issue12729 messages |
2011-08-15 04:56:54 | ezio.melotti | create | |
|