Author ezio.melotti
Recipients Arfrever, ezio.melotti, jkloth, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date 2011-08-15.04:56:54
SpamBayes Score 0.00136024
Marked as misclassified No
Message-id <>
Keep in mind that we should be able to access and use lone surrogates too, therefore:
s = '\ud800'  # should be valid
len(s)  # should this raise an error? (or return 0.5 ;)?
s[0]  # error here too?
list(s)  # here too?

p = s + '\udc00'
len(p)  # 1?
s[0]  # '\U00010000' ?
s[1]  # IndexError?
list(p + 'a')  # ['\ud800\udc00', 'a']?

We can still decide that strings with lone surrogates work only with a limited number of methods/functions but:
1) it's not backward compatible;
2) it's not very consistent

Another thing I noticed is that (at least on wide builds) surrogate pairs are not joined "on the fly":
>>> p
>>> len(p)
>>> p.encode('utf-16').decode('utf-16')
>>> len(_)
Date User Action Args
2011-08-15 04:56:55ezio.melottisetrecipients: + ezio.melotti, terry.reedy, pitrou, jkloth, mrabarnett, Arfrever, r.david.murray, tchrist
2011-08-15 04:56:54ezio.melottisetmessageid: <>
2011-08-15 04:56:54ezio.melottilinkissue12729 messages
2011-08-15 04:56:54ezio.melotticreate