Author terry.reedy
Recipients Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date 2011-08-27.03:12:32
SpamBayes Score 1.43197e-12
Marked as misclassified No
Message-id <>
My proposal is better than log(N) in 2 respects.

1) There need only be a time penalty when there are non-BMP chars and indexing currently gives the 'wrong' answer and therefore when a time-penalty should be acceptable. Lookup for normal all-BMP strings could remain the same.

2) The penalty is log(K), where K in the number of non-BMP chars. In theory, O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the difference should be noticeable when there are just a few (say .01%) extended-range chars.

I am aware that this is an idea for the future, not now.

Fixing string iteration on narrow builds to produce code points the same
as with wide builds is easy and costs O(1) per code point (character), which is the same as the current cost. Then

>>> from unicodedata import name
>>> name('\U0001043c')
>>> for c in 'a\U0001043c': name(c)
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    for c in 'a\U0001043c': name(c)
ValueError: no such name

would work like it does on wide builds instead of failing.

I admit that it would be strange to have default iteration produce different items than default indexing (and indeed, str currently iterates by sequential indexing). But keeping them in sync means that buggy iteration is another cost of O(1) indexing.
Date User Action Args
2011-08-27 03:12:33terry.reedysetrecipients: + terry.reedy, lemburg, gvanrossum, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-08-27 03:12:33terry.reedysetmessageid: <>
2011-08-27 03:12:33terry.reedylinkissue12729 messages
2011-08-27 03:12:32terry.reedycreate