Message 143054 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date	2011-08-27.03:12:32
SpamBayes Score	1.4319657e-12
Marked as misclassified	No
Message-id	<1314414753.58.0.189637808515.issue12729@psf.upfronthosting.co.za>
In-reply-to

Content
My proposal is better than log(N) in 2 respects. 1) There need only be a time penalty when there are non-BMP chars and indexing currently gives the 'wrong' answer and therefore when a time-penalty should be acceptable. Lookup for normal all-BMP strings could remain the same. 2) The penalty is log(K), where K in the number of non-BMP chars. In theory, O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the difference should be noticeable when there are just a few (say .01%) extended-range chars. I am aware that this is an idea for the future, not now. --- Fixing string iteration on narrow builds to produce code points the same as with wide builds is easy and costs O(1) per code point (character), which is the same as the current cost. Then >>> from unicodedata import name >>> name('\U0001043c') 'DESERET SMALL LETTER DEE' >>> for c in 'a\U0001043c': name(c) 'LATIN SMALL LETTER A' Traceback (most recent call last): File "<pyshell#3>", line 1, in <module> for c in 'a\U0001043c': name(c) ValueError: no such name would work like it does on wide builds instead of failing. I admit that it would be strange to have default iteration produce different items than default indexing (and indeed, str currently iterates by sequential indexing). But keeping them in sync means that buggy iteration is another cost of O(1) indexing.

My proposal is better than log(N) in 2 respects.

1) There need only be a time penalty when there are non-BMP chars and indexing currently gives the 'wrong' answer and therefore when a time-penalty should be acceptable. Lookup for normal all-BMP strings could remain the same.

2) The penalty is log(K), where K in the number of non-BMP chars. In theory, O(logK) is as 'bad' as O(logN), for any fixed ratio K/N. In practice, the difference should be noticeable when there are just a few (say .01%) extended-range chars.

I am aware that this is an idea for the future, not now.
---

Fixing string iteration on narrow builds to produce code points the same
as with wide builds is easy and costs O(1) per code point (character), which is the same as the current cost. Then

>>> from unicodedata import name
>>> name('\U0001043c')
'DESERET SMALL LETTER DEE'
>>> for c in 'a\U0001043c': name(c)
'LATIN SMALL LETTER A'
Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    for c in 'a\U0001043c': name(c)
ValueError: no such name

would work like it does on wide builds instead of failing.

I admit that it would be strange to have default iteration produce different items than default indexing (and indeed, str currently iterates by sequential indexing). But keeping them in sync means that buggy iteration is another cost of O(1) indexing.

History
Date	User	Action	Args
2011-08-27 03:12:33	terry.reedy	set	recipients: + terry.reedy, lemburg, gvanrossum, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-08-27 03:12:33	terry.reedy	set	messageid: <1314414753.58.0.189637808515.issue12729@psf.upfronthosting.co.za>
2011-08-27 03:12:33	terry.reedy	link	issue12729 messages
2011-08-27 03:12:32	terry.reedy	create