Author tchrist
Recipients Arfrever, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date 2011-08-14.01:11:51
SpamBayes Score 2.25357e-06
Marked as misclassified No
Message-id <10002.1313284306@chthon>
In-reply-to <1313269670.3553.18.camel@localhost.localdomain>
Content
Antoine Pitrou <report@bugs.python.org> wrote
   on Sat, 13 Aug 2011 21:09:52 -0000: 

> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for less than 16 characters after the
> lookup).

> On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
> table would be 1/16. It could also be constructed lazily whenever more
> than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you.  Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 
or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.

--tom
History
Date User Action Args
2011-08-14 01:11:52tchristsetrecipients: + tchrist, terry.reedy, pitrou, mrabarnett, Arfrever, r.david.murray
2011-08-14 01:11:52tchristlinkissue12729 messages
2011-08-14 01:11:51tchristcreate