Message142042
Antoine Pitrou <report@bugs.python.org> wrote
on Sat, 13 Aug 2011 21:09:52 -0000:
> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for less than 16 characters after the
> lookup).
> On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
> table would be 1/16. It could also be constructed lazily whenever more
> than 2 positions are cached.
You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you. Larry explained it to me once but I no longer recall any details.
I notice though that they don't seem to think it worth doing for UTF-8
or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.
--tom |
|
Date |
User |
Action |
Args |
2011-08-14 01:11:52 | tchrist | set | recipients:
+ tchrist, terry.reedy, pitrou, mrabarnett, Arfrever, r.david.murray |
2011-08-14 01:11:52 | tchrist | link | issue12729 messages |
2011-08-14 01:11:51 | tchrist | create | |
|