Message 142042 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	tchrist
Recipients	Arfrever, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy
Date	2011-08-14.01:11:51
SpamBayes Score	2.253567e-06
Marked as misclassified	No
Message-id	<10002.1313284306@chthon>
In-reply-to	<1313269670.3553.18.camel@localhost.localdomain>

Content
Antoine Pitrou <report@bugs.python.org> wrote on Sat, 13 Aug 2011 21:09:52 -0000: > And/or a lookup table giving the byte offset of, say, every 16th > character. It gives you a O(1) lookup with a relatively reasonable > constant cost (you have to scan for less than 16 characters after the > lookup). > On small strings (< 256 UTF-8 bytes) the space overhead for the lookup > table would be 1/16. It could also be constructed lazily whenever more > than 2 positions are cached. You really should talk to the Perl 6 people to see whether their current strategy for caching offset maps for grapheme positions might be of use to you. Larry explained it to me once but I no longer recall any details. I notice though that they don't seem to think it worth doing for UTF-8 or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form) strings, where it would be needed even if they used UTF-32 underneath. --tom

Antoine Pitrou <report@bugs.python.org> wrote
   on Sat, 13 Aug 2011 21:09:52 -0000: 

> And/or a lookup table giving the byte offset of, say, every 16th
> character. It gives you a O(1) lookup with a relatively reasonable
> constant cost (you have to scan for less than 16 characters after the
> lookup).

> On small strings (< 256 UTF-8 bytes) the space overhead for the lookup
> table would be 1/16. It could also be constructed lazily whenever more
> than 2 positions are cached.

You really should talk to the Perl 6 people to see whether their current
strategy for caching offset maps for grapheme positions might be of use to
you.  Larry explained it to me once but I no longer recall any details.

I notice though that they don't seem to think it worth doing for UTF-8 
or UTF-16, just for their synthetic "NFG" (Grapheme Normalization Form)
strings, where it would be needed even if they used UTF-32 underneath.

--tom

History
Date	User	Action	Args
2011-08-14 01:11:52	tchrist	set	recipients: + tchrist, terry.reedy, pitrou, mrabarnett, Arfrever, r.david.murray
2011-08-14 01:11:52	tchrist	link	issue12729 messages
2011-08-14 01:11:51	tchrist	create