Message143088
Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one.
However, grapheme clustering does require look-ahead, which is a bit trickier. Assume s is a sanitized sequence of code points with unicode database entries. Ignoring line endings the following should work (I tested it with a toy definition of mark()):
def graphemes(s):
sit = iter(s)
try: graph = [next(sit)]
except StopIteration: graph = []
for cp in sit:
if mark(cp):
graph.append(cp)
else:
yield combine(graph)
graph = [cp]
yield combine(graph)
I tested this with several input with
def mark(cp): return cp == '.'
def combine(l) return ''.join(l)
Python's object orientation makes formatting easy for the user. Assume someone does the hard work of writing (once ;-) a GCString class with a .__format__ method that interprets the format mini-language for graphemes, using a generalized version of your 'simply horrible' code. The might be done by adapting str.__format__ to use the grapheme iterator above. Then users should be able to write
>>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠"))
"a̠ˈne̞ɣ̞ð̞"
(Note: Thunderbird properly displays characters with the marks beneath even though FireFox does not do so above or in its display of your message.) |
|
Date |
User |
Action |
Args |
2011-08-27 20:23:39 | terry.reedy | set | recipients:
+ terry.reedy, lemburg, gvanrossum, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, tchrist |
2011-08-27 20:23:39 | terry.reedy | set | messageid: <1314476619.85.0.242545829832.issue12729@psf.upfronthosting.co.za> |
2011-08-27 20:23:39 | terry.reedy | link | issue12729 messages |
2011-08-27 20:23:38 | terry.reedy | create | |
|