Message 143088 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	terry.reedy
Recipients	Arfrever, ezio.melotti, gvanrossum, jkloth, lemburg, mrabarnett, pitrou, r.david.murray, tchrist, terry.reedy, v+python, vstinner
Date	2011-08-27.20:23:38
SpamBayes Score	2.3365199e-12
Marked as misclassified	No
Message-id	<1314476619.85.0.242545829832.issue12729@psf.upfronthosting.co.za>
In-reply-to

Content
Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one. However, grapheme clustering does require look-ahead, which is a bit trickier. Assume s is a sanitized sequence of code points with unicode database entries. Ignoring line endings the following should work (I tested it with a toy definition of mark()): def graphemes(s): sit = iter(s) try: graph = [next(sit)] except StopIteration: graph = [] for cp in sit: if mark(cp): graph.append(cp) else: yield combine(graph) graph = [cp] yield combine(graph) I tested this with several input with def mark(cp): return cp == '.' def combine(l) return ''.join(l) Python's object orientation makes formatting easy for the user. Assume someone does the hard work of writing (once ;-) a GCString class with a .__format__ method that interprets the format mini-language for graphemes, using a generalized version of your 'simply horrible' code. The might be done by adapting str.__format__ to use the grapheme iterator above. Then users should be able to write >>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠")) "a̠ˈne̞ɣ̞ð̞" (Note: Thunderbird properly displays characters with the marks beneath even though FireFox does not do so above or in its display of your message.)

Python makes it easy to transform a sequence with a generator as long as no look-ahead is needed. utf16.UTF16.__iter__ is a typical example. Whenever a surrogate is found, grab the matching one.

However, grapheme clustering does require look-ahead, which is a bit trickier. Assume s is a sanitized sequence of code points with unicode database entries. Ignoring line endings the following should work (I tested it with a toy definition of mark()):

def graphemes(s):
  sit = iter(s)
  try: graph = [next(sit)]
  except StopIteration: graph = []

  for cp in sit:
    if mark(cp):  
      graph.append(cp)
    else:
      yield combine(graph)
      graph = [cp]

  yield combine(graph)

I tested this with several input with
def mark(cp): return cp == '.'
def combine(l) return ''.join(l)

Python's object orientation makes formatting easy for the user. Assume someone does the hard work of writing (once ;-) a GCString class with a .__format__ method that interprets the format mini-language for graphemes, using a generalized version of your 'simply horrible' code. The might be done by adapting str.__format__ to use the grapheme iterator above. Then users should be able to write

>>> '{:6.6}'.format(GCString("a̠ˈne̞ɣ̞ð̞o̞t̪a̠"))
"a̠ˈne̞ɣ̞ð̞"
(Note: Thunderbird properly displays characters with the marks beneath even though FireFox does not do so above or in its display of your message.)

History
Date	User	Action	Args
2011-08-27 20:23:39	terry.reedy	set	recipients: + terry.reedy, lemburg, gvanrossum, pitrou, vstinner, jkloth, ezio.melotti, mrabarnett, Arfrever, v+python, r.david.murray, tchrist
2011-08-27 20:23:39	terry.reedy	set	messageid: <1314476619.85.0.242545829832.issue12729@psf.upfronthosting.co.za>
2011-08-27 20:23:39	terry.reedy	link	issue12729 messages
2011-08-27 20:23:38	terry.reedy	create