This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Guillaume Sanchez
Recipients Arfrever, Guillaume Sanchez, Nicholas.Cole, benjamin.peterson, eric.araujo, ezio.melotti, inigoserna, lemburg, loewis, poq, r.david.murray, serhiy.storchaka, tchrist, terry.reedy, vstinner, zeha
Date 2017-07-13.23:47:06
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1499989626.46.0.535744055477.issue12568@psf.upfronthosting.co.za>
In-reply-to
Content
Hello,

I come from bugs.python.org/issue30717 . I have a pending PR that needs review ( https://github.com/python/cpython/pull/2673 ) adding a function that breaks unicode strings into grapheme clusters (aka what one would intuitively call "a character"). It's based on the grapheme cluster breaking algorithm from TR29.

Let me know if this is of any relevance.

Quick demo:
>>> a=unicodedata.break_graphemes("lol")
>>> list(a)
['l', 'o', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309l"))
['l', 'ỏ', 'l']
>>> list(unicodedata.break_graphemes("lo\u0309\u0301l"))
['l', 'ỏ́', 'l']
>>> list(unicodedata.break_graphemes("lo\u0301l"))
['l', 'ó', 'l']
>>> list(unicodedata.break_graphemes(""))
[]
History
Date User Action Args
2017-07-13 23:47:06Guillaume Sanchezsetrecipients: + Guillaume Sanchez, lemburg, loewis, terry.reedy, vstinner, benjamin.peterson, ezio.melotti, eric.araujo, Arfrever, r.david.murray, inigoserna, zeha, poq, Nicholas.Cole, tchrist, serhiy.storchaka
2017-07-13 23:47:06Guillaume Sanchezsetmessageid: <1499989626.46.0.535744055477.issue12568@psf.upfronthosting.co.za>
2017-07-13 23:47:06Guillaume Sanchezlinkissue12568 messages
2017-07-13 23:47:06Guillaume Sanchezcreate