Author Guillaume Sanchez
Recipients Guillaume Sanchez, steven.daprano
Date 2017-06-21.00:55:40
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1498006541.96.0.125467711843.issue30717@psf.upfronthosting.co.za>
In-reply-to
Content
Thanks for all those interesting cases you brought here! I didn't think of that at all!

I'm using the word "grapheme" as per the definition given in UAX TR29 which is *not* language/locale dependant [1].

This annex is very specific and precise about where to break "grapheme cluster" aka "when does a character starts and ends". Sadly, it's a bit more complex than just accumulating based on the Combining property. This annex gives a set of rules to implement, based on Grapheme_Cluster_Break property, and while those rules may naively be implemented as comparing adjacent pairs of code points, this is wrong and can be correctly and efficiently implemented as an automaton. My code [2] passes all tests from GraphemeBreakTests.txt (provided by Unicode).

We can definitely do a generator like you propose, or rather do it in the C layer to gain more efficiency and coherence since the other string / Unicode operations are in the C layer (upper, lower, casefold, etc)

Let me know what you guys think, what (and if) I should contribute :)

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2] https://github.com/Vermeille/batriz/blob/master/src/str/grapheme_iterator.h#L31
History
Date User Action Args
2017-06-21 00:55:42Guillaume Sanchezsetrecipients: + Guillaume Sanchez, steven.daprano
2017-06-21 00:55:41Guillaume Sanchezsetmessageid: <1498006541.96.0.125467711843.issue30717@psf.upfronthosting.co.za>
2017-06-21 00:55:41Guillaume Sanchezlinkissue30717 messages
2017-06-21 00:55:40Guillaume Sanchezcreate