Message296504
Thanks for all those interesting cases you brought here! I didn't think of that at all!
I'm using the word "grapheme" as per the definition given in UAX TR29 which is *not* language/locale dependant [1].
This annex is very specific and precise about where to break "grapheme cluster" aka "when does a character starts and ends". Sadly, it's a bit more complex than just accumulating based on the Combining property. This annex gives a set of rules to implement, based on Grapheme_Cluster_Break property, and while those rules may naively be implemented as comparing adjacent pairs of code points, this is wrong and can be correctly and efficiently implemented as an automaton. My code [2] passes all tests from GraphemeBreakTests.txt (provided by Unicode).
We can definitely do a generator like you propose, or rather do it in the C layer to gain more efficiency and coherence since the other string / Unicode operations are in the C layer (upper, lower, casefold, etc)
Let me know what you guys think, what (and if) I should contribute :)
[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2] https://github.com/Vermeille/batriz/blob/master/src/str/grapheme_iterator.h#L31 |
|
Date |
User |
Action |
Args |
2017-06-21 00:55:42 | Guillaume Sanchez | set | recipients:
+ Guillaume Sanchez, steven.daprano |
2017-06-21 00:55:41 | Guillaume Sanchez | set | messageid: <1498006541.96.0.125467711843.issue30717@psf.upfronthosting.co.za> |
2017-06-21 00:55:41 | Guillaume Sanchez | link | issue30717 messages |
2017-06-21 00:55:40 | Guillaume Sanchez | create | |
|