msg192684 - (view) Author: David P. Kendal (dpk) Date: 2013-07-08 18:25
On python-ideas I proposed the addition of a way to iterate over the graphemes of a string, either as part of the unicodedata library or as a method on the built-in str type. <>

I provided a sample implementation, but "MRAB" pointed out that my definition of a grapheme is slightly wrong; it's a little more complex than just "character followed by combiners". <>

M.-A. Lenburg asked me to open this issue. <>
msg192724 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2013-07-09 07:42
It may be useful to also add the start position of the grapheme to the iterator output.

Related to this, please also see this pre-PEP I once wrote for a Unicode indexing module:
msg192769 - (view) Author: Matthew Barnett (mrabarnett) * (Python triager) Date: 2013-07-09 17:25
This is basically what the regex module does, written in Python:

    def get_grapheme_cluster_break(codepoint):
        """Gets the "Grapheme Cluster Break" property of a codepoint.
        The properties defined here:
        # The return value is one of:
        #     "Other"
        #     "CR"
        #     "LF"
        #     "Control"
        #     "Extend"
        #     "Prepend"
        #      "Regional_Indicator"
        #     "SpacingMark"
        #     "L"
        #     "V"
        #     "T"
        #     "LV"
        #     "LVT"
    def at_grapheme_boundary(string, index):
        """Checks whether the codepoint at 'index' is on a grapheme boundary.
        The rules are defined here:
        # Break at the start and end of the text.
        if index <= 0 or index >= len(string):
            return True
        prop = get_grapheme_cluster_break(string[index])
        prop_m1 = get_grapheme_cluster_break(string[index - 1])
        # Don't break within CRLF.
        if prop_m1 == "CR" and prop == "LF":
            return False
        # Otherwise break before and after controls (including CR and LF).
        if prop_m1 in ("Control", "CR", "LF") or prop in ("Control", "CR", "LF"):
            return True
        # Don't break Hangul syllable sequences.
        if prop_m1 == "L" and prop in ("L", "V", "LV", "LVT"):
            return False
        if prop_m1 in ("LV", "V") and prop in ("V",  "T"):
            return False
        if prop_m1 in ("LVT", "T") and prop == "T":
            return False
        # Don't break between regional indicator symbols.
        if (prop_m1 == "REGIONALINDICATOR" and prop ==
            return False
        # Don't break just before Extend characters.
        if prop == "Extend":
            return False
        # Don't break before SpacingMarks, or after Prepend characters.
        if prop == "SpacingMark":
            return False
        if prop_m1 == "Prepend":
            return False
        # Otherwise, break everywhere.
        return True
msg299697 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-08-03 11:07
Issue30717 has a patch.
