unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

dpk · 2013-07-08T18:25:45Z

BPO	18406
Nosy	@malemburg, @loewis, @benjaminp, @ezio-melotti, @serhiy-storchaka
Superseder	bpo-30717: Add unicode grapheme cluster break algorithm

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2017-08-03.11:07:26.423>
created_at = <Date 2013-07-08.18:25:45.309>
labels = ['type-feature', '3.7', 'expert-unicode']
title = 'unicodedata.itergraphemes / str.itergraphemes / str.graphemes'
updated_at = <Date 2017-08-03.11:07:26.422>
user = 'https://bugs.python.org/dpk'

bugs.python.org fields:

activity = <Date 2017-08-03.11:07:26.422>
actor = 'serhiy.storchaka'
assignee = 'none'
closed = True
closed_date = <Date 2017-08-03.11:07:26.423>
closer = 'serhiy.storchaka'
components = ['Unicode']
creation = <Date 2013-07-08.18:25:45.309>
creator = 'dpk'
dependencies = []
files = []
hgrepos = []
issue_num = 18406
keywords = []
message_count = 4.0
messages = ['192684', '192724', '192769', '299697']
nosy_count = 9.0
nosy_names = ['lemburg', 'loewis', 'benjamin.peterson', 'ezio.melotti', 'mrabarnett', 'cvrebert', 'serhiy.storchaka', 'dpk', 'Socob']
pr_nums = []
priority = 'normal'
resolution = 'duplicate'
stage = 'resolved'
status = 'closed'
superseder = '30717'
type = 'enhancement'
url = 'https://bugs.python.org/issue18406'
versions = ['Python 3.7']

dpk · 2013-07-08T18:25:45Z

On python-ideas I proposed the addition of a way to iterate over the graphemes of a string, either as part of the unicodedata library or as a method on the built-in str type. <http://mail.python.org/pipermail/python-ideas/2013-July/021916.html\>

I provided a sample implementation, but "MRAB" pointed out that my definition of a grapheme is slightly wrong; it's a little more complex than just "character followed by combiners". <http://mail.python.org/pipermail/python-ideas/2013-July/021917.html\>

M.-A. Lenburg asked me to open this issue. <http://mail.python.org/pipermail/python-ideas/2013-July/021929.html\>

malemburg · 2013-07-09T07:42:49Z

It may be useful to also add the start position of the grapheme to the iterator output.

Related to this, please also see this pre-PEP I once wrote for a Unicode indexing module:

http://mail.python.org/pipermail/python-dev/2001-July/015938.html

mrabarnett · 2013-07-09T17:25:36Z

This is basically what the regex module does, written in Python:

    def get_grapheme_cluster_break(codepoint):
        """Gets the "Grapheme Cluster Break" property of a codepoint.
    
        The properties defined here:
    
        http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
        """
        # The return value is one of:
        #
        #     "Other"
        #     "CR"
        #     "LF"
        #     "Control"
        #     "Extend"
        #     "Prepend"
        #      "Regional_Indicator"
        #     "SpacingMark"
        #     "L"
        #     "V"
        #     "T"
        #     "LV"
        #     "LVT"
        ...
    
    def at_grapheme_boundary(string, index):
        """Checks whether the codepoint at 'index' is on a grapheme boundary.
    
        The rules are defined here:
    
        http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
        """
        # Break at the start and end of the text.
        if index <= 0 or index >= len(string):
            return True
    
        prop = get_grapheme_cluster_break(string[index])
        prop_m1 = get_grapheme_cluster_break(string[index - 1])
    
        # Don't break within CRLF.
        if prop_m1 == "CR" and prop == "LF":
            return False
    
        # Otherwise break before and after controls (including CR and LF).
        if prop_m1 in ("Control", "CR", "LF") or prop in ("Control", "CR", "LF"):
            return True
    
        # Don't break Hangul syllable sequences.
        if prop_m1 == "L" and prop in ("L", "V", "LV", "LVT"):
            return False
        if prop_m1 in ("LV", "V") and prop in ("V",  "T"):
            return False
        if prop_m1 in ("LVT", "T") and prop == "T":
            return False
    
        # Don't break between regional indicator symbols.
        if (prop_m1 == "REGIONALINDICATOR" and prop ==
          "REGIONALINDICATOR"):
            return False
    
        # Don't break just before Extend characters.
        if prop == "Extend":
            return False
    
        # Don't break before SpacingMarks, or after Prepend characters.
        if prop == "SpacingMark":
            return False
    
        if prop_m1 == "Prepend":
            return False
    
        # Otherwise, break everywhere.
        return True

serhiy-storchaka · 2017-08-03T11:07:26Z

bpo-30717 has a patch.

dpk mannequin added the type-feature A feature request or enhancement label Jul 8, 2013

malemburg added the topic-unicode label Jul 9, 2013

serhiy-storchaka added the 3.7 (EOL) end of life label Jul 24, 2017

serhiy-storchaka closed this as completed Aug 3, 2017

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

dpk mannequin commented Jul 8, 2013

dpk mannequin commented Jul 8, 2013

malemburg commented Jul 9, 2013

mrabarnett mannequin commented Jul 9, 2013

serhiy-storchaka commented Aug 3, 2017

unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

Comments

dpk mannequin commented Jul 8, 2013

dpk mannequin commented Jul 8, 2013

malemburg commented Jul 9, 2013

mrabarnett mannequin commented Jul 9, 2013

serhiy-storchaka commented Aug 3, 2017