Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

Closed
dpk mannequin opened this issue Jul 8, 2013 · 4 comments
Closed

unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606

dpk mannequin opened this issue Jul 8, 2013 · 4 comments
Labels
3.7 (EOL) end of life topic-unicode type-feature A feature request or enhancement

Comments

@dpk
Copy link
Mannequin

dpk mannequin commented Jul 8, 2013

BPO 18406
Nosy @malemburg, @loewis, @benjaminp, @ezio-melotti, @serhiy-storchaka
Superseder
  • bpo-30717: Add unicode grapheme cluster break algorithm
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2017-08-03.11:07:26.423>
    created_at = <Date 2013-07-08.18:25:45.309>
    labels = ['type-feature', '3.7', 'expert-unicode']
    title = 'unicodedata.itergraphemes / str.itergraphemes / str.graphemes'
    updated_at = <Date 2017-08-03.11:07:26.422>
    user = 'https://bugs.python.org/dpk'

    bugs.python.org fields:

    activity = <Date 2017-08-03.11:07:26.422>
    actor = 'serhiy.storchaka'
    assignee = 'none'
    closed = True
    closed_date = <Date 2017-08-03.11:07:26.423>
    closer = 'serhiy.storchaka'
    components = ['Unicode']
    creation = <Date 2013-07-08.18:25:45.309>
    creator = 'dpk'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 18406
    keywords = []
    message_count = 4.0
    messages = ['192684', '192724', '192769', '299697']
    nosy_count = 9.0
    nosy_names = ['lemburg', 'loewis', 'benjamin.peterson', 'ezio.melotti', 'mrabarnett', 'cvrebert', 'serhiy.storchaka', 'dpk', 'Socob']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '30717'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue18406'
    versions = ['Python 3.7']

    @dpk
    Copy link
    Mannequin Author

    dpk mannequin commented Jul 8, 2013

    On python-ideas I proposed the addition of a way to iterate over the graphemes of a string, either as part of the unicodedata library or as a method on the built-in str type. <http://mail.python.org/pipermail/python-ideas/2013-July/021916.html\>

    I provided a sample implementation, but "MRAB" pointed out that my definition of a grapheme is slightly wrong; it's a little more complex than just "character followed by combiners". <http://mail.python.org/pipermail/python-ideas/2013-July/021917.html\>

    M.-A. Lenburg asked me to open this issue. <http://mail.python.org/pipermail/python-ideas/2013-July/021929.html\>

    @dpk dpk mannequin added the type-feature A feature request or enhancement label Jul 8, 2013
    @malemburg
    Copy link
    Member

    It may be useful to also add the start position of the grapheme to the iterator output.

    Related to this, please also see this pre-PEP I once wrote for a Unicode indexing module:

    http://mail.python.org/pipermail/python-dev/2001-July/015938.html

    @mrabarnett
    Copy link
    Mannequin

    mrabarnett mannequin commented Jul 9, 2013

    This is basically what the regex module does, written in Python:

        def get_grapheme_cluster_break(codepoint):
            """Gets the "Grapheme Cluster Break" property of a codepoint.
        
            The properties defined here:
        
            http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
            """
            # The return value is one of:
            #
            #     "Other"
            #     "CR"
            #     "LF"
            #     "Control"
            #     "Extend"
            #     "Prepend"
            #      "Regional_Indicator"
            #     "SpacingMark"
            #     "L"
            #     "V"
            #     "T"
            #     "LV"
            #     "LVT"
            ...
        
        def at_grapheme_boundary(string, index):
            """Checks whether the codepoint at 'index' is on a grapheme boundary.
        
            The rules are defined here:
        
            http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
            """
            # Break at the start and end of the text.
            if index <= 0 or index >= len(string):
                return True
        
            prop = get_grapheme_cluster_break(string[index])
            prop_m1 = get_grapheme_cluster_break(string[index - 1])
        
            # Don't break within CRLF.
            if prop_m1 == "CR" and prop == "LF":
                return False
        
            # Otherwise break before and after controls (including CR and LF).
            if prop_m1 in ("Control", "CR", "LF") or prop in ("Control", "CR", "LF"):
                return True
        
            # Don't break Hangul syllable sequences.
            if prop_m1 == "L" and prop in ("L", "V", "LV", "LVT"):
                return False
            if prop_m1 in ("LV", "V") and prop in ("V",  "T"):
                return False
            if prop_m1 in ("LVT", "T") and prop == "T":
                return False
        
            # Don't break between regional indicator symbols.
            if (prop_m1 == "REGIONALINDICATOR" and prop ==
              "REGIONALINDICATOR"):
                return False
        
            # Don't break just before Extend characters.
            if prop == "Extend":
                return False
        
            # Don't break before SpacingMarks, or after Prepend characters.
            if prop == "SpacingMark":
                return False
        
            if prop_m1 == "Prepend":
                return False
        
            # Otherwise, break everywhere.
            return True

    @serhiy-storchaka serhiy-storchaka added the 3.7 (EOL) end of life label Jul 24, 2017
    @serhiy-storchaka
    Copy link
    Member

    bpo-30717 has a patch.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life topic-unicode type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants