New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unicodedata.itergraphemes / str.itergraphemes / str.graphemes #62606
Comments
On python-ideas I proposed the addition of a way to iterate over the graphemes of a string, either as part of the unicodedata library or as a method on the built-in str type. <http://mail.python.org/pipermail/python-ideas/2013-July/021916.html\> I provided a sample implementation, but "MRAB" pointed out that my definition of a grapheme is slightly wrong; it's a little more complex than just "character followed by combiners". <http://mail.python.org/pipermail/python-ideas/2013-July/021917.html\> M.-A. Lenburg asked me to open this issue. <http://mail.python.org/pipermail/python-ideas/2013-July/021929.html\> |
It may be useful to also add the start position of the grapheme to the iterator output. Related to this, please also see this pre-PEP I once wrote for a Unicode indexing module: http://mail.python.org/pipermail/python-dev/2001-July/015938.html |
This is basically what the regex module does, written in Python: def get_grapheme_cluster_break(codepoint):
"""Gets the "Grapheme Cluster Break" property of a codepoint.
The properties defined here:
http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt
"""
# The return value is one of:
#
# "Other"
# "CR"
# "LF"
# "Control"
# "Extend"
# "Prepend"
# "Regional_Indicator"
# "SpacingMark"
# "L"
# "V"
# "T"
# "LV"
# "LVT"
...
def at_grapheme_boundary(string, index):
"""Checks whether the codepoint at 'index' is on a grapheme boundary.
The rules are defined here:
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
"""
# Break at the start and end of the text.
if index <= 0 or index >= len(string):
return True
prop = get_grapheme_cluster_break(string[index])
prop_m1 = get_grapheme_cluster_break(string[index - 1])
# Don't break within CRLF.
if prop_m1 == "CR" and prop == "LF":
return False
# Otherwise break before and after controls (including CR and LF).
if prop_m1 in ("Control", "CR", "LF") or prop in ("Control", "CR", "LF"):
return True
# Don't break Hangul syllable sequences.
if prop_m1 == "L" and prop in ("L", "V", "LV", "LVT"):
return False
if prop_m1 in ("LV", "V") and prop in ("V", "T"):
return False
if prop_m1 in ("LVT", "T") and prop == "T":
return False
# Don't break between regional indicator symbols.
if (prop_m1 == "REGIONALINDICATOR" and prop ==
"REGIONALINDICATOR"):
return False
# Don't break just before Extend characters.
if prop == "Extend":
return False
# Don't break before SpacingMarks, or after Prepend characters.
if prop == "SpacingMark":
return False
if prop_m1 == "Prepend":
return False
# Otherwise, break everywhere.
return True |
bpo-30717 has a patch. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: