Issue 18406: unicodedata.itergraphemes / str.itergraphemes / str.graphemes

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/62606

classification

Title:	unicodedata.itergraphemes / str.itergraphemes / str.graphemes
Type:	enhancement	Stage:	resolved
Components:	Unicode	Versions:	Python 3.7

process

Status:	closed	Resolution:	duplicate
Dependencies:		Superseder:	Add unicode grapheme cluster break algorithm View: 30717
Assigned To:		Nosy List:	Socob, benjamin.peterson, cvrebert, dpk, ezio.melotti, lemburg, loewis, mrabarnett, serhiy.storchaka
Priority:	normal	Keywords:

Created on 2013-07-08 18:25 by dpk, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (4)
msg192684 - (view)	Author: David P. Kendal (dpk)	Date: 2013-07-08 18:25
On python-ideas I proposed the addition of a way to iterate over the graphemes of a string, either as part of the unicodedata library or as a method on the built-in str type. <http://mail.python.org/pipermail/python-ideas/2013-July/021916.html> I provided a sample implementation, but "MRAB" pointed out that my definition of a grapheme is slightly wrong; it's a little more complex than just "character followed by combiners". <http://mail.python.org/pipermail/python-ideas/2013-July/021917.html> M.-A. Lenburg asked me to open this issue. <http://mail.python.org/pipermail/python-ideas/2013-July/021929.html>
msg192724 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2013-07-09 07:42
It may be useful to also add the start position of the grapheme to the iterator output. Related to this, please also see this pre-PEP I once wrote for a Unicode indexing module: http://mail.python.org/pipermail/python-dev/2001-July/015938.html
msg192769 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2013-07-09 17:25
This is basically what the regex module does, written in Python: def get_grapheme_cluster_break(codepoint): """Gets the "Grapheme Cluster Break" property of a codepoint. The properties defined here: http://www.unicode.org/Public/UNIDATA/auxiliary/GraphemeBreakProperty.txt """ # The return value is one of: # # "Other" # "CR" # "LF" # "Control" # "Extend" # "Prepend" # "Regional_Indicator" # "SpacingMark" # "L" # "V" # "T" # "LV" # "LVT" ... def at_grapheme_boundary(string, index): """Checks whether the codepoint at 'index' is on a grapheme boundary. The rules are defined here: http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries """ # Break at the start and end of the text. if index <= 0 or index >= len(string): return True prop = get_grapheme_cluster_break(string[index]) prop_m1 = get_grapheme_cluster_break(string[index - 1]) # Don't break within CRLF. if prop_m1 == "CR" and prop == "LF": return False # Otherwise break before and after controls (including CR and LF). if prop_m1 in ("Control", "CR", "LF") or prop in ("Control", "CR", "LF"): return True # Don't break Hangul syllable sequences. if prop_m1 == "L" and prop in ("L", "V", "LV", "LVT"): return False if prop_m1 in ("LV", "V") and prop in ("V", "T"): return False if prop_m1 in ("LVT", "T") and prop == "T": return False # Don't break between regional indicator symbols. if (prop_m1 == "REGIONALINDICATOR" and prop == "REGIONALINDICATOR"): return False # Don't break just before Extend characters. if prop == "Extend": return False # Don't break before SpacingMarks, or after Prepend characters. if prop == "SpacingMark": return False if prop_m1 == "Prepend": return False # Otherwise, break everywhere. return True
msg299697 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-08-03 11:07
Issue30717 has a patch.

History
Date	User	Action	Args
2022-04-11 14:57:47	admin	set	github: 62606
2017-08-03 11:07:26	serhiy.storchaka	set	status: open -> closed superseder: Add unicode grapheme cluster break algorithm messages: + msg299697 resolution: duplicate stage: needs patch -> resolved
2017-07-24 04:21:00	serhiy.storchaka	set	nosy: + serhiy.storchaka versions: + Python 3.7, - Python 3.4, Python 3.5
2017-07-24 02:20:01	Socob	set	nosy: + Socob
2013-07-09 17:25:36	mrabarnett	set	nosy: + mrabarnett messages: + msg192769
2013-07-09 07:42:49	lemburg	set	nosy: + lemburg messages: + msg192724 components: + Unicode
2013-07-08 18:34:13	ezio.melotti	set	nosy: + loewis, benjamin.peterson, ezio.melotti stage: needs patch
2013-07-08 18:32:32	cvrebert	set	nosy: + cvrebert
2013-07-08 18:25:45	dpk	create