Message 359408 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	Bert JW Regeer, Guillaume Sanchez, Manishearth, Socob, _savage, benjamin.peterson, bianjp, ezio.melotti, lemburg, loewis, mcepl, methane, mrabarnett, p-ganssle, r.david.murray, scoder, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, xiang.zhang
Date	2020-01-06.08:44:46
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<20200106084438.GA839@ando.pearwood.info>
In-reply-to	<1578280890.0.0.889384912462.issue30717@roundup.psfhosted.org>

Content
> I think it would be a mistake to make the stdlib use this for most > notions of what a "character" is, as I said this notion is also > inaccurate. Having an iterator library somewhere that you can use and > compose is great, changing the internal workings of string operations > would be a major change, and not entirely productive. Agreed. I won't pretend to be able to predict what Python 5.0 will bring wink but there's too much history around the "code point = character" notion for the language to change now. If the language can expose a grapheme iterator, then people can experiment with grapheme-based APIs in libraries. (By grapheme I mean "extended grapheme cluster", but that's a mouthful. Sorry linguists.) What do you think of these as a set of grapheme primitives? (1) is_grapheme_break(string, i) Return True if a grapheme break would occur before string[i]. (2) graphemes(string, start=0, end=len(string)) Iterate over graphemes in string[start:end]. (3) graphemes_reversed(string, start=0, end=len(string)) Iterate over graphemes in reverse order. I think is_grapheme_break would be enough for people to implement their own versions of graphemes and graphemes_reversed. Here's an untested version: def graphemes(string, start, end): cluster = [] for i in range(start, end): c = string[i] if is_grapheme_break(string, i): if i != start: # don't yield the empty cluster at Start Of Text yield ''.join(cluster) cluster = [c] else: cluster.append(c) if cluster: yield ''.join(cluster) Regarding is_grapheme_break, if I understand the note here: https://www.unicode.org/reports/tr29/#Testing one never needs to look at more than two adjacent code points to tell whether or not a grapheme break will occur between them, so this ought to be pretty efficient. At worst, it needs to look at string[i-1] and string[i], if they exist.

> I think it would be a mistake to make the stdlib use this for most 
> notions of what a "character" is, as I said this notion is also 
> inaccurate. Having an iterator library somewhere that you can use and 
> compose is great, changing the internal workings of string operations 
> would be a major change, and not entirely productive.

Agreed. 

I won't pretend to be able to predict what Python 5.0 will bring *wink* 
but there's too much history around the "code point = character" notion 
for the language to change now.

If the language can expose a grapheme iterator, then people can 
experiment with grapheme-based APIs in libraries.

(By grapheme I mean "extended grapheme cluster", but that's a mouthful. 
Sorry linguists.)

What do you think of these as a set of grapheme primitives?

(1) is_grapheme_break(string, i)

Return True if a grapheme break would occur *before* string[i].

(2) graphemes(string, start=0, end=len(string))

Iterate over graphemes in string[start:end].

(3) graphemes_reversed(string, start=0, end=len(string))

Iterate over graphemes in reverse order.

I *think* is_grapheme_break would be enough for people to implement 
their own versions of graphemes and graphemes_reversed. Here's an 
untested version:

    def graphemes(string, start, end):
        cluster = []
        for i in range(start, end):
            c = string[i]
            if is_grapheme_break(string, i):
                if i != start:
                    # don't yield the empty cluster at Start Of Text
                    yield ''.join(cluster)
                cluster = [c]
            else:
                cluster.append(c)
        if cluster:
            yield ''.join(cluster)

Regarding is_grapheme_break, if I understand the note here:

https://www.unicode.org/reports/tr29/#Testing

one never needs to look at more than two adjacent code points to tell 
whether or not a grapheme break will occur between them, so this ought 
to be pretty efficient. At worst, it needs to look at string[i-1] and 
string[i], if they exist.

History
Date	User	Action	Args
2020-01-06 08:44:47	steven.daprano	set	recipients: + steven.daprano, lemburg, loewis, terry.reedy, scoder, vstinner, benjamin.peterson, mcepl, ezio.melotti, mrabarnett, r.david.murray, methane, serhiy.storchaka, _savage, xiang.zhang, p-ganssle, Socob, Guillaume Sanchez, Bert JW Regeer, bianjp, Manishearth
2020-01-06 08:44:47	steven.daprano	link	issue30717 messages
2020-01-06 08:44:46	steven.daprano	create