Message 296503 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	steven.daprano
Recipients	Guillaume Sanchez, steven.daprano
Date	2017-06-21.00:06:48
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1498003610.36.0.176389885423.issue30717@psf.upfronthosting.co.za>
In-reply-to

Content
I don't think graphemes is the right term here. Graphemes are language dependent, for instance "ǆ" may be considered a grapheme in Croatian. https://en.wikipedia.org/wiki/D%C5%BE http://www.unicode.org/glossary/#grapheme I believe you are referring to combining characters: http://www.unicode.org/faq/char_combmark.html It is unfortunate that Python's string methods are naive about combining characters, and just count code points, but I'm not sure what the alternative is. For example the human reader may be surprised that these give two different results: py> len("naïve") 5 py> len("naïve") 6 I'm not sure if the effect will survive copying and pasting, but the first string uses U+00EF LATIN SMALL LETTER I WITH DIAERESIS while the second uses: U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS And check out this surprising result: py> "xïoz"[::-1] 'zöix' It seems to me that it would be great if Python was fully aware of combining characters, its not so great if it is naïve, but it would be simply terrible if only a few methods were aware and the rest naïve. I don't have a good solution to this, but perhaps an iterator over (base character + combining marks) would be a good first step. Something like this? import unicodedata def chars(string): accum = [] for c in string: cat = unicodedata.category(c) if cat == 'Mn': accum.append(c) else: if accum: yield accum accum = [] accum.append(c) if accum: yield accum

I don't think graphemes is the right term here. Graphemes are language dependent, for instance "ǆ" may be considered a grapheme in Croatian.

https://en.wikipedia.org/wiki/D%C5%BE
http://www.unicode.org/glossary/#grapheme

I believe you are referring to combining characters:

http://www.unicode.org/faq/char_combmark.html

It is unfortunate that Python's string methods are naive about combining characters, and just count code points, but I'm not sure what the alternative is. For example the human reader may be surprised that these give two different results:

py> len("naïve")
5
py> len("naïve")
6

I'm not sure if the effect will survive copying and pasting, but the first string uses 

U+00EF LATIN SMALL LETTER I WITH DIAERESIS

while the second uses:

U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS

And check out this surprising result:

py> "xïoz"[::-1]
'zöix'


It seems to me that it would be great if Python was fully aware of combining characters, its not so great if it is naïve, but it would be simply terrible if only a few methods were aware and the rest naïve.

I don't have a good solution to this, but perhaps an iterator over (base character + combining marks) would be a good first step. Something like this?

import unicodedata

def chars(string):
    accum = []
    for c in string:
        cat = unicodedata.category(c)
        if cat == 'Mn':
            accum.append(c)
        else:
            if accum:
                yield accum
                accum = []
            accum.append(c)
    if accum:
        yield accum

History
Date	User	Action	Args
2017-06-21 00:06:50	steven.daprano	set	recipients: + steven.daprano, Guillaume Sanchez
2017-06-21 00:06:50	steven.daprano	set	messageid: <1498003610.36.0.176389885423.issue30717@psf.upfronthosting.co.za>
2017-06-21 00:06:50	steven.daprano	link	issue30717 messages
2017-06-21 00:06:48	steven.daprano	create