classification
Title: str.center() is not unicode aware
Type: enhancement Stage: needs patch
Components: Interpreter Core Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Guillaume Sanchez, haypo, r.david.murray, serhiy.storchaka, steven.daprano, terry.reedy
Priority: normal Keywords:

Created on 2017-06-20 19:15 by Guillaume Sanchez, last changed 2017-07-15 11:06 by christian.heimes.

Pull Requests
URL Status Linked Edit
PR 2673 open python-dev, 2017-07-11 23:14
Messages (11)
msg296478 - (view) Author: Guillaume Sanchez (Guillaume Sanchez) * Date: 2017-06-20 19:15
"a⃑".center(width=5, fillchar=".")
produces
'..a⃑.' instead of '..a⃑..'

The reason is that "a⃑" is composed of two code points (2 UCS4 chars), one 'a' and one combining code point "above arrow". str.center() counts the size of the string and fills it both sides with `fillchar` until the size reaches `width`. However, this size is certainly intended to be the number of characters and not the number of code points.

The correct way to count characters is to use the grapheme clustering algorithm from UAX TR29.

Turns out I implemented this myself already, and might do the PR if asked so, with a little help to make the C <-> Python glue.

Thanks for your time.
msg296479 - (view) Author: Guillaume Sanchez (Guillaume Sanchez) * Date: 2017-06-20 19:27
Obviously, I'm talking about str.center() but all functions needing a count of graphemes are then not totally correct.

I can fix that and add the corresponding function, or an iterator over graphemes, or whatever seems right :)
msg296503 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-06-21 00:06
I don't think graphemes is the right term here. Graphemes are language dependent, for instance "dž" may be considered a grapheme in Croatian.

https://en.wikipedia.org/wiki/D%C5%BE
http://www.unicode.org/glossary/#grapheme

I believe you are referring to combining characters:

http://www.unicode.org/faq/char_combmark.html

It is unfortunate that Python's string methods are naive about combining characters, and just count code points, but I'm not sure what the alternative is. For example the human reader may be surprised that these give two different results:

py> len("naïve")
5
py> len("naïve")
6

I'm not sure if the effect will survive copying and pasting, but the first string uses 

U+00EF LATIN SMALL LETTER I WITH DIAERESIS

while the second uses:

U+0069 LATIN SMALL LETTER I + U+0308 COMBINING DIAERESIS

And check out this surprising result:

py> "xïoz"[::-1]
'zöix'


It seems to me that it would be great if Python was fully aware of combining characters, its not so great if it is naïve, but it would be simply terrible if only a few methods were aware and the rest naïve.

I don't have a good solution to this, but perhaps an iterator over (base character + combining marks) would be a good first step. Something like this?

import unicodedata

def chars(string):
    accum = []
    for c in string:
        cat = unicodedata.category(c)
        if cat == 'Mn':
            accum.append(c)
        else:
            if accum:
                yield accum
                accum = []
            accum.append(c)
    if accum:
        yield accum
msg296504 - (view) Author: Guillaume Sanchez (Guillaume Sanchez) * Date: 2017-06-21 00:55
Thanks for all those interesting cases you brought here! I didn't think of that at all!

I'm using the word "grapheme" as per the definition given in UAX TR29 which is *not* language/locale dependant [1].

This annex is very specific and precise about where to break "grapheme cluster" aka "when does a character starts and ends". Sadly, it's a bit more complex than just accumulating based on the Combining property. This annex gives a set of rules to implement, based on Grapheme_Cluster_Break property, and while those rules may naively be implemented as comparing adjacent pairs of code points, this is wrong and can be correctly and efficiently implemented as an automaton. My code [2] passes all tests from GraphemeBreakTests.txt (provided by Unicode).

We can definitely do a generator like you propose, or rather do it in the C layer to gain more efficiency and coherence since the other string / Unicode operations are in the C layer (upper, lower, casefold, etc)

Let me know what you guys think, what (and if) I should contribute :)

[1] http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries
[2] https://github.com/Vermeille/batriz/blob/master/src/str/grapheme_iterator.h#L31
msg296505 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-06-21 01:34
http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundaries

talks about *grapheme clusters*, not "graphemes" alone, and it seems clear to me that they are language dependent. For example, it says:

The Unicode Standard provides default algorithms for determining grapheme cluster boundaries, with two variants: legacy grapheme clusters and extended grapheme clusters. The most appropriate variant depends on the language and operation involved. ... These algorithms can be adapted to produce tailored grapheme clusters for specific locales...


Nevertheless, even just a basic API to either the *legacy grapheme cluster* or the *extended grapheme cluster* algorithms would be a good start.

Can I suggest that the unicodedata module might be the right place for it?

And thank you for volunteering to do the work on this!
msg297488 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-01 17:48
See also issue 12568.
msg298190 - (view) Author: Guillaume Sanchez (Guillaume Sanchez) * Date: 2017-07-11 23:43
Hello to all of you, sorry for the delay. Been busy.

I added the base code needed to built the grapheme cluster break algorithm. We now have the GraphemeBreakProperty available via unicodedata.grapheme_cluster_break()

Can you check that the implementation correctly fits the design? I was not sure about adding that prop to unicodedata_db ou unicodectype_db, tbh.

If it's all correct, I'll move forward with the automaton and the grapheme cluster breaking algorithm.

Thanks!
msg298321 - (view) Author: Guillaume Sanchez (Guillaume Sanchez) * Date: 2017-07-13 23:42
Hello,

I implemented unicodedata.break_graphemes() that returns an iterators that spits consecutive graphemes.

This is a "test" implementation meant to see what doesn't fits Python's style and design, to discuss naming and implementation details.

https://github.com/python/cpython/pull/2673

Thanks for your time and interest
msg298325 - (view) Author: Steven D'Aprano (steven.daprano) * (Python committer) Date: 2017-07-14 00:36
Thank you, but I cannot review your C code.

Can you start by telling us what the two functions:

unicodedata.grapheme_cluster_break()
unicodedata.break_graphemes()

take as arguments, and what they return? If we were to call 
help(function), what would we see?
msg298326 - (view) Author: Guillaume Sanchez (Guillaume Sanchez) * Date: 2017-07-14 00:43
Hello Steven!

Thanks for your reactivity!

unicodedata.grapheme_cluster_break() takes a unicode code point as an argument and return its GraphemeBreakProperty as a string. Possible values are listed here: http://www.unicode.org/reports/tr29/#CR

help(unicodedata.grapheme_cluster_break) says:
grapheme_cluster_break(chr, /)
    Returns the GraphemeBreakProperty assigned to the character chr as string.

====

unicodedata.break_graphemes() takes a unicode string as argument and returns an GraphemeClusterIterator that spits consecutive graphemes clusters.

help(unicodedata.break_graphemes) says:

break_graphemes(unistr, /)
    Returns an iterator to iterate over grapheme clusters in unistr.
    
    It uses extended grapheme cluster rules from TR29.


Is there anything else you would like to know? Don't hesitate to ask :)

Thank you for your time!
msg298336 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2017-07-14 04:25
I think it at least plausible that we should add implementations of some of the unicode standard's algorithms.  Victor and Serhiy, as two of the active core devs most involved with unicode issues, what do you think?
History
Date User Action Args
2017-07-15 11:06:10christian.heimessetassignee: christian.heimes ->

components: + Interpreter Core, - Tests, Tkinter, SSL
nosy: - christian.heimes
2017-07-14 04:25:14terry.reedysetnosy: + terry.reedy, haypo, serhiy.storchaka
messages: + msg298336
2017-07-14 00:43:36Guillaume Sanchezsetnosy: + christian.heimes
messages: + msg298326

assignee: christian.heimes
components: + Tests, Tkinter, SSL, - Library (Lib)
2017-07-14 00:36:45steven.dapranosetmessages: + msg298325
2017-07-13 23:42:08Guillaume Sanchezsetmessages: + msg298321
2017-07-11 23:43:00Guillaume Sanchezsetmessages: + msg298190
2017-07-11 23:14:06python-devsetpull_requests: + pull_request2741
2017-07-01 17:48:23r.david.murraysetnosy: + r.david.murray
messages: + msg297488
2017-06-21 03:53:00Mariattasettype: enhancement
stage: needs patch
2017-06-21 01:34:08steven.dapranosetmessages: + msg296505
2017-06-21 00:55:41Guillaume Sanchezsetmessages: + msg296504
2017-06-21 00:06:50steven.dapranosetnosy: + steven.daprano
messages: + msg296503
2017-06-20 19:27:45Guillaume Sanchezsetmessages: + msg296479
2017-06-20 19:15:22Guillaume Sanchezcreate