Message 359397 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	Manishearth
Recipients	Bert JW Regeer, Guillaume Sanchez, Manishearth, Socob, _savage, benjamin.peterson, bianjp, ezio.melotti, lemburg, loewis, mcepl, methane, mrabarnett, p-ganssle, r.david.murray, scoder, serhiy.storchaka, steven.daprano, terry.reedy, vstinner, xiang.zhang
Date	2020-01-06.03:21:29
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1578280890.0.0.889384912462.issue30717@roundup.psfhosted.org>
In-reply-to

Content
Hi, Unicodey person here, I'm involved in Unicode itself and also maintain an implementation of this particular spec[1]. So, firstly, > "a⃑".center(width=5, fillchar=".") If you're trying to do terminal width stuff, extended grapheme clusters will not solve the problem for you. There is no algorithm specified in Unicode that does this, because this is font dependent. Extended grapheme clusters are better than code points for this, however, and will not ever produce worse results. It's fine to expose this, but it's worth adding caveats. Also, yes, please do not expose a direct indexing function. Aside from almost all Unicode algorithms being streaming algorithms and thus inefficient to index directly, needing to directly index a grapheme cluster is almost always a sign that you are making a mistake. > Yes. I clearly don't want this PR to be interpreted as "we're needing ICU". ICU provides much much more than what I'm willing to provide. My goal here is just to "fix" how the python's standard library iterates over characters. Noting more, nothing less. I think it would be a mistake to make the stdlib use this for most notions of what a "character" is, as I said this notion is also inaccurate. Having an iterator library somewhere that you can use and compose is great, changing the internal workings of string operations would be a major change, and not entirely productive. There's only one language I can think of that uses extended grapheme clusters as its default notion of "character": Swift. Swift is largely designed for UI stuff, and it makes sense in this context. This is also baked in very deeply to the language (e.g. their Character class is a thin wrapper around String, since grapheme clusters can be arbitrarily large). You'd need a pretty major paradigm shift for python to make a similar change, and it doesn't make as much sense for python in the first place. Starting off with a library published to pypi makes sense to me. [1]: https://github.com/unicode-rs/unicode-segmentation/

Hi,

Unicodey person here, I'm involved in Unicode itself and also maintain an implementation of this particular spec[1].

So, firstly,

> "a⃑".center(width=5, fillchar=".")

If you're trying to do terminal width stuff, extended grapheme clusters *will not* solve the problem for you. There is no algorithm specified in Unicode that does this, because this is font dependent. Extended grapheme clusters are better than code points for this, however, and will not ever produce *worse* results.

It's fine to expose this, but it's worth adding caveats.

Also, yes, please do not expose a direct indexing function. Aside from almost all Unicode algorithms being streaming algorithms and thus inefficient to index directly, needing to directly index a grapheme cluster is almost always a sign that you are making a mistake.

> Yes. I clearly don't want this PR to be interpreted as "we're needing ICU". ICU provides much much more than what I'm willing to provide. My goal here is just to "fix" how the python's standard library iterates over characters. Noting more, nothing less.

I think it would be a mistake to make the stdlib use this for most notions of what a "character" is, as I said this notion is also inaccurate. Having an iterator library somewhere that you can use and compose is great, changing the internal workings of string operations would be a major change, and not entirely productive.

There's only one language I can think of that uses extended grapheme clusters as its default notion of "character": Swift. Swift is largely designed for UI stuff, and it makes sense in this context. This is also baked in very deeply to the language (e.g. their Character class is a thin wrapper around String, since grapheme clusters can be arbitrarily large). You'd need a pretty major paradigm shift for python to make a similar change, and it doesn't make as much sense for python in the first place.

Starting off with a library published to pypi makes sense to me.

[1]: https://github.com/unicode-rs/unicode-segmentation/

History
Date	User	Action	Args
2020-01-06 03:21:30	Manishearth	set	recipients: + Manishearth, lemburg, loewis, terry.reedy, scoder, vstinner, benjamin.peterson, mcepl, ezio.melotti, mrabarnett, steven.daprano, r.david.murray, methane, serhiy.storchaka, _savage, xiang.zhang, p-ganssle, Socob, Guillaume Sanchez, Bert JW Regeer, bianjp
2020-01-06 03:21:30	Manishearth	set	messageid: <1578280890.0.0.889384912462.issue30717@roundup.psfhosted.org>
2020-01-06 03:21:29	Manishearth	link	issue30717 messages
2020-01-06 03:21:29	Manishearth	create