This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: request full Unicode collation support in std python library
Type: enhancement Stage:
Components: Library (Lib) Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Ahmad Azizi, Arfrever, belopolsky, daniel.urban, eric.araujo, ezio.melotti, gvanrossum, mcepl, mrabarnett, rhettinger, tchrist, wumpus
Priority: normal Keywords:

Created on 2011-08-11 20:18 by tchrist, last changed 2022-04-11 14:57 by admin.

Messages (11)
msg141926 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-11 20:18
Python has no standard support for the Unicode Collation Library as explained in UTS #10.  This is request that UCA library be added to the standard Python distribution.

Collation underlies virtually everything we do with text, not just sorting but any sort of comparison. Furthermore, the UCA is tailorable for locales in a portable way that does not require dodgy vendor support. It is a very important step in making Python suitable for full Unicode text processing.
msg143035 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-26 21:06
Sounds like a fair feature request for Python 3.3, as long as the intention is that users must import some module from the standard library and use functions defined in that module.  The operations and methods defined for str instances (e.g. ==, <, etc.) should not change their behavior.

Is there an existing 3rd party library that we could adopt (even if it isn't perfect yet)?
msg143044 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-26 21:51
> Sounds like a fair feature request for Python 3.3, as long as the
> intention is that users must import some module from the standard
> library and use functions defined in that module.  The operations and
> methods defined for str instances (e.g. ==, <, etc.) should not change
> their behavior.

> Is there an existing 3rd party library that we could adopt (even if it isn't perfect yet)?

I *think* you could use ICU's.  

I'm pretty sure the Parrot people use ICU libraries.

--tom
msg143045 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2011-08-26 21:55
I know I sound like NIH, but I'm always reluctant to add a big 3rd party lib like ICU to the permanent dependencies of all future Python distros.  If people want to use ICU they already can.  OTOH I don't have a better idea. :-(
msg143047 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2011-08-26 22:00
I would like to be involved in the design of the API for a UCA module and its routines for loading Unicode Collation Element Tables (not making the mistake of using global state like the locale module does).
msg143048 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-26 22:21
Raymond Hettinger <raymond.hettinger@gmail.com> added the comment:

> I would like to be involved in the design of the API for a UCA module
> and its routines for loading Unicode Collation Element Tables (not
> making the mistake of using global state like the locale module does).

Is this the problem where a locale is global to a process (or thread)?

The way I'm used to using the UCA module in Perl, that's never a problem,
because it's completely object-oriented.  There's no global state.  You 
instantiate a collator object with all the state it needs, like

    collation_level
    upper_before_lower
    backwards_levels
    normalization
    override_CJK
    override_Hangul
    katakana_before_hiragana
    variable
    locale
    preprocess

And then you use that object for all your collation needs, including
not just sorting but also string comparison and even searches.

For example, you could instantiate a first collator object with its level
set to one, meaning just compare base alphanumerics not diacritics or case
or nonletters, and a second with the defaults so that it uses all four
levels or a different normalization.  I have on occasion had more than one
collator object around at once each with its own locale, like if I want to
compare different locales' comparisons.

--tom
msg143049 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-26 22:28
I should probably mention the importance in the design of a UCA module of
being able to specify which UCA version number you want it to behave like
in case you plan to override some of the DUCET entries.  That way if you
run under a later UCA with different DUCET weights, your own tailorings will
still make sense.  If you don't do this, your collation tailorings can break 
in a new release of the UCA.

--tom
msg143050 - (view) Author: Tom Christiansen (tchrist) Date: 2011-08-26 22:31
Guido van Rossum <report@bugs.python.org> wrote
   on Fri, 26 Aug 2011 21:55:03 -0000: 

> I know I sound like NIH, but I'm always reluctant to add a big 3rd
> party lib like ICU to the permanent dependencies of all future Python
> distros.  If people want to use ICU they already can.  OTOH I don't
> have a better idea. :-(

I know exactly what you mean.  I would not want to push that on anyone,
being dependent on a gigantic 3rd-party module.  I just tried to answer
the question.  The only two full UCA implementations I know of are ICU's
and Perl's, which does not use ICU (since we're UTF-8, etc).

I just wish Python had Unicode collation, is all.

--tom

PS: (I haven't had good luck the ICU bindings in 3.2.)
msg365433 - (view) Author: Ahmad Azizi (Ahmad Azizi) Date: 2020-03-31 22:50
Remember, sorting standard Persian(Farsi) characters does not work properly with current implementation of Python 3.x
As the result, python is probably unable to sort properly in some other languages.

Here is correct order of alphabet in Persian:

 "آ",    "ا",    "ب",    "پ",    "ت",    "ث",    "ج",    "چ",    "ح",    "خ",    "د",    "ذ",    "ص",    "ض",    "ط",    "ظ",    "ع",    "غ",    "ف",    "ق",    "ک",    "گ",    "ك",    "ل",    "م",    "ن",    "و",    "ه",    "ی",    "ي",

After sorting using sorted():

آ, ا, ب, ت, ث, ج, ح, خ, د, ذ, ص, ض, ط, ظ, ع, غ, ف, ق, ك, ل, م, ن, ه, و, ي, پ, چ, ک, گ, ی,
msg365480 - (view) Author: Matej Cepl (mcepl) * Date: 2020-04-01 15:06
Isn’t this done by the system? It feels like barking at the wrong tree.
msg365501 - (view) Author: Ahmad Azizi (Ahmad Azizi) Date: 2020-04-01 18:21
No, this is not an OS dependent issue. Python does not use Unicode collation(uses utf-8) for sorting.
History
Date User Action Args
2022-04-11 14:57:20adminsetgithub: 56944
2020-04-01 18:21:46Ahmad Azizisetmessages: + msg365501
versions: - Python 3.4
2020-04-01 15:06:16mceplsetmessages: + msg365480
2020-03-31 22:50:18Ahmad Azizisetnosy: + Ahmad Azizi
messages: + msg365433
2018-04-13 16:41:53mceplsetnosy: + mcepl
2018-01-22 06:44:45wumpussetnosy: + wumpus
2013-07-10 19:10:35terry.reedysetversions: + Python 3.4, - Python 3.3
2011-08-26 22:31:43tchristsetmessages: + msg143050
2011-08-26 22:28:43tchristsetmessages: + msg143049
2011-08-26 22:21:08tchristsetmessages: + msg143048
2011-08-26 22:00:41rhettingersetnosy: + rhettinger
messages: + msg143047
2011-08-26 21:55:02gvanrossumsetmessages: + msg143045
2011-08-26 21:51:47tchristsetmessages: + msg143044
2011-08-26 21:06:48gvanrossumsetnosy: + gvanrossum
messages: + msg143035
2011-08-13 00:58:04mrabarnettsetnosy: + mrabarnett
2011-08-12 18:06:31eric.araujosetnosy: + eric.araujo

versions: + Python 3.3, - Python 3.2
2011-08-12 18:05:23Arfreversetnosy: + Arfrever
2011-08-12 16:43:11daniel.urbansetnosy: + daniel.urban
2011-08-12 00:18:00ezio.melottisetnosy: + belopolsky, ezio.melotti
2011-08-11 20:18:16tchristcreate