Issue 12735: request full Unicode collation support in std python library

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/56944

classification

Title:	request full Unicode collation support in std python library
Type:	enhancement	Stage:
Components:	Library (Lib)	Versions:

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Ahmad Azizi, Arfrever, belopolsky, daniel.urban, eric.araujo, ezio.melotti, gvanrossum, mcepl, mrabarnett, rhettinger, tchrist, wumpus
Priority:	normal	Keywords:

Created on 2011-08-11 20:18 by tchrist, last changed 2022-04-11 14:57 by admin.

Messages (11)
msg141926 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-11 20:18
Python has no standard support for the Unicode Collation Library as explained in UTS #10. This is request that UCA library be added to the standard Python distribution. Collation underlies virtually everything we do with text, not just sorting but any sort of comparison. Furthermore, the UCA is tailorable for locales in a portable way that does not require dodgy vendor support. It is a very important step in making Python suitable for full Unicode text processing.
msg143035 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2011-08-26 21:06
Sounds like a fair feature request for Python 3.3, as long as the intention is that users must import some module from the standard library and use functions defined in that module. The operations and methods defined for str instances (e.g. ==, <, etc.) should not change their behavior. Is there an existing 3rd party library that we could adopt (even if it isn't perfect yet)?
msg143044 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-26 21:51
> Sounds like a fair feature request for Python 3.3, as long as the > intention is that users must import some module from the standard > library and use functions defined in that module. The operations and > methods defined for str instances (e.g. ==, <, etc.) should not change > their behavior. > Is there an existing 3rd party library that we could adopt (even if it isn't perfect yet)? I think you could use ICU's. I'm pretty sure the Parrot people use ICU libraries. --tom
msg143045 - (view)	Author: Guido van Rossum (gvanrossum) *	Date: 2011-08-26 21:55
I know I sound like NIH, but I'm always reluctant to add a big 3rd party lib like ICU to the permanent dependencies of all future Python distros. If people want to use ICU they already can. OTOH I don't have a better idea. :-(
msg143047 - (view)	Author: Raymond Hettinger (rhettinger) *	Date: 2011-08-26 22:00
I would like to be involved in the design of the API for a UCA module and its routines for loading Unicode Collation Element Tables (not making the mistake of using global state like the locale module does).
msg143048 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-26 22:21
Raymond Hettinger <raymond.hettinger@gmail.com> added the comment: > I would like to be involved in the design of the API for a UCA module > and its routines for loading Unicode Collation Element Tables (not > making the mistake of using global state like the locale module does). Is this the problem where a locale is global to a process (or thread)? The way I'm used to using the UCA module in Perl, that's never a problem, because it's completely object-oriented. There's no global state. You instantiate a collator object with all the state it needs, like collation_level upper_before_lower backwards_levels normalization override_CJK override_Hangul katakana_before_hiragana variable locale preprocess And then you use that object for all your collation needs, including not just sorting but also string comparison and even searches. For example, you could instantiate a first collator object with its level set to one, meaning just compare base alphanumerics not diacritics or case or nonletters, and a second with the defaults so that it uses all four levels or a different normalization. I have on occasion had more than one collator object around at once each with its own locale, like if I want to compare different locales' comparisons. --tom
msg143049 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-26 22:28
I should probably mention the importance in the design of a UCA module of being able to specify which UCA version number you want it to behave like in case you plan to override some of the DUCET entries. That way if you run under a later UCA with different DUCET weights, your own tailorings will still make sense. If you don't do this, your collation tailorings can break in a new release of the UCA. --tom
msg143050 - (view)	Author: Tom Christiansen (tchrist)	Date: 2011-08-26 22:31
Guido van Rossum <report@bugs.python.org> wrote on Fri, 26 Aug 2011 21:55:03 -0000: > I know I sound like NIH, but I'm always reluctant to add a big 3rd > party lib like ICU to the permanent dependencies of all future Python > distros. If people want to use ICU they already can. OTOH I don't > have a better idea. :-( I know exactly what you mean. I would not want to push that on anyone, being dependent on a gigantic 3rd-party module. I just tried to answer the question. The only two full UCA implementations I know of are ICU's and Perl's, which does not use ICU (since we're UTF-8, etc). I just wish Python had Unicode collation, is all. --tom PS: (I haven't had good luck the ICU bindings in 3.2.)
msg365433 - (view)	Author: Ahmad Azizi (Ahmad Azizi)	Date: 2020-03-31 22:50
Remember, sorting standard Persian(Farsi) characters does not work properly with current implementation of Python 3.x As the result, python is probably unable to sort properly in some other languages. Here is correct order of alphabet in Persian: "آ", "ا", "ب", "پ", "ت", "ث", "ج", "چ", "ح", "خ", "د", "ذ", "ص", "ض", "ط", "ظ", "ع", "غ", "ف", "ق", "ک", "گ", "ك", "ل", "م", "ن", "و", "ه", "ی", "ي", After sorting using sorted(): آ, ا, ب, ت, ث, ج, ح, خ, د, ذ, ص, ض, ط, ظ, ع, غ, ف, ق, ك, ل, م, ن, ه, و, ي, پ, چ, ک, گ, ی,
msg365480 - (view)	Author: Matej Cepl (mcepl) *	Date: 2020-04-01 15:06
Isn’t this done by the system? It feels like barking at the wrong tree.
msg365501 - (view)	Author: Ahmad Azizi (Ahmad Azizi)	Date: 2020-04-01 18:21
No, this is not an OS dependent issue. Python does not use Unicode collation(uses utf-8) for sorting.

History
Date	User	Action	Args
2022-04-11 14:57:20	admin	set	github: 56944
2020-04-01 18:21:46	Ahmad Azizi	set	messages: + msg365501 versions: - Python 3.4
2020-04-01 15:06:16	mcepl	set	messages: + msg365480
2020-03-31 22:50:18	Ahmad Azizi	set	nosy: + Ahmad Azizi messages: + msg365433
2018-04-13 16:41:53	mcepl	set	nosy: + mcepl
2018-01-22 06:44:45	wumpus	set	nosy: + wumpus
2013-07-10 19:10:35	terry.reedy	set	versions: + Python 3.4, - Python 3.3
2011-08-26 22:31:43	tchrist	set	messages: + msg143050
2011-08-26 22:28:43	tchrist	set	messages: + msg143049
2011-08-26 22:21:08	tchrist	set	messages: + msg143048
2011-08-26 22:00:41	rhettinger	set	nosy: + rhettinger messages: + msg143047
2011-08-26 21:55:02	gvanrossum	set	messages: + msg143045
2011-08-26 21:51:47	tchrist	set	messages: + msg143044
2011-08-26 21:06:48	gvanrossum	set	nosy: + gvanrossum messages: + msg143035
2011-08-13 00:58:04	mrabarnett	set	nosy: + mrabarnett
2011-08-12 18:06:31	eric.araujo	set	nosy: + eric.araujo versions: + Python 3.3, - Python 3.2
2011-08-12 18:05:23	Arfrever	set	nosy: + Arfrever
2011-08-12 16:43:11	daniel.urban	set	nosy: + daniel.urban
2011-08-12 00:18:00	ezio.melotti	set	nosy: + belopolsky, ezio.melotti
2011-08-11 20:18:16	tchrist	create