Issue23195
Created on 2015-01-08 20:30 by pnugues, last changed 2015-01-08 22:37 by vstinner.
Messages (4) | |||
---|---|---|---|
msg233685 - (view) | Author: Pierre Nugues (pnugues) | Date: 2015-01-08 20:30 | |
The sorted() function does not work properly with macosx. Here is a script to reproduce the issue: import locale locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") a = ["A", "E", "Z", "a", "e", "é", "z"] sorted(a) sorted(a, key=locale.strxfrm) The execution on MacOsX produces: pierre:Flaubert pierre$ sw_vers -productVersion 10.10.1 pierre:Flaubert pierre$ python3 Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "help", "copyright", "credits" or "license" for more information. import locale locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") 'fr_FR.UTF-8' a = ["A", "E", "Z", "a", "e", "é", "z"] sorted(a) ['A', 'E', 'Z', 'a', 'e', 'z', 'é'] sorted(a, key=locale.strxfrm) ['A', 'E', 'Z', 'a', 'e', 'z', 'é'] while it produces this on your interactive shell (python.org): In [10]: import locale In [11]: locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") Out[11]: 'fr_FR.UTF-8' In [12]: a = ["A", "E", "Z", "a", "e", "é", "z"] In [13]: sorted(a) Out[13]: ['A', 'E', 'Z', 'a', 'e', 'z', 'é'] In [14]: sorted(a, key=locale.strxfrm) Out[14]: ['a', 'A', 'e', 'E', 'é', 'z', 'Z'] which is correct. |
|||
msg233687 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2015-01-08 21:27 | |
locale.strxfrm() have a different implementation in Python 2 and in Python 3: - Python 2 uses strxfrm(), so works on bytes strings - Python 3 uses wcsxfrm(), so works on multibyte strings ("unicode" strings) It looks like Python 2 and 3 have the same behaviour on Mac OS X: the list is not sorted as expected. Test on Mac OS X 10.9.2. Imac-Photo:~ haypo$ cat collate2.py #coding:utf8 import locale, random locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") print("LC_COLLATE = %s" % locale.setlocale(locale.LC_COLLATE, None)) a = ["A", "E", "Z", "\xc9", "a", "e", "\xe9", "z"] random.shuffle(a) print(sorted(a)) print(sorted(a, key=locale.strxfrm)) Imac-Photo:~ haypo$ cat collate3.py #coding:utf8 import locale, random locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8") print("LC_COLLATE = %s" % locale.setlocale(locale.LC_COLLATE, None)) a = ["A", "E", "Z", "\xc9", "a", "e", "\xe9", "z"] random.shuffle(a) print(ascii(sorted(a))) print(ascii(sorted(a, key=locale.strxfrm))) Imac-Photo:~ haypo$ LC_ALL=fr_FR.utf8 python collate2.py LC_COLLATE = fr_FR.UTF-8 ['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9'] ['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9'] Imac-Photo:~ haypo$ LC_ALL=fr_FR.utf8 ~/prog/python/default/python.exe ~/collate3.py LC_COLLATE = fr_FR.UTF-8 ['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9'] ['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9'] On Linux, I get the expected order with Python 3: LC_COLLATE = fr_FR.UTF-8 ['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9'] ['a', 'A', 'e', 'E', '\xe9', '\xc9', 'z', 'Z'] On Linux, Python 2 gives me a strange order. It's maybe an issue in my program: haypo@selma$ python x.py LC_COLLATE = fr_FR.UTF-8 ['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9'] ['\xe9', '\xc9', 'a', 'A', 'e', 'E', 'z', 'Z'] |
|||
msg233690 - (view) | Author: Ned Deily (ned.deily) * ![]() |
Date: 2015-01-08 22:26 | |
The initial difference appears to be a long-standing BSD (including OS X) versus GNU/Linux platform difference. See, for example: http://www.postgresql.org/message-id/18C8A481-33A6-4483-8C24-B8CE70DB7F27@eggerapps.at Why there is no difference between en and fr UTF-8 is obvious when you look under the covers at the system locale definitions. This is on FreeBSD 10, OS X 10.10 is the same: $ cd /usr/share/locale/fr_FR.UTF-8/ $ ls -l total 8 lrwxr-xr-x 1 root wheel 28 Jan 16 2014 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE lrwxr-xr-x 1 root wheel 17 Jan 16 2014 LC_CTYPE -> ../UTF-8/LC_CTYPE lrwxr-xr-x 1 root wheel 30 Jan 16 2014 LC_MESSAGES -> ../fr_FR.ISO8859-1/LC_MESSAGES -r--r--r-- 1 root wheel 36 Jan 16 2014 LC_MONETARY lrwxr-xr-x 1 root wheel 29 Jan 16 2014 LC_NUMERIC -> ../fr_FR.ISO8859-1/LC_NUMERIC -r--r--r-- 1 root wheel 364 Jan 16 2014 LC_TIME For some reason US-ASCII is used for UTF-8 collation; this is also true for en_US.UTF-8 and de_DE.UTF-8, the only other ones I checked. The postresq discussion and some earlier Python issues suggest using ICU to properly implement Unicode functions like collation across all platforms. But that has never been implemented in Python. Nosing Marc-Andre. |
|||
msg233691 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2015-01-08 22:37 | |
> The postresq discussion and some earlier Python issues suggest using ICU to properly implement Unicode functions like collation across all platforms. In my experience, the locale module is error-prone and not reliable, especially if you want portability. It just uses functions provided by the OS. And the locales (LC_CTYPE, LC_MESSAGE, etc.) are process-wide which become a major issue if you want to serve different clients using different locales... Windows supports a different locale per thread if I remember correctly. It would be more reliable to use a good library like ICU. You may try: https://pypi.python.org/pypi/PyICU Link showing how to use PyICU to sort a Python sequence: https://stackoverflow.com/questions/11121636/sorting-list-of-string-with-specific-locale-in-python => strings.sort(key=lambda x: collator[loc].getCollationKey(x).getByteArray()) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2015-01-08 22:37:48 | vstinner | set | messages: + msg233691 |
2015-01-08 22:27:21 | ned.deily | set | title: Sorting with locale (strxfrm) does not work properly with Python3 on Macos -> Sorting with locale (strxfrm) does not work properly with Python3 on BSD or OS X |
2015-01-08 22:26:41 | ned.deily | set | nosy:
+ lemburg messages: + msg233690 |
2015-01-08 21:48:54 | r.david.murray | set | nosy:
+ r.david.murray |
2015-01-08 21:48:27 | r.david.murray | link | issue23196 superseder |
2015-01-08 21:46:17 | r.david.murray | set | title: Sorting with locale does not work properly with Python3 on Macos -> Sorting with locale (strxfrm) does not work properly with Python3 on Macos |
2015-01-08 21:27:27 | vstinner | set | messages: + msg233687 |
2015-01-08 20:33:56 | ned.deily | set | nosy:
+ ned.deily |
2015-01-08 20:30:56 | pnugues | create |