classification
Title: Sorting with locale (strxfrm) does not work properly with Python3 on BSD or OS X
Type: behavior Stage:
Components: Unicode Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, ned.deily, pnugues, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2015-01-08 20:30 by pnugues, last changed 2015-01-08 22:37 by vstinner.

Messages (4)
msg233685 - (view) Author: Pierre Nugues (pnugues) Date: 2015-01-08 20:30
The sorted() function does not work properly with macosx.
Here is a script to reproduce the issue:

import locale
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
a = ["A", "E", "Z", "a", "e", "é", "z"]
sorted(a)
sorted(a, key=locale.strxfrm)


The execution on MacOsX produces:
pierre:Flaubert pierre$ sw_vers -productVersion
10.10.1
pierre:Flaubert pierre$ python3
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct  5 2014, 20:42:22) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import locale
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
'fr_FR.UTF-8'
a = ["A", "E", "Z", "a", "e", "é", "z"]
sorted(a)
['A', 'E', 'Z', 'a', 'e', 'z', 'é']
sorted(a, key=locale.strxfrm)
['A', 'E', 'Z', 'a', 'e', 'z', 'é']


while it produces this on your interactive shell (python.org):
In [10]: import locale
In [11]: locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
Out[11]: 'fr_FR.UTF-8'
In [12]: a = ["A", "E", "Z", "a", "e", "é", "z"]
In [13]: sorted(a)
Out[13]: ['A', 'E', 'Z', 'a', 'e', 'z', 'é']
In [14]: sorted(a, key=locale.strxfrm)
Out[14]: ['a', 'A', 'e', 'E', 'é', 'z', 'Z']

which is correct.
msg233687 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-08 21:27
locale.strxfrm() have a different implementation in Python 2 and in Python 3:
- Python 2 uses strxfrm(), so works on bytes strings
- Python 3 uses wcsxfrm(), so works on multibyte strings ("unicode" strings)

It looks like Python 2 and 3 have the same behaviour on Mac OS X: the list is not sorted as expected. Test on Mac OS X 10.9.2.

Imac-Photo:~ haypo$ cat collate2.py 
#coding:utf8
import locale, random
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
print("LC_COLLATE = %s" % locale.setlocale(locale.LC_COLLATE, None))
a = ["A", "E", "Z", "\xc9", "a", "e", "\xe9", "z"]
random.shuffle(a)
print(sorted(a))
print(sorted(a, key=locale.strxfrm))

Imac-Photo:~ haypo$ cat collate3.py 
#coding:utf8
import locale, random
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
print("LC_COLLATE = %s" % locale.setlocale(locale.LC_COLLATE, None))
a = ["A", "E", "Z", "\xc9", "a", "e", "\xe9", "z"]
random.shuffle(a)
print(ascii(sorted(a)))
print(ascii(sorted(a, key=locale.strxfrm)))

Imac-Photo:~ haypo$ LC_ALL=fr_FR.utf8 python collate2.py 
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']

Imac-Photo:~ haypo$ LC_ALL=fr_FR.utf8 ~/prog/python/default/python.exe ~/collate3.py 
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']

On Linux, I get the expected order with Python 3:

LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['a', 'A', 'e', 'E', '\xe9', '\xc9', 'z', 'Z']

On Linux, Python 2 gives me a strange order. It's maybe an issue in my program:

haypo@selma$ python x.py 
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['\xe9', '\xc9', 'a', 'A', 'e', 'E', 'z', 'Z']
msg233690 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015-01-08 22:26
The initial difference appears to be a long-standing BSD (including OS X) versus GNU/Linux platform difference.  See, for example:
http://www.postgresql.org/message-id/18C8A481-33A6-4483-8C24-B8CE70DB7F27@eggerapps.at

Why there is no difference between en and fr UTF-8 is obvious when you look under the covers at the system locale definitions.  This is on FreeBSD 10, OS X 10.10 is the same:

$ cd /usr/share/locale/fr_FR.UTF-8/
$ ls -l
total 8
lrwxr-xr-x  1 root  wheel   28 Jan 16  2014 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x  1 root  wheel   17 Jan 16  2014 LC_CTYPE -> ../UTF-8/LC_CTYPE
lrwxr-xr-x  1 root  wheel   30 Jan 16  2014 LC_MESSAGES -> ../fr_FR.ISO8859-1/LC_MESSAGES
-r--r--r--  1 root  wheel   36 Jan 16  2014 LC_MONETARY
lrwxr-xr-x  1 root  wheel   29 Jan 16  2014 LC_NUMERIC -> ../fr_FR.ISO8859-1/LC_NUMERIC
-r--r--r--  1 root  wheel  364 Jan 16  2014 LC_TIME

For some reason US-ASCII is used for UTF-8 collation; this is also true for en_US.UTF-8 and de_DE.UTF-8, the only other ones I checked.

The postresq discussion and some earlier Python issues suggest using ICU to properly implement Unicode functions like collation across all platforms.  But that has never been implemented in Python.  Nosing Marc-Andre.
msg233691 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-01-08 22:37
> The postresq discussion and some earlier Python issues suggest using ICU to properly implement Unicode functions like collation across all platforms.

In my experience, the locale module is error-prone and not reliable, especially if you want portability. It just uses functions provided by the OS. And the locales (LC_CTYPE, LC_MESSAGE, etc.) are process-wide which become a major issue if you want to serve different clients using different locales... Windows supports a different locale per thread if I remember correctly.

It would be more reliable to use a good library like ICU. You may try:
https://pypi.python.org/pypi/PyICU

Link showing how to use PyICU to sort a Python sequence:
https://stackoverflow.com/questions/11121636/sorting-list-of-string-with-specific-locale-in-python
=> strings.sort(key=lambda x: collator[loc].getCollationKey(x).getByteArray())
History
Date User Action Args
2015-01-08 22:37:48vstinnersetmessages: + msg233691
2015-01-08 22:27:21ned.deilysettitle: Sorting with locale (strxfrm) does not work properly with Python3 on Macos -> Sorting with locale (strxfrm) does not work properly with Python3 on BSD or OS X
2015-01-08 22:26:41ned.deilysetnosy: + lemburg
messages: + msg233690
2015-01-08 21:48:54r.david.murraysetnosy: + r.david.murray
2015-01-08 21:48:27r.david.murraylinkissue23196 superseder
2015-01-08 21:46:17r.david.murraysettitle: Sorting with locale does not work properly with Python3 on Macos -> Sorting with locale (strxfrm) does not work properly with Python3 on Macos
2015-01-08 21:27:27vstinnersetmessages: + msg233687
2015-01-08 20:33:56ned.deilysetnosy: + ned.deily
2015-01-08 20:30:56pnuguescreate