classification
Title: locale.strxfrm can't handle non-ascii strings
Type: behavior Stage:
Components: Versions: Python 3.0
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: christian.heimes, filip, loewis
Priority: normal Keywords:

Created on 2007-12-13 21:41 by filip, last changed 2008-03-08 10:55 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
strxfrm-unicode.diff filip, 2007-12-13 21:41
Messages (5)
msg58592 - (view) Author: Filip Salomonsson (filip) Date: 2007-12-13 21:41
locale.strxfrm currently does not handle non-ascii strings:

$ ./python
Python 3.0a2 (py3k:59482, Dec 13 2007, 21:27:14) 
[GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US.utf8")
'en_US.utf8'
>>> locale.strxfrm("a")
'\x0c\x01\x08\x01\x02'
>>> locale.strxfrm("\N{LATIN SMALL LETTER A WITH DIAERESIS}")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: strxfrm() argument 1 must be string without null bytes, not str

The attached patch tries to fix this:

$ ./python
Python 3.0a2 (py3k:59482M, Dec 13 2007, 21:58:09) 
[GCC 4.1.2 20070626 (Red Hat 4.1.2-14)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_COLLATE, "en_US.utf8")
'en_US.utf8'
>>> locale.strxfrm("a")
'.\x01\x10\x01\x02'
>>> locale.strxfrm("\N{LATIN SMALL LETTER A WITH DIAERESIS}")
'.\x01\x19\x01\x02'
>>> alist = list("aboåäöABOÅÄÖñÑ")
>>> sorted(alist, cmp=locale.strcoll) == sorted(alist, key=locale.strxfrm)
True


The patch does not include what's needed to define HAVE_WCSXFRM, since I
really don't know how to do that properly (I edited 'configure' and
'pyconfig.h.in' manually to compile it).
msg58596 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-12-13 22:18
locale.strxfrm needs to be removed in Python 3, probably along with the
entire locale module. We can't support it anymore.
msg58599 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2007-12-14 00:29
What's wrong with the locale module?
msg58615 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-12-14 06:49
It operates on char*, not Unicode strings.
msg63396 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-03-08 10:55
I found a way to fix this, using wchar_t functions. Fixed in r61307.
History
Date User Action Args
2008-03-08 10:55:18loewissetstatus: open -> closed
resolution: fixed
messages: + msg63396
2007-12-14 06:49:53loewissetmessages: + msg58615
2007-12-14 00:29:50christian.heimessetnosy: + christian.heimes
messages: + msg58599
2007-12-13 22:18:57loewissetnosy: + loewis
messages: + msg58596
2007-12-13 21:41:48filipcreate