classification
Title: test_local.TestEnUSCollection failures on Solaris 10
Type: behavior Stage:
Components: Extension Modules Versions: Python 3.7, Python 3.6, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: jcea, loewis, petriborg, serhiy.storchaka, trent, vstinner
Priority: normal Keywords: 3.3regression

Created on 2012-10-17 02:19 by trent, last changed 2017-06-20 14:54 by serhiy.storchaka.

Messages (17)
msg173124 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2012-10-17 02:19
======================================================================
ERROR: test_strxfrm (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 346, in test_strxfrm
    self.assertLess(locale.strxfrm('a'), locale.strxfrm('b'))
ValueError: character U+101010e is not in range [U+0000; U+10ffff]

======================================================================
ERROR: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic
    self.assertLess(locale.strxfrm('à'), locale.strxfrm('b'))
ValueError: character U+101010e is not in range [U+0000; U+10ffff]

----------------------------------------------------------------------

Haven't investigated yet.
msg173164 - (view) Author: Trent Nelson (trent) * (Python committer) Date: 2012-10-17 12:56
With the caveat that I know absolutely nothing about locales, here's what I've been able to reduce the problem down to:

zinc (alias s11, Solaris 11 x64):
    >>> locale.setlocale(locale.LC_ALL, 'C')
    'C'
    >>> locale.strxfrm('a')
    'a'
    >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    'en_US.UTF-8'
    >>> locale.strxfrm('a')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: character U+10105a3 is not in range [U+0000; U+10ffff]
    >>> 

nitrogen (alias s10, Solaris 10 SPARC):

    >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
    'en_US.UTF-8'
    >>> locale.strxfrm('a')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: character U+101010e is not in range [U+0000; U+10ffff]

Not sure how relevant it is, but on both those Solaris boxes, locale.LC_ALL returns 6, whereas on BSD and OS X it always seems to return 0.
msg173166 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-10-17 13:02
I can reproduce this on my x86 Solaris 10 update 10.
msg173167 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-17 13:03
With the system Python on s10:

Python 2.6.8 (unknown, Apr 13 2012, 17:08:12) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.strxfrm('a')
'a'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01'
>>> locale.strxfrm('a').decode('utf-8')
u'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01'

The difference between Python 2 and Python 3 is that Python 3 uses wcsxfrm, not strxfrm. Apparently Solaris' wcsxfrm is some broken thing that returns the same thing as strxfrm, cast to a wchar_t *, hence the character U+101010e (corresponding to the '\x01\x01\x01\x0e' bytestring above).
msg173168 - (view) Author: Jesús Cea Avión (jcea) * (Python committer) Date: 2012-10-17 13:05
BTW, this works in python 3.2:

x86, 32 bit python, Solaris 10 update 10:

"""
Python 3.2.3 (default, Apr 12 2012, 13:29:13) 
[GCC 4.7.0] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
'���\U00010f69�'
"""
msg173171 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-17 13:34
It only works on Python 3.2 because PyUnicode_FromWideChar is more permissive, it seems. The first character in the wchar_t string returned by Solaris is still 0x101010e.
msg173172 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-10-17 13:44
(by the way, I also tried a memset() before calling wcsxfrm(): no change)
msg173199 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012-10-17 19:28
Python 3.2 rejects characters outside the range U+0000-U+10ffff in
some operations, but not everywhere. I fixed Python 3.3 to be more
strict and always reject characters outside this range. I noticed the
Solaris issue with mbstowcs() on locale encodings different than
UTF-8: #13441. I asked if it's more important to be strict on Unicode,
or if we need to handle the wcsxfrm() issue on python-dev:
http://mail.python.org/pipermail/python-dev/2011-December/114759.html

Stefan Krah answered: "Yes, if the cause is a broken mbstowcs() that
sounds good."
http://mail.python.org/pipermail/python-dev/2011-December/114781.html

I asked for help on OpenIndiana IRC channel, but nobody had a locale
encoding different than UTF-8. I didn't have access to a Solaris box,
so I chose to skip failing tests on Solaris.

My commit 2a2d0872d993 (and 7ffe3d304487) skips many locales to
workaround this issue in test__locale.
msg289382 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-03-10 15:51
May be issue15954 is related to this issue. Is this issue still reproduced?
msg296414 - (view) Author: Peter (petriborg) Date: 2017-06-20 12:16
I'm getting the same 2 errors in Python 3.4.6 on Solaris 11.

Comes up when you run 'gmake test' or

./python -W default -bb -E -W error::BytesWarning -m test -r -w -j 0 -v test_locale.py
msg296415 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-20 12:23
A solution for that would be to return the raw byte string or to return a list of integers, rather than an unicode string.

I don't think that locale.strxfrm() result is supposed to be displayed in a terminal, it should only be used to sort two strings, or to be used as a key function for list.sort() for example.
msg296416 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-20 12:26
Currently, the function is documented to return a string:
https://docs.python.org/dev/library/locale.html#locale.strxfrm
"Transforms a string to one that can be used in locale-aware comparisons."

The problem is that we don't have enough developers who care of Solaris/Illimios to fix these issues (propose patches).

test_locale is just *one* example. The curses module is broken for years on Solaris if I recall correctly...
msg296418 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 12:47
It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm().

All codes < 0x10000 are not changed. Codes >= 0x10000 are encoded as a pair: 0x10000 + (code >> 16), code & 0xffff.
msg296435 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-20 14:20
> It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm().

I wouldn't say that the function is wrong. wchar_t is 32-bit long, the
function is free to use numbers > 0x10ffff. It's more a Python
limitation, no?
msg296440 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 14:36
Agree, it's more a Python limitation.
msg296441 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-20 14:38
> Agree, it's more a Python limitation.

Why do you think of changing locale.strxfrm() from str to bytes or tuple? I prefer a tuple.

But again, I'm not super motivated by this change. IMHO there are more severe issues that should be fixed in Solaris.
msg296445 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-20 14:54
This will change the documented behavior. Even if allow this change in a new feature release, it can't be made in maintained releases.

A tuple of integers is memory excessive and slow. A bytes object is more compact (but may be less compact than a string) and faster. But on little-endian platform every wchar_t should be converted to big-endian for supporting comparison of bytes objects.
History
Date User Action Args
2017-06-20 14:54:41serhiy.storchakasetmessages: + msg296445
2017-06-20 14:38:53vstinnersetmessages: + msg296441
2017-06-20 14:36:13serhiy.storchakasetmessages: + msg296440
2017-06-20 14:20:32vstinnersetmessages: + msg296435
2017-06-20 12:48:36serhiy.storchakasetcomponents: + Extension Modules, - Interpreter Core
2017-06-20 12:47:47serhiy.storchakasettype: behavior
messages: + msg296418
components: + Interpreter Core
versions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.3, Python 3.4
2017-06-20 12:26:30pitrousetnosy: - pitrou
2017-06-20 12:26:12vstinnersetmessages: + msg296416
2017-06-20 12:23:29vstinnersetmessages: + msg296415
2017-06-20 12:16:35petriborgsetnosy: + petriborg
messages: + msg296414
2017-03-10 15:51:28serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg289382
2012-10-17 19:28:28vstinnersetmessages: + msg173199
2012-10-17 14:36:26jceasetnosy: + vstinner
2012-10-17 14:35:41jcealinkissue13441 superseder
2012-10-17 13:44:36pitrousetmessages: + msg173172
2012-10-17 13:34:00pitrousetmessages: + msg173171
2012-10-17 13:05:36jceasetkeywords: + 3.3regression

messages: + msg173168
2012-10-17 13:03:20pitrousetnosy: + loewis, pitrou
messages: + msg173167
2012-10-17 13:02:59jceasetmessages: + msg173166
2012-10-17 12:56:34trentsetmessages: + msg173164
2012-10-17 03:08:51jceasetnosy: + jcea
2012-10-17 02:19:55trentcreate