msg173124 - (view) |
Author: Trent Nelson (trent) *  |
Date: 2012-10-17 02:19 |
======================================================================
ERROR: test_strxfrm (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 346, in test_strxfrm
self.assertLess(locale.strxfrm('a'), locale.strxfrm('b'))
ValueError: character U+101010e is not in range [U+0000; U+10ffff]
======================================================================
ERROR: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic
self.assertLess(locale.strxfrm('à'), locale.strxfrm('b'))
ValueError: character U+101010e is not in range [U+0000; U+10ffff]
----------------------------------------------------------------------
Haven't investigated yet.
|
msg173164 - (view) |
Author: Trent Nelson (trent) *  |
Date: 2012-10-17 12:56 |
With the caveat that I know absolutely nothing about locales, here's what I've been able to reduce the problem down to:
zinc (alias s11, Solaris 11 x64):
>>> locale.setlocale(locale.LC_ALL, 'C')
'C'
>>> locale.strxfrm('a')
'a'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: character U+10105a3 is not in range [U+0000; U+10ffff]
>>>
nitrogen (alias s10, Solaris 10 SPARC):
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: character U+101010e is not in range [U+0000; U+10ffff]
Not sure how relevant it is, but on both those Solaris boxes, locale.LC_ALL returns 6, whereas on BSD and OS X it always seems to return 0.
|
msg173166 - (view) |
Author: Jesús Cea Avión (jcea) *  |
Date: 2012-10-17 13:02 |
I can reproduce this on my x86 Solaris 10 update 10.
|
msg173167 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-10-17 13:03 |
With the system Python on s10:
Python 2.6.8 (unknown, Apr 13 2012, 17:08:12) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.strxfrm('a')
'a'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01'
>>> locale.strxfrm('a').decode('utf-8')
u'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01'
The difference between Python 2 and Python 3 is that Python 3 uses wcsxfrm, not strxfrm. Apparently Solaris' wcsxfrm is some broken thing that returns the same thing as strxfrm, cast to a wchar_t *, hence the character U+101010e (corresponding to the '\x01\x01\x01\x0e' bytestring above).
|
msg173168 - (view) |
Author: Jesús Cea Avión (jcea) *  |
Date: 2012-10-17 13:05 |
BTW, this works in python 3.2:
x86, 32 bit python, Solaris 10 update 10:
"""
Python 3.2.3 (default, Apr 12 2012, 13:29:13)
[GCC 4.7.0] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
'���\U00010f69�'
"""
|
msg173171 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-10-17 13:34 |
It only works on Python 3.2 because PyUnicode_FromWideChar is more permissive, it seems. The first character in the wchar_t string returned by Solaris is still 0x101010e.
|
msg173172 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2012-10-17 13:44 |
(by the way, I also tried a memset() before calling wcsxfrm(): no change)
|
msg173199 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2012-10-17 19:28 |
Python 3.2 rejects characters outside the range U+0000-U+10ffff in
some operations, but not everywhere. I fixed Python 3.3 to be more
strict and always reject characters outside this range. I noticed the
Solaris issue with mbstowcs() on locale encodings different than
UTF-8: #13441. I asked if it's more important to be strict on Unicode,
or if we need to handle the wcsxfrm() issue on python-dev:
http://mail.python.org/pipermail/python-dev/2011-December/114759.html
Stefan Krah answered: "Yes, if the cause is a broken mbstowcs() that
sounds good."
http://mail.python.org/pipermail/python-dev/2011-December/114781.html
I asked for help on OpenIndiana IRC channel, but nobody had a locale
encoding different than UTF-8. I didn't have access to a Solaris box,
so I chose to skip failing tests on Solaris.
My commit 2a2d0872d993 (and 7ffe3d304487) skips many locales to
workaround this issue in test__locale.
|
msg289382 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-03-10 15:51 |
May be issue15954 is related to this issue. Is this issue still reproduced?
|
msg296414 - (view) |
Author: Peter (petriborg) |
Date: 2017-06-20 12:16 |
I'm getting the same 2 errors in Python 3.4.6 on Solaris 11.
Comes up when you run 'gmake test' or
./python -W default -bb -E -W error::BytesWarning -m test -r -w -j 0 -v test_locale.py
|
msg296415 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-06-20 12:23 |
A solution for that would be to return the raw byte string or to return a list of integers, rather than an unicode string.
I don't think that locale.strxfrm() result is supposed to be displayed in a terminal, it should only be used to sort two strings, or to be used as a key function for list.sort() for example.
|
msg296416 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-06-20 12:26 |
Currently, the function is documented to return a string:
https://docs.python.org/dev/library/locale.html#locale.strxfrm
"Transforms a string to one that can be used in locale-aware comparisons."
The problem is that we don't have enough developers who care of Solaris/Illimios to fix these issues (propose patches).
test_locale is just *one* example. The curses module is broken for years on Solaris if I recall correctly...
|
msg296418 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-06-20 12:47 |
It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm().
All codes < 0x10000 are not changed. Codes >= 0x10000 are encoded as a pair: 0x10000 + (code >> 16), code & 0xffff.
|
msg296435 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-06-20 14:20 |
> It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm().
I wouldn't say that the function is wrong. wchar_t is 32-bit long, the
function is free to use numbers > 0x10ffff. It's more a Python
limitation, no?
|
msg296440 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-06-20 14:36 |
Agree, it's more a Python limitation.
|
msg296441 - (view) |
Author: STINNER Victor (vstinner) *  |
Date: 2017-06-20 14:38 |
> Agree, it's more a Python limitation.
Why do you think of changing locale.strxfrm() from str to bytes or tuple? I prefer a tuple.
But again, I'm not super motivated by this change. IMHO there are more severe issues that should be fixed in Solaris.
|
msg296445 - (view) |
Author: Serhiy Storchaka (serhiy.storchaka) *  |
Date: 2017-06-20 14:54 |
This will change the documented behavior. Even if allow this change in a new feature release, it can't be made in maintained releases.
A tuple of integers is memory excessive and slow. A bytes object is more compact (but may be less compact than a string) and faster. But on little-endian platform every wchar_t should be converted to big-endian for supporting comparison of bytes objects.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:37 | admin | set | github: 60462 |
2017-06-20 14:54:41 | serhiy.storchaka | set | messages:
+ msg296445 |
2017-06-20 14:38:53 | vstinner | set | messages:
+ msg296441 |
2017-06-20 14:36:13 | serhiy.storchaka | set | messages:
+ msg296440 |
2017-06-20 14:20:32 | vstinner | set | messages:
+ msg296435 |
2017-06-20 12:48:36 | serhiy.storchaka | set | components:
+ Extension Modules, - Interpreter Core |
2017-06-20 12:47:47 | serhiy.storchaka | set | type: behavior messages:
+ msg296418 components:
+ Interpreter Core versions:
+ Python 3.5, Python 3.6, Python 3.7, - Python 3.3, Python 3.4 |
2017-06-20 12:26:30 | pitrou | set | nosy:
- pitrou
|
2017-06-20 12:26:12 | vstinner | set | messages:
+ msg296416 |
2017-06-20 12:23:29 | vstinner | set | messages:
+ msg296415 |
2017-06-20 12:16:35 | petriborg | set | nosy:
+ petriborg messages:
+ msg296414
|
2017-03-10 15:51:28 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages:
+ msg289382
|
2012-10-17 19:28:28 | vstinner | set | messages:
+ msg173199 |
2012-10-17 14:36:26 | jcea | set | nosy:
+ vstinner
|
2012-10-17 14:35:41 | jcea | link | issue13441 superseder |
2012-10-17 13:44:36 | pitrou | set | messages:
+ msg173172 |
2012-10-17 13:34:00 | pitrou | set | messages:
+ msg173171 |
2012-10-17 13:05:36 | jcea | set | keywords:
+ 3.3regression
messages:
+ msg173168 |
2012-10-17 13:03:20 | pitrou | set | nosy:
+ loewis, pitrou messages:
+ msg173167
|
2012-10-17 13:02:59 | jcea | set | messages:
+ msg173166 |
2012-10-17 12:56:34 | trent | set | messages:
+ msg173164 |
2012-10-17 03:08:51 | jcea | set | nosy:
+ jcea
|
2012-10-17 02:19:55 | trent | create | |