Issue 16258: test_local.TestEnUSCollection failures on Solaris 10

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60462

classification

Title:	test_local.TestEnUSCollection failures on Solaris 10
Type:	behavior	Stage:
Components:	Extension Modules	Versions:	Python 3.7, Python 3.6, Python 3.5

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	jcea, loewis, petriborg, serhiy.storchaka, trent, vstinner
Priority:	normal	Keywords:	3.3regression

Created on 2012-10-17 02:19 by trent, last changed 2022-04-11 14:57 by admin.

Messages (17)
msg173124 - (view)	Author: Trent Nelson (trent) *	Date: 2012-10-17 02:19
====================================================================== ERROR: test_strxfrm (test.test_locale.TestEnUSCollation) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 346, in test_strxfrm self.assertLess(locale.strxfrm('a'), locale.strxfrm('b')) ValueError: character U+101010e is not in range [U+0000; U+10ffff] ====================================================================== ERROR: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic self.assertLess(locale.strxfrm('à'), locale.strxfrm('b')) ValueError: character U+101010e is not in range [U+0000; U+10ffff] ---------------------------------------------------------------------- Haven't investigated yet.
msg173164 - (view)	Author: Trent Nelson (trent) *	Date: 2012-10-17 12:56
With the caveat that I know absolutely nothing about locales, here's what I've been able to reduce the problem down to: zinc (alias s11, Solaris 11 x64): >>> locale.setlocale(locale.LC_ALL, 'C') 'C' >>> locale.strxfrm('a') 'a' >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: character U+10105a3 is not in range [U+0000; U+10ffff] >>> nitrogen (alias s10, Solaris 10 SPARC): >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: character U+101010e is not in range [U+0000; U+10ffff] Not sure how relevant it is, but on both those Solaris boxes, locale.LC_ALL returns 6, whereas on BSD and OS X it always seems to return 0.
msg173166 - (view)	Author: Jesús Cea Avión (jcea) *	Date: 2012-10-17 13:02
I can reproduce this on my x86 Solaris 10 update 10.
msg173167 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-17 13:03
With the system Python on s10: Python 2.6.8 (unknown, Apr 13 2012, 17:08:12) [C] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.strxfrm('a') 'a' >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') '\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01' >>> locale.strxfrm('a').decode('utf-8') u'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01' The difference between Python 2 and Python 3 is that Python 3 uses wcsxfrm, not strxfrm. Apparently Solaris' wcsxfrm is some broken thing that returns the same thing as strxfrm, cast to a wchar_t *, hence the character U+101010e (corresponding to the '\x01\x01\x01\x0e' bytestring above).
msg173168 - (view)	Author: Jesús Cea Avión (jcea) *	Date: 2012-10-17 13:05
BTW, this works in python 3.2: x86, 32 bit python, Solaris 10 update 10: """ Python 3.2.3 (default, Apr 12 2012, 13:29:13) [GCC 4.7.0] on sunos5 Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8') 'en_US.UTF-8' >>> locale.strxfrm('a') '��\U00010f69�' """
msg173171 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-17 13:34
It only works on Python 3.2 because PyUnicode_FromWideChar is more permissive, it seems. The first character in the wchar_t string returned by Solaris is still 0x101010e.
msg173172 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2012-10-17 13:44
(by the way, I also tried a memset() before calling wcsxfrm(): no change)
msg173199 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-17 19:28
Python 3.2 rejects characters outside the range U+0000-U+10ffff in some operations, but not everywhere. I fixed Python 3.3 to be more strict and always reject characters outside this range. I noticed the Solaris issue with mbstowcs() on locale encodings different than UTF-8: #13441. I asked if it's more important to be strict on Unicode, or if we need to handle the wcsxfrm() issue on python-dev: http://mail.python.org/pipermail/python-dev/2011-December/114759.html Stefan Krah answered: "Yes, if the cause is a broken mbstowcs() that sounds good." http://mail.python.org/pipermail/python-dev/2011-December/114781.html I asked for help on OpenIndiana IRC channel, but nobody had a locale encoding different than UTF-8. I didn't have access to a Solaris box, so I chose to skip failing tests on Solaris. My commit 2a2d0872d993 (and 7ffe3d304487) skips many locales to workaround this issue in test__locale.
msg289382 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-03-10 15:51
May be issue15954 is related to this issue. Is this issue still reproduced?
msg296414 - (view)	Author: Peter (petriborg)	Date: 2017-06-20 12:16
I'm getting the same 2 errors in Python 3.4.6 on Solaris 11. Comes up when you run 'gmake test' or ./python -W default -bb -E -W error::BytesWarning -m test -r -w -j 0 -v test_locale.py
msg296415 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-20 12:23
A solution for that would be to return the raw byte string or to return a list of integers, rather than an unicode string. I don't think that locale.strxfrm() result is supposed to be displayed in a terminal, it should only be used to sort two strings, or to be used as a key function for list.sort() for example.
msg296416 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-20 12:26
Currently, the function is documented to return a string: https://docs.python.org/dev/library/locale.html#locale.strxfrm "Transforms a string to one that can be used in locale-aware comparisons." The problem is that we don't have enough developers who care of Solaris/Illimios to fix these issues (propose patches). test_locale is just one example. The curses module is broken for years on Solaris if I recall correctly...
msg296418 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-20 12:47
It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm(). All codes < 0x10000 are not changed. Codes >= 0x10000 are encoded as a pair: 0x10000 + (code >> 16), code & 0xffff.
msg296435 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-20 14:20
> It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm(). I wouldn't say that the function is wrong. wchar_t is 32-bit long, the function is free to use numbers > 0x10ffff. It's more a Python limitation, no?
msg296440 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-20 14:36
Agree, it's more a Python limitation.
msg296441 - (view)	Author: STINNER Victor (vstinner) *	Date: 2017-06-20 14:38
> Agree, it's more a Python limitation. Why do you think of changing locale.strxfrm() from str to bytes or tuple? I prefer a tuple. But again, I'm not super motivated by this change. IMHO there are more severe issues that should be fixed in Solaris.
msg296445 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2017-06-20 14:54
This will change the documented behavior. Even if allow this change in a new feature release, it can't be made in maintained releases. A tuple of integers is memory excessive and slow. A bytes object is more compact (but may be less compact than a string) and faster. But on little-endian platform every wchar_t should be converted to big-endian for supporting comparison of bytes objects.

History
Date	User	Action	Args
2022-04-11 14:57:37	admin	set	github: 60462
2017-06-20 14:54:41	serhiy.storchaka	set	messages: + msg296445
2017-06-20 14:38:53	vstinner	set	messages: + msg296441
2017-06-20 14:36:13	serhiy.storchaka	set	messages: + msg296440
2017-06-20 14:20:32	vstinner	set	messages: + msg296435
2017-06-20 12:48:36	serhiy.storchaka	set	components: + Extension Modules, - Interpreter Core
2017-06-20 12:47:47	serhiy.storchaka	set	type: behavior messages: + msg296418 components: + Interpreter Core versions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.3, Python 3.4
2017-06-20 12:26:30	pitrou	set	nosy: - pitrou
2017-06-20 12:26:12	vstinner	set	messages: + msg296416
2017-06-20 12:23:29	vstinner	set	messages: + msg296415
2017-06-20 12:16:35	petriborg	set	nosy: + petriborg messages: + msg296414
2017-03-10 15:51:28	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg289382
2012-10-17 19:28:28	vstinner	set	messages: + msg173199
2012-10-17 14:36:26	jcea	set	nosy: + vstinner
2012-10-17 14:35:41	jcea	link	issue13441 superseder
2012-10-17 13:44:36	pitrou	set	messages: + msg173172
2012-10-17 13:34:00	pitrou	set	messages: + msg173171
2012-10-17 13:05:36	jcea	set	keywords: + 3.3regression messages: + msg173168
2012-10-17 13:03:20	pitrou	set	nosy: + loewis, pitrou messages: + msg173167
2012-10-17 13:02:59	jcea	set	messages: + msg173166
2012-10-17 12:56:34	trent	set	messages: + msg173164
2012-10-17 03:08:51	jcea	set	nosy: + jcea
2012-10-17 02:19:55	trent	create