classification
Title: TestEnUSCollation.test_strxfrm() fails on Solaris
Type: Stage:
Components: Unicode Versions: Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder: test_local.TestEnUSCollection failures on Solaris 10
View: 16258
Assigned To: Nosy List: ezio.melotti, jcea, loewis, python-dev, skrah, vstinner
Priority: normal Keywords:

Created on 2011-11-20 23:58 by vstinner, last changed 2012-10-17 14:35 by jcea. This issue is now closed.

Files
File name Uploaded Description Edit
strxfrm.c vstinner, 2011-11-21 02:09
localeconv_wchar.c vstinner, 2011-12-08 01:23
Messages (24)
msg148017 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-20 23:58
I added a test in _PyUnicode_CheckConsistency() (in debug mode) to ensure that all characters of a string are in the range U+0000-U+10FFFF. Locale tests are now failing on Solaris:

-----------------------------------
[ 28/361] test__locale
Assertion failed: maxchar <= 0x10FFFF, file Objects/unicodeobject.c, line 408
Fatal Python error: Aborted

Current thread 0x00000001:
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 134 in test_float_parsing
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/case.py", line 385 in _executeTestPart
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/case.py", line 440 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/case.py", line 492 in __call__
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 105 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 67 in __call__
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 105 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 67 in __call__
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/runner.py", line 168 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/support.py", line 1368 in _run_suite
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/support.py", line 1402 in run_unittest
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 139 in test_main
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/regrtest.py", line 1203 in runtest_inner
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/regrtest.py", line 906 in runtest
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/regrtest.py", line 709 in main
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/__main__.py", line 13 in <module>
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/runpy.py", line 73 in _run_code
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/runpy.py", line 160 in _run_module_as_main
*** Error code 134
-----------------------------------

The problem is that strxfrm() and wcsxfrm() return strange results for the string "a" and the english locale (e.g. en_US.UTF-8).

strxfrm(buffer, "a\0", 100) returns 21 (bytes) but only 2 bytes are written ("\x01\x00"). The next bytes are unchanged.

wcsxfrm(buffer, L"a\0", 100) returns 7 (characters), the 7 characters are written but they are in range U+1010101..U+1010163, whereas the maximum character of Unicode 6.0 is U+10FFFF (U+101xxxx vs U+10xxxx).

Output of the attached program, strxfrm.c, on OpenSolaris:
-----------------------------------
strxfrm: len=21
0x01
0x00
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff
0xff

wcsxfrm: len=7
U+1010163
U+1010101
U+1010103
U+1010101
U+1010103
U+1010101
U+1010101
-----------------------------------

I don't know if it's normal that wcsxfrm() writes characters in the range U+1010101..U+1010163.

Is Python supposed to support characters outside U+0000-U+10FFFF range? chr(0x10FFFF+1) raises a ValueError.
msg148019 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-21 00:11
New changeset 31baf1363ba1 by Victor Stinner in branch 'default':
Issue #13441: Disable temporary strxfrm() tests on Solaris
http://hg.python.org/cpython/rev/31baf1363ba1
msg148026 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-21 02:05
> Is Python supposed to support characters outside U+0000-U+10FFFF range?

If not, PyUnicode_FromUnicode(), PyUnicode_FromWideChar() and PyUnicode_FromKindAndData() should be patched to raise an error if a bigger character is encountered.
msg148027 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-21 02:09
> strxfrm(buffer, "a\0", 100) returns 21 (bytes) but only 2 bytes
> are written ("\x01\x00"). The next bytes are unchanged.

Woops, it was a bug in my program. I attached the fixed version. The correct program writes:
----
strxfrm: len=21
0x01
0x01
0x63
0x01
0x01
0x01
0x01
0x01
0x03
0x01
0x01
0x01
0x01
0x01
0x03
0x01
0x01
0x01
0x01
0x01
0x01

wcsxfrm: len=7
U+1010163
U+1010101
U+1010103
U+1010101
U+1010103
U+1010101
U+1010101
----
msg148028 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-21 02:17
New changeset 78123afb3ea4 by Victor Stinner in branch 'default':
Issue #13441: Disable temporary localeconv() tests on Solaris
http://hg.python.org/cpython/rev/78123afb3ea4
msg148034 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2011-11-21 12:14
> Is Python supposed to support characters outside U+0000-U+10FFFF range?

No, they should be rejected.  Allowing them in some specific places might cause them to leak somewhere else and cause problems, so I'd rather stick with that range and reject all the chars >U+10FFFF everywhere.
msg148038 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-21 13:32
New changeset a19dad38d4e8 by Victor Stinner in branch 'default':
Issue #13441: _PyUnicode_CheckConsistency() dumps the string if the maximum
http://hg.python.org/cpython/rev/a19dad38d4e8
msg148039 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-21 13:32
> No, they should be rejected. Allowing them in some specific
> places might cause them to leak somewhere else and cause problems,
> so I'd rather stick with that range and reject all the chars
> >U+10FFFF everywhere.

That's why I added a (debug) check to reject them. I don't think that your UTF-8 encoder support such character some example. All functions assumes that the maximum character is U+10FFFF.

If they should be rejected, a solution is to modify strxfrm() to return a list of integer (of code points) instead of a string.
msg148046 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-21 14:41
New changeset d1b3b1d00811 by Victor Stinner in branch 'default':
Another temporary hack to debug the issue #13441
http://hg.python.org/cpython/rev/d1b3b1d00811
msg148048 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-21 14:44
I dumped some values to try to debug this issue. Last failure in test__locale.test_lc_numeric_basic() on localeconv():
----------------------------
[ 25/361] test_float
Decode localeconv() decimal_point: {0x2c} (len=1)
Decode localeconv() thousands_sep: {0x2e} (len=1)
Decode localeconv() int_curr_symbol: {} (len=0)
Decode localeconv() currency_symbol: {} (len=0)
Decode localeconv() mon_decimal_point: {} (len=0)
Decode localeconv() mon_thousands_sep: {} (len=0)
Decode localeconv() positive_sign: {} (len=0)
Decode localeconv() negative_sign: {} (len=0)
...
[100/361] test__locale
Decode localeconv() decimal_point: {0x2c} (len=1)
Decode localeconv() thousands_sep: {0xa0} (len=1)
Invalid Unicode string! {U+30000020} (len=1)
Fatal Python error: Aborted
----------------------------
msg148051 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-21 15:01
New changeset acda16de630c by Victor Stinner in branch 'default':
Remove temporary hacks for the issue #13441
http://hg.python.org/cpython/rev/acda16de630c
msg148054 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-11-21 15:10
Here is a more complete output. localeconv() fails in the hu_HU locale for the "thousands_sep" field: localeconv() returns b'\xa0' which is decoded as the wchar_t* string: {U+30000020} (len=1). This is an invalid character
-----------------------------------
[ 54/361/3] test__locale
Decode wchar_t {U+0043} (len=1)
SET LOCALE "es_UY"
SET LOCALE "fr_FR"
SET LOCALE "fi_FI"
SET LOCALE "es_CO"
SET LOCALE "pt_PT"
SET LOCALE "it_IT"
SET LOCALE "et_EE"
SET LOCALE "es_PY"
SET LOCALE "no_NO"
SET LOCALE "nl_NL"
SET LOCALE "lv_LV"
SET LOCALE "el_GR"
SET LOCALE "be_BY"
SET LOCALE "fr_BE"
SET LOCALE "ro_RO"
SET LOCALE "ru_UA"
SET LOCALE "ru_RU"
SET LOCALE "es_VE"
SET LOCALE "ca_ES"
SET LOCALE "se_NO"
SET LOCALE "es_EC"
SET LOCALE "id_ID"
SET LOCALE "ka_GE"
SET LOCALE "es_CL"
SET LOCALE "hu_HU"
SET LOCALE -> hu_HU
Decode wchar_t {U+0068 U+0075 U+005f U+0048 U+0055} (len=5)
SET LOCALE "hu_HU"
SET LOCALE -> hu_HU
Decode wchar_t {U+0068 U+0075 U+005f U+0048 U+0055} (len=5)
Decode wchar_t {U+002c} (len=1)
Decode localeconv() decimal_point: {0x2c} (len=1)
Decode wchar_t {U+002c} (len=1)
Decode localeconv() thousands_sep: {0xa0} (len=1)
Decode wchar_t {U+30000020} (len=1)
Invalid Unicode string! {U+30000020} (len=1)
Fatal Python error: Aborted

Current thread 0x00000001:
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 105 in test_lc_numeric_basic
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/case.py", line 385 in _executeTestPart
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/case.py", line 440 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/case.py", line 492 in __call__
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 105 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 67 in __call__
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 105 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/suite.py", line 67 in __call__
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/unittest/runner.py", line 168 in run
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/support.py", line 1368 in _run_suite
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/support.py", line 1402 in run_unittest
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 141 in test_main
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/regrtest.py", line 1203 in runtest_inner
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/regrtest.py", line 906 in runtest
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/regrtest.py", line 709 in main
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/__main__.py", line 13 in <module>
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/runpy.py", line 73 in _run_code
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/runpy.py", line 160 in _run_module_as_main
*** Error code 134
-----------------------------------
msg148061 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-21 17:04
New changeset d6d15fcf5eb6 by Victor Stinner in branch 'default':
Issue #13441: Reenable strxfrm() tests on Solaris
http://hg.python.org/cpython/rev/d6d15fcf5eb6
msg148106 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-11-22 02:25
New changeset 6f9af4e3c1db by Victor Stinner in branch 'default':
Issue #13441: Disable temporary the check on the maximum character until
http://hg.python.org/cpython/rev/6f9af4e3c1db
msg149014 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-08 01:23
localeconv_wchar.c: test program to dump the thousands separator on a locale specified on the command line. I wrote this program to try to reproduce the hu_HU issue, but I cannot reproduce it on OpenIndiana. I only have UTF-8 locales on my OpenIndiana VM, whereas the issue looks to be specific to an ISO-8859-?? encoding (b'\xA0' is not decodable from UTF-8).
msg149022 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-08 12:16
See also the issue #7442.
msg149033 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-12-08 13:31
localeconv_wchar.c runs fine on Ubuntu with hu_HU and fi_FI.

I tried on OpenSolaris, but I only have UTF-8 locales. The package
with ISO locales seems to be SUNWlang-cs-extra, but Oracle took down
http://pkg.opensolaris.org/release/ .
msg149058 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-12-08 22:41
New changeset 93bab8400ca5 by Victor Stinner in branch 'default':
Issue #13441: Log the locale when localeconv() fails
http://hg.python.org/cpython/rev/93bab8400ca5
msg149059 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-08 22:42
Changeset 489ea02ed351 changed PyUnicode_FromWideChar() and PyUnicode_FromUnicode(): raise a ValueError if a character in not in range [U+0000; U+10ffff].

test__locale errors:

======================================================================
ERROR: test_float_parsing (test.test__locale._LocaleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 134, in test_float_parsing
    if localeconv()['decimal_point'] != '.':
ValueError: character U+30000020 is not in range [U+0000; U+10ffff]

======================================================================
ERROR: test_lc_numeric_basic (test.test__locale._LocaleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 105, in test_lc_numeric_basic
    li_radixchar = localeconv()[lc]
ValueError: character U+30000020 is not in range [U+0000; U+10ffff]

======================================================================
ERROR: test_lc_numeric_localeconv (test.test__locale._LocaleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 91, in test_lc_numeric_localeconv
    self.numeric_tester('localeconv', localeconv()[lc], lc, loc)
ValueError: character U+30000020 is not in range [U+0000; U+10ffff]

======================================================================
ERROR: test_lc_numeric_nl_langinfo (test.test__locale._LocaleTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot/slave/3.x.loewis-sun/build/Lib/test/test__locale.py", line 79, in test_lc_numeric_nl_langinfo
    self.numeric_tester('nl_langinfo', nl_langinfo(li), lc, loc)
ValueError: character U+30000020 is not in range [U+0000; U+10ffff]

----------------------------------------------------------------------

If the issue is specific to the hu_HU locale, a possible workaround is to skip this locale on Solaris. I changed to test to display the locale on failure.
msg149064 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-12-09 00:18
New changeset 87c6be1e393a by Victor Stinner in branch 'default':
Issue #13441: Don't test the hu_HU locale on Solaris to workaround a mbstowcs()
http://hg.python.org/cpython/rev/87c6be1e393a
msg149081 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-12-09 09:28
New changeset 2a2d0872d993 by Victor Stinner in branch 'default':
Issue #13441: Skip some locales (e.g. cs_CZ and hu_HU) on Solaris to workaround
http://hg.python.org/cpython/rev/2a2d0872d993
msg149085 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011-12-09 10:29
New changeset 7ffe3d304487 by Victor Stinner in branch 'default':
Issue #13441: Enable the workaround for Solaris locale bug
http://hg.python.org/cpython/rev/7ffe3d304487
msg149086 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-09 10:34
I collected the locale list triggering the mbstowcs() bug thanks my previous commit:

 * hu_HU (ISO8859-2): character U+30000020
 * de_AT (ISO8859-1): character U+30000076
 * cs_CZ (ISO8859-2): character U+30000020
 * sk_SK (ISO8859-2): character U+30000020
 * pl_PL (ISO8859-2): character U+30000020
 * fr_CA (ISO8859-1): character U+30000020

Hum, the bug occurs maybe on all locales... I suppose that all "xx_XX" locales use an encoding different than UTF-8 and that the bug is specific to encodings different than UTF-8.

I don't understand why locale.strxfrm('à') doesn't crash anymore.
msg149091 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-12-09 13:21
The Solaris buildbot is green, let's close it. I didn't report the bug upstream. Feel free to report it to Oracle!
History
Date User Action Args
2012-10-17 14:35:41jceasetnosy: + jcea
superseder: test_local.TestEnUSCollection failures on Solaris 10
2011-12-09 13:21:17vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg149091
2011-12-09 10:34:18vstinnersetmessages: + msg149086
2011-12-09 10:29:16python-devsetmessages: + msg149085
2011-12-09 09:28:41python-devsetmessages: + msg149081
2011-12-09 00:18:20python-devsetmessages: + msg149064
2011-12-08 22:42:10vstinnersetmessages: + msg149059
2011-12-08 22:41:00python-devsetmessages: + msg149058
2011-12-08 13:31:31skrahsetmessages: + msg149033
2011-12-08 12:16:17vstinnersetmessages: + msg149022
2011-12-08 10:20:00skrahsetnosy: + skrah
2011-12-08 01:23:28vstinnersetfiles: + localeconv_wchar.c

messages: + msg149014
2011-11-22 02:25:52python-devsetmessages: + msg148106
2011-11-21 17:04:19python-devsetmessages: + msg148061
2011-11-21 15:12:05pitrousetnosy: - pitrou
2011-11-21 15:10:15vstinnersetmessages: + msg148054
2011-11-21 15:01:15python-devsetmessages: + msg148051
2011-11-21 14:44:12vstinnersetmessages: + msg148048
2011-11-21 14:41:04python-devsetmessages: + msg148046
2011-11-21 13:32:32vstinnersetmessages: + msg148039
2011-11-21 13:32:04python-devsetmessages: + msg148038
2011-11-21 12:14:39ezio.melottisetmessages: + msg148034
2011-11-21 02:17:06python-devsetmessages: + msg148028
2011-11-21 02:09:27vstinnersetfiles: - strxfrm.c
2011-11-21 02:09:16vstinnersetfiles: + strxfrm.c

messages: + msg148027
2011-11-21 02:05:09vstinnersetmessages: + msg148026
2011-11-21 00:11:53python-devsetnosy: + python-dev
messages: + msg148019
2011-11-20 23:58:15vstinnercreate