classification
Title: test test_codecs failed
Type: behavior Stage: test needed
Components: Extension Modules, Unicode Versions: Python 2.6
process
Status: closed Resolution: wont fix
Dependencies: Superseder: remove --with-wctype-functions configure option
View: 9210
Assigned To: Nosy List: BreamoreBoy, amaury.forgeotdarc, ezio.melotti, lemburg, nijel, pierre42, pitrou
Priority: normal Keywords: patch

Created on 2004-12-01 14:41 by nijel, last changed 2010-07-13 09:59 by lemburg. This issue is now closed.

Files
File name Uploaded Description Edit
Python-2.4-wctype.patch nijel, 2004-12-03 11:46
Messages (31)
msg23431 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 14:41
test test_codecs failed -- Traceback (most recent call
last):
  File
"/usr/src/packages/BUILD/Python-2.4/Lib/test/test_codecs.py",
line 446, in test_nameprep
    raise test_support.TestFailed("Test 3.%d: %s" %
(pos+1, str(e)))
TestFailed: Test 3.5: u'\u0143 \u03b9' != u'\u0144 \u03b9'
msg23432 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-01 14:53
Logged In: YES 
user_id=38388

Please make sure that Python is picking up the correct modules.
You can do so, buy running Python in verbose mode (python -vv).
msg23433 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 14:59
Logged In: YES 
user_id=192186

It's clean build root with no other python, so it has no
chance to pickup bad modules.
msg23434 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 15:26
Logged In: YES 
user_id=192186

System information:
i386
kernel 2.6.8
glibc 2.3.3
msg23435 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-01 16:15
Logged In: YES 
user_id=38388

The tests pass just fine on my machine. 

Is it possible that your compiler is broken ? 
gcc 2.3.3 is *very* old !
msg23436 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-01 16:16
Logged In: YES 
user_id=38388

Sorry: I misread glibc as gcc. Still, this sounds a lot like
a broken compiler.

BTW, are you building a UCS4 version ?
msg23437 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 16:21
Logged In: YES 
user_id=192186

gcc (GCC) 3.3.4 (pre 3.3.5 20040809)

Yes, I'm building UCS4 version.
msg23438 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 16:32
Logged In: YES 
user_id=192186

The problem seems to be in glibc, when I remove
--with-wctype-functions, it passes. Or could it be in Python
interface to wctype functions?
msg23439 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-01 17:20
Logged In: YES 
user_id=38388

Ah, now I understand: it is well possible that the Unicode
database versions differ. Python uses version 3.2.

Do you know which version glibc 2.3.3 uses ?

Note that for portability it is usually better not to use wctype
functions.
msg23440 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 17:29
Logged In: YES 
user_id=192186

I'm not sure what means "uses", but I found several mentions
of Unicode 3.2 in code and in changelogs.
msg23441 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-01 18:33
Logged In: YES 
user_id=38388

The wctype functions must have been built using tables from 
the Unicode code point database. Python's own APIs for this
were built using the Unicode DB 3.2. My question is whether
you know which version the glibc was built from.

It is not surprising that the two tests fail if the underlying 
Unicode DB versions differ.
msg23442 - (view) Author: Michal Čihař (nijel) Date: 2004-12-01 18:37
Logged In: YES 
user_id=192186

I understand the question, but I have no idea how to find
this information inside glibc.
msg23443 - (view) Author: Pierre (pierre42) Date: 2004-12-01 21:30
Logged In: YES 
user_id=512388

I have the same problem
msg23444 - (view) Author: Michal Čihař (nijel) Date: 2004-12-02 11:07
Logged In: YES 
user_id=192186

Well, glibc 2.3.3 is reportedly using Unicode DB 3.2, so
there must be either bug in it or in Python, I can't tell.
Any idea how to find out?
msg23445 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-02 15:40
Logged In: YES 
user_id=38388

Do you get the same error when compiling without
--with-wctype-functions ?

If not, then we'll just have to close this report as "won't
fix" - the
reason is that we as Python developers don't have control over
what glibc does or does not do. 

Unfortunately, there's not way to disable the failing tests
since 
the configure option is not available to the Python program.
msg23446 - (view) Author: Michal Čihař (nijel) Date: 2004-12-02 16:03
Logged In: YES 
user_id=192186

Compiling without --with-wctype-functions "fixes" this problem.

I still don't see what has wctype functions to do with this.
They are used for operations like is this numeric,
alphanumeric, upper,... I'd like to trace this bug either it
is in Python or glibc, but I still don't know what of glibc
functions do influence this test.
msg23447 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-02 16:23
Logged In: YES 
user_id=38388

The punycode codec uses the .upper() method on Unicode objects.

Since this method uses Py_UNICODE_TOUPPER(), any difference
in case mapping between the Unicode DB used in Python and the
one used in glibc will be noticable as a result of 
--with-wctype-functions.
msg23448 - (view) Author: Michal Čihař (nijel) Date: 2004-12-02 16:38
Logged In: YES 
user_id=192186

I tried towupper and towupper functions for all characters
in failed test and I can see no difference comared to python
ones...
msg23449 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-03 10:43
Logged In: YES 
user_id=38388

Maybe you should add some hooks to the Py_UNICODE_* macros and
recompile (or run the script in a C debugger).

The difference in output is minimal (\u0143 vs. \u0144) which I
believe hints at a change in the used Unicode DB:

0143;LATIN CAPITAL LETTER N WITH ACUTE;Lu;0;L;004E
0301;;;;N;LATIN CAPITAL LETTER N ACUTE;;;0144;
0144;LATIN SMALL LETTER N WITH ACUTE;Ll;0;L;006E
0301;;;;N;LATIN SMALL LETTER N ACUTE;;0143;;0143

The only difference here is the case.
msg23450 - (view) Author: Michal Čihař (nijel) Date: 2004-12-03 11:03
Logged In: YES 
user_id=192186

However when I make simple C program containing:

    s = 0x143;
    printf("%lc %lc %lc\n", s, towupper(s), towlower(s));
    s = 0x144;
    printf("%lc %lc %lc\n", s, towupper(s), towlower(s));

I get expected results and they're same as from python code:

s =u'\u0143'
print '%s %s %s' % (s, s.upper(), s.lower())
s =u'\u0144'
print '%s %s %s' % (s, s.upper(), s.lower())

I'm starting to thing that it might be something with
locales, I'll investigate it more.
msg23451 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-03 11:19
Logged In: YES 
user_id=38388

Could you run this test (comparing lower and upper) for all
code points in the range(sys.maxunicode) ?!

The origin of the problem could be a different code point.

I don't think that it has to do with locale (but you never
know...), since Unicode is all about unifying locales. The C
functions should not be locale aware (even though the man
page says it depends on LC_CTYPE).
msg23452 - (view) Author: Michal Čihař (nijel) Date: 2004-12-03 11:46
Logged In: YES 
user_id=192186

Okay, it IS locales problem. You should trust man page :-),
calling towupper/towlower without set locales (or with POSIX
locales) gives wrong result. After applying attached patch,
all problems in tests are gone.
msg23453 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-03 12:16
Logged In: YES 
user_id=38388

Thanks for the patch. I see a few problems with this
approach, though:

* We brake binary compatibility depending on the configure
settings used for building Python; if this is really
necessary we should place the changes into the
_PyUnicode_ToLowerCase() et al. APIs defined in unicodectype.c

* I'm not sure whether there is any performance or memory
usage win in using the wctype functions from glibc: the
Unicode type mapping DB table has to be included anyway (due
to the title case mapping), so the only win I could see is a
performance one and given that towlower et al. do seem to be
locale aware I have strong doubts that these functions are
actually faster than the lookup in our own database.

Could you check whether using the wctype functions from
glibc does have any effect on size of the interpreter and
performance of e.g. .lower() and .upper() ?

If not, I'm inclined to remove the wctype function support
altogether.
msg23454 - (view) Author: Michal Čihař (nijel) Date: 2004-12-03 13:13
Logged In: YES 
user_id=192186

After talk to glibc developer: towlower/towupper will never
work as expected with POSIX/C locales (because anything
besides a-z is not alpha character for these).

I can give some performace results, but even without tests,
it looks to me like good idea to drop support for this.

msg23455 - (view) Author: Michal Čihař (nijel) Date: 2004-12-03 13:26
Logged In: YES 
user_id=192186

without wctype: 100x test_codecs: 10.209s, libpython size:
1140098
with wctype: 100x test_codecs: 10.120s (removed one failing
test), libpython size: 1140314
msg23456 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2004-12-03 14:50
Logged In: YES 
user_id=38388

Thanks for the tests. 

Looks to me as if the trouble of keeping the wctype support
and working around quirks with the locales is not worth it.

I think it's better to remove the support altogether and
stick with the builtin type database.
msg23457 - (view) Author: Michal Čihař (nijel) Date: 2004-12-03 14:55
Logged In: YES 
user_id=192186

I agree with removing it, however I'm not the one who could
decide :-)
msg110029 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2010-07-11 17:36
Is there any point in leaving this open as the patch is against Python 2.4?  How many times has test_codecs been successfully run against various Python versions since the patch was produced 6 1/2 years ago?
msg110050 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-07-11 21:26
The OP compiled python with --with-wctype-functions, and the libc wctype functions work differently depending on the locale.
I suggest closing this issue as "won't fix", and favor the removal of the "--with-wctype-functions" option proposed in issue9210.
msg110102 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-07-12 15:51
Amaury's suggestion sounds good to me.
msg110167 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-13 09:59
Amaury Forgeot d'Arc wrote:
> 
> Amaury Forgeot d'Arc <amauryfa@gmail.com> added the comment:
> 
> The OP compiled python with --with-wctype-functions, and the libc wctype functions work differently depending on the locale.
> I suggest closing this issue as "won't fix", and favor the removal of the "--with-wctype-functions" option proposed in issue9210.

+1
History
Date User Action Args
2010-07-13 09:59:14lemburgsetmessages: + msg110167
2010-07-12 15:51:03pitrousetstatus: pending -> closed

nosy: + pitrou
messages: + msg110102

superseder: remove --with-wctype-functions configure option
2010-07-11 21:26:17amaury.forgeotdarcsetstatus: open -> pending

nosy: + amaury.forgeotdarc
messages: + msg110050

resolution: wont fix
2010-07-11 17:36:26BreamoreBoysetnosy: + BreamoreBoy
messages: + msg110029
2010-07-11 17:19:12ezio.melottisetnosy: + ezio.melotti
2009-02-15 22:34:33ajaksu2setstage: test needed
type: behavior
components: + Extension Modules, Unicode, - Library (Lib)
versions: + Python 2.6, - Python 2.4
2007-09-20 04:55:22brett.cannonsetkeywords: + patch
2004-12-01 14:41:24nijelcreate