classification
Title: PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC
Type: behavior Stage:
Components: Unicode Versions: Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, lemburg, loewis, ronaldoussoren, trentm
Priority: normal Keywords: patch

Created on 2008-10-06 22:27 by trentm, last changed 2008-12-28 15:38 by benjamin.peterson. This issue is now closed.

Files
File name Uploaded Description Edit
issue4060_macosx_endian.patch trentm, 2008-10-06 22:53 patch to fix (as described in original comment)
pymacconfig.h.patch ronaldoussoren, 2008-10-07 06:13 Moving detection of WORDS_BIGENDIAN to pymacconfig.h
pymacconfig.h.patch2 ronaldoussoren, 2008-10-07 12:31
Messages (16)
msg74398 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-06 22:27
Revision 63955 removed a block from configure.in (and effectively from
pyconfig.h.in) having to do with endianness that results in an incorrect
setting for "WORDS_BIGENDIAN" in Universal builds on Mac OS X.

The removed part was this:

> AH_VERBATIM([WORDS_BIGENDIAN], 	 
> [ 	 
>  /* Define to 1 if your processor stores words with the most
significant byte 	 
>     first (like Motorola and SPARC, unlike Intel and VAX). 	 
>  	 
>     The block below does compile-time checking for endianness on
platforms 	 
>     that use GCC and therefore allows compiling fat binaries on OSX by
using 	 
>     '-arch ppc -arch i386' as the compile flags. The phrasing was
choosen 	 
>     such that the configure-result is used on systems that don't use
GCC. 	 
>   */ 	 
> #ifdef __BIG_ENDIAN__ 	 
> #define WORDS_BIGENDIAN 1 	 
> #else 	 
> #ifndef __LITTLE_ENDIAN__ 	 
> #undef WORDS_BIGENDIAN 	 
> #endif 	 
> #endif])

This used to allow "WORDS_BIGENDIAN" to be correct for all parts of a
universal Python build done via `gcc -arch i386 -arch ppc ...`.

This was originally added for issue 1471883 (see msg50040 for a
discussion of this particular bit).


The result of this bug is that Python extensions using either of the
following to get native byte ordering for UTF-16 decoding:

   PyUnicode_DecodeUTF16(..., byteorder=0);
   PyUnicode_DecodeUTF16Stateful(..., byteorder=0, ...);

on Mac OS X/PowerPC with a universal build built on Intel hardware (most
such builds) will get the wrong byte-ordering.


The fix is to restore that section to configure.in and re-run autoconf
and autoheader.


Ronald,

Was there are particular reason that this block was removed from
configure.in (and pyconfig.h.in)?

I'd like to hear comments from either Ronald or Martin, and then I can
commit the fix.
msg74399 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-06 22:31
This also shows up in the byte ordering that Python uses to encode utf-16:

$ uname -a
Darwin sphinx 8.11.0 Darwin Kernel Version 8.11.0: Wed Oct 10 18:26:00
PDT 2007; root:xnu-792.24.17~1/RELEASE_PPC Power Macintosh powerpc

$ python2.6 -c "import codecs; codecs.open('26.txt', 'w',
'utf-16').write('hi')"
$ od -cx 26.txt
0000000  377 376   h  \0   i  \0
             fffe    6800    6900
0000006

$ /usr/bin/python -c "import codecs; codecs.open('system.txt', 'w',
'utf-16').write('hi')"
$ od -cx system.txt
0000000  376 377  \0   h  \0   i
             feff    0068    0069
0000006


The BOM here ensures, of course, that this is still valid UTF-16
content, but the difference in behaviour here btwn Python versions might
not be intended.
msg74400 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-06 22:40
Does this also affect sys.byteorder and the struct module ?

I think those would be more important to get right than the UTF-16
codec, since this only uses the native byte ordering for increased
performance and compatibility with other OS tools. Since UTF-16 is not
wide-spread on Mac OS X, it's not so much an issue... it would be on
Windows.
msg74402 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-06 22:47
BTW: Does this simplified approach really work for Python on Mac OS X:

On 2008-10-07 00:27, Trent Mick wrote:
>>     The block below does compile-time checking for endianness on
> platforms 	 
>>     that use GCC and therefore allows compiling fat binaries on OSX by
> using 	 
>>     '-arch ppc -arch i386' as the compile flags. The phrasing was
> choosen 	 
>>     such that the configure-result is used on systems that don't use
> GCC. 	 

For most other tools that require configure tests regarding endianness
on Mac OS X, the process of building a universal binary goes something
like this:

http://developer.apple.com/opensource/buildingopensourceuniversal.html

ie. you run the whole process twice and then combine the results using
lipo.
msg74404 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-06 22:49
> Does this also affect sys.byteorder and the struct module ?

Doesn't seem to affect sys.byteorder:

$ /usr/bin/python -c "import sys; print sys.byteorder"
big
$ python2.6 -c "import sys; print sys.byteorder"
big


> I think those would be more important to get right than the UTF-16
> codec, since this only uses the native byte ordering for increased
> performance and compatibility with other OS tools. Since UTF-16 is not
> wide-spread on Mac OS X, it's not so much an issue...

It is an issue for Python extensions that use that API. For example, it
is the cause of recent Komodo builds not starting Mac OS X/PowerPC
(http://bugs.activestate.com/show_bug.cgi?id=79366) because the PyXPCOM
extension and embedded Python 2.6 build was getting UTF-16 data mixed up
when talking with Mozilla APIs.

 it would be on
> Windows.
msg74406 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-06 22:52
> BTW: Does this simplified approach really work for Python on Mac OS X

It works for Python 2.5:

http://svn.python.org/view/*checkout*/python/branches/release25-maint/configure.in?rev=66299

search for "BIGENDIAN".
msg74407 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-06 22:59
On 2008-10-07 00:52, Trent Mick wrote:
> Trent Mick <trentm@gmail.com> added the comment:
> 
>> BTW: Does this simplified approach really work for Python on Mac OS X
> 
> It works for Python 2.5:
> 
> http://svn.python.org/view/*checkout*/python/branches/release25-maint/configure.in?rev=66299
> 
> search for "BIGENDIAN".

Thanks... didn't see that the settings enables a compile-time check.
msg74424 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2008-10-07 06:13
The issue was introduced while moving universal-binary specific trickery 
from pyconfig.h.in to a separate header file. Obviously I must have been 
drunk at the time, because I didn't move the WORDS_BIGENDIAN bits 
correctly.

The attached patch in "pymacconfig.h.patch" adds detection of 
WORDS_BIGENDIAN to pymacconfig.h, the header where the other pyconfig.h 
overrides for universal builds are as well.

Background: this work was done while adding support for 4-way universal 
builds, that is x86, x86_64, ppc and ppc64. This required many more 
updates to pyconfig.h, most of which couldn't be done in a clean 
platform-independent way. That's why I (tried to) move the setting of 
pyconfig.h values that are affected by the current architecture to 
Include/pymacconfig.h.

NOTE: I haven't tested my patch yet, I'll do a full test round later 
today.
msg74425 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-07 07:06
> Added file: http://bugs.python.org/file11723/pymacconfig.h.patch

I'll test that on my end tomorrow -- though it looks like it will work fine.
Thanks.
msg74442 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2008-10-07 12:31
Annoyingly enough my patch isn't good enough, it turns out that ctypes 
has introduced a SIZEOF__BOOL definition in configure.in and that needs 
special caseing as well.

pymacconfig.h.patch2 fixes that issue as well. Do you have access to a 
PPC G5 system? I've determined the correct value of SIZEOF__BOOL for 
that platform by reading the assembly code for a small test program and 
hence am not 100% sure that sizeof(_Bool) actually is 1 on that 
architecture.

One other annoying issue cropped up: regrtest.py consistently hangs in 
test_signal (with 100% CPU usage) when I run it in rossetta (PPC 
emulator). I'll test this on an actual PPC machine as well, this might 
well be an issue with the PPC emulator.
msg74448 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-10-07 13:37
I agree with Trent that this is a bug, and I agree with the second patch
(pymacconfig.h.patch2).

Mark-Andre, sys.byteorder is not affected because detects the byte order
at run-time, not at compile-time. Likewise, in the struct module,
several code paths rely on dynamic determination of the endianness, such
as _PyLong_FromByteArray, the float packing, and the whichtable function.
msg74459 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-10-07 15:23
On 2008-10-07 14:33, Ronald Oussoren wrote:
> Ronald Oussoren <ronaldoussoren@mac.com> added the comment:
> 
> Annoyingly enough my patch isn't good enough, it turns out that ctypes 
> has introduced a SIZEOF__BOOL definition in configure.in and that needs 
> special caseing as well.
> 
> pymacconfig.h.patch2 fixes that issue as well. Do you have access to a 
> PPC G5 system? I've determined the correct value of SIZEOF__BOOL for 
> that platform by reading the assembly code for a small test program and 
> hence am not 100% sure that sizeof(_Bool) actually is 1 on that 
> architecture.

Using this helper:

#include <stdio.h>
main() {
    printf("sizeof(_Bool)=%i bytes\n", sizeof(_Bool));
}

I get:

sizeof(_Bool)=4 bytes

on a G4 PPC.

Seems strange to me, but reasonable since it is defined like this
in stdbool.h:

#if __STDC_VERSION__ < 199901L && __GNUC__ < 3
typedef int     _Bool;
#endif
msg74463 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-07 16:29
> I get:
>
> sizeof(_Bool)=4 bytes
>
> on a G4 PPC.

Same thing on a G5 PPC:

$ cat main.c
#include <stdio.h>

int main(void) {
    printf("sizeof(_Bool) is %d\n", sizeof(_Bool));
}
$ gcc main.c
$ ./a.out
sizeof(_Bool) is 4
msg74474 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2008-10-07 19:54
On 7 Oct, 2008, at 18:29, Trent Mick wrote:

>
> Trent Mick <trentm@gmail.com> added the comment:
>
>> I get:
>>
>> sizeof(_Bool)=4 bytes
>>
>> on a G4 PPC.
>
> Same thing on a G5 PPC:
>
> $ cat main.c
> #include <stdio.h>
>
> int main(void) {
>    printf("sizeof(_Bool) is %d\n", sizeof(_Bool));
> }
> $ gcc main.c

What if you compile using 'gcc -arch ppc64 main.c'?

Ronald
msg74494 - (view) Author: Trent Mick (trentm) * (Python committer) Date: 2008-10-07 23:06
> What if you compile using 'gcc -arch ppc64 main.c'?

$ gcc -arch ppc64 main.c
$ ./a.out
sizeof(_Bool) is 1

As you figured out.
msg78412 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-12-28 15:38
Applied the patch in r67982.
History
Date User Action Args
2008-12-28 15:38:00benjamin.petersonsetstatus: open -> closed
nosy: + benjamin.peterson
resolution: fixed
messages: + msg78412
2008-10-07 23:06:10trentmsetmessages: + msg74494
2008-10-07 19:54:17ronaldoussorensetmessages: + msg74474
2008-10-07 16:29:01trentmsetmessages: + msg74463
title: PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC -> PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC
2008-10-07 15:23:49lemburgsetmessages: + msg74459
title: PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC -> PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC
2008-10-07 13:37:39loewissetmessages: + msg74448
2008-10-07 12:31:54ronaldoussorensetfiles: + pymacconfig.h.patch2
messages: + msg74442
2008-10-07 07:06:41trentmsetmessages: + msg74425
title: PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC -> PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC
2008-10-07 06:13:04ronaldoussorensetfiles: + pymacconfig.h.patch
messages: + msg74424
2008-10-06 22:59:05lemburgsetmessages: + msg74407
title: PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC -> PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC
2008-10-06 22:53:22trentmsetkeywords: + patch
files: + issue4060_macosx_endian.patch
2008-10-06 22:52:49trentmsetmessages: + msg74406
2008-10-06 22:49:39trentmsetmessages: + msg74404
2008-10-06 22:47:14lemburgsetmessages: + msg74402
title: PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC -> PyUnicode_DecodeUTF16(..., byteorder=0) gets it wrong on Mac OS X/PowerPC
2008-10-06 22:40:34lemburgsetnosy: + lemburg
messages: + msg74400
2008-10-06 22:31:28trentmsetmessages: + msg74399
2008-10-06 22:27:09trentmcreate