classification
Title: sys.sizeof test fails with wide unicode
Type: behavior Stage:
Components: Interpreter Core Versions: Python 3.0, Python 2.6
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: schuppenies Nosy List: amaury.forgeotdarc, benjamin.peterson, georg.brandl, lemburg, pitrou, schuppenies
Priority: critical Keywords: patch

Created on 2008-06-12 22:13 by benjamin.peterson, last changed 2008-06-17 10:34 by schuppenies. This issue is now closed.

Files
File name Uploaded Description Edit
maxunicode.patch schuppenies, 2008-06-13 19:41 Patch against 2.6 trunk, revision 64230
Py_UNICODE.patch schuppenies, 2008-06-13 19:42 Patch against 2.6 trunk, revision 64230
Py_UNICODE_SIZEOF.patch schuppenies, 2008-06-15 16:45 Patch against 2.6 trunk, revision 64296
Messages (23)
msg68102 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-12 22:13
test test_sys failed -- Traceback (most recent call last):
  File "/temp/python/trunk/Lib/test/test_sys.py", line 549, in
test_specialtypes
    size2=basicsize + sys.getsizeof(str(s)))
  File "/temp/python/trunk/Lib/test/test_sys.py", line 429, in check_sizeof
    self.assertEqual(result, size2, msg + str(size2))
AssertionError: wrong size for <type 'unicode'>: got 28, expected
50.5109328552
msg68104 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-12 22:19
It was recommended by Georg that you expose Py_UNICODE_SIZE in the
_testcapi, since the size is not consistent across all platforms.
msg68138 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-13 09:04
Are they any buildbots running with the "--enable-unicode=ucs4" option?
Just curious.
msg68141 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-06-13 09:21
I'm sure there wasn't any a few months ago.
msg68159 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-13 13:59
Do you really need to expose Py_UNICODE_SIZE? There is already
sys.maxunicode, unless I'm missing something.
msg68160 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-13 14:09
It is true that sys.maxunicode reflects whether the build is using UCS-2
or UCS-4; however, the size of Py_UNICODE is not fixed by that, look at
unicodeobject.h.

(Though I don't think we have platforms that actually *do* use sizes
other than 2 or 4, so we can of course be sloppy.)
msg68177 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-13 19:42
sys.maxunicode is well defined to be either 0xFFFF for UCS-2
or 0x10FFFF for UCS-4 (see PyUnicode_GetMax).

Py_UNICODE_SIZE is set in pyconfig.h to be either 2 or 4 during
configuration. When >= 4, Py_UNICODE_WIDE is set which again influences
sys.maxunicode.

Thus, it currently is possible to derive Py_UNICODE_SIZE from
sys.maxunicode. But it takes some indirections.

So here are 2 possible patches, one which exposes Py_UNICODE_SIZE via
_testcapi and one which assumes that sys.maxunicode reflects UCS-X
settings. Since I am a fairly new Python developer and the new
4-eyes-per-commit policy for the beta phase, please decide which patch
should be applied.
msg68178 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-13 19:50
Personally, I prefer the one with _testcapi.Py_UNICODE_SIZE because it
is safe against future changes, but wait for someone else's opinion.
msg68179 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-13 19:51
It's actually very easy:

Py_UNICODE is a 2-byte value for UCS-2 builds and 4 byte value for UCS-4
builds of Python.

print ((sys.maxunicode < 66000) and 'UCS2' or 'UCS4')

tells you which one you have.

Note that you should *not* use the exact value of 0x10FFFF for UCS-4 -
it's possible that the Unicode consortium decides to add more planes to
the Universal Character Set... (though not likely).

The above comparison is good enough to detect the number of bytes in a
single code point, though.
msg68180 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-13 19:54
BTW: Here's another trick you can use:

print 'sizeof(Py_UNICODE) =', len(u'\0'.encode('unicode-internal'))

(for Py2.x)
msg68181 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-13 19:56
Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
up being more than 4 if the native int type is itself larger than 32
bits; although the latter is probably quite rare (64-bit platforms are
usually either LP64 or LLP64).

However, Py_UNICODE.patch is wrong in that it uses Py_UNICODE_SIZE
rather than sizeof(Py_UNICODE). Py_UNICODE_SIZE itself is always either
2 or 4.
msg68182 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-13 20:18
On 2008-06-13 21:56, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
> up being more than 4 if the native int type is itself larger than 32
> bits; although the latter is probably quite rare (64-bit platforms are
> usually either LP64 or LLP64).

AFAIK, only Crays have this problem, but apart from that: I'd consider
it a bug if sizeof(Py_UCS4) != 4.
msg68183 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-13 20:32
Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
> AFAIK, only Crays have this problem, but apart from that: I'd consider
> it a bug if sizeof(Py_UCS4) != 4.

Perhaps a #error can be added to that effect?
Something like (untested):

#if SIZEOF_INT == 4 
typedef unsigned int Py_UCS4; 
#elif SIZEOF_LONG == 4
typedef unsigned long Py_UCS4; 
#else
#error Could not find a 4-byte integer type for Py_UCS4, aborting
#endif

(of course we could also try harder to find an appropriate type, but I'm
no specialist in C integer variations)
msg68184 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-13 21:01
I think you're right that sizeof(Py_UNICODE) is the correct value to
use. But could you please explain to me how PY_UNICODE_TYPE is set, I
cannot find it.

Also, len(u'\0'.encode('unicode-internal')) does not work for Py3.0.
Any suggestion how could this information can be retrieved in py3k?
msg68185 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-06-13 21:21
I believe Py_UNICODE_TYPE is set be configure in pyconfig.h.
msg68186 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-13 21:59
Found it, thanks. Wrong use of grep :|
msg68231 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-15 13:18
If I understand configure correctly, PY_UNICODE_TYPE is only set when
a type matching the size of $unicode_size is found. And this is set to
either 2 or 4. Thus, sizeof(Py_UNICODE) should always return 2 or 4.
If you agree, I would suggest using the method proposed by Marc in
msg68179.
msg68234 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-06-15 13:39
Le dimanche 15 juin 2008 à 13:18 +0000, Robert Schuppenies a écrit :
> If I understand configure correctly, PY_UNICODE_TYPE is only set when
> a type matching the size of $unicode_size is found. And this is set to
> either 2 or 4.

Buf if PY_UNICODE_TYPE is not set in configure, unicodeobject.h tries to
settle on a default value. Which turns out to be Py_UCS4 in UCS4 builds:
http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l86

And Py_UCS4 itself will be larger than 4 bytes if the platform's int
size is larger than that:
http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l119

So if you want to be 100% correct, you should use
sizeof(PY_UNICODE_TYPE) (or sizeof(Py_UNICODE), which is the same). If
you don't want to, sys.maxunicode is sufficient :-)
msg68242 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-15 16:45
Correct is good, so here is a patch which exposes the size of
Py_UNICODE via _testcapi.
msg68251 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2008-06-15 20:49
Looks good to me.
msg68265 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-16 09:57
On 2008-06-13 22:32, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
>> AFAIK, only Crays have this problem, but apart from that: I'd consider
>> it a bug if sizeof(Py_UCS4) != 4.
> 
> Perhaps a #error can be added to that effect?
> Something like (untested):
> 
> #if SIZEOF_INT == 4 
> typedef unsigned int Py_UCS4; 
> #elif SIZEOF_LONG == 4
> typedef unsigned long Py_UCS4; 
> #else
> #error Could not find a 4-byte integer type for Py_UCS4, aborting
> #endif

Sounds good !

> (of course we could also try harder to find an appropriate type, but I'm
> no specialist in C integer variations)

Python should really try to use uint32_t as fallback solution for
UCS4 where available (and uint16_t for UCS2).

We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to
configure:

http://www.gnu.org/software/autoconf/manual/html_node/Particular-Types.html#Particular-Types

and could then use

typedef uint32_t Py_UCS4

and

typedef uint16_t Py_UCS2

Note that the code for supporting UCS2/UCS4 is not really all that
clean. It was a quick sprint between Martin and Fredrik and appears
to be only half-done... e.g. there currently is no Py_UCS2.
msg68271 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-06-16 16:21
On 2008-06-13 21:54, Marc-Andre Lemburg wrote:
> BTW: Here's another trick you can use:
> 
> print 'sizeof(Py_UNICODE) =', len(u'\0'.encode('unicode-internal'))
> 
> (for Py2.x)

... and for Py3.x:

print(len(u'\0'.encode('unicode-internal')))

There's really no need to drop to C to get at sizeof(Py_UNICODE).
msg68312 - (view) Author: Robert Schuppenies (schuppenies) * (Python committer) Date: 2008-06-17 10:34
I followed Marc's advise and checked-in a corrected test.

Besides, I opened a new issue to address the fallback solution for
UCS4 and UCS2 (see issue3130).
History
Date User Action Args
2009-04-27 01:10:42ajaksu2linkissue3130 dependencies
2008-06-17 10:34:08schuppeniessetstatus: open -> closed
resolution: fixed
messages: + msg68312
2008-06-16 16:21:42lemburgsetmessages: + msg68271
2008-06-16 09:57:19lemburgsetmessages: + msg68265
2008-06-15 20:49:59georg.brandlsetmessages: + msg68251
2008-06-15 16:45:56schuppeniessetfiles: + Py_UNICODE_SIZEOF.patch
messages: + msg68242
2008-06-15 13:39:09pitrousetmessages: + msg68234
2008-06-15 13:18:53schuppeniessetmessages: + msg68231
2008-06-13 21:59:36schuppeniessetmessages: + msg68186
2008-06-13 21:21:37benjamin.petersonsetmessages: + msg68185
2008-06-13 21:01:08schuppeniessetmessages: + msg68184
2008-06-13 20:32:41pitrousetmessages: + msg68183
2008-06-13 20:18:22lemburgsetmessages: + msg68182
2008-06-13 19:56:47pitrousetmessages: + msg68181
2008-06-13 19:54:53lemburgsetmessages: + msg68180
2008-06-13 19:51:41lemburgsetnosy: + lemburg
messages: + msg68179
2008-06-13 19:50:41benjamin.petersonsetmessages: + msg68178
2008-06-13 19:42:05schuppeniessetfiles: + Py_UNICODE.patch
messages: + msg68177
2008-06-13 19:41:27schuppeniessetfiles: + maxunicode.patch
keywords: + patch
2008-06-13 14:09:09georg.brandlsetnosy: + georg.brandl
messages: + msg68160
2008-06-13 13:59:54pitrousetnosy: + pitrou
messages: + msg68159
2008-06-13 09:21:18amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg68141
2008-06-13 09:04:52schuppeniessetmessages: + msg68138
2008-06-12 22:19:43benjamin.petersonsetmessages: + msg68104
2008-06-12 22:13:16benjamin.petersoncreate