msg68102 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2008-06-12 22:13 |
test test_sys failed -- Traceback (most recent call last):
File "/temp/python/trunk/Lib/test/test_sys.py", line 549, in
test_specialtypes
size2=basicsize + sys.getsizeof(str(s)))
File "/temp/python/trunk/Lib/test/test_sys.py", line 429, in check_sizeof
self.assertEqual(result, size2, msg + str(size2))
AssertionError: wrong size for <type 'unicode'>: got 28, expected
50.5109328552
|
msg68104 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2008-06-12 22:19 |
It was recommended by Georg that you expose Py_UNICODE_SIZE in the
_testcapi, since the size is not consistent across all platforms.
|
msg68138 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-13 09:04 |
Are they any buildbots running with the "--enable-unicode=ucs4" option?
Just curious.
|
msg68141 - (view) |
Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *  |
Date: 2008-06-13 09:21 |
I'm sure there wasn't any a few months ago.
|
msg68159 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-13 13:59 |
Do you really need to expose Py_UNICODE_SIZE? There is already
sys.maxunicode, unless I'm missing something.
|
msg68160 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2008-06-13 14:09 |
It is true that sys.maxunicode reflects whether the build is using UCS-2
or UCS-4; however, the size of Py_UNICODE is not fixed by that, look at
unicodeobject.h.
(Though I don't think we have platforms that actually *do* use sizes
other than 2 or 4, so we can of course be sloppy.)
|
msg68177 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-13 19:42 |
sys.maxunicode is well defined to be either 0xFFFF for UCS-2
or 0x10FFFF for UCS-4 (see PyUnicode_GetMax).
Py_UNICODE_SIZE is set in pyconfig.h to be either 2 or 4 during
configuration. When >= 4, Py_UNICODE_WIDE is set which again influences
sys.maxunicode.
Thus, it currently is possible to derive Py_UNICODE_SIZE from
sys.maxunicode. But it takes some indirections.
So here are 2 possible patches, one which exposes Py_UNICODE_SIZE via
_testcapi and one which assumes that sys.maxunicode reflects UCS-X
settings. Since I am a fairly new Python developer and the new
4-eyes-per-commit policy for the beta phase, please decide which patch
should be applied.
|
msg68178 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2008-06-13 19:50 |
Personally, I prefer the one with _testcapi.Py_UNICODE_SIZE because it
is safe against future changes, but wait for someone else's opinion.
|
msg68179 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2008-06-13 19:51 |
It's actually very easy:
Py_UNICODE is a 2-byte value for UCS-2 builds and 4 byte value for UCS-4
builds of Python.
print ((sys.maxunicode < 66000) and 'UCS2' or 'UCS4')
tells you which one you have.
Note that you should *not* use the exact value of 0x10FFFF for UCS-4 -
it's possible that the Unicode consortium decides to add more planes to
the Universal Character Set... (though not likely).
The above comparison is good enough to detect the number of bytes in a
single code point, though.
|
msg68180 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2008-06-13 19:54 |
BTW: Here's another trick you can use:
print 'sizeof(Py_UNICODE) =', len(u'\0'.encode('unicode-internal'))
(for Py2.x)
|
msg68181 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-13 19:56 |
Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
up being more than 4 if the native int type is itself larger than 32
bits; although the latter is probably quite rare (64-bit platforms are
usually either LP64 or LLP64).
However, Py_UNICODE.patch is wrong in that it uses Py_UNICODE_SIZE
rather than sizeof(Py_UNICODE). Py_UNICODE_SIZE itself is always either
2 or 4.
|
msg68182 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2008-06-13 20:18 |
On 2008-06-13 21:56, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
> up being more than 4 if the native int type is itself larger than 32
> bits; although the latter is probably quite rare (64-bit platforms are
> usually either LP64 or LLP64).
AFAIK, only Crays have this problem, but apart from that: I'd consider
it a bug if sizeof(Py_UCS4) != 4.
|
msg68183 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-13 20:32 |
Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
> AFAIK, only Crays have this problem, but apart from that: I'd consider
> it a bug if sizeof(Py_UCS4) != 4.
Perhaps a #error can be added to that effect?
Something like (untested):
#if SIZEOF_INT == 4
typedef unsigned int Py_UCS4;
#elif SIZEOF_LONG == 4
typedef unsigned long Py_UCS4;
#else
#error Could not find a 4-byte integer type for Py_UCS4, aborting
#endif
(of course we could also try harder to find an appropriate type, but I'm
no specialist in C integer variations)
|
msg68184 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-13 21:01 |
I think you're right that sizeof(Py_UNICODE) is the correct value to
use. But could you please explain to me how PY_UNICODE_TYPE is set, I
cannot find it.
Also, len(u'\0'.encode('unicode-internal')) does not work for Py3.0.
Any suggestion how could this information can be retrieved in py3k?
|
msg68185 - (view) |
Author: Benjamin Peterson (benjamin.peterson) *  |
Date: 2008-06-13 21:21 |
I believe Py_UNICODE_TYPE is set be configure in pyconfig.h.
|
msg68186 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-13 21:59 |
Found it, thanks. Wrong use of grep :|
|
msg68231 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-15 13:18 |
If I understand configure correctly, PY_UNICODE_TYPE is only set when
a type matching the size of $unicode_size is found. And this is set to
either 2 or 4. Thus, sizeof(Py_UNICODE) should always return 2 or 4.
If you agree, I would suggest using the method proposed by Marc in
msg68179.
|
msg68234 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2008-06-15 13:39 |
Le dimanche 15 juin 2008 à 13:18 +0000, Robert Schuppenies a écrit :
> If I understand configure correctly, PY_UNICODE_TYPE is only set when
> a type matching the size of $unicode_size is found. And this is set to
> either 2 or 4.
Buf if PY_UNICODE_TYPE is not set in configure, unicodeobject.h tries to
settle on a default value. Which turns out to be Py_UCS4 in UCS4 builds:
http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l86
And Py_UCS4 itself will be larger than 4 bytes if the platform's int
size is larger than that:
http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l119
So if you want to be 100% correct, you should use
sizeof(PY_UNICODE_TYPE) (or sizeof(Py_UNICODE), which is the same). If
you don't want to, sys.maxunicode is sufficient :-)
|
msg68242 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-15 16:45 |
Correct is good, so here is a patch which exposes the size of
Py_UNICODE via _testcapi.
|
msg68251 - (view) |
Author: Georg Brandl (georg.brandl) *  |
Date: 2008-06-15 20:49 |
Looks good to me.
|
msg68265 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2008-06-16 09:57 |
On 2008-06-13 22:32, Antoine Pitrou wrote:
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
>> AFAIK, only Crays have this problem, but apart from that: I'd consider
>> it a bug if sizeof(Py_UCS4) != 4.
>
> Perhaps a #error can be added to that effect?
> Something like (untested):
>
> #if SIZEOF_INT == 4
> typedef unsigned int Py_UCS4;
> #elif SIZEOF_LONG == 4
> typedef unsigned long Py_UCS4;
> #else
> #error Could not find a 4-byte integer type for Py_UCS4, aborting
> #endif
Sounds good !
> (of course we could also try harder to find an appropriate type, but I'm
> no specialist in C integer variations)
Python should really try to use uint32_t as fallback solution for
UCS4 where available (and uint16_t for UCS2).
We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to
configure:
http://www.gnu.org/software/autoconf/manual/html_node/Particular-Types.html#Particular-Types
and could then use
typedef uint32_t Py_UCS4
and
typedef uint16_t Py_UCS2
Note that the code for supporting UCS2/UCS4 is not really all that
clean. It was a quick sprint between Martin and Fredrik and appears
to be only half-done... e.g. there currently is no Py_UCS2.
|
msg68271 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2008-06-16 16:21 |
On 2008-06-13 21:54, Marc-Andre Lemburg wrote:
> BTW: Here's another trick you can use:
>
> print 'sizeof(Py_UNICODE) =', len(u'\0'.encode('unicode-internal'))
>
> (for Py2.x)
... and for Py3.x:
print(len(u'\0'.encode('unicode-internal')))
There's really no need to drop to C to get at sizeof(Py_UNICODE).
|
msg68312 - (view) |
Author: Robert Schuppenies (schuppenies) *  |
Date: 2008-06-17 10:34 |
I followed Marc's advise and checked-in a corrected test.
Besides, I opened a new issue to address the fallback solution for
UCS4 and UCS2 (see issue3130).
|
|
Date |
User |
Action |
Args |
2022-04-11 14:56:35 | admin | set | github: 47348 |
2009-04-27 01:10:42 | ajaksu2 | link | issue3130 dependencies |
2008-06-17 10:34:08 | schuppenies | set | status: open -> closed resolution: fixed messages:
+ msg68312 |
2008-06-16 16:21:42 | lemburg | set | messages:
+ msg68271 |
2008-06-16 09:57:19 | lemburg | set | messages:
+ msg68265 |
2008-06-15 20:49:59 | georg.brandl | set | messages:
+ msg68251 |
2008-06-15 16:45:56 | schuppenies | set | files:
+ Py_UNICODE_SIZEOF.patch messages:
+ msg68242 |
2008-06-15 13:39:09 | pitrou | set | messages:
+ msg68234 |
2008-06-15 13:18:53 | schuppenies | set | messages:
+ msg68231 |
2008-06-13 21:59:36 | schuppenies | set | messages:
+ msg68186 |
2008-06-13 21:21:37 | benjamin.peterson | set | messages:
+ msg68185 |
2008-06-13 21:01:08 | schuppenies | set | messages:
+ msg68184 |
2008-06-13 20:32:41 | pitrou | set | messages:
+ msg68183 |
2008-06-13 20:18:22 | lemburg | set | messages:
+ msg68182 |
2008-06-13 19:56:47 | pitrou | set | messages:
+ msg68181 |
2008-06-13 19:54:53 | lemburg | set | messages:
+ msg68180 |
2008-06-13 19:51:41 | lemburg | set | nosy:
+ lemburg messages:
+ msg68179 |
2008-06-13 19:50:41 | benjamin.peterson | set | messages:
+ msg68178 |
2008-06-13 19:42:05 | schuppenies | set | files:
+ Py_UNICODE.patch messages:
+ msg68177 |
2008-06-13 19:41:27 | schuppenies | set | files:
+ maxunicode.patch keywords:
+ patch |
2008-06-13 14:09:09 | georg.brandl | set | nosy:
+ georg.brandl messages:
+ msg68160 |
2008-06-13 13:59:54 | pitrou | set | nosy:
+ pitrou messages:
+ msg68159 |
2008-06-13 09:21:18 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages:
+ msg68141 |
2008-06-13 09:04:52 | schuppenies | set | messages:
+ msg68138 |
2008-06-12 22:19:43 | benjamin.peterson | set | messages:
+ msg68104 |
2008-06-12 22:13:16 | benjamin.peterson | create | |