Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys.sizeof test fails with wide unicode #47348

Closed
benjaminp opened this issue Jun 12, 2008 · 23 comments
Closed

sys.sizeof test fails with wide unicode #47348

benjaminp opened this issue Jun 12, 2008 · 23 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@benjaminp
Copy link
Contributor

BPO 3098
Nosy @malemburg, @birkenfeld, @amauryfa, @pitrou, @benjaminp
Files
  • maxunicode.patch: Patch against 2.6 trunk, revision 64230
  • Py_UNICODE.patch: Patch against 2.6 trunk, revision 64230
  • Py_UNICODE_SIZEOF.patch: Patch against 2.6 trunk, revision 64296
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-06-17.10:34:08.437>
    created_at = <Date 2008-06-12.22:13:16.311>
    labels = ['interpreter-core', 'type-bug']
    title = 'sys.sizeof test fails with wide unicode'
    updated_at = <Date 2008-06-17.10:34:08.401>
    user = 'https://github.com/benjaminp'

    bugs.python.org fields:

    activity = <Date 2008-06-17.10:34:08.401>
    actor = 'schuppenies'
    assignee = 'schuppenies'
    closed = True
    closed_date = <Date 2008-06-17.10:34:08.437>
    closer = 'schuppenies'
    components = ['Interpreter Core']
    creation = <Date 2008-06-12.22:13:16.311>
    creator = 'benjamin.peterson'
    dependencies = []
    files = ['10623', '10624', '10635']
    hgrepos = []
    issue_num = 3098
    keywords = ['patch']
    message_count = 23.0
    messages = ['68102', '68104', '68138', '68141', '68159', '68160', '68177', '68178', '68179', '68180', '68181', '68182', '68183', '68184', '68185', '68186', '68231', '68234', '68242', '68251', '68265', '68271', '68312']
    nosy_count = 6.0
    nosy_names = ['lemburg', 'georg.brandl', 'amaury.forgeotdarc', 'pitrou', 'benjamin.peterson', 'schuppenies']
    pr_nums = []
    priority = 'critical'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue3098'
    versions = ['Python 2.6', 'Python 3.0']

    @benjaminp
    Copy link
    Contributor Author

    test test_sys failed -- Traceback (most recent call last):
      File "/temp/python/trunk/Lib/test/test_sys.py", line 549, in
    test_specialtypes
        size2=basicsize + sys.getsizeof(str(s)))
      File "/temp/python/trunk/Lib/test/test_sys.py", line 429, in check_sizeof
        self.assertEqual(result, size2, msg + str(size2))
    AssertionError: wrong size for <type 'unicode'>: got 28, expected
    50.5109328552

    @benjaminp benjaminp added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jun 12, 2008
    @benjaminp
    Copy link
    Contributor Author

    It was recommended by Georg that you expose Py_UNICODE_SIZE in the
    _testcapi, since the size is not consistent across all platforms.

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 13, 2008

    Are they any buildbots running with the "--enable-unicode=ucs4" option?
    Just curious.

    @amauryfa
    Copy link
    Member

    I'm sure there wasn't any a few months ago.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 13, 2008

    Do you really need to expose Py_UNICODE_SIZE? There is already
    sys.maxunicode, unless I'm missing something.

    @birkenfeld
    Copy link
    Member

    It is true that sys.maxunicode reflects whether the build is using UCS-2
    or UCS-4; however, the size of Py_UNICODE is not fixed by that, look at
    unicodeobject.h.

    (Though I don't think we have platforms that actually *do* use sizes
    other than 2 or 4, so we can of course be sloppy.)

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 13, 2008

    sys.maxunicode is well defined to be either 0xFFFF for UCS-2
    or 0x10FFFF for UCS-4 (see PyUnicode_GetMax).

    Py_UNICODE_SIZE is set in pyconfig.h to be either 2 or 4 during
    configuration. When >= 4, Py_UNICODE_WIDE is set which again influences
    sys.maxunicode.

    Thus, it currently is possible to derive Py_UNICODE_SIZE from
    sys.maxunicode. But it takes some indirections.

    So here are 2 possible patches, one which exposes Py_UNICODE_SIZE via
    _testcapi and one which assumes that sys.maxunicode reflects UCS-X
    settings. Since I am a fairly new Python developer and the new
    4-eyes-per-commit policy for the beta phase, please decide which patch
    should be applied.

    @benjaminp
    Copy link
    Contributor Author

    Personally, I prefer the one with _testcapi.Py_UNICODE_SIZE because it
    is safe against future changes, but wait for someone else's opinion.

    @malemburg
    Copy link
    Member

    It's actually very easy:

    Py_UNICODE is a 2-byte value for UCS-2 builds and 4 byte value for UCS-4
    builds of Python.

    print ((sys.maxunicode < 66000) and 'UCS2' or 'UCS4')

    tells you which one you have.

    Note that you should *not* use the exact value of 0x10FFFF for UCS-4 -
    it's possible that the Unicode consortium decides to add more planes to
    the Universal Character Set... (though not likely).

    The above comparison is good enough to detect the number of bytes in a
    single code point, though.

    @malemburg
    Copy link
    Member

    BTW: Here's another trick you can use:

    print 'sizeof(Py_UNICODE) =', len(u'\0'.encode('unicode-internal'))

    (for Py2.x)

    @pitrou
    Copy link
    Member

    pitrou commented Jun 13, 2008

    Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
    up being more than 4 if the native int type is itself larger than 32
    bits; although the latter is probably quite rare (64-bit platforms are
    usually either LP64 or LLP64).

    However, Py_UNICODE.patch is wrong in that it uses Py_UNICODE_SIZE
    rather than sizeof(Py_UNICODE). Py_UNICODE_SIZE itself is always either
    2 or 4.

    @malemburg
    Copy link
    Member

    On 2008-06-13 21:56, Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Hmm, so it seems that in some UCS4 builds, sizeof(Py_UNICODE) could end
    up being more than 4 if the native int type is itself larger than 32
    bits; although the latter is probably quite rare (64-bit platforms are
    usually either LP64 or LLP64).

    AFAIK, only Crays have this problem, but apart from that: I'd consider
    it a bug if sizeof(Py_UCS4) != 4.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 13, 2008

    Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :

    AFAIK, only Crays have this problem, but apart from that: I'd consider
    it a bug if sizeof(Py_UCS4) != 4.

    Perhaps a #error can be added to that effect?
    Something like (untested):

    #if SIZEOF_INT == 4 
    typedef unsigned int Py_UCS4; 
    #elif SIZEOF_LONG == 4
    typedef unsigned long Py_UCS4; 
    #else
    #error Could not find a 4-byte integer type for Py_UCS4, aborting
    #endif

    (of course we could also try harder to find an appropriate type, but I'm
    no specialist in C integer variations)

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 13, 2008

    I think you're right that sizeof(Py_UNICODE) is the correct value to
    use. But could you please explain to me how PY_UNICODE_TYPE is set, I
    cannot find it.

    Also, len(u'\0'.encode('unicode-internal')) does not work for Py3.0.
    Any suggestion how could this information can be retrieved in py3k?

    @benjaminp
    Copy link
    Contributor Author

    I believe Py_UNICODE_TYPE is set be configure in pyconfig.h.

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 13, 2008

    Found it, thanks. Wrong use of grep :|

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 15, 2008

    If I understand configure correctly, PY_UNICODE_TYPE is only set when
    a type matching the size of $unicode_size is found. And this is set to
    either 2 or 4. Thus, sizeof(Py_UNICODE) should always return 2 or 4.
    If you agree, I would suggest using the method proposed by Marc in
    msg68179.

    @pitrou
    Copy link
    Member

    pitrou commented Jun 15, 2008

    Le dimanche 15 juin 2008 à 13:18 +0000, Robert Schuppenies a écrit :

    If I understand configure correctly, PY_UNICODE_TYPE is only set when
    a type matching the size of $unicode_size is found. And this is set to
    either 2 or 4.

    Buf if PY_UNICODE_TYPE is not set in configure, unicodeobject.h tries to
    settle on a default value. Which turns out to be Py_UCS4 in UCS4 builds:
    http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l86

    And Py_UCS4 itself will be larger than 4 bytes if the platform's int
    size is larger than that:
    http://hg.pitrou.net/public/py3k/py3k/file/da93fc81b086/Include/unicodeobject.h#l119

    So if you want to be 100% correct, you should use
    sizeof(PY_UNICODE_TYPE) (or sizeof(Py_UNICODE), which is the same). If
    you don't want to, sys.maxunicode is sufficient :-)

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 15, 2008

    Correct is good, so here is a patch which exposes the size of
    Py_UNICODE via _testcapi.

    @birkenfeld
    Copy link
    Member

    Looks good to me.

    @malemburg
    Copy link
    Member

    On 2008-06-13 22:32, Antoine Pitrou wrote:

    Antoine Pitrou <pitrou@free.fr> added the comment:

    Le vendredi 13 juin 2008 à 20:18 +0000, Marc-Andre Lemburg a écrit :
    > AFAIK, only Crays have this problem, but apart from that: I'd consider
    > it a bug if sizeof(Py_UCS4) != 4.

    Perhaps a #error can be added to that effect?
    Something like (untested):

    #if SIZEOF_INT == 4
    typedef unsigned int Py_UCS4;
    #elif SIZEOF_LONG == 4
    typedef unsigned long Py_UCS4;
    #else
    #error Could not find a 4-byte integer type for Py_UCS4, aborting
    #endif

    Sounds good !

    (of course we could also try harder to find an appropriate type, but I'm
    no specialist in C integer variations)

    Python should really try to use uint32_t as fallback solution for
    UCS4 where available (and uint16_t for UCS2).

    We'd have to add an AC_TYPE_INT32_T and AC_TYPE_INT16_T check to
    configure:

    http://www.gnu.org/software/autoconf/manual/html_node/Particular-Types.html#Particular-Types

    and could then use

    typedef uint32_t Py_UCS4

    and

    typedef uint16_t Py_UCS2

    Note that the code for supporting UCS2/UCS4 is not really all that
    clean. It was a quick sprint between Martin and Fredrik and appears
    to be only half-done... e.g. there currently is no Py_UCS2.

    @malemburg
    Copy link
    Member

    On 2008-06-13 21:54, Marc-Andre Lemburg wrote:

    BTW: Here's another trick you can use:

    print 'sizeof(Py_UNICODE) =', len(u'\0'.encode('unicode-internal'))

    (for Py2.x)

    ... and for Py3.x:

    print(len(u'\0'.encode('unicode-internal')))

    There's really no need to drop to C to get at sizeof(Py_UNICODE).

    @schuppenies
    Copy link
    Mannequin

    schuppenies mannequin commented Jun 17, 2008

    I followed Marc's advise and checked-in a corrected test.

    Besides, I opened a new issue to address the fallback solution for
    UCS4 and UCS2 (see bpo-3130).

    @schuppenies schuppenies mannequin closed this as completed Jun 17, 2008
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants