Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only) #48724

Closed
mdickinson opened this issue Nov 30, 2008 · 33 comments
Closed
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error

Comments

@mdickinson
Copy link
Member

BPO 4474
Nosy @malemburg, @mdickinson, @vstinner
Files
  • unicode_fromwidechar_surrogate-7.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2009-03-18.16:10:55.243>
    created_at = <Date 2008-11-30.18:54:07.896>
    labels = ['interpreter-core', 'type-bug']
    title = 'PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only)'
    updated_at = <Date 2009-03-18.16:10:55.242>
    user = 'https://github.com/mdickinson'

    bugs.python.org fields:

    activity = <Date 2009-03-18.16:10:55.242>
    actor = 'mark.dickinson'
    assignee = 'none'
    closed = True
    closed_date = <Date 2009-03-18.16:10:55.243>
    closer = 'mark.dickinson'
    components = ['Interpreter Core']
    creation = <Date 2008-11-30.18:54:07.896>
    creator = 'mark.dickinson'
    dependencies = []
    files = ['13210']
    hgrepos = []
    issue_num = 4474
    keywords = ['patch']
    message_count = 33.0
    messages = ['76651', '76653', '76665', '76666', '76670', '76684', '76686', '76708', '80001', '80002', '80008', '80012', '80013', '80017', '80018', '80021', '80023', '80024', '80132', '80151', '80311', '80578', '80636', '80795', '82674', '82676', '82677', '82678', '82679', '82789', '82919', '83751', '83754']
    nosy_count = 4.0
    nosy_names = ['lemburg', 'mark.dickinson', 'vstinner', 'rpetrov']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue4474'
    versions = ['Python 2.7']

    @mdickinson
    Copy link
    Member Author

    On systems (Linux, OS X) where sizeof(wchar_t) is 4 and wchar_t arrays are
    usually encoded as UTF-32, it looks as though PyUnicode_FromWideChar
    simply truncates the 32-bit characters to 16-bits, thus giving incorrect
    results for characters outside the BMP. I expected it to convert the UTF-
    32 encoding to UTF-16.

    Note that PyUnicode_FromWideChar is used to process command-line
    arguments, so strange things can happen when passing filenames with non-
    BMP characters to a Python script.

    Here's an OS X 10.5 Terminal session (current directory is the root of the
    py3k tree).

    dickinsm$ cat test𐅭.py
    from sys import argv
    print("My arguments are: ",argv)
    dickinsm$ ./python.exe test𐅭.py
    My arguments are: ['testŭ.py']
    dickinsm$ ./python.exe Lib/tabnanny.py test𐅭.py
    'testŭ.py': I/O Error: [Errno 2] No such file or directory: 'testŭ.py'

    (In case the character after 'test' and before '.py' isn't showing up
    correctly, it's chr(65901), 'GREEK ACROPHONIC TROEZENIAN FIVE HUNDRED'.)

    @mdickinson mdickinson added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Nov 30, 2008
    @mdickinson
    Copy link
    Member Author

    Comments from MvL in bpo-4388:

    I think you are right. However, conversion to/from wchar_t is/was
    rarely used, and so are non-BMP characters; it's very likely that
    the problem hasn't occurred in practice (and I doubt it would occur
    in 3.0 if not fixed - there are more severe problems around).

    @rpetrov
    Copy link
    Mannequin

    rpetrov mannequin commented Nov 30, 2008

    it is fine on linux (tested with UTF-8 codeset for locale):
    $ ./python test𐅭.py
    ('My arguments are: ', ['test\xf0\x90\x85\xad.py'])
    \xf0\x90\x85\xad (UTF-8) = 0001016d (USC-4) = 65901

    @rpetrov
    Copy link
    Mannequin

    rpetrov mannequin commented Nov 30, 2008

    s/USC-4/UCS-4/g

    @mdickinson
    Copy link
    Member Author

    it is fine on linux

    Interesting. Which version of Python is that? And is PyUNICODE 2 bytes
    or 4 bytes for that build of Python?

    @mdickinson
    Copy link
    Member Author

    Just to be clear, the defect in PyUnicode_FromWideChar is present both in
    Python 2.x and Python 3.x.

    The problem with command-line arguments only occurs in Python 3.x, since
    2.x doesn't use PyUnicode_FromWideChar in converting arguments.

    I can reproduce the 'No such file or directory' error on both OS X and
    Linux, for Python 3.0.

    @malemburg
    Copy link
    Member

    This is due to the function downcasting the wchar_t values to
    Py_UNICODE, which is a 2-byte value if you build Python as UCS2 version
    on Unix.

    Most Unixes ship with UCS4 builds, so you don't see the problem there.
    Mac OS X ships with a UCS2 build, which is why you run into the problem
    on that platform.

    UCS2 builds are also the default build on Unix, so if you compile Python
    yourself, it will result in a UCS2 build, unless you explicitly specify
    the UCS4 build configure option or configure happens to find a Tcl/Tk
    version installed that uses UCS4 internally.

    @rpetrov
    Copy link
    Mannequin

    rpetrov mannequin commented Dec 1, 2008

    Marc-Andre explain all. For the protocol my version is from trunk,
    python is build with default options. Since system tcl limit UTF-8 to 3
    bytes, python is build for UCS-2.

    In the report output from python is with character 010d(UCS-2).

    May be issue is not for versions before 3.0.

    @vstinner
    Copy link
    Member

    Patch fixing PyUnicode_FromWideChar() for UCS-2 build: create
    surrogates for character > U+FFFF like PyUnicode_FromOrdinal() does.

    @vstinner
    Copy link
    Member

    Note: I wrote my patch against py3k r68646.

    @mdickinson
    Copy link
    Member Author

    Thanks for the patch, Victor!

    Looks pretty good at first glance, except that it seems that the UTF-32 to
    UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
    deliberate?

    A test would be good, too.

    @vstinner
    Copy link
    Member

    Looks pretty good at first glance, except that it seems that the UTF-32 to
    UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
    deliberate?

    #ifdef HAVE_USABLE_WCHAR_T
        memcpy(unicode->str, w, size * sizeof(wchar_t));
    #else
        ...
    #endif

    I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
    misunderstood the code, it's a a heap overflow :-) So there is no not
    conversion from UTF-32 to UTF-16 using memcpy if HAVE_USABLE_WCHAR_T is
    defined, right?

    A test would be good, too.

    PyUnicode_FromWideChar() is not a public API. Should I write a function in
    _testcapi?

    @mdickinson
    Copy link
    Member Author

    I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
    misunderstood the code, it's a a heap overflow :-)

    Yep, sorry. You're right.

    > A test would be good, too.

    PyUnicode_FromWideChar() is not a public API. Should I write a function
    in _testcapi?

    I was actually thinking of a test for the "No such file or directory"
    bug that's mentioned at the start of the discussion.

    @malemburg
    Copy link
    Member

    On 2009-01-17 14:00, STINNER Victor wrote:

    > A test would be good, too.

    PyUnicode_FromWideChar() is not a public API. Should I write a function in
    _testcapi?

    It is a public C API. Regardless of this aspect, we should always
    add tests for bugs in order to make sure they don't pop up again.

    @malemburg
    Copy link
    Member

    On 2009-01-17 14:00, STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    > Looks pretty good at first glance, except that it seems that the UTF-32 to
    > UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
    > deliberate?

    #ifdef HAVE_USABLE_WCHAR_T
    memcpy(unicode->str, w, size * sizeof(wchar_t));
    #else
    ...
    #endif

    I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I
    misunderstood the code, it's a a heap overflow :-) So there is no not
    conversion from UTF-32 to UTF-16 using memcpy if HAVE_USABLE_WCHAR_T is
    defined, right?

    If HAVE_USABLE_WCHAR_T is defined, Py_UNICODE is defined as wchar_t,
    so a memcpy can be used. Note that this does not provide any information
    about sizeof(wchar_t), e.g. with GLIBC, wchar_t is 4 bytes. MS C lib defines
    it as 2 bytes.

    That said, if Py_UNICODE is the same as wchar_t, no conversion is
    necessary and that's why the function simply copies over the data.

    @vstinner
    Copy link
    Member

    Updated patch including a test in _testcapi module: create two
    PyUnicode objects from wide string (PyUnicode_FromWideChar) and
    another from utf-8 (PyUnicode_FromString) and compare the value. Patch
    is still for py3k branch and can be ported easily on python trunk.

    @vstinner
    Copy link
    Member

    I run my test on py3k on Linux with 32 bits wchar_t:

    • 16 bits Py_UNICODE: test is failing without
      PyUnicode_FromWideChar() patch
    • 32 bits Py_UNICODE: test pass without the patch, so the issue only
      impact 16 bits Py_UNICODE

    Can someone test with 16 bits wchar_t (eg. Windows)? I think that the
    bug only impact 16 bits Py_UNICODE with 32 bits wchar_t.

    @vstinner
    Copy link
    Member

    (with the full patch, all tests pass with 16 or 32 bits Py_UNICODE)

    @mdickinson
    Copy link
    Member Author

    Looks good to me.

    I'm not in a position to test with 16-bit wchar_t, but I can't see why
    anything would go wrong. I think we can take our chances: check this in
    and watch the buildbots for signs of trouble.

    Some minor whitespace issues in the unicodeobject.c part of the patch
    (mixing of tabs and spaces, one brace indented oddly), but those can
    easily be taken care of before committing; not worth regenerating the
    patch for.

    Marc-André, is it okay with you to check this in?

    @malemburg
    Copy link
    Member

    On 2009-01-18 22:59, Mark Dickinson wrote:

    Mark Dickinson <dickinsm@gmail.com> added the comment:

    Looks good to me.

    I'm not in a position to test with 16-bit wchar_t, but I can't see why
    anything would go wrong. I think we can take our chances: check this in
    and watch the buildbots for signs of trouble.

    Some minor whitespace issues in the unicodeobject.c part of the patch
    (mixing of tabs and spaces, one brace indented oddly), but those can
    easily be taken care of before committing; not worth regenerating the
    patch for.

    Marc-André, is it okay with you to check this in?

    I'd structure the patch differently, ie. put the whole support code
    into a single #ifndef Py_UNICODE_WIDE section as part of the
    #ifdef HAVE_USABLE_WCHAR_T pre-processor statement.

    Also note that on platforms with 16-bit wchar_t, the comparison
    (0xffff < *w) will always be false, so an additional check for
    (Py_UNICODE_SIZE > 2) is needed.

    BTW: Please always use upper-case hex literals, or at leat don't
    mix the case within the same function.

    Thanks,

    Marc-Andre Lemburg
    eGenix.com


    ::: Try our new mxODBC.Connect Python Database Interface for free ! ::::

    eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
    Registered at Amtsgericht Duesseldorf: HRB 46611
    http://www.egenix.com/company/contact/

    @vstinner
    Copy link
    Member

    Also note that on platforms with 16-bit wchar_t, the comparison
    (0xffff < *w) will always be false, so an additional check for
    (Py_UNICODE_SIZE > 2) is needed.

    Yes, but the right test is (SIZEOF_WCHAR_T > 2). I wrote a new test:

    #if (Py_UNICODE_SIZE == 2) && (SIZEOF_WCHAR_T > 2)
    #define USE_WCHAR_SURROGATE
        const wchar_t *orig_w;
    #endif

    BTW: Please always use upper-case hex literals, or at leat don't
    mix the case within the same function.

    I try, but it would be easier if the rule was already respected: they
    are many tabs and many lower-case hex literals. I used copy/paste from
    existing code of unicodeobject.c...

    Patch version 3:

    • disable the UTF-16 surrogate for 16 bits wchar_t: so my patch is
      only used for 16 bits Py_UNICODE and 32 bits wchar_t... which is the
      default case for python 2.6 and 3.0 on Linux
    • replace tabulation by spaces (in existing code)
    • use upper case literals

    @vstinner
    Copy link
    Member

    @marketdickinson, @lemburg: ping! I updated the patch, does it look
    better?

    @malemburg
    Copy link
    Member

    On 2009-01-26 17:56, STINNER Victor wrote:

    STINNER Victor <victor.stinner@haypocalc.com> added the comment:

    @marketdickinson, @lemburg: ping! I updated the patch, does it look
    better?

    Yes, but there are a few things that still need fixing:

     * SIZEOF_WCHAR_T is not defined for Windows builds, so needs
       to be added to PC/pyconfig.h (OTOH wchar_t is 2 bytes on Windows)
     * USE_WCHAR_SURROGATE should be #defined just before the
       function and #undef'ed right after it; I'd also use a more
       accurate name
     * please use pre-processor indents, e.g.
       #ifdef ...
       # define ...
       #endif

    I'd write

    #if (Py_UNICODE_SIZE == 2) && defined((SIZEOF_WCHAR_T) && (SIZEOF_WCHAR_T > 2)
    # define CONVERT_WCHAR_TO_SURROGATES
    #endif

    ...

    #undef CONVERT_WCHAR_TO_SURROGATES

    Thanks.

    @vstinner
    Copy link
    Member

    For lemburg, updated patch:

    • Move USE_WCHAR_SURROGATE define outside PyUnicode_FromWideChar()
      (and indent the defines, sorry)
    • Add "#define SIZEOF_WCHAR_T 2" to PC/pyconfig.h

    @mdickinson
    Copy link
    Member Author

    Updated Victor's patch:

    • applies cleanly against newly whitespace-normalized unicodeobject.c
    • renamed USE_WCHAR_SURROGATE to CONVERT_WCHAR_TO_SURROGATES
    • add defined(SIZEOF_WCHAR_T) check

    I find the patched version of PyUnicode_FromWideChar quite hard to follow
    with all the #ifdefery. I wonder whether it might be better to define a
    _PyUnicode16_FromWideChar32 helper function instead.

    @malemburg
    Copy link
    Member

    On 2009-02-24 20:39, Mark Dickinson wrote:

    Mark Dickinson <dickinsm@gmail.com> added the comment:

    Updated Victor's patch:

    • applies cleanly against newly whitespace-normalized unicodeobject.c
    • renamed USE_WCHAR_SURROGATE to CONVERT_WCHAR_TO_SURROGATES
    • add defined(SIZEOF_WCHAR_T) check

    I find the patched version of PyUnicode_FromWideChar quite hard to follow
    with all the #ifdefery. I wonder whether it might be better to define a
    _PyUnicode16_FromWideChar32 helper function instead.

    Same here. It would be better to have a single #ifdef #else #endif
    block with one branch for the CONVERT_WCHAR_TO_SURROGATES case and the
    other for the normal operation.

    No need for a new helper function.

    @mdickinson
    Copy link
    Member Author

    It would be better to have a single #ifdef #else #endif

    Yes, of course it would. :)

    @mdickinson
    Copy link
    Member Author

    New patch, with two separate versions of PyUnicode_FromWideChar.

    @malemburg
    Copy link
    Member

    On 2009-02-24 21:50, Mark Dickinson wrote:

    Mark Dickinson <dickinsm@gmail.com> added the comment:

    New patch, with two separate versions of PyUnicode_FromWideChar.

    Thanks, much better :-)

    @vstinner
    Copy link
    Member

    add defined(SIZEOF_WCHAR_T) check

    I don't understand why SIZEOF_WCHAR_T could be unset, but the patch
    version 6 only checks defined(SIZEOF_WCHAR_T) in unicodeobject.c, not
    in _testcapimodule.c (#if SIZEOF_WCHAR_T == 4).

    @mdickinson
    Copy link
    Member Author

    Good catch! Added defined(SIZEOF_WCHAR) to the testcapi code as well,
    and removed the change to PC/pyconfig.h, since we don't need it any
    more...

    @mdickinson
    Copy link
    Member Author

    Committed to py3k, r70452.

    Since this is partway between a bugfix and a new feature, I suggest that
    it's not worth merging it to 3.0 (or 2.6). It should be backported to
    2.7, however; I'll do this after verifying that the py3k buildbots are
    happy.

    @mdickinson
    Copy link
    Member Author

    Backported to the trunk in r70454. Thanks, all!

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants