classification
Title: c_types.c_wchar should not assume that sizeof(wchar_t) == sizeof(Py_UNICODE)
Type: behavior Stage: resolved
Components: ctypes, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: theller Nosy List: ezio.melotti, lemburg, stutzbach, theller, vstinner
Priority: normal Keywords: patch

Created on 2010-05-09 06:38 by stutzbach, last changed 2010-10-02 15:33 by stutzbach. This issue is now closed.

Files
File name Uploaded Description Edit
aswidechar_nonbmp-4.patch vstinner, 2010-10-01 22:56
ctypes_nonbmp.patch vstinner, 2010-10-01 23:02
Messages (20)
msg105374 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-09 06:38
Using a UCS2 Python on a platform with a 32-bit wchar_t, the following code throws an exception (but should not):

>>> ctypes.c_wchar('\u10000')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: one character unicode string expected

The trouble is in the u_set() function in Modules/_ctypes/cfield.c.  The corresponding u_get() function looks correct.

On a UCS4 Python running on a system with a 16-bit wchar_t, u_set() will corrupt the data by silently truncating the character to 16-bits.

For reference, Linux and Mac OS use a 32-bit wchar_t while Windows uses a 16-bit wchar_t.
msg105656 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-13 21:10
Support of characters outside the Unicode BMP (code > 0xffff) is not complete in narrow build (sizeof(Py_UNICODE) == 2) for Python2:

$ ./python
Python 2.7b2+ (trunk:81139M, May 13 2010, 18:45:37) 
>>> x=u'\U00010000'
>>> x[0], x[1]
(u'\ud800', u'\udc00')
>>> len(x)
2
>>> ord(x)
Traceback (most recent call last):
  ...
TypeError: ord() expected a character, but string of length 2 found
>>> unichr(0x10000)
Traceback (most recent call last):
  ...
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

It looks better in Python3:

$ ./python 
Python 3.2a0 (py3k:81137:81138, May 13 2010, 18:50:51) 
>>> x='\U00010000'
>>> x[0], x[1]
('\ud800', '\udc00')
>>> len(x)
2
>>> ord(x)
65536
>>> chr(0x10000)
'\U00010000'

About the issue, the problem is in function u_set(). This function should use PyUnicode_AsWideChar() but PyUnicode_AsWideChar() doesn't support surrogates... whereas PyUnicode_FromWideChar() does support surrogates.
msg105670 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-13 23:56
Patch for Python3:
 - Fix PyUnicode_AsWideChar() to support surrogates (Py_UNICODE: 2 bytes, wchar_t: 4 bytes)
 - u_set() of _ctypes uses PyUnicode_AsWideChar()
 - add a test (skipped if sizeof(wchar_t) is smaller than 4 bytes)

It's too late to fix Python2: 2.7 beta 2 was released (it doesn't support non BMP characters for chr()/ord()).

TODO:
 - I'm not sure that my patch on PyUnicode_AsWideChar() works if sizeof(wchar_t)==2
 - Test the patch on Windows
 - Check that it doesn't break other functions calling PyUnicode_AsWideChar()

ctypes, _locale.strcoll() and time.strftime() use PyUnicode_AsWideChar(). time has interesting comments:

    /* This assumes that PyUnicode_AsWideChar doesn't do any UTF-16
       expansion. */
    if (PyUnicode_AsWideChar((PyUnicodeObject*)format,
                             (wchar_t*)PyBytes_AS_STRING(tmpfmt),
                             PyUnicode_GetSize(format)+1) == (size_t)-1)
        /* This shouldn't fail. */
        Py_FatalError("PyUnicode_AsWideChar failed");
msg117567 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 00:21
#9979 proposes to create a new PyUnicode_AsWideCharString() function.
msg117568 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-09-29 00:27
I know enough about Unicode to have reported this bug, but I don't feel knowledgeable enough about Python's Unicode implementation to comment on your suggested solution.

I'm adding the other people listed in Misc/maintainers.rst as interested in Unicode to the nosy list on this issue and the one you just linked to.
msg117787 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-01 11:53
Update the patch for the new PyUnicode_AsWideCharString() function:
 - use Py_UNICODE_SIZE and SIZEOF_WCHAR_T in the preprocessor tests
 - faster loop: don't use a counter + pointer, but only use pointers (for the stop condition)

The patch is not finished: I have to implement "#elif Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2" case.
msg117789 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-01 12:24
Patch version 3:
 - fix unicode_aswidechar if Py_UNICODE_SIZE == SIZEOF_WCHAR_T and w == NULL (return the number of characters, don't write into w!)
 - improve unicode_aswidechar() comment
msg117790 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-01 12:28
I don't know how to test "if Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2". On Windows, sizeof(wchar_t) is 2, but it looks like Python is not prepared to have Py_UNICODE != wchar_t for is Windows implementation.

wchar_t is 32 bits long on Linux and Mac OS X. So how can I test it? Or should we just drop support of "Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2"?
msg117791 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-10-01 12:44
I, too, can't think of any platforms where Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2 and I'm not sure what the previous policy has been.  Have you noticed any other code that would set a precedent?

If no one else chimes in, perhaps ask on IRC or python-dev?
msg117792 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-01 12:46
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> I don't know how to test "if Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2". On Windows, sizeof(wchar_t) is 2, but it looks like Python is not prepared to have Py_UNICODE != wchar_t for is Windows implementation.
>
> wchar_t is 32 bits long on Linux and Mac OS X. So how can I test it? Or should we just drop support of "Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2"?

You can tweak the Windows pyconfig.h to use UCS4, AFAIK, if you want to
test drive this case.

But it's probably easier to configure with "gcc -fshort-wchar" on
Linux :-)

Dropping support for this is not a good idea.
msg117795 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-10-01 13:01
> You can tweak the Windows pyconfig.h to use UCS4, AFAIK, if you want to
> test drive this case.

I seem to recall seeing some other code that assumed Windows implied UCS2.  Proceed with caution. ;-)

> But it's probably easier to configure with "gcc -fshort-wchar" on
> Linux :-)

libc will still be using sizeof(wchar_t) == 4, though.  Won't that cause Bad Things to happen when calling libc wide-character functions?
msg117796 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-01 13:06
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
>> You can tweak the Windows pyconfig.h to use UCS4, AFAIK, if you want to
>> test drive this case.
> 
> I seem to recall seeing some other code that assumed Windows implied UCS2.  Proceed with caution. ;-)

Probably, yes. I've never tried it myself.

>> But it's probably easier to configure with "gcc -fshort-wchar" on
>> Linux :-)
> 
> libc will still be using sizeof(wchar_t) == 4, though.  Won't that cause Bad Things to happen when calling libc wide-character functions?

Sure, but this is just about testing an interface, not running
applications :-)

Here's what the GCC man-page has to say:

       -fshort-wchar
           Override the underlying type for wchar_t to be short unsigned int instead
           of the default for the target.  This option is useful for building
           programs to run under WINE.

           Warning: the -fshort-wchar switch causes GCC to generate code that is not
           binary compatible with code generated without that switch.  Use it to
           conform to a non-default application binary interface.
msg117845 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-01 22:56
Patch version 4:
 - implement unicode_aswidechar() for 16 bits wchar_t and 32 bits Py_UNICODE
 - PyUnicode_AsWideWcharString() returns the number of wide characters excluding the nul character as does PyUnicode_AsWideChar()

For 16 bits wchar_t and 32 bits Py_UNICODE, I extracted the "as wide char" unicode functions into a small C file and compiled it with -fshort-wchar on Linux.
msg117848 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-01 23:02
Ooops, I lost my patch to fix the initial (ctypes) issue. Here is an updated patch: ctypes_nonbmp.patch (which needs aswidechar_nonbmp-4.patch).
msg117866 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-02 11:17
r85172 changes PyUnicode_AsWideCharString() (don't count the trailing nul character in the output size) and add unit tests.

r85173 patches unicode_aswidechar() to supports non-BMP characters for all known wchar_t/Py_UNICODE size combinaisons (2/2, 2/4 and 4/2).

I noticed that PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() accept embeded nul characters. I don't know if it is a bug or an expected behaviour. Anyway, there is now a test for this case.
msg117868 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-02 11:55
r85174+r85177: ctypes.c_wchar supports non-BMP characters with 32 bits wchar_t => fix this issue

(I commited also an unwanted change on _testcapi to fix r85172 in r85174: r85175 reverts this change, and r85176 fixes the _testcapi bug again)
msg117869 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-02 11:59
> r85173 patches unicode_aswidechar() to supports non-BMP characters 
> for all known wchar_t/Py_UNICODE size combinaisons (2/2, 2/4 and 4/2).

Oh, and 4/4 ;-)
msg117872 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-10-02 13:15
Thanks for working on this!

Since this was a bugfix, it should be merged back into 2.7, yes?
msg117882 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-02 15:29
> Since this was a bugfix, it should be merged back into 2.7, yes?

Mmmh, the fix requires to change PyUnicode_AsWideChar() function (support non-BMP characters and surrogate pairs) (and maybe also to create PyUnicode_AsWideCharString()). I don't really want to change such important function in a stable branch (Python2).

Is it really important to support non-BMP characters for ctypes.c_wchar in Python2? I would like to say: if you want better unicode support, use Python 3. And Python 3.2 if it's possible :-)
msg117883 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-10-02 15:33
Since I noticed the bug through source code inspection and no one has reported it occurring in practice, that sounds reasonable to me.
History
Date User Action Args
2010-10-02 15:33:48stutzbachsetmessages: + msg117883
versions: - Python 2.7
2010-10-02 15:29:09vstinnersetmessages: + msg117882
2010-10-02 13:15:31stutzbachsetmessages: + msg117872
stage: test needed -> resolved
2010-10-02 11:59:14vstinnersetmessages: + msg117869
2010-10-02 11:55:58vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg117868
2010-10-02 11:17:32vstinnersetmessages: + msg117866
2010-10-01 23:02:37vstinnersetfiles: + ctypes_nonbmp.patch

messages: + msg117848
2010-10-01 22:56:28vstinnersetfiles: - aswidechar_nonbmp-3.patch
2010-10-01 22:56:25vstinnersetfiles: - aswidechar_nonbmp-2.patch
2010-10-01 22:56:16vstinnersetfiles: + aswidechar_nonbmp-4.patch

messages: + msg117845
2010-10-01 13:06:02lemburgsetmessages: + msg117796
2010-10-01 13:01:40stutzbachsetmessages: + msg117795
2010-10-01 12:46:53lemburgsetmessages: + msg117792
title: c_types.c_wchar should not assume that sizeof(wchar_t) == sizeof(Py_UNICODE) -> c_types.c_wchar should not assume that sizeof(wchar_t) == sizeof(Py_UNICODE)
2010-10-01 12:44:59stutzbachsetmessages: + msg117791
2010-10-01 12:28:35vstinnersetmessages: + msg117790
2010-10-01 12:24:56vstinnersetfiles: + aswidechar_nonbmp-3.patch

messages: + msg117789
2010-10-01 11:53:40vstinnersetfiles: - pyunicode_aswidechar_surrogates-py3k.patch
2010-10-01 11:53:25vstinnersetfiles: + aswidechar_nonbmp-2.patch

messages: + msg117787
2010-09-29 00:27:50stutzbachsetnosy: + lemburg, ezio.melotti
messages: + msg117568
2010-09-29 00:21:32vstinnersetmessages: + msg117567
2010-05-13 23:56:52vstinnersetfiles: + pyunicode_aswidechar_surrogates-py3k.patch
keywords: + patch
messages: + msg105670
2010-05-13 21:10:20vstinnersetmessages: + msg105656
2010-05-13 01:00:48vstinnersetnosy: theller, vstinner, stutzbach
components: + Unicode
2010-05-13 01:00:42vstinnersetnosy: + vstinner
2010-05-09 06:38:33stutzbachcreate