classification
Title: Create PyUnicode_AsWideCharString() function
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, ezio.melotti, lemburg, vstinner
Priority: normal Keywords: patch

Created on 2010-09-29 00:20 by vstinner, last changed 2010-09-29 10:41 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
pyunicode_aswidecharstring-2.patch vstinner, 2010-09-29 01:00
Messages (6)
msg117566 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 00:20
PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue #8670.

It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated.

To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument).

Attached patch implements it:
-------
/* Convert the Unicode object to a wide character string. The output string
   always ends with a nul character. If size is not NULL, write the number of
   wide characters (including the final nul character) into *size.

   Returns a buffer allocated by PyMem_Alloc() (use PyMem_Free() to free it) on
   success. On error, returns NULL and *size is undefined. */

PyAPI_FUNC(wchar_t*) PyUnicode_AsWideCharString(
    PyUnicodeObject *unicode,   /* Unicode object */
    Py_ssize_t *size            /* number of characters of the result */
    );
-------
msg117570 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 01:00
New version of the patch:
 - fix PyUnicode_AsWideCharString() :-)
 - replace PyUnicode_AsWideChar() by PyUnicode_AsWideCharString() in most functions using PyUnicode_AsWideChar()
 - indicate that PyUnicode_AsWideCharString() raises a MemoryError on error

Keep the call to PyUnicode_AsWideChar() in:
 - Modules/getpath.c because getpath.c uses a global limitation of MAXPATHLEN+1 characters
 - WCharArray_set_value() and U_set() of ctypes because the output buffer size is fixed
msg117577 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-09-29 06:54
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue #8670.
> 
> It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated.
> 
> To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument).

Great idea !
msg117578 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-29 07:11
+1 from me as well.
But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it can? The implementation doesn't do this.
msg117586 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 09:14
> But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it
> can? The implementation doesn't do this.

I don't want to do two different things at the same time. My plan is:
 - create PyUnicode_AsWideCharString()
 - use PyUnicode_AsWideCharString() everywhere
 - patch unicode_aswidechar() (used by PyUnicode_AsWideChar() and 
PyUnicode_AsWideCharString()) to convert surrogates when needed

So, you agree with the API (and the documentation)?
msg117592 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 10:41
I fixed in this issue in multiple commits:
 - r85093: create PyUnicode_AsWideCharString()
 - r85094: use it in import.c
 - r85095: use it for _locale.strcoll()
 - r85096: use it for time.strftime()
 - r85097: use it in _ctypes module

> So, you agree with the API (and the documentation)?

Well, you can now directly patch the documentation. I think that the API is simple and fine :-)
History
Date User Action Args
2010-09-29 10:41:57vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg117592
2010-09-29 09:14:24vstinnersetmessages: + msg117586
2010-09-29 07:11:26amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg117578
2010-09-29 06:54:20lemburgsetmessages: + msg117577
2010-09-29 01:01:02vstinnersetfiles: - pyunicode_aswidecharstring.patch
2010-09-29 01:00:57vstinnersetfiles: + pyunicode_aswidecharstring-2.patch

messages: + msg117570
2010-09-29 00:28:14stutzbachsetnosy: + lemburg, ezio.melotti
2010-09-29 00:20:49vstinnercreate