Issue 9979: Create PyUnicode_AsWideCharString() function

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/54188

classification

Title:	Create PyUnicode_AsWideCharString() function
Type:		Stage:
Components:	Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, ezio.melotti, lemburg, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-09-29 00:20 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
pyunicode_aswidecharstring-2.patch	vstinner, 2010-09-29 01:00

Messages (6)
msg117566 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-29 00:20
PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue #8670. It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated. To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument). Attached patch implements it: ------- /* Convert the Unicode object to a wide character string. The output string always ends with a nul character. If size is not NULL, write the number of wide characters (including the final nul character) into size. Returns a buffer allocated by PyMem_Alloc() (use PyMem_Free() to free it) on success. On error, returns NULL and size is undefined. / PyAPI_FUNC(wchar_t) PyUnicode_AsWideCharString( PyUnicodeObject unicode, / Unicode object / Py_ssize_t size /* number of characters of the result */ ); -------
msg117570 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-29 01:00
New version of the patch: - fix PyUnicode_AsWideCharString() :-) - replace PyUnicode_AsWideChar() by PyUnicode_AsWideCharString() in most functions using PyUnicode_AsWideChar() - indicate that PyUnicode_AsWideCharString() raises a MemoryError on error Keep the call to PyUnicode_AsWideChar() in: - Modules/getpath.c because getpath.c uses a global limitation of MAXPATHLEN+1 characters - WCharArray_set_value() and U_set() of ctypes because the output buffer size is fixed
msg117577 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-09-29 06:54
STINNER Victor wrote: > > New submission from STINNER Victor <victor.stinner@haypocalc.com>: > > PyUnicode_AsWideChar() doesn't merge surrogate pairs on a system with 32 bits wchar_t and Python compiled in narrow mode (sizeof(wchar_t) == 4 and sizeof(Py_UNICODE) == 2) => see issue #8670. > > It is not easy to fix this problem because the callers of PyUnicode_AsWideChar() suppose that the output (wide character) string has the same length (in character) than the input (PyUnicode) string (suppose that sizeof(wchar_t) == sizeof(Py_UNICODE)). And PyUnicode_AsWideChar() doesn't write nul character at the end if the output string is truncated. > > To prepare this change, a new PyUnicode_AsWideCharString() function would help because it does compute the size of the output buffer (whereas PyUnicode_AsWideChar() requires the output buffer in an argument). Great idea !
msg117578 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2010-09-29 07:11
+1 from me as well. But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it can? The implementation doesn't do this.
msg117586 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-29 09:14
> But shouldn't PyUnicode_AsWideCharString() merge surrogate pairs when it > can? The implementation doesn't do this. I don't want to do two different things at the same time. My plan is: - create PyUnicode_AsWideCharString() - use PyUnicode_AsWideCharString() everywhere - patch unicode_aswidechar() (used by PyUnicode_AsWideChar() and PyUnicode_AsWideCharString()) to convert surrogates when needed So, you agree with the API (and the documentation)?
msg117592 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-09-29 10:41
I fixed in this issue in multiple commits: - r85093: create PyUnicode_AsWideCharString() - r85094: use it in import.c - r85095: use it for _locale.strcoll() - r85096: use it for time.strftime() - r85097: use it in _ctypes module > So, you agree with the API (and the documentation)? Well, you can now directly patch the documentation. I think that the API is simple and fine :-)

History
Date	User	Action	Args
2022-04-11 14:57:07	admin	set	github: 54188
2010-09-29 10:41:57	vstinner	set	status: open -> closed resolution: fixed messages: + msg117592
2010-09-29 09:14:24	vstinner	set	messages: + msg117586
2010-09-29 07:11:26	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg117578
2010-09-29 06:54:20	lemburg	set	messages: + msg117577
2010-09-29 01:01:02	vstinner	set	files: - pyunicode_aswidecharstring.patch
2010-09-29 01:00:57	vstinner	set	files: + pyunicode_aswidecharstring-2.patch messages: + msg117570
2010-09-29 00:28:14	stutzbach	set	nosy: + lemburg, ezio.melotti
2010-09-29 00:20:49	vstinner	create