Message 135366 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lemburg
Recipients	dcoles, lemburg, pitrou, vstinner
Date	2011-05-06.20:31:02
SpamBayes Score	0.0
Marked as misclassified	No
Message-id	<4DC45A81.3000101@egenix.com>
In-reply-to	<1304709169.2.0.592458854747.issue12010@psf.upfronthosting.co.za>

Content
David Coles wrote: > > David Coles <coles.david@gmail.com> added the comment: > > After doing some more investigation it appears that Android's wchar_t support before android-9 is totally broken (see http://android.git.kernel.org/?p=platform/ndk.git;a=blob_plain;f=docs/STANDALONE-TOOLCHAIN.html;hb=HEAD). With android-9 you get 4 byte wchar_t and working wide character functions. > > Possibly of more interest for Python is that it's no longer buildable without wchar_t support. While unicodeobject is pretty good at checking HAVE_WCHAR_H, a number of modules and even pythonrun.c directly use wchar_t or functions like PyUnicode_FromWideChar without providing a fallback. Does Python 3 now require wchar_t or are these bugs? (either option seems sensible). wchar_t should be fairly portable these days. I think the main problem is that we never assumed sizeof(wchar_t) == 1 to be a possibility. On Windows, wchar_t was 16 bit and the glibc started out with 32 bits. > A few other notes: > HAVE_USABLE_WCHAR_T looks like it was a check for unsigned/>16 bits wchar_t that would allow them to be directly memcpy'd. The code in unicodeobject.c seems not to really use this anymore except (it's happy with signed or unsigned) and it looks like the check is only used for Windows now. Note that HAVE_USABLE_WCHAR_T is only used to check whether Python can use wchar_t as alias for Py_UNICODE. Python's Unicode implementation needs Py_UNICODE to be an unsigned type with either 2 bytes or 4 bytes. If wchar_t does not provide these sizes or is a signed type, Python cannot use it for Py_UNICODE and must instead use "unsigned short". If the configure script does not detect this case, then a patch would be helpful. The other wchar_t C lib functions should still remain usable, though. > To properly support wchar_t of size 1 you would basically implement multibyte character storage either with UTF-8 or just packing two wchar_t's with UTF-16. At least in Android the distinction doesn't seem to matter as Android's internationalziation/localization policy seems to be "use Java". Python should not use wchar_t for Py_UNICODE on such platforms and instead go with "unsigned short". I would assume that the wchar_t C lib routines work based on UTF-8 with sizeof(wchar_t) == 1, so the PyUnicode_WideChar() APIs would need to be adjusted to work more or less like the UTF-8 codecs.

David Coles wrote:
> 
> David Coles <coles.david@gmail.com> added the comment:
> 
> After doing some more investigation it appears that Android's wchar_t support before android-9 is totally broken (see http://android.git.kernel.org/?p=platform/ndk.git;a=blob_plain;f=docs/STANDALONE-TOOLCHAIN.html;hb=HEAD). With android-9 you get 4 byte wchar_t and working wide character functions.
>
> Possibly of more interest for Python is that it's no longer buildable without wchar_t support. While unicodeobject is pretty good at checking HAVE_WCHAR_H, a number of modules and even pythonrun.c directly use wchar_t or functions like PyUnicode_FromWideChar without providing a fallback. Does Python 3 now require wchar_t or are these bugs? (either option seems sensible).

wchar_t should be fairly portable these days. I think the main
problem is that we never assumed sizeof(wchar_t) == 1 to be a
possibility. On Windows, wchar_t was 16 bit and the glibc started
out with 32 bits.

> A few other notes:
> HAVE_USABLE_WCHAR_T looks like it was a check for unsigned/>16 bits wchar_t that would allow them to be directly memcpy'd. The code in unicodeobject.c seems not to really use this anymore except (it's happy with signed or unsigned) and it looks like the check is only used for Windows now.

Note that HAVE_USABLE_WCHAR_T is only used to check whether
Python can use wchar_t as alias for Py_UNICODE. Python's Unicode
implementation needs Py_UNICODE to be an unsigned type with
either 2 bytes or 4 bytes. If wchar_t does not provide these
sizes or is a signed type, Python cannot use it for Py_UNICODE
and must instead use "unsigned short".

If the configure script does not detect this case, then a patch
would be helpful.

The other wchar_t C lib functions should still remain usable,
though.

> To properly support wchar_t of size 1 you would basically implement multibyte character storage either with UTF-8 or just packing two wchar_t's with UTF-16. At least in Android the distinction doesn't seem to matter as Android's internationalziation/localization policy seems to be "use Java".

Python should not use wchar_t for Py_UNICODE on such platforms
and instead go with "unsigned short".

I would assume that the wchar_t C lib routines work based on UTF-8
with sizeof(wchar_t) == 1, so the PyUnicode_*WideChar*() APIs would
need to be adjusted to work more or less like the UTF-8 codecs.

History
Date	User	Action	Args
2011-05-06 20:31:05	lemburg	set	recipients: + lemburg, pitrou, vstinner, dcoles
2011-05-06 20:31:02	lemburg	link	issue12010 messages
2011-05-06 20:31:02	lemburg	create