Issue 8781: 32-bit wchar_t doesn't need to be unsigned to be usable (I think)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/53027

classification

Title:	32-bit wchar_t doesn't need to be unsigned to be usable (I think)
Type:	performance	Stage:	resolved
Components:	Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	accepted
Dependencies:	9684	Superseder:
Assigned To:	stutzbach	Nosy List:	brian.curtin, flox, lemburg, loewis, pitrou, stutzbach, tim.golden
Priority:	normal	Keywords:	buildbot, patch

Created on 2010-05-21 12:43 by stutzbach, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue8781.patch	stutzbach, 2010-08-18 16:09

Messages (19)
msg106235 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-05-21 12:43
If ./configure detects that the system's wchar_t type is compatible, it will define "#define PY_UNICODE_TYPE wchar_t" and enable certain optimizations when converting between Py_UNICODE and wchar_t (i.e., it can just do a memcpy). Right now, ./configure considers wchar_t to be compatible if it is the same bit-width as Py_UNICODE and if wchar_t is unsigned. In practice, that means Python only uses wchar_t on Windows, which uses an unsigned 16-bit wchar_t. On Linux, wchar_t is 32-bit and signed. In the original Unicode implementation for Python, Py_UNICODE was always 16-bit. I believe the "unsigned" requirement heralds back to that time. A 32-bit wchar_t gives us plenty of space to hold the maximum Unicode code point of 0x10FFFF, regardless of whether wchar_t is signed or unsigned. I believe the condition could be relaxed to the following: - wchar_t must be the same bit-width as Py_UNICODE, and - if wchar_t is 16-bit, it must be unsigned That would allow a UCS4 Python to use wchar_t on Linux. I experimented by manually tweaking my pyconfig.h to treat Linux's signed 32-bit wchar_t as compatible. The unit test suite encountered no problems. However, it's quite possible that I'm missing some important detail here. Someone familiar with the guts of Python's Unicode implementation will presumably have a much better idea of whether I have this right or not. ;-)
msg106236 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-21 13:19
The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE. Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?
msg106240 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-05-21 13:55
Usually you wouldn't want to cast a char directly to a Py_UNICODE, because you need to take into account the encoding of the char and map it to the appropriate Unicode character. The exception is when you're certain the char is 7-bit ASCII, which is a subset of Unicode; that's safe since 7-bit ASCII never uses the sign bit. However, again, I'm not an expert on the internals of Python's Unicode implementation and it's possible that I'm missing something. ;) You also raise a good point about third-party code. Your other suggestion is quite workable. ./configure could define HAVE_USABLE_WCHAR_T (which is used to enable the optimizations) when sizeof(wchar_t) == sizeof(Py_UNICODE), yet still define Py_UNICODE as unsigned. Using Google Code I could not find any instances of HAVE_USABLE_WCHAR_T being used outside of CPython itself. Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T). The plus side is that we wouldn't be changing the semantics of anything.
msg106241 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2010-05-21 13:59
> Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to > enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T). > The plus side is that we wouldn't be changing the semantics of > anything. I guess it's sufficient, indeed. Besides, the optimizations don't really seem to be used in any critical paths anyway...
msg106242 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-05-21 14:05
Yeah, this is a "I noticed this small optimization while working on something else" kind of thing. ;) ("something else" being Issue8654) I can make a patch to change the #if's to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T, though I'll give others a few days to chime in first.
msg106522 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-26 11:21
Antoine Pitrou wrote: > > Antoine Pitrou <pitrou@free.fr> added the comment: > > The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE. Right. The Unicode code was written with an unsigned data type in mind (range checks, conversions, etc.). We'd have to do some serious code review to allow switching to a signed data type. > Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard? The memcpy optimizations don't rely on the unsignedness of wchar_t, so they would work just as well.
msg114238 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-18 16:09
Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).
msg114242 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-08-18 16:18
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > > Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux). Please make sure that those places where you replaced HAVE_USABLE_WCHAR_T are enclosed in #ifdef HAVE_WCHAR_H ... #endif sections. For unicodeobject.c that's true, not sure about the other places.
msg114246 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-18 16:34
Thanks. I dug into that a little just now, and it turns out to happen automatically. If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.
msg114251 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-08-18 17:22
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > > Thanks. I dug into that a little just now, and it turns out to happen automatically. > > If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false. Ok, thanks for checking. Other than that detail, I think the patch looks good.
msg114838 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-24 21:57
Commited in r84307.
msg114883 - (view)	Author: Florent Xicluna (flox) *	Date: 2010-08-25 07:15
Hi Daniel, there's a test failure which is related with r84307 on windows buildbot. ====================================================================== FAIL: test_cw_strings (ctypes.test.test_parameters.SimpleTypesTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\ctypes\test\test_parameters.py", line 78, in test_cw_strings self.assertTrue(c_wchar_p.from_param(s)._obj is s) AssertionError: False is not True http://www.python.org/dev/buildbot/builders/x86%20XP-4%203.x/builds/2837
msg114891 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-25 11:02
Thanks, I will take a look sometime today.
msg114893 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-25 11:31
The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows). This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H: http://bugs.python.org/issue4474 http://trac.wxwidgets.org/ticket/12013 The problem with my patch can be solved in one of two ways: 1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or 2. Change the #if's to: HAVE_USABLE_WCHAR_T \|\| Py_UNICODE_SIZE == SIZEOF_WCHAR_T I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?
msg114894 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-25 11:33
Adding other Windows developers to the nosy list. See msg114893 where your input would be helpful.
msg114895 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-08-25 11:54
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > > The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows). > > This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H: > http://bugs.python.org/issue4474 > http://trac.wxwidgets.org/ticket/12013 > > The problem with my patch can be solved in one of two ways: > 1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or > 2. Change the #if's to: HAVE_USABLE_WCHAR_T \|\| Py_UNICODE_SIZE == SIZEOF_WCHAR_T > > I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts? It possible, we should do the right thing and implement #1. One thing I'm not sure about is how other Windows compilers deal with wchar_t, e.g. MinGW or the Borland compiler. I suppose that they all use the standard Windows C lib, so the parameter should be 2 for them as well, but I'm not 100% sure.
msg114896 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-25 12:08
On Windows, the Python headers define HAVE_USABLE_WCHAR_T and Py_UNICODE_SIZE 2, so we are already relying on sizeof(wchar_t) == 2 on Windows. My patch ran into trouble because it inadvertently disabled that assumption in a few places, causing code to take the slow path and convert between wchar_t * and Py_UNICODE *. The test that failed checks that the fast path was taken on Windows.
msg114916 - (view)	Author: Brian Curtin (brian.curtin) *	Date: 2010-08-25 14:43
+1 on option 1
msg114928 - (view)	Author: Daniel Stutzbach (stutzbach)	Date: 2010-08-25 19:21
I opened a separate issue for the SIZEOF_WCHAR_T issue so I could refer to that issue number in Misc/NEWS. Fixed in r84317.

History
Date	User	Action	Args
2022-04-11 14:57:01	admin	set	github: 53027
2010-08-25 19:21:24	stutzbach	set	status: open -> closed messages: + msg114928
2010-08-25 19:17:21	stutzbach	set	dependencies: + PC/pyconfig.h should define SIZEOF_WCHAR_T
2010-08-25 14:43:21	brian.curtin	set	messages: + msg114916
2010-08-25 12:08:32	stutzbach	set	messages: + msg114896
2010-08-25 11:54:55	lemburg	set	messages: + msg114895 title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
2010-08-25 11:33:37	stutzbach	set	nosy: + tim.golden, brian.curtin messages: + msg114894
2010-08-25 11:31:42	stutzbach	set	messages: + msg114893
2010-08-25 11:02:17	stutzbach	set	messages: + msg114891
2010-08-25 07:15:16	flox	set	status: closed -> open nosy: + flox messages: + msg114883 keywords: + buildbot
2010-08-24 21:57:58	stutzbach	set	status: open -> closed messages: + msg114838 keywords: - needs review resolution: accepted stage: patch review -> resolved
2010-08-18 17:22:18	lemburg	set	messages: + msg114251 title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
2010-08-18 16:34:12	stutzbach	set	messages: + msg114246
2010-08-18 16:18:15	lemburg	set	messages: + msg114242
2010-08-18 16:09:12	stutzbach	set	keywords: + patch, needs review files: + issue8781.patch messages: + msg114238 stage: needs patch -> patch review
2010-05-26 11:21:13	lemburg	set	messages: + msg106522 title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
2010-05-21 14:05:30	stutzbach	set	assignee: stutzbach messages: + msg106242
2010-05-21 13:59:24	pitrou	set	messages: + msg106241
2010-05-21 13:55:03	stutzbach	set	messages: + msg106240
2010-05-21 13:19:03	pitrou	set	nosy: + loewis, lemburg, pitrou messages: + msg106236
2010-05-21 12:43:59	stutzbach	create