classification
Title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
Type: performance Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: 9684 Superseder:
Assigned To: stutzbach Nosy List: brian.curtin, flox, lemburg, loewis, pitrou, stutzbach, tim.golden
Priority: normal Keywords: buildbot, patch

Created on 2010-05-21 12:43 by stutzbach, last changed 2010-08-25 19:21 by stutzbach. This issue is now closed.

Files
File name Uploaded Description Edit
issue8781.patch stutzbach, 2010-08-18 16:09
Messages (19)
msg106235 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-21 12:43
If ./configure detects that the system's wchar_t type is compatible, it will define "#define PY_UNICODE_TYPE wchar_t" and enable certain optimizations when converting between Py_UNICODE and wchar_t (i.e., it can just do a memcpy).

Right now, ./configure considers wchar_t to be compatible if it is the same bit-width as Py_UNICODE and if wchar_t is unsigned.  In practice, that means Python only uses wchar_t on Windows, which uses an unsigned 16-bit wchar_t.  On Linux, wchar_t is 32-bit and signed.

In the original Unicode implementation for Python, Py_UNICODE was always 16-bit.  I believe the "unsigned" requirement heralds back to that time.  A 32-bit wchar_t gives us plenty of space to hold the maximum Unicode code point of 0x10FFFF, regardless of whether wchar_t is signed or unsigned.

I believe the condition could be relaxed to the following:
- wchar_t must be the same bit-width as Py_UNICODE, and
- if wchar_t is 16-bit, it must be unsigned

That would allow a UCS4 Python to use wchar_t on Linux.

I experimented by manually tweaking my pyconfig.h to treat Linux's signed 32-bit wchar_t as compatible.  The unit test suite encountered no problems.

However, it's quite possible that I'm missing some important detail here.  Someone familiar with the guts of Python's Unicode implementation  will presumably have a much better idea of whether I have this right or not. ;-)
msg106236 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-21 13:19
The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.

Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?
msg106240 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-21 13:55
Usually you wouldn't want to cast a char directly to a Py_UNICODE, because you need to take into account the encoding of the char and map it to the appropriate Unicode character.  The exception is when you're certain the char is 7-bit ASCII, which is a subset of Unicode; that's safe since 7-bit ASCII never uses the sign bit.

However, again, I'm not an expert on the internals of Python's Unicode implementation and it's possible that I'm missing something.  ;) You also raise a good point about third-party code.

Your other suggestion is quite workable.  ./configure could define HAVE_USABLE_WCHAR_T (which is used to enable the optimizations) when sizeof(wchar_t) == sizeof(Py_UNICODE), yet still define Py_UNICODE as unsigned.  Using Google Code I could not find any instances of HAVE_USABLE_WCHAR_T being used outside of CPython itself.

Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T).  The plus side is that we wouldn't be changing the semantics of anything.
msg106241 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-05-21 13:59
> Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to
> enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T).
> The plus side is that we wouldn't be changing the semantics of
> anything.

I guess it's sufficient, indeed.
Besides, the optimizations don't really seem to be used in any critical
paths anyway...
msg106242 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-21 14:05
Yeah, this is a "I noticed this small optimization while working on something else" kind of thing. ;) ("something else" being Issue8654)

I can make a patch to change the #if's to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T, though I'll give others a few days to chime in first.
msg106522 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-26 11:21
Antoine Pitrou wrote:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
> The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.

Right.

The Unicode code was written with an unsigned data type in mind (range
checks, conversions, etc.). We'd have to do some serious code review to
allow switching to a signed data type.

> Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?

The memcpy optimizations don't rely on the unsignedness of
wchar_t, so they would work just as well.
msg114238 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-18 16:09
Attached is a patch implementing the change agreed upon in the earlier discussion.  This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).
msg114242 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-18 16:18
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
> Attached is a patch implementing the change agreed upon in the earlier discussion.  This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).

Please make sure that those places where you replaced HAVE_USABLE_WCHAR_T
are enclosed in

#ifdef HAVE_WCHAR_H
...
#endif

sections. For unicodeobject.c that's true, not sure about the other
places.
msg114246 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-18 16:34
Thanks.  I dug into that a little just now, and it turns out to happen automatically.

If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T.  If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE.  In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.
msg114251 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-18 17:22
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
> Thanks.  I dug into that a little just now, and it turns out to happen automatically.
> 
> If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T.  If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE.  In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.

Ok, thanks for checking.

Other than that detail, I think the patch looks good.
msg114838 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-24 21:57
Commited in r84307.
msg114883 - (view) Author: Florent Xicluna (flox) * (Python committer) Date: 2010-08-25 07:15
Hi Daniel,

there's a test failure which is related with r84307 on windows buildbot.

======================================================================
FAIL: test_cw_strings (ctypes.test.test_parameters.SimpleTypesTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\ctypes\test\test_parameters.py", line 78, in test_cw_strings
    self.assertTrue(c_wchar_p.from_param(s)._obj is s)
AssertionError: False is not True

http://www.python.org/dev/buildbot/builders/x86%20XP-4%203.x/builds/2837
msg114891 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-25 11:02
Thanks, I will take a look sometime today.
msg114893 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-25 11:31
The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows.  My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).

This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
http://bugs.python.org/issue4474
http://trac.wxwidgets.org/ticket/12013

The problem with my patch can be solved in one of two ways:
1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
2. Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T

I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue.  Thoughts?
msg114894 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-25 11:33
Adding other Windows developers to the nosy list.  See msg114893 where your input would be helpful.
msg114895 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-08-25 11:54
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
> The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows.  My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).
> 
> This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
> http://bugs.python.org/issue4474
> http://trac.wxwidgets.org/ticket/12013
> 
> The problem with my patch can be solved in one of two ways:
> 1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
> 2. Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T
> 
> I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue.  Thoughts?

It possible, we should do the right thing and implement #1.

One thing I'm not sure about is how other Windows compilers deal
with wchar_t, e.g. MinGW or the Borland compiler. I suppose
that they all use the standard Windows C lib, so the parameter
should be 2 for them as well, but I'm not 100% sure.
msg114896 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-25 12:08
On Windows, the Python headers define HAVE_USABLE_WCHAR_T and Py_UNICODE_SIZE 2, so we are already relying on sizeof(wchar_t) == 2 on Windows.

My patch ran into trouble because it inadvertently disabled that assumption in a few places, causing code to take the slow path and convert between wchar_t * and Py_UNICODE *.  The test that failed checks that the fast path was taken on Windows.
msg114916 - (view) Author: Brian Curtin (brian.curtin) * (Python committer) Date: 2010-08-25 14:43
+1 on option 1
msg114928 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-08-25 19:21
I opened a separate issue for the SIZEOF_WCHAR_T issue so I could refer to that issue number in Misc/NEWS.  Fixed in r84317.
History
Date User Action Args
2010-08-25 19:21:24stutzbachsetstatus: open -> closed

messages: + msg114928
2010-08-25 19:17:21stutzbachsetdependencies: + PC/pyconfig.h should define SIZEOF_WCHAR_T
2010-08-25 14:43:21brian.curtinsetmessages: + msg114916
2010-08-25 12:08:32stutzbachsetmessages: + msg114896
2010-08-25 11:54:55lemburgsetmessages: + msg114895
title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
2010-08-25 11:33:37stutzbachsetnosy: + tim.golden, brian.curtin
messages: + msg114894
2010-08-25 11:31:42stutzbachsetmessages: + msg114893
2010-08-25 11:02:17stutzbachsetmessages: + msg114891
2010-08-25 07:15:16floxsetstatus: closed -> open

nosy: + flox
messages: + msg114883

keywords: + buildbot
2010-08-24 21:57:58stutzbachsetstatus: open -> closed
messages: + msg114838

keywords: - needs review
resolution: accepted
stage: patch review -> resolved
2010-08-18 17:22:18lemburgsetmessages: + msg114251
title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
2010-08-18 16:34:12stutzbachsetmessages: + msg114246
2010-08-18 16:18:15lemburgsetmessages: + msg114242
2010-08-18 16:09:12stutzbachsetkeywords: + patch, needs review
files: + issue8781.patch
messages: + msg114238

stage: needs patch -> patch review
2010-05-26 11:21:13lemburgsetmessages: + msg106522
title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think)
2010-05-21 14:05:30stutzbachsetassignee: stutzbach
messages: + msg106242
2010-05-21 13:59:24pitrousetmessages: + msg106241
2010-05-21 13:55:03stutzbachsetmessages: + msg106240
2010-05-21 13:19:03pitrousetnosy: + loewis, lemburg, pitrou
messages: + msg106236
2010-05-21 12:43:59stutzbachcreate