msg106235 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-05-21 12:43 |
If ./configure detects that the system's wchar_t type is compatible, it will define "#define PY_UNICODE_TYPE wchar_t" and enable certain optimizations when converting between Py_UNICODE and wchar_t (i.e., it can just do a memcpy).
Right now, ./configure considers wchar_t to be compatible if it is the same bit-width as Py_UNICODE and if wchar_t is unsigned. In practice, that means Python only uses wchar_t on Windows, which uses an unsigned 16-bit wchar_t. On Linux, wchar_t is 32-bit and signed.
In the original Unicode implementation for Python, Py_UNICODE was always 16-bit. I believe the "unsigned" requirement heralds back to that time. A 32-bit wchar_t gives us plenty of space to hold the maximum Unicode code point of 0x10FFFF, regardless of whether wchar_t is signed or unsigned.
I believe the condition could be relaxed to the following:
- wchar_t must be the same bit-width as Py_UNICODE, and
- if wchar_t is 16-bit, it must be unsigned
That would allow a UCS4 Python to use wchar_t on Linux.
I experimented by manually tweaking my pyconfig.h to treat Linux's signed 32-bit wchar_t as compatible. The unit test suite encountered no problems.
However, it's quite possible that I'm missing some important detail here. Someone familiar with the guts of Python's Unicode implementation will presumably have a much better idea of whether I have this right or not. ;-)
|
msg106236 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-05-21 13:19 |
The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.
Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?
|
msg106240 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-05-21 13:55 |
Usually you wouldn't want to cast a char directly to a Py_UNICODE, because you need to take into account the encoding of the char and map it to the appropriate Unicode character. The exception is when you're certain the char is 7-bit ASCII, which is a subset of Unicode; that's safe since 7-bit ASCII never uses the sign bit.
However, again, I'm not an expert on the internals of Python's Unicode implementation and it's possible that I'm missing something. ;) You also raise a good point about third-party code.
Your other suggestion is quite workable. ./configure could define HAVE_USABLE_WCHAR_T (which is used to enable the optimizations) when sizeof(wchar_t) == sizeof(Py_UNICODE), yet still define Py_UNICODE as unsigned. Using Google Code I could not find any instances of HAVE_USABLE_WCHAR_T being used outside of CPython itself.
Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T). The plus side is that we wouldn't be changing the semantics of anything.
|
msg106241 - (view) |
Author: Antoine Pitrou (pitrou) *  |
Date: 2010-05-21 13:59 |
> Another option would be to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T to
> enable the optimizations, instead of defined(HAVE_USABLE_WCHAR_T).
> The plus side is that we wouldn't be changing the semantics of
> anything.
I guess it's sufficient, indeed.
Besides, the optimizations don't really seem to be used in any critical
paths anyway...
|
msg106242 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-05-21 14:05 |
Yeah, this is a "I noticed this small optimization while working on something else" kind of thing. ;) ("something else" being Issue8654)
I can make a patch to change the #if's to test Py_UNICODE_SIZE == SIZEOF_WCHAR_T, though I'll give others a few days to chime in first.
|
msg106522 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-05-26 11:21 |
Antoine Pitrou wrote:
>
> Antoine Pitrou <pitrou@free.fr> added the comment:
>
> The problem with a signed Py_UNICODE is implicit sign extension (rather than zero extension) in some conversions, for example from "char" or "unsigned char" to "Py_UNICODE". The effects could go anywhere from incorrect results to plain crashes. Not only in our code, but in C extensions relying on the unsignedness of Py_UNICODE.
Right.
The Unicode code was written with an unsigned data type in mind (range
checks, conversions, etc.). We'd have to do some serious code review to
allow switching to a signed data type.
> Is there a way to enable those optimizations while keeping an unsigned Py_UNICODE type? It seems Py_UNICODE doesn't have to be typedef'ed to wchar_t, it can be defined to be an unsigned integer of the same width. Or would it break some part of the C standard?
The memcpy optimizations don't rely on the unsignedness of
wchar_t, so they would work just as well.
|
msg114238 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-18 16:09 |
Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).
|
msg114242 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-08-18 16:18 |
Daniel Stutzbach wrote:
>
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
>
> Attached is a patch implementing the change agreed upon in the earlier discussion. This will allow wchar_t <-> Py_UNICODE conversions to use memcpy on systems where wchar_t and Py_UNICODE have the same size but different signs (e.g., Linux).
Please make sure that those places where you replaced HAVE_USABLE_WCHAR_T
are enclosed in
#ifdef HAVE_WCHAR_H
...
#endif
sections. For unicodeobject.c that's true, not sure about the other
places.
|
msg114246 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-18 16:34 |
Thanks. I dug into that a little just now, and it turns out to happen automatically.
If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.
|
msg114251 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-08-18 17:22 |
Daniel Stutzbach wrote:
>
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
>
> Thanks. I dug into that a little just now, and it turns out to happen automatically.
>
> If ./configure doesn't define HAVE_WCHAR_H then it also will not define SIZEOF_WCHAR_T. If SIZEOF_WCHAR_T is not defined, the preprocessor will treat it as 0 causing it to be unequal to Py_UNICODE_SIZE. In other words, if HAVE_WCHAR_H is not defined then SIZEOF_WCHAR_T == Py_UNICODE_SIZE will be false.
Ok, thanks for checking.
Other than that detail, I think the patch looks good.
|
msg114838 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-24 21:57 |
Commited in r84307.
|
msg114883 - (view) |
Author: Florent Xicluna (flox) *  |
Date: 2010-08-25 07:15 |
Hi Daniel,
there's a test failure which is related with r84307 on windows buildbot.
======================================================================
FAIL: test_cw_strings (ctypes.test.test_parameters.SimpleTypesTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
File "D:\cygwin\home\db3l\buildarea\3.x.bolen-windows\build\lib\ctypes\test\test_parameters.py", line 78, in test_cw_strings
self.assertTrue(c_wchar_p.from_param(s)._obj is s)
AssertionError: False is not True
http://www.python.org/dev/buildbot/builders/x86%20XP-4%203.x/builds/2837
|
msg114891 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-25 11:02 |
Thanks, I will take a look sometime today.
|
msg114893 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-25 11:31 |
The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).
This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
http://bugs.python.org/issue4474
http://trac.wxwidgets.org/ticket/12013
The problem with my patch can be solved in one of two ways:
1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
2. Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T
I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?
|
msg114894 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-25 11:33 |
Adding other Windows developers to the nosy list. See msg114893 where your input would be helpful.
|
msg114895 - (view) |
Author: Marc-Andre Lemburg (lemburg) *  |
Date: 2010-08-25 11:54 |
Daniel Stutzbach wrote:
>
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
>
> The underlying problem here is that SIZEOF_WCHAR_T is not defined in pyconfig.h on Windows. My patch assumed that it would be defined on all platforms where HAVE_WCHAR_H is defined (I had checked ./configure, but forgotten Windows).
>
> This has come up before and caused problems for other projects that assume including python.h will define SIZEOF_WCHAR_T on all platforms with HAVE_WCHAR_H:
> http://bugs.python.org/issue4474
> http://trac.wxwidgets.org/ticket/12013
>
> The problem with my patch can be solved in one of two ways:
> 1. In PC/pyconfig.h, #define SIZEOF_WCHAR_T 2, or
> 2. Change the #if's to: HAVE_USABLE_WCHAR_T || Py_UNICODE_SIZE == SIZEOF_WCHAR_T
>
> I prefer option #1, but it's also a more visible change than my original patch and may warrant its own issue. Thoughts?
It possible, we should do the right thing and implement #1.
One thing I'm not sure about is how other Windows compilers deal
with wchar_t, e.g. MinGW or the Borland compiler. I suppose
that they all use the standard Windows C lib, so the parameter
should be 2 for them as well, but I'm not 100% sure.
|
msg114896 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-25 12:08 |
On Windows, the Python headers define HAVE_USABLE_WCHAR_T and Py_UNICODE_SIZE 2, so we are already relying on sizeof(wchar_t) == 2 on Windows.
My patch ran into trouble because it inadvertently disabled that assumption in a few places, causing code to take the slow path and convert between wchar_t * and Py_UNICODE *. The test that failed checks that the fast path was taken on Windows.
|
msg114916 - (view) |
Author: Brian Curtin (brian.curtin) *  |
Date: 2010-08-25 14:43 |
+1 on option 1
|
msg114928 - (view) |
Author: Daniel Stutzbach (stutzbach)  |
Date: 2010-08-25 19:21 |
I opened a separate issue for the SIZEOF_WCHAR_T issue so I could refer to that issue number in Misc/NEWS. Fixed in r84317.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:01 | admin | set | github: 53027 |
2010-08-25 19:21:24 | stutzbach | set | status: open -> closed
messages:
+ msg114928 |
2010-08-25 19:17:21 | stutzbach | set | dependencies:
+ PC/pyconfig.h should define SIZEOF_WCHAR_T |
2010-08-25 14:43:21 | brian.curtin | set | messages:
+ msg114916 |
2010-08-25 12:08:32 | stutzbach | set | messages:
+ msg114896 |
2010-08-25 11:54:55 | lemburg | set | messages:
+ msg114895 title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think) |
2010-08-25 11:33:37 | stutzbach | set | nosy:
+ tim.golden, brian.curtin messages:
+ msg114894
|
2010-08-25 11:31:42 | stutzbach | set | messages:
+ msg114893 |
2010-08-25 11:02:17 | stutzbach | set | messages:
+ msg114891 |
2010-08-25 07:15:16 | flox | set | status: closed -> open
nosy:
+ flox messages:
+ msg114883
keywords:
+ buildbot |
2010-08-24 21:57:58 | stutzbach | set | status: open -> closed messages:
+ msg114838
keywords:
- needs review resolution: accepted stage: patch review -> resolved |
2010-08-18 17:22:18 | lemburg | set | messages:
+ msg114251 title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think) |
2010-08-18 16:34:12 | stutzbach | set | messages:
+ msg114246 |
2010-08-18 16:18:15 | lemburg | set | messages:
+ msg114242 |
2010-08-18 16:09:12 | stutzbach | set | keywords:
+ patch, needs review files:
+ issue8781.patch messages:
+ msg114238
stage: needs patch -> patch review |
2010-05-26 11:21:13 | lemburg | set | messages:
+ msg106522 title: 32-bit wchar_t doesn't need to be unsigned to be usable (I think) -> 32-bit wchar_t doesn't need to be unsigned to be usable (I think) |
2010-05-21 14:05:30 | stutzbach | set | assignee: stutzbach messages:
+ msg106242 |
2010-05-21 13:59:24 | pitrou | set | messages:
+ msg106241 |
2010-05-21 13:55:03 | stutzbach | set | messages:
+ msg106240 |
2010-05-21 13:19:03 | pitrou | set | nosy:
+ loewis, lemburg, pitrou messages:
+ msg106236
|
2010-05-21 12:43:59 | stutzbach | create | |