classification
Title: Improve ABI compatibility between UCS2 and UCS4 builds
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: stutzbach Nosy List: Arfrever, belopolsky, eric.araujo, lemburg, loewis, pitrou, r.david.murray, scoder, stutzbach, vstinner, zooko
Priority: normal Keywords: needs review, patch

Created on 2010-05-07 19:02 by stutzbach, last changed 2011-09-29 19:39 by stutzbach. This issue is now closed.

Files
File name Uploaded Description Edit
unicode.patch stutzbach, 2010-05-09 22:14
unicode-2.patch stutzbach, 2010-09-03 11:05 Updated patch
unicode-3.patch stutzbach, 2010-09-03 17:36 Updated based on Antoine's feedback
Messages (31)
msg105222 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-07 19:02
Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4.  To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions.  For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.

Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:

        undefined symbol: PyUnicodeUCS2_FromString

In Python 2, only some extension modules were stung by this problem.  For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object.  It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).

Right now, nearly all the functions in unicodeobject.h are wrapped.  Several functions are not.  Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4.  We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.

For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects.  It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.

On the other hand, PyUnicode_FromString creates a Unicode object from a char *.  The caller can treat the returned object as opaque, so the function should not be wrapped.

The attached patch implements that rule.  It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped.  "make test" works with both UCS2 and UCS4 builds.

I previously brought this issue up on python-ideas, see:
http://mail.python.org/pipermail/python-ideas/2009-November/006543.html

Here's a summary of that discussion:

Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.

Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require.  PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care".  My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers.  In other words, they are complimentary fixes.

Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.

With my current patch, there are still two ways for that to happen:

1) The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or

2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.

Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions.  Consequently, they will still generate a linker error if the encoding is mismatched, as desired.

I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions.  I'll post about that in a separate comment to this bug.
msg105224 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-07 19:24
Forcing the compile-time and link-time encodings to match is tricky.  The goal is:

- Allow Unicode-agnostic modules to link, regardless of unicode settings
- Cause a link failure if the unicode settings are mismatched for a module that pokes into PyUnicodeObject

All of the solutions I've come up with have trade-offs.  Here is one approach:

Expose PyUnicodeObject if and only if the extension #defines a special flag (PY_EXPOSE_UNICODE_OBJECT?) before including Python.h.

If an extension does NOT define the flag, then PyUnicodeObject will not be defined nor the accessor macros for accessing its fields.  All of the opaque and non-opaque functions are still defined; the module just can't poke directly into PyUnicodeObject.  Linking will succeed as long as the module avoids non-opaque functions.

If the flag IS defined, then PyUnicodeObject will be defined along with the accessor macros.  The unicodeobject.h header will also arrange to require an appropriate symbol so that linking will succeed only if the module is compiled with the same Unicode settings as Python.

The upside of this approach is that it works: Unicode-agnostic modules will link and all others will not.

The drawback is a few extension modules will need to have a #define before including Python.h.  Only modules that poke into PyUnicodeObject will be impacted.  No change will be required for modules that depend on the Unicode setting yet stick to functions (opaque or not).

Thoughts?
msg105225 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-07 19:28
Adding people to the Nosy List who participated in the original thread on python-ideas, several months ago.  Hope you don't mind. :-)
msg105239 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-07 21:54
I've been thinking about this a bit more.  There are three types of symbols in unicodeobject.h:

1. Functions that are always safe to use
2. Functions that are only safe if the module is compiled with the same Unicode settings as Python
3. Structures and macros that are only safe if the module is compiled with the same Unicode settings as Python

The functions in #2 will generate a link error if the module actually uses them.  

We can add some symbols next to the structures and macros (#3), such that there will always be a link error if the Unicode settings are mismatched.  However, we can't tell if the module actually uses the structure or not.

The hard question is: what should be declared by default?

Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header.  If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals.  If they include the header, their module will always generate a link failure if the Unicode settings are mismatched.  (Guido proposed this solution in the python-ideas thread)

Option 2: Make Unicode-dependence the default.  If the compilation settings don't match, force a link failure.  Allow extension authors to define a flag (Py_UNICODE_AGNOSTIC) before including Python.h to avoid defining any unsafe functions, structures, or macros.  In practice, this is what Python 3 does today, except there's currently no way to declare Unicode-agnosticism.

Option 3: Go for a middle ground.  Modules are Unicode agnostic by default, unless they call a non-agnostic function (which will cause a link error if there's a mismatch).  If they want to poke directly into PyUnicodeObject, they still need to include a separate header.  I fear that this is the worst of both worlds, though.

The more I think about it, the more I like the first option.

Maybe I should bring this up on capi-sig and try to gather a consensus?
msg105242 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2010-05-07 22:38
Adding MvL because he wrote the ABI PEP, and MAL because he cares about the Unicode interface.
msg105260 - (view) Author: Zooko O'Whielacronx (zooko) Date: 2010-05-08 04:46
> Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header.  If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals.  If they include the header, their module will always generate a link failure if the Unicode settings are mismatched.  (Guido proposed this solution in the python-ideas thread)

+1

The packaging and compatibility problems are pressing concerns for many people. Poking at PyUnicodeObject's internals seems like a rare pretty rare need for an extension module to do. If I understand correctly, this option provides the best solution to the packaging and compatibility issues (over the long term, as Python and the authors of extension modules upgrade).
msg105272 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-08 10:03
Daniel Stutzbach wrote:
> 
> New submission from Daniel Stutzbach <daniel@stutzbachenterprises.com>:
> 
> Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4.  To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions.  For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.
> 
> Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:
> 
>         undefined symbol: PyUnicodeUCS2_FromString
> 
> In Python 2, only some extension modules were stung by this problem.  For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object.  It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).
> 
> Right now, nearly all the functions in unicodeobject.h are wrapped.  Several functions are not.  Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

That's true. The main point in wrapping the APIs was to make sure that
if an extension module uses Unicode, it will likely use one of the
wrapped APIs and then be protected by the name mangling to prevent
extensions compiled as UCS2 to be loaded into a UCS4 interpreter
and vice-versa.

The main difference between those two build variants is the definition
of Py_UNICODE. Unfortunately, there's no way to have the linker check
that type definition. The wrapping was chosen as alternative protection.

> Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4.  We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.
> 
> For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects.  It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.
>
> On the other hand, PyUnicode_FromString creates a Unicode object from a char *.  The caller can treat the returned object as opaque, so the function should not be wrapped.

The point of the wrapping was not to protect the APIs themselves.
It is just meant to be able to use the linker as protection device
when loading a module.

If you can propose a different method of reliably protecting against
mixed Unicode build module loads, that would be great. We could then
get rid off the wrapping altogether.

> The attached patch implements that rule.  It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped.  "make test" works with both UCS2 and UCS4 builds.

I don't think that removing a few wrapped function is going to help
with the problem.

Extension modules can still use Py_UNICODE internally and make
use of the macros we have for accessing the PyUnicodeObject
internals. You simply don't catch such usage with wrapping
the APIs - as I said above: it's not a perfect solution, but
only a method that works most of the time.

I'd much rather like to see a solution that always works rather
than making mixed Unicode build extension loading easier.

Perhaps we could do something in the module import mechanism
to check for Unicode build compatibility (much like the Python API
version check, but with raising an error instead of just issuing
a warning).

> I previously brought this issue up on python-ideas, see:
> http://mail.python.org/pipermail/python-ideas/2009-November/006543.html
> 
> Here's a summary of that discussion:
> 
> Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.
> 
> Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require.  PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care".  My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers.  In other words, they are complimentary fixes.
> 
> Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.
> 
> With my current patch, there are still two ways for that to happen:
> 
> 1) The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or
> 
> 2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.
> 
> Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions.  Consequently, they will still generate a linker error if the encoding is mismatched, as desired.
> 
> I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions.  I'll post about that in a separate comment to this bug.

Please note that UCS2 and UCS4 builds of Python are different in
more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
use surrogates when converting between Unicode and bytes which
UCS4 don't, sys.maxunicode is different, range checks use different
bounds, unichr() behaves differently, etc. etc.

It is simply not a good idea to have modules build for a UCS2
interpreter work in a UCS4 interpreter and vice-versa.
msg105284 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-08 13:07
I propose a different approach:

1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes.

There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create.
msg105286 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-08 13:15
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> I propose a different approach:
> 
> 1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
> 2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.
> 
> Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes.
> 
> There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create.

+1

We could then get rid off the API renaming altogether.
msg105295 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 15:01
On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> If you can propose a different method of reliably protecting against
> mixed Unicode build module loads, that would be great. We could then
> get rid off the wrapping altogether.

The following code will cause a link error 100% of the time iff
there's a mismatch:

#ifndef Py_UNICODE_WIDE
#define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
#else
#define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
#endif
extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol;

In practice, I'd surrounded it with a bunch #ifdefs to disable the
"defined but not used" warning in gcc and MSVC.

> Please note that UCS2 and UCS4 builds of Python are different in
> more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
> use surrogates when converting between Unicode and bytes which
> UCS4 don't, sys.maxunicode is different, range checks use different
> bounds, unichr() behaves differently, etc. etc.

That's true, but those differences are visible from pure-Python code
as well aren't they?
msg105297 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-08 15:16
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
> On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
>> If you can propose a different method of reliably protecting against
>> mixed Unicode build module loads, that would be great. We could then
>> get rid off the wrapping altogether.
> 
> The following code will cause a link error 100% of the time iff
> there's a mismatch:
> 
> #ifndef Py_UNICODE_WIDE
> #define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
> #else
> #define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
> #endif
> extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
> static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol;
> 
> In practice, I'd surrounded it with a bunch #ifdefs to disable the
> "defined but not used" warning in gcc and MSVC.

Are you sure this doesn't get optimized away in practice ?

FWIW: I still like the import logic solution better, since that
will make it possible for the user to see a truly useful
error message rather than just some missing symbol linker
error.

>> Please note that UCS2 and UCS4 builds of Python are different in
>> more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
>> use surrogates when converting between Unicode and bytes which
>> UCS4 don't, sys.maxunicode is different, range checks use different
>> bounds, unichr() behaves differently, etc. etc.
> 
> That's true, but those differences are visible from pure-Python code
> as well aren't they?

Sure, though, I don't see how this relates to C code relying
on these details, e.g. a C extension will probably use different
conversion code depending on whether UCS2 or UCS4 is compatible
with some external library, etc.
msg105302 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 15:48
On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> Are you sure this doesn't get optimized away in practice ?

I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

> Sure, though, I don't see how this relates to C code relying
> on these details, e.g. a C extension will probably use different
> conversion code depending on whether UCS2 or UCS4 is compatible
> with some external library, etc.

Can you give an example?

All of the examples I can think of either:
- poke into PyUnicodeObject's internals,
- call a Python function that exposes Py_UNICODE or PyUnicodeObject

I'm explicitly trying to protect those two cases.  It's quite possible
that I'm missing something, but I can't think of any other unsafe way
for a C extension to convert a Python Unicode object to a byte string.
msg105304 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-08 16:02
>> Are you sure this doesn't get optimized away in practice ?
> 
> I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

Did you mean to include the hunk from msg105295 as part of the patch?
If so, wouldn't that defeat the whole point of the patch?
msg105307 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 16:11
On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org> wrote:
> 1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
> 2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

I notice that PyModule_Create currently works like this:

PyAPI_FUNC(PyObject *) PyModule_Create2(struct PyModuleDef*, int apiver);
#define PyModule_Create(module) \
        PyModule_Create2(module, PYTHON_API_VERSION)

Instead of modifying PyModuleDef, what if we changed PyModule_Create
to something like the following?

PyAPI_FUNC(PyObject *) PyModule_Create3(struct PyModuleDef*, int apiver);
#define PyModule_Create(module) \
        PyModule_Create3(module, PYTHON_API_VERSION, PYTHON_UNICODE_SETTING)

In most cases that will Just Work, without requiring the module writer
to modify their PyModuleDef.
msg105309 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 16:19
On Sat, May 8, 2010 at 11:02 AM, Martin v. Löwis <report@bugs.python.org> wrote:
> Did you mean to include the hunk from msg105295 as part of the patch?
> If so, wouldn't that defeat the whole point of the patch?

My intention is to offer two compilation modes to extension authors:

1) A mode that defines Py_UNICODE, PyUnicodeObject, and various unsafe
functions and macros.
2) A mode that does not define them.

In mode #1, importing the module should fail if the Unicode settings
are mismatched.  I had meant to include the hunk from msg105295 only
in mode #1.

Right now I favor your idea of generating a runtime error when loading
the module, instead of a linker error.  Assuming we go that route, the
hunk from msg105295 will not be used at all.
msg105310 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-08 16:20
> In most cases that will Just Work, without requiring the module writer
> to modify their PyModuleDef.

I'd rather modify PyModuleDef_Base, to include a flags field, and
perhaps put the api version into it, as well. That's an ABI change,
unfortunately, but such a flags field may turn out useful, anyway.
msg105311 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 16:33
On Sat, May 8, 2010 at 11:20 AM, Martin v. Löwis <report@bugs.python.org> wrote:
> I'd rather modify PyModuleDef_Base, to include a flags field, and
> perhaps put the api version into it, as well. That's an ABI change,
> unfortunately, but such a flags field may turn out useful, anyway.

That works for me.  I'll work on a patch (which may end up needing
further revision, but it will at least give us a concrete basis for
further discussion).
msg105312 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-08 16:35
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
> On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
>> Are you sure this doesn't get optimized away in practice ?
> 
> I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)
> 
>> Sure, though, I don't see how this relates to C code relying
>> on these details, e.g. a C extension will probably use different
>> conversion code depending on whether UCS2 or UCS4 is compatible
>> with some external library, etc.
> 
> Can you give an example?
> 
> All of the examples I can think of either:
> - poke into PyUnicodeObject's internals,
> - call a Python function that exposes Py_UNICODE or PyUnicodeObject
> 
> I'm explicitly trying to protect those two cases.  It's quite possible
> that I'm missing something, but I can't think of any other unsafe way
> for a C extension to convert a Python Unicode object to a byte string.

One of the more important cases you are missing is the
argument parser in Python:

Py_UNICODE *x;
Py_ssize_t y;
PyArg_ParseTuple(args, "u#", &x, &y);

This uses the native Py_UNICODE type, but doesn't rely on any
Unicode APIs.

Same for the tuple builder:

args = Py_BuildValue("(u#)", x, y);
msg105315 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 16:47
On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> One of the more important cases you are missing is the
> argument parser in Python:

Thanks.  I've had my head buried in c-api/unicode.html and unicodeobject.h.

> Py_UNICODE *x;
> Py_ssize_t y;
> PyArg_ParseTuple(args, "u#", &x, &y);

My plan is to not define Py_UNICODE in Unicode-agnostic mode, to
prevent that from compiling.
msg105321 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-08 17:50
In Unicode-agnostic mode, instead of leaving Py_UNICODE, PyUnicodeObject, and many functions undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:

struct PY_UNICODE_TYPE;
typedef struct PY_UNICODE_TYPE Py_UNICODE;

That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to dereference the pointers nor evaluate sizeof(Py_UNICODE).

That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions (anything that doesn't manipulate individual characters, basically).
msg105327 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-08 19:35
> In Unicode-agnostic mode, instead of leaving Py_UNICODE,
> PyUnicodeObject, and many functions undefined, I wonder if it would
> be sufficient to declare Py_UNICODE like this:

Sounds good.
msg105398 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-09 15:12
Daniel Stutzbach wrote:
> 
> Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:
> 
> On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
> <report@bugs.python.org> wrote:
>> One of the more important cases you are missing is the
>> argument parser in Python:
> 
> Thanks.  I've had my head buried in c-api/unicode.html and unicodeobject.h.
> 
>> Py_UNICODE *x;
>> Py_ssize_t y;
>> PyArg_ParseTuple(args, "u#", &x, &y);
> 
> My plan is to not define Py_UNICODE in Unicode-agnostic mode, to
> prevent that from compiling.
>

And later...

undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:
>
> struct PY_UNICODE_TYPE;
> typedef struct PY_UNICODE_TYPE Py_UNICODE;
>
> That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to
dereference the pointers nor evaluate sizeof(Py_UNICODE).
>
> That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions
(anything that doesn't manipulate individual characters, basically).

Please make sure that these compiler tricks are portable
enough across the supported Python platforms. Just checking
with GCC 4.3 is not good enough.

I suppose you could use the buildbots to gather information
on how different compilers behave (e.g. by checking in a patch,
letting the buildbots run and then reverting it, if there's
a problem). I'm just not sure how you could check for optimization
compiler bugs/features using the buildbots.
msg105400 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-09 16:32
On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org>wrote:

> 1. add a flag to PyModuleDef, indicating whether the module was built in
> UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module,
> instead of having the dynamic linker do so.

I just thought of another risk inherit in this approach.  If the extension
module is composed of more than one C file, the extension author may
inadvertently compile the file defining the PyModuleDef in Unicode-agnostic
mode but compile another file in Unicode-sensitive mode.  Then they would
have a Unicode-sensitive extension as Unicode-agnostic, which would lead to
mysterious crashes if the Unicode settings are mismatched. :-(
msg105401 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-09 16:42
On Sun, May 9, 2010 at 10:12 AM, Marc-Andre Lemburg
<report@bugs.python.org>wrote:

> I'm just not sure how you could check for optimization
> compiler bugs/features using the buildbots.
>

Once I have a patch that I'm modestly confident in, I'll write automated
tests to go with it.

The tests will (at minimum):

- Create a tweaked copy of pyconfig.h that uses the opposite Unicode
settings (UCS4 v UCS2)

- Build an extension module in Unicode-agnostic mode
- Make sure it loads and works

- Build an extension module in Unicode-sensitive mode
- Make sure it doesn't load
msg105402 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-09 16:53
> I just thought of another risk inherit in this approach.  If the extension
> module is composed of more than one C file, the extension author may
> inadvertently compile the file defining the PyModuleDef in Unicode-agnostic
> mode but compile another file in Unicode-sensitive mode.

I wouldn't worry about this case.
msg105419 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-05-09 22:14
Here is a new patch.

Usage, short version: 

By default, modules compile in Unicode-agnostic mode, where Py_UNICODE is an incomplete type.  Unicode-agnostic modules will load in both UCS2 and UCS4 interpreters.

If a module defines PY_REAL_PY_UNICODE before including Python.h, Py_UNICODE will be a complete type.  The module can then dereference Py_UNICODE pointers and use sizeof(Py_UNICODE).  Attempting to load the module into a mismatched interpreter will cause an ImportError.

Longer version:

In Unicode-agnostic mode, extensions can pass around Py_UNICODE pointers freely, but they will get a compilation error if they attempt to dereference the pointer or use sizeof(Py_UNICODE).  In Unicode-agnostic mode, the following are never defined: Py_UNICODE_SIZE, Py_UNICODE_WIDE, HAVE_USABLE_CHAR_T, and functions and macros take Py_UNICODE as parameter or return it.

Passing a Py_UNICODE pointer to a function that expects a wchar_t * (or something similar) will compile, but will generate an incompatible pointer types warning.

Per MvL's suggestion, the patch adds a flag field to PyModuleDef_Base and defines two flags: one to indicate if the module depends on the Unicode width, and another to indicate the width the module was compiled with.  These flags are checked in PyModule_Create2.

The patch includes unit tests to ensure that importing works or doesn't work in the correct cases.

19 of the 77 modules in Modules/ needed PY_REAL_PY_UNICODE.  

18 of them failed to compile without it (good!).

1 of them (_tkinter.c) output an "incompatible pointer types" warning without it, but compiled anyway (resulting in bad code - without evidence to the contrary it assumed Python was UCS2).

I ran the test suite with a UCS2 Python and a UCS4 Python on Linux.  I had hoped to run the test suite on Windows as well, but many tests are failing on Windows even without the patch.
msg115437 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-09-03 11:05
I just noticed that the previous "new" patch I uploaded was actually my original patch.  Oops.  Anyway...

In a discussion on the capi list, a few people objected to compiling in Unicode-agnostic mode by default (e.g., it would be annoying for Cython).  Attached is a new patch that compiles in Unicode-agnostic mode only when Py_UNICODE_AGNOSTIC is defined (and brings the patch up to date with other changes to the py3k branch).

It includes automated tests, and I have tested UCS2 Linux, UCS4 Linux, and Windows builds.

After the patch, trying to import a module with a mismatched Unicode setting looks like this:

>>> import _testunicodemismatched
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: module's Unicode representation does not match the interpreter's

Here's a birds-eye view of the patch:

moduleobject.h: Add a flags field to PyModuleDef_Base
moduleobject.c: Detect import with mismatched Unicode setting
pyunicode.h: New file; non-agnostic code from unicodeobject.h moved here
unicodeobject.h: Deleted ~150 lines of #defines (yay!)
Everything else: tests

I'm pretty happy with the patch.  In the earlier discussion here, there was generally positive feedback for the overall idea.  Someone want to do a patch review?
msg115461 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-09-03 15:38
+#undef HAVE_USABLE_CHAR_T

Isn't it HAVE_USABLE_WCHAR_T?

+struct PY_UNICODE_STRUCT;
+typedef struct PY_UNICODE_STRUCT Py_UNICODE_STRUCT;
+#define Py_UNICODE Py_UNICODE_STRUCT

Why not simply `typedef struct PY_UNICODE_STRUCT Py_UNICODE`?

The _testunicodeagnostic module could have a simple test method making a couple of calls to PyUnicode_* functions.
msg115474 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2010-09-03 17:36
Thanks, Antoine.  Those are all good suggestions.  Attached is an updated patch.

Marc-Andre, Martin: would you like to look this over as well?
msg144619 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-09-29 19:38
The PEP 393 has been accepted. There is no more narrow or wide build, Py_UNICODE is now always wchar_t.

We have also a stable ABI which doesn't depend on the internal stucture of Unicode strings.

In Python 3.2, the name of dynamic libraries is also different if the module was compiled in narrow or wide mode. See the PEP 3149.

I think that it is enough. Can we close the issue?
msg144620 - (view) Author: Daniel Stutzbach (stutzbach) (Python committer) Date: 2011-09-29 19:39
Gladly. :-)
History
Date User Action Args
2011-09-29 19:39:30stutzbachsetstatus: open -> closed
resolution: out of date
messages: + msg144620

stage: patch review -> resolved
2011-09-29 19:38:10vstinnersetmessages: + msg144619
2010-12-29 01:46:04eric.araujosetnosy: + eric.araujo
2010-12-07 17:48:17belopolskysetnosy: + belopolsky
2010-09-03 17:36:48stutzbachsetfiles: + unicode-3.patch

messages: + msg115474
2010-09-03 15:38:44pitrousetnosy: + pitrou
messages: + msg115461
2010-09-03 14:57:22gvanrossumsetnosy: - gvanrossum
2010-09-03 11:05:24stutzbachsetkeywords: + needs review
files: + unicode-2.patch
messages: + msg115437

stage: needs patch -> patch review
2010-05-09 22:14:49stutzbachsetfiles: + unicode.patch

messages: + msg105419
2010-05-09 21:22:51stutzbachsetfiles: - unicode.patch
2010-05-09 16:53:48loewissetmessages: + msg105402
2010-05-09 16:45:55stutzbachsetfiles: - unnamed
2010-05-09 16:45:51stutzbachsetfiles: - unnamed
2010-05-09 16:42:33stutzbachsetfiles: + unnamed

messages: + msg105401
2010-05-09 16:32:09stutzbachsetfiles: + unnamed

messages: + msg105400
2010-05-09 15:12:35lemburgsetmessages: + msg105398
2010-05-08 19:35:22loewissetmessages: + msg105327
2010-05-08 17:55:07Arfreversetnosy: + Arfrever
2010-05-08 17:50:21stutzbachsetmessages: + msg105321
2010-05-08 16:47:27stutzbachsetmessages: + msg105315
2010-05-08 16:35:34lemburgsetmessages: + msg105312
2010-05-08 16:33:59stutzbachsetmessages: + msg105311
2010-05-08 16:20:35loewissetmessages: + msg105310
2010-05-08 16:19:27stutzbachsetmessages: + msg105309
2010-05-08 16:11:05stutzbachsetmessages: + msg105307
2010-05-08 16:02:49loewissetmessages: + msg105304
2010-05-08 15:48:38stutzbachsetmessages: + msg105302
2010-05-08 15:16:07lemburgsetmessages: + msg105297
2010-05-08 15:01:06stutzbachsetmessages: + msg105295
2010-05-08 13:15:28lemburgsetmessages: + msg105286
2010-05-08 13:07:46loewissetmessages: + msg105284
2010-05-08 10:03:50lemburgsetmessages: + msg105272
2010-05-08 04:46:24zookosetmessages: + msg105260
2010-05-07 22:42:25vstinnersetnosy: + vstinner
2010-05-07 22:38:23r.david.murraysetnosy: + r.david.murray, lemburg, loewis
messages: + msg105242
2010-05-07 21:54:29stutzbachsetmessages: + msg105239
2010-05-07 19:28:56stutzbachsetnosy: + gvanrossum, zooko, scoder
messages: + msg105225
2010-05-07 19:24:09stutzbachsetmessages: + msg105224
2010-05-07 19:04:05stutzbachsetfiles: + unicode.patch
keywords: + patch
2010-05-07 19:02:43stutzbachcreate