Improve ABI compatibility between UCS2 and UCS4 builds #52900

stutzbach · 2010-05-07T19:02:43Z

BPO	8654
Nosy	@malemburg, @loewis, @abalkin, @pitrou, @scoder, @vstinner, @merwok, @bitdancer
Files	unicode.patch unicode-2.patch: Updated patch unicode-3.patch: Updated based on Antoine's feedback

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2011-09-29.19:39:30.360>
created_at = <Date 2010-05-07.19:02:43.448>
labels = ['interpreter-core', 'type-bug', 'expert-unicode']
title = 'Improve ABI compatibility between UCS2 and UCS4 builds'
updated_at = <Date 2011-09-29.19:39:30.359>
user = 'https://bugs.python.org/stutzbach'

bugs.python.org fields:

activity = <Date 2011-09-29.19:39:30.359>
actor = 'stutzbach'
assignee = 'stutzbach'
closed = True
closed_date = <Date 2011-09-29.19:39:30.360>
closer = 'stutzbach'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-05-07.19:02:43.448>
creator = 'stutzbach'
dependencies = []
files = ['17278', '18723', '18730']
hgrepos = []
issue_num = 8654
keywords = ['patch', 'needs review']
message_count = 31.0
messages = ['105222', '105224', '105225', '105239', '105242', '105260', '105272', '105284', '105286', '105295', '105297', '105302', '105304', '105307', '105309', '105310', '105311', '105312', '105315', '105321', '105327', '105398', '105400', '105401', '105402', '105419', '115437', '115461', '115474', '144619', '144620']
nosy_count = 11.0
nosy_names = ['lemburg', 'loewis', 'zooko', 'belopolsky', 'pitrou', 'scoder', 'vstinner', 'stutzbach', 'eric.araujo', 'Arfrever', 'r.david.murray']
pr_nums = []
priority = 'normal'
resolution = 'out of date'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8654'
versions = ['Python 3.2']

stutzbach · 2010-05-07T19:02:41Z

Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.

Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:

    undefined symbol: PyUnicodeUCS2_FromString

In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).

Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.

For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.

On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped.

The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds.

I previously brought this issue up on python-ideas, see:
http://mail.python.org/pipermail/python-ideas/2009-November/006543.html

Here's a summary of that discussion:

Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.

Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes.

Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.

With my current patch, there are still two ways for that to happen:

The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or
The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.

Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired.

I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug.

stutzbach · 2010-05-07T19:24:09Z

Forcing the compile-time and link-time encodings to match is tricky. The goal is:

Allow Unicode-agnostic modules to link, regardless of unicode settings
Cause a link failure if the unicode settings are mismatched for a module that pokes into PyUnicodeObject

All of the solutions I've come up with have trade-offs. Here is one approach:

Expose PyUnicodeObject if and only if the extension #defines a special flag (PY_EXPOSE_UNICODE_OBJECT?) before including Python.h.

If an extension does NOT define the flag, then PyUnicodeObject will not be defined nor the accessor macros for accessing its fields. All of the opaque and non-opaque functions are still defined; the module just can't poke directly into PyUnicodeObject. Linking will succeed as long as the module avoids non-opaque functions.

If the flag IS defined, then PyUnicodeObject will be defined along with the accessor macros. The unicodeobject.h header will also arrange to require an appropriate symbol so that linking will succeed only if the module is compiled with the same Unicode settings as Python.

The upside of this approach is that it works: Unicode-agnostic modules will link and all others will not.

The drawback is a few extension modules will need to have a #define before including Python.h. Only modules that poke into PyUnicodeObject will be impacted. No change will be required for modules that depend on the Unicode setting yet stick to functions (opaque or not).

Thoughts?

stutzbach · 2010-05-07T19:28:56Z

Adding people to the Nosy List who participated in the original thread on python-ideas, several months ago. Hope you don't mind. :-)

stutzbach · 2010-05-07T21:54:28Z

I've been thinking about this a bit more. There are three types of symbols in unicodeobject.h:

Functions that are always safe to use
Functions that are only safe if the module is compiled with the same Unicode settings as Python
Structures and macros that are only safe if the module is compiled with the same Unicode settings as Python

The functions in #2 will generate a link error if the module actually uses them.

We can add some symbols next to the structures and macros (#3), such that there will always be a link error if the Unicode settings are mismatched. However, we can't tell if the module actually uses the structure or not.

The hard question is: what should be declared by default?

Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread)

Option 2: Make Unicode-dependence the default. If the compilation settings don't match, force a link failure. Allow extension authors to define a flag (Py_UNICODE_AGNOSTIC) before including Python.h to avoid defining any unsafe functions, structures, or macros. In practice, this is what Python 3 does today, except there's currently no way to declare Unicode-agnosticism.

Option 3: Go for a middle ground. Modules are Unicode agnostic by default, unless they call a non-agnostic function (which will cause a link error if there's a mismatch). If they want to poke directly into PyUnicodeObject, they still need to include a separate header. I fear that this is the worst of both worlds, though.

The more I think about it, the more I like the first option.

Maybe I should bring this up on capi-sig and try to gather a consensus?

bitdancer · 2010-05-07T22:38:24Z

Adding MvL because he wrote the ABI PEP, and MAL because he cares about the Unicode interface.

zooko · 2010-05-08T04:46:22Z

Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread)

+1

The packaging and compatibility problems are pressing concerns for many people. Poking at PyUnicodeObject's internals seems like a rare pretty rare need for an extension module to do. If I understand correctly, this option provides the best solution to the packaging and compatibility issues (over the long term, as Python and the authors of extension modules upgrade).

malemburg · 2010-05-08T10:03:47Z

Daniel Stutzbach wrote:

New submission from Daniel Stutzbach <daniel@stutzbachenterprises.com>:

Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.

Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:
    undefined symbol: PyUnicodeUCS2_FromString
In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).

Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

That's true. The main point in wrapping the APIs was to make sure that
if an extension module uses Unicode, it will likely use one of the
wrapped APIs and then be protected by the name mangling to prevent
extensions compiled as UCS2 to be loaded into a UCS4 interpreter
and vice-versa.

The main difference between those two build variants is the definition
of Py_UNICODE. Unfortunately, there's no way to have the linker check
that type definition. The wrapping was chosen as alternative protection.

Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.

For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.

On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped.

The point of the wrapping was not to protect the APIs themselves.
It is just meant to be able to use the linker as protection device
when loading a module.

If you can propose a different method of reliably protecting against
mixed Unicode build module loads, that would be great. We could then
get rid off the wrapping altogether.

The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds.

I don't think that removing a few wrapped function is going to help
with the problem.

Extension modules can still use Py_UNICODE internally and make
use of the macros we have for accessing the PyUnicodeObject
internals. You simply don't catch such usage with wrapping
the APIs - as I said above: it's not a perfect solution, but
only a method that works most of the time.

I'd much rather like to see a solution that always works rather
than making mixed Unicode build extension loading easier.

Perhaps we could do something in the module import mechanism
to check for Unicode build compatibility (much like the Python API
version check, but with raising an error instead of just issuing
a warning).

I previously brought this issue up on python-ideas, see:
http://mail.python.org/pipermail/python-ideas/2009-November/006543.html

Here's a summary of that discussion:

Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.

Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes.

Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.

With my current patch, there are still two ways for that to happen:

The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or

The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.

Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired.

I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug.

Please note that UCS2 and UCS4 builds of Python are different in
more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
use surrogates when converting between Unicode and bytes which
UCS4 don't, sys.maxunicode is different, range checks use different
bounds, unichr() behaves differently, etc. etc.

It is simply not a good idea to have modules build for a UCS2
interpreter work in a UCS4 interpreter and vice-versa.

loewis · 2010-05-08T13:07:46Z

I propose a different approach:

add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes.

There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create.

malemburg · 2010-05-08T13:15:28Z

Martin v. Löwis wrote:

Martin v. Löwis <martin@v.loewis.de> added the comment:

I propose a different approach:

add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.

provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes.

There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create.

+1

We could then get rid off the API renaming altogether.

stutzbach · 2010-05-08T15:01:05Z

On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

If you can propose a different method of reliably protecting against
mixed Unicode build module loads, that would be great. We could then
get rid off the wrapping altogether.

The following code will cause a link error 100% of the time iff
there's a mismatch:

#ifndef Py_UNICODE_WIDE
#define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
#else
#define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
#endif
extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol;

In practice, I'd surrounded it with a bunch #ifdefs to disable the
"defined but not used" warning in gcc and MSVC.

Please note that UCS2 and UCS4 builds of Python are different in
more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
use surrogates when converting between Unicode and bytes which
UCS4 don't, sys.maxunicode is different, range checks use different
bounds, unichr() behaves differently, etc. etc.

That's true, but those differences are visible from pure-Python code
as well aren't they?

malemburg · 2010-05-08T15:16:07Z

Daniel Stutzbach wrote:

Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> If you can propose a different method of reliably protecting against
> mixed Unicode build module loads, that would be great. We could then
> get rid off the wrapping altogether.

The following code will cause a link error 100% of the time iff
there's a mismatch:

#ifndef Py_UNICODE_WIDE
#define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
#else
#define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
#endif
extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol;

In practice, I'd surrounded it with a bunch #ifdefs to disable the
"defined but not used" warning in gcc and MSVC.

Are you sure this doesn't get optimized away in practice ?

FWIW: I still like the import logic solution better, since that
will make it possible for the user to see a truly useful
error message rather than just some missing symbol linker
error.

> Please note that UCS2 and UCS4 builds of Python are different in
> more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
> use surrogates when converting between Unicode and bytes which
> UCS4 don't, sys.maxunicode is different, range checks use different
> bounds, unichr() behaves differently, etc. etc.

That's true, but those differences are visible from pure-Python code
as well aren't they?

Sure, though, I don't see how this relates to C code relying
on these details, e.g. a C extension will probably use different
conversion code depending on whether UCS2 or UCS4 is compatible
with some external library, etc.

stutzbach · 2010-05-08T15:48:38Z

On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

Are you sure this doesn't get optimized away in practice ?

I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

Sure, though, I don't see how this relates to C code relying
on these details, e.g. a C extension will probably use different
conversion code depending on whether UCS2 or UCS4 is compatible
with some external library, etc.

Can you give an example?

All of the examples I can think of either:

poke into PyUnicodeObject's internals,
call a Python function that exposes Py_UNICODE or PyUnicodeObject

I'm explicitly trying to protect those two cases. It's quite possible
that I'm missing something, but I can't think of any other unsafe way
for a C extension to convert a Python Unicode object to a byte string.

loewis · 2010-05-08T16:02:49Z

> Are you sure this doesn't get optimized away in practice ?

I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

Did you mean to include the hunk from msg105295 as part of the patch?
If so, wouldn't that defeat the whole point of the patch?

stutzbach · 2010-05-08T16:11:05Z

On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org> wrote:

add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.

provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

I notice that PyModule_Create currently works like this:

PyAPI_FUNC(PyObject *) PyModule_Create2(struct PyModuleDef*, int apiver);
#define PyModule_Create(module) \
        PyModule_Create2(module, PYTHON_API_VERSION)

Instead of modifying PyModuleDef, what if we changed PyModule_Create
to something like the following?

PyAPI_FUNC(PyObject *) PyModule_Create3(struct PyModuleDef*, int apiver);
#define PyModule_Create(module) \
        PyModule_Create3(module, PYTHON_API_VERSION, PYTHON_UNICODE_SETTING)

In most cases that will Just Work, without requiring the module writer
to modify their PyModuleDef.

stutzbach · 2010-05-08T16:19:27Z

On Sat, May 8, 2010 at 11:02 AM, Martin v. Löwis <report@bugs.python.org> wrote:

Did you mean to include the hunk from msg105295 as part of the patch?
If so, wouldn't that defeat the whole point of the patch?

My intention is to offer two compilation modes to extension authors:

A mode that defines Py_UNICODE, PyUnicodeObject, and various unsafe
functions and macros.
A mode that does not define them.

In mode #1, importing the module should fail if the Unicode settings
are mismatched. I had meant to include the hunk from msg105295 only
in mode #1.

Right now I favor your idea of generating a runtime error when loading
the module, instead of a linker error. Assuming we go that route, the
hunk from msg105295 will not be used at all.

loewis · 2010-05-08T16:20:35Z

In most cases that will Just Work, without requiring the module writer
to modify their PyModuleDef.

I'd rather modify PyModuleDef_Base, to include a flags field, and
perhaps put the api version into it, as well. That's an ABI change,
unfortunately, but such a flags field may turn out useful, anyway.

stutzbach · 2010-05-08T16:33:59Z

On Sat, May 8, 2010 at 11:20 AM, Martin v. Löwis <report@bugs.python.org> wrote:

I'd rather modify PyModuleDef_Base, to include a flags field, and
perhaps put the api version into it, as well. That's an ABI change,
unfortunately, but such a flags field may turn out useful, anyway.

That works for me. I'll work on a patch (which may end up needing
further revision, but it will at least give us a concrete basis for
further discussion).

malemburg · 2010-05-08T16:35:34Z

Daniel Stutzbach wrote:

Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> Are you sure this doesn't get optimized away in practice ?

I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

> Sure, though, I don't see how this relates to C code relying
> on these details, e.g. a C extension will probably use different
> conversion code depending on whether UCS2 or UCS4 is compatible
> with some external library, etc.

Can you give an example?

All of the examples I can think of either:

poke into PyUnicodeObject's internals,

call a Python function that exposes Py_UNICODE or PyUnicodeObject

I'm explicitly trying to protect those two cases. It's quite possible
that I'm missing something, but I can't think of any other unsafe way
for a C extension to convert a Python Unicode object to a byte string.

One of the more important cases you are missing is the
argument parser in Python:

Py_UNICODE *x;
Py_ssize_t y;
PyArg_ParseTuple(args, "u#", &x, &y);

This uses the native Py_UNICODE type, but doesn't rely on any
Unicode APIs.

Same for the tuple builder:

args = Py_BuildValue("(u#)", x, y);

stutzbach · 2010-05-08T16:47:27Z

On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:

One of the more important cases you are missing is the
argument parser in Python:

Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h.

Py_UNICODE *x;
Py_ssize_t y;
PyArg_ParseTuple(args, "u#", &x, &y);

My plan is to not define Py_UNICODE in Unicode-agnostic mode, to
prevent that from compiling.

stutzbach · 2010-05-08T17:50:20Z

In Unicode-agnostic mode, instead of leaving Py_UNICODE, PyUnicodeObject, and many functions undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:

struct PY_UNICODE_TYPE;
typedef struct PY_UNICODE_TYPE Py_UNICODE;

That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to dereference the pointers nor evaluate sizeof(Py_UNICODE).

That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions (anything that doesn't manipulate individual characters, basically).

loewis · 2010-05-08T19:35:22Z

In Unicode-agnostic mode, instead of leaving Py_UNICODE,
PyUnicodeObject, and many functions undefined, I wonder if it would
be sufficient to declare Py_UNICODE like this:

Sounds good.

malemburg · 2010-05-09T15:12:34Z

Daniel Stutzbach wrote:

Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
<report@bugs.python.org> wrote:
> One of the more important cases you are missing is the
> argument parser in Python:

Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h.

> Py_UNICODE *x;
> Py_ssize_t y;
> PyArg_ParseTuple(args, "u#", &x, &y);

My plan is to not define Py_UNICODE in Unicode-agnostic mode, to
prevent that from compiling.

And later...

undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:

struct PY_UNICODE_TYPE;
typedef struct PY_UNICODE_TYPE Py_UNICODE;

That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to
dereference the pointers nor evaluate sizeof(Py_UNICODE).

That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions
(anything that doesn't manipulate individual characters, basically).

Please make sure that these compiler tricks are portable
enough across the supported Python platforms. Just checking
with GCC 4.3 is not good enough.

I suppose you could use the buildbots to gather information
on how different compilers behave (e.g. by checking in a patch,
letting the buildbots run and then reverting it, if there's
a problem). I'm just not sure how you could check for optimization
compiler bugs/features using the buildbots.

stutzbach · 2010-05-09T16:32:09Z

On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org>wrote:

add a flag to PyModuleDef, indicating whether the module was built in
UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module,
instead of having the dynamic linker do so.

I just thought of another risk inherit in this approach. If the extension
module is composed of more than one C file, the extension author may
inadvertently compile the file defining the PyModuleDef in Unicode-agnostic
mode but compile another file in Unicode-sensitive mode. Then they would
have a Unicode-sensitive extension as Unicode-agnostic, which would lead to
mysterious crashes if the Unicode settings are mismatched. :-(

stutzbach · 2010-05-09T16:42:33Z

On Sun, May 9, 2010 at 10:12 AM, Marc-Andre Lemburg
<report@bugs.python.org>wrote:

I'm just not sure how you could check for optimization
compiler bugs/features using the buildbots.

Once I have a patch that I'm modestly confident in, I'll write automated
tests to go with it.

The tests will (at minimum):

Create a tweaked copy of pyconfig.h that uses the opposite Unicode
settings (UCS4 v UCS2)
Build an extension module in Unicode-agnostic mode
Make sure it loads and works
Build an extension module in Unicode-sensitive mode
Make sure it doesn't load

loewis · 2010-05-09T16:53:48Z

I just thought of another risk inherit in this approach. If the extension
module is composed of more than one C file, the extension author may
inadvertently compile the file defining the PyModuleDef in Unicode-agnostic
mode but compile another file in Unicode-sensitive mode.

I wouldn't worry about this case.

stutzbach · 2010-05-09T22:14:47Z

Here is a new patch.

Usage, short version:

By default, modules compile in Unicode-agnostic mode, where Py_UNICODE is an incomplete type. Unicode-agnostic modules will load in both UCS2 and UCS4 interpreters.

If a module defines PY_REAL_PY_UNICODE before including Python.h, Py_UNICODE will be a complete type. The module can then dereference Py_UNICODE pointers and use sizeof(Py_UNICODE). Attempting to load the module into a mismatched interpreter will cause an ImportError.

Longer version:

In Unicode-agnostic mode, extensions can pass around Py_UNICODE pointers freely, but they will get a compilation error if they attempt to dereference the pointer or use sizeof(Py_UNICODE). In Unicode-agnostic mode, the following are never defined: Py_UNICODE_SIZE, Py_UNICODE_WIDE, HAVE_USABLE_CHAR_T, and functions and macros take Py_UNICODE as parameter or return it.

Passing a Py_UNICODE pointer to a function that expects a wchar_t * (or something similar) will compile, but will generate an incompatible pointer types warning.

Per MvL's suggestion, the patch adds a flag field to PyModuleDef_Base and defines two flags: one to indicate if the module depends on the Unicode width, and another to indicate the width the module was compiled with. These flags are checked in PyModule_Create2.

The patch includes unit tests to ensure that importing works or doesn't work in the correct cases.

19 of the 77 modules in Modules/ needed PY_REAL_PY_UNICODE.

18 of them failed to compile without it (good!).

1 of them (_tkinter.c) output an "incompatible pointer types" warning without it, but compiled anyway (resulting in bad code - without evidence to the contrary it assumed Python was UCS2).

I ran the test suite with a UCS2 Python and a UCS4 Python on Linux. I had hoped to run the test suite on Windows as well, but many tests are failing on Windows even without the patch.

stutzbach · 2010-09-03T11:05:17Z

I just noticed that the previous "new" patch I uploaded was actually my original patch. Oops. Anyway...

In a discussion on the capi list, a few people objected to compiling in Unicode-agnostic mode by default (e.g., it would be annoying for Cython). Attached is a new patch that compiles in Unicode-agnostic mode only when Py_UNICODE_AGNOSTIC is defined (and brings the patch up to date with other changes to the py3k branch).

It includes automated tests, and I have tested UCS2 Linux, UCS4 Linux, and Windows builds.

After the patch, trying to import a module with a mismatched Unicode setting looks like this:

>>> import _testunicodemismatched
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: module's Unicode representation does not match the interpreter's

Here's a birds-eye view of the patch:

moduleobject.h: Add a flags field to PyModuleDef_Base
moduleobject.c: Detect import with mismatched Unicode setting
pyunicode.h: New file; non-agnostic code from unicodeobject.h moved here
unicodeobject.h: Deleted ~150 lines of #defines (yay!)
Everything else: tests

I'm pretty happy with the patch. In the earlier discussion here, there was generally positive feedback for the overall idea. Someone want to do a patch review?

pitrou · 2010-09-03T15:38:44Z

+#undef HAVE_USABLE_CHAR_T

Isn't it HAVE_USABLE_WCHAR_T?

+struct PY_UNICODE_STRUCT;
+typedef struct PY_UNICODE_STRUCT Py_UNICODE_STRUCT;
+#define Py_UNICODE Py_UNICODE_STRUCT

Why not simply typedef struct PY_UNICODE_STRUCT Py_UNICODE?

The _testunicodeagnostic module could have a simple test method making a couple of calls to PyUnicode_* functions.

stutzbach · 2010-09-03T17:36:43Z

Thanks, Antoine. Those are all good suggestions. Attached is an updated patch.

Marc-Andre, Martin: would you like to look this over as well?

vstinner · 2011-09-29T19:38:11Z

The PEP-393 has been accepted. There is no more narrow or wide build, Py_UNICODE is now always wchar_t.

We have also a stable ABI which doesn't depend on the internal stucture of Unicode strings.

In Python 3.2, the name of dynamic libraries is also different if the module was compiled in narrow or wide mode. See the PEP-3149.

I think that it is enough. Can we close the issue?

stutzbach · 2011-09-29T19:39:30Z

Gladly. :-)

stutzbach mannequin self-assigned this May 7, 2010

stutzbach mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error labels May 7, 2010

stutzbach mannequin closed this as completed Sep 29, 2011

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve ABI compatibility between UCS2 and UCS4 builds #52900

Improve ABI compatibility between UCS2 and UCS4 builds #52900

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

bitdancer commented May 7, 2010

zooko mannequin commented May 8, 2010

malemburg commented May 8, 2010

loewis mannequin commented May 8, 2010

malemburg commented May 8, 2010

stutzbach mannequin commented May 8, 2010

malemburg commented May 8, 2010

stutzbach mannequin commented May 8, 2010

loewis mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

loewis mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

malemburg commented May 8, 2010

stutzbach mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

loewis mannequin commented May 8, 2010

malemburg commented May 9, 2010

stutzbach mannequin commented May 9, 2010

stutzbach mannequin commented May 9, 2010

loewis mannequin commented May 9, 2010

stutzbach mannequin commented May 9, 2010

stutzbach mannequin commented Sep 3, 2010

pitrou commented Sep 3, 2010

stutzbach mannequin commented Sep 3, 2010

vstinner commented Sep 29, 2011

stutzbach mannequin commented Sep 29, 2011

Improve ABI compatibility between UCS2 and UCS4 builds #52900

Improve ABI compatibility between UCS2 and UCS4 builds #52900

Comments

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

stutzbach mannequin commented May 7, 2010

bitdancer commented May 7, 2010

zooko mannequin commented May 8, 2010

malemburg commented May 8, 2010

loewis mannequin commented May 8, 2010

malemburg commented May 8, 2010

stutzbach mannequin commented May 8, 2010

malemburg commented May 8, 2010

stutzbach mannequin commented May 8, 2010

loewis mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

loewis mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

malemburg commented May 8, 2010

stutzbach mannequin commented May 8, 2010

stutzbach mannequin commented May 8, 2010

loewis mannequin commented May 8, 2010

malemburg commented May 9, 2010

stutzbach mannequin commented May 9, 2010

stutzbach mannequin commented May 9, 2010

loewis mannequin commented May 9, 2010

stutzbach mannequin commented May 9, 2010

stutzbach mannequin commented Sep 3, 2010

pitrou commented Sep 3, 2010

stutzbach mannequin commented Sep 3, 2010

vstinner commented Sep 29, 2011

stutzbach mannequin commented Sep 29, 2011