New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve ABI compatibility between UCS2 and UCS4 builds #52900
Comments
Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString. Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:
In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way). Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added. Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view. For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped. On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped. The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds. I previously brought this issue up on python-ideas, see: Here's a summary of that discussion: Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding. Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes. Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction. With my current patch, there are still two ways for that to happen:
Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired. I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug. |
Forcing the compile-time and link-time encodings to match is tricky. The goal is:
All of the solutions I've come up with have trade-offs. Here is one approach: Expose PyUnicodeObject if and only if the extension #defines a special flag (PY_EXPOSE_UNICODE_OBJECT?) before including Python.h. If an extension does NOT define the flag, then PyUnicodeObject will not be defined nor the accessor macros for accessing its fields. All of the opaque and non-opaque functions are still defined; the module just can't poke directly into PyUnicodeObject. Linking will succeed as long as the module avoids non-opaque functions. If the flag IS defined, then PyUnicodeObject will be defined along with the accessor macros. The unicodeobject.h header will also arrange to require an appropriate symbol so that linking will succeed only if the module is compiled with the same Unicode settings as Python. The upside of this approach is that it works: Unicode-agnostic modules will link and all others will not. The drawback is a few extension modules will need to have a #define before including Python.h. Only modules that poke into PyUnicodeObject will be impacted. No change will be required for modules that depend on the Unicode setting yet stick to functions (opaque or not). Thoughts? |
Adding people to the Nosy List who participated in the original thread on python-ideas, several months ago. Hope you don't mind. :-) |
I've been thinking about this a bit more. There are three types of symbols in unicodeobject.h:
The functions in #2 will generate a link error if the module actually uses them. We can add some symbols next to the structures and macros (#3), such that there will always be a link error if the Unicode settings are mismatched. However, we can't tell if the module actually uses the structure or not. The hard question is: what should be declared by default? Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread) Option 2: Make Unicode-dependence the default. If the compilation settings don't match, force a link failure. Allow extension authors to define a flag (Py_UNICODE_AGNOSTIC) before including Python.h to avoid defining any unsafe functions, structures, or macros. In practice, this is what Python 3 does today, except there's currently no way to declare Unicode-agnosticism. Option 3: Go for a middle ground. Modules are Unicode agnostic by default, unless they call a non-agnostic function (which will cause a link error if there's a mismatch). If they want to poke directly into PyUnicodeObject, they still need to include a separate header. I fear that this is the worst of both worlds, though. The more I think about it, the more I like the first option. Maybe I should bring this up on capi-sig and try to gather a consensus? |
Adding MvL because he wrote the ABI PEP, and MAL because he cares about the Unicode interface. |
+1 The packaging and compatibility problems are pressing concerns for many people. Poking at PyUnicodeObject's internals seems like a rare pretty rare need for an extension module to do. If I understand correctly, this option provides the best solution to the packaging and compatibility issues (over the long term, as Python and the authors of extension modules upgrade). |
Daniel Stutzbach wrote:
That's true. The main point in wrapping the APIs was to make sure that The main difference between those two build variants is the definition
The point of the wrapping was not to protect the APIs themselves. If you can propose a different method of reliably protecting against
I don't think that removing a few wrapped function is going to help Extension modules can still use Py_UNICODE internally and make I'd much rather like to see a solution that always works rather Perhaps we could do something in the module import mechanism
Please note that UCS2 and UCS4 builds of Python are different in It is simply not a good idea to have modules build for a UCS2 |
I propose a different approach:
Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes. There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create. |
Martin v. Löwis wrote:
+1 We could then get rid off the API renaming altogether. |
On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
The following code will cause a link error 100% of the time iff #ifndef Py_UNICODE_WIDE
#define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
#else
#define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
#endif
extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol; In practice, I'd surrounded it with a bunch #ifdefs to disable the
That's true, but those differences are visible from pure-Python code |
Daniel Stutzbach wrote:
Are you sure this doesn't get optimized away in practice ? FWIW: I still like the import logic solution better, since that
Sure, though, I don't see how this relates to C code relying |
On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)
Can you give an example? All of the examples I can think of either:
I'm explicitly trying to protect those two cases. It's quite possible |
Did you mean to include the hunk from msg105295 as part of the patch? |
On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org> wrote:
I notice that PyModule_Create currently works like this: PyAPI_FUNC(PyObject *) PyModule_Create2(struct PyModuleDef*, int apiver);
#define PyModule_Create(module) \
PyModule_Create2(module, PYTHON_API_VERSION) Instead of modifying PyModuleDef, what if we changed PyModule_Create PyAPI_FUNC(PyObject *) PyModule_Create3(struct PyModuleDef*, int apiver);
#define PyModule_Create(module) \
PyModule_Create3(module, PYTHON_API_VERSION, PYTHON_UNICODE_SETTING) In most cases that will Just Work, without requiring the module writer |
On Sat, May 8, 2010 at 11:02 AM, Martin v. Löwis <report@bugs.python.org> wrote:
My intention is to offer two compilation modes to extension authors:
In mode #1, importing the module should fail if the Unicode settings Right now I favor your idea of generating a runtime error when loading |
I'd rather modify PyModuleDef_Base, to include a flags field, and |
On Sat, May 8, 2010 at 11:20 AM, Martin v. Löwis <report@bugs.python.org> wrote:
That works for me. I'll work on a patch (which may end up needing |
Daniel Stutzbach wrote:
One of the more important cases you are missing is the Py_UNICODE *x;
Py_ssize_t y;
PyArg_ParseTuple(args, "u#", &x, &y); This uses the native Py_UNICODE type, but doesn't rely on any Same for the tuple builder: args = Py_BuildValue("(u#)", x, y); |
On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h.
My plan is to not define Py_UNICODE in Unicode-agnostic mode, to |
In Unicode-agnostic mode, instead of leaving Py_UNICODE, PyUnicodeObject, and many functions undefined, I wonder if it would be sufficient to declare Py_UNICODE like this: struct PY_UNICODE_TYPE;
typedef struct PY_UNICODE_TYPE Py_UNICODE; That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to dereference the pointers nor evaluate sizeof(Py_UNICODE). That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions (anything that doesn't manipulate individual characters, basically). |
Sounds good. |
Daniel Stutzbach wrote:
And later... undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:
Please make sure that these compiler tricks are portable I suppose you could use the buildbots to gather information |
On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org>wrote:
I just thought of another risk inherit in this approach. If the extension |
On Sun, May 9, 2010 at 10:12 AM, Marc-Andre Lemburg
Once I have a patch that I'm modestly confident in, I'll write automated The tests will (at minimum):
|
I wouldn't worry about this case. |
Here is a new patch. Usage, short version: By default, modules compile in Unicode-agnostic mode, where Py_UNICODE is an incomplete type. Unicode-agnostic modules will load in both UCS2 and UCS4 interpreters. If a module defines PY_REAL_PY_UNICODE before including Python.h, Py_UNICODE will be a complete type. The module can then dereference Py_UNICODE pointers and use sizeof(Py_UNICODE). Attempting to load the module into a mismatched interpreter will cause an ImportError. Longer version: In Unicode-agnostic mode, extensions can pass around Py_UNICODE pointers freely, but they will get a compilation error if they attempt to dereference the pointer or use sizeof(Py_UNICODE). In Unicode-agnostic mode, the following are never defined: Py_UNICODE_SIZE, Py_UNICODE_WIDE, HAVE_USABLE_CHAR_T, and functions and macros take Py_UNICODE as parameter or return it. Passing a Py_UNICODE pointer to a function that expects a wchar_t * (or something similar) will compile, but will generate an incompatible pointer types warning. Per MvL's suggestion, the patch adds a flag field to PyModuleDef_Base and defines two flags: one to indicate if the module depends on the Unicode width, and another to indicate the width the module was compiled with. These flags are checked in PyModule_Create2. The patch includes unit tests to ensure that importing works or doesn't work in the correct cases. 19 of the 77 modules in Modules/ needed PY_REAL_PY_UNICODE. 18 of them failed to compile without it (good!). 1 of them (_tkinter.c) output an "incompatible pointer types" warning without it, but compiled anyway (resulting in bad code - without evidence to the contrary it assumed Python was UCS2). I ran the test suite with a UCS2 Python and a UCS4 Python on Linux. I had hoped to run the test suite on Windows as well, but many tests are failing on Windows even without the patch. |
I just noticed that the previous "new" patch I uploaded was actually my original patch. Oops. Anyway... In a discussion on the capi list, a few people objected to compiling in Unicode-agnostic mode by default (e.g., it would be annoying for Cython). Attached is a new patch that compiles in Unicode-agnostic mode only when Py_UNICODE_AGNOSTIC is defined (and brings the patch up to date with other changes to the py3k branch). It includes automated tests, and I have tested UCS2 Linux, UCS4 Linux, and Windows builds. After the patch, trying to import a module with a mismatched Unicode setting looks like this: >>> import _testunicodemismatched
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: module's Unicode representation does not match the interpreter's Here's a birds-eye view of the patch: moduleobject.h: Add a flags field to PyModuleDef_Base I'm pretty happy with the patch. In the earlier discussion here, there was generally positive feedback for the overall idea. Someone want to do a patch review? |
+#undef HAVE_USABLE_CHAR_T Isn't it HAVE_USABLE_WCHAR_T? +struct PY_UNICODE_STRUCT; Why not simply The _testunicodeagnostic module could have a simple test method making a couple of calls to PyUnicode_* functions. |
Thanks, Antoine. Those are all good suggestions. Attached is an updated patch. Marc-Andre, Martin: would you like to look this over as well? |
The PEP-393 has been accepted. There is no more narrow or wide build, Py_UNICODE is now always wchar_t. We have also a stable ABI which doesn't depend on the internal stucture of Unicode strings. In Python 3.2, the name of dynamic libraries is also different if the module was compiled in narrow or wide mode. See the PEP-3149. I think that it is enough. Can we close the issue? |
Gladly. :-) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: