Issue8654
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-05-07 19:02 by stutzbach, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
unicode.patch | stutzbach, 2010-05-09 22:14 | |||
unicode-2.patch | stutzbach, 2010-09-03 11:05 | Updated patch | ||
unicode-3.patch | stutzbach, 2010-09-03 17:36 | Updated based on Antoine's feedback |
Messages (31) | |||
---|---|---|---|
msg105222 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-07 19:02 | |
Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString. Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it: undefined symbol: PyUnicodeUCS2_FromString In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way). Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added. Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view. For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped. On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped. The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds. I previously brought this issue up on python-ideas, see: http://mail.python.org/pipermail/python-ideas/2009-November/006543.html Here's a summary of that discussion: Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding. Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes. Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction. With my current patch, there are still two ways for that to happen: 1) The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or 2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros. Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired. I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug. |
|||
msg105224 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-07 19:24 | |
Forcing the compile-time and link-time encodings to match is tricky. The goal is: - Allow Unicode-agnostic modules to link, regardless of unicode settings - Cause a link failure if the unicode settings are mismatched for a module that pokes into PyUnicodeObject All of the solutions I've come up with have trade-offs. Here is one approach: Expose PyUnicodeObject if and only if the extension #defines a special flag (PY_EXPOSE_UNICODE_OBJECT?) before including Python.h. If an extension does NOT define the flag, then PyUnicodeObject will not be defined nor the accessor macros for accessing its fields. All of the opaque and non-opaque functions are still defined; the module just can't poke directly into PyUnicodeObject. Linking will succeed as long as the module avoids non-opaque functions. If the flag IS defined, then PyUnicodeObject will be defined along with the accessor macros. The unicodeobject.h header will also arrange to require an appropriate symbol so that linking will succeed only if the module is compiled with the same Unicode settings as Python. The upside of this approach is that it works: Unicode-agnostic modules will link and all others will not. The drawback is a few extension modules will need to have a #define before including Python.h. Only modules that poke into PyUnicodeObject will be impacted. No change will be required for modules that depend on the Unicode setting yet stick to functions (opaque or not). Thoughts? |
|||
msg105225 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-07 19:28 | |
Adding people to the Nosy List who participated in the original thread on python-ideas, several months ago. Hope you don't mind. :-) |
|||
msg105239 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-07 21:54 | |
I've been thinking about this a bit more. There are three types of symbols in unicodeobject.h: 1. Functions that are always safe to use 2. Functions that are only safe if the module is compiled with the same Unicode settings as Python 3. Structures and macros that are only safe if the module is compiled with the same Unicode settings as Python The functions in #2 will generate a link error if the module actually uses them. We can add some symbols next to the structures and macros (#3), such that there will always be a link error if the Unicode settings are mismatched. However, we can't tell if the module actually uses the structure or not. The hard question is: what should be declared by default? Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread) Option 2: Make Unicode-dependence the default. If the compilation settings don't match, force a link failure. Allow extension authors to define a flag (Py_UNICODE_AGNOSTIC) before including Python.h to avoid defining any unsafe functions, structures, or macros. In practice, this is what Python 3 does today, except there's currently no way to declare Unicode-agnosticism. Option 3: Go for a middle ground. Modules are Unicode agnostic by default, unless they call a non-agnostic function (which will cause a link error if there's a mismatch). If they want to poke directly into PyUnicodeObject, they still need to include a separate header. I fear that this is the worst of both worlds, though. The more I think about it, the more I like the first option. Maybe I should bring this up on capi-sig and try to gather a consensus? |
|||
msg105242 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2010-05-07 22:38 | |
Adding MvL because he wrote the ABI PEP, and MAL because he cares about the Unicode interface. |
|||
msg105260 - (view) | Author: Zooko O'Whielacronx (zooko) | Date: 2010-05-08 04:46 | |
> Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread) +1 The packaging and compatibility problems are pressing concerns for many people. Poking at PyUnicodeObject's internals seems like a rare pretty rare need for an extension module to do. If I understand correctly, this option provides the best solution to the packaging and compatibility issues (over the long term, as Python and the authors of extension modules upgrade). |
|||
msg105272 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-05-08 10:03 | |
Daniel Stutzbach wrote: > > New submission from Daniel Stutzbach <daniel@stutzbachenterprises.com>: > > Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString. > > Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it: > > undefined symbol: PyUnicodeUCS2_FromString > > In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way). > > Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added. That's true. The main point in wrapping the APIs was to make sure that if an extension module uses Unicode, it will likely use one of the wrapped APIs and then be protected by the name mangling to prevent extensions compiled as UCS2 to be loaded into a UCS4 interpreter and vice-versa. The main difference between those two build variants is the definition of Py_UNICODE. Unfortunately, there's no way to have the linker check that type definition. The wrapping was chosen as alternative protection. > Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view. > > For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped. > > On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped. The point of the wrapping was not to protect the APIs themselves. It is just meant to be able to use the linker as protection device when loading a module. If you can propose a different method of reliably protecting against mixed Unicode build module loads, that would be great. We could then get rid off the wrapping altogether. > The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds. I don't think that removing a few wrapped function is going to help with the problem. Extension modules can still use Py_UNICODE internally and make use of the macros we have for accessing the PyUnicodeObject internals. You simply don't catch such usage with wrapping the APIs - as I said above: it's not a perfect solution, but only a method that works most of the time. I'd much rather like to see a solution that always works rather than making mixed Unicode build extension loading easier. Perhaps we could do something in the module import mechanism to check for Unicode build compatibility (much like the Python API version check, but with raising an error instead of just issuing a warning). > I previously brought this issue up on python-ideas, see: > http://mail.python.org/pipermail/python-ideas/2009-November/006543.html > > Here's a summary of that discussion: > > Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding. > > Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes. > > Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction. > > With my current patch, there are still two ways for that to happen: > > 1) The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or > > 2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros. > > Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired. > > I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug. Please note that UCS2 and UCS4 builds of Python are different in more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds use surrogates when converting between Unicode and bytes which UCS4 don't, sys.maxunicode is different, range checks use different bounds, unichr() behaves differently, etc. etc. It is simply not a good idea to have modules build for a UCS2 interpreter work in a UCS4 interpreter and vice-versa. |
|||
msg105284 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2010-05-08 13:07 | |
I propose a different approach: 1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so. 2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension. Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes. There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create. |
|||
msg105286 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-05-08 13:15 | |
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > > I propose a different approach: > > 1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so. > 2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension. > > Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes. > > There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create. +1 We could then get rid off the API renaming altogether. |
|||
msg105295 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 15:01 | |
On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > If you can propose a different method of reliably protecting against > mixed Unicode build module loads, that would be great. We could then > get rid off the wrapping altogether. The following code will cause a link error 100% of the time iff there's a mismatch: #ifndef Py_UNICODE_WIDE #define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol #else #define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol #endif extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */ static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol; In practice, I'd surrounded it with a bunch #ifdefs to disable the "defined but not used" warning in gcc and MSVC. > Please note that UCS2 and UCS4 builds of Python are different in > more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds > use surrogates when converting between Unicode and bytes which > UCS4 don't, sys.maxunicode is different, range checks use different > bounds, unichr() behaves differently, etc. etc. That's true, but those differences are visible from pure-Python code as well aren't they? |
|||
msg105297 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-05-08 15:16 | |
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > > On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: >> If you can propose a different method of reliably protecting against >> mixed Unicode build module loads, that would be great. We could then >> get rid off the wrapping altogether. > > The following code will cause a link error 100% of the time iff > there's a mismatch: > > #ifndef Py_UNICODE_WIDE > #define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol > #else > #define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol > #endif > extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */ > static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol; > > In practice, I'd surrounded it with a bunch #ifdefs to disable the > "defined but not used" warning in gcc and MSVC. Are you sure this doesn't get optimized away in practice ? FWIW: I still like the import logic solution better, since that will make it possible for the user to see a truly useful error message rather than just some missing symbol linker error. >> Please note that UCS2 and UCS4 builds of Python are different in >> more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds >> use surrogates when converting between Unicode and bytes which >> UCS4 don't, sys.maxunicode is different, range checks use different >> bounds, unichr() behaves differently, etc. etc. > > That's true, but those differences are visible from pure-Python code > as well aren't they? Sure, though, I don't see how this relates to C code relying on these details, e.g. a C extension will probably use different conversion code depending on whether UCS2 or UCS4 is compatible with some external library, etc. |
|||
msg105302 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 15:48 | |
On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > Are you sure this doesn't get optimized away in practice ? I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :) > Sure, though, I don't see how this relates to C code relying > on these details, e.g. a C extension will probably use different > conversion code depending on whether UCS2 or UCS4 is compatible > with some external library, etc. Can you give an example? All of the examples I can think of either: - poke into PyUnicodeObject's internals, - call a Python function that exposes Py_UNICODE or PyUnicodeObject I'm explicitly trying to protect those two cases. It's quite possible that I'm missing something, but I can't think of any other unsafe way for a C extension to convert a Python Unicode object to a byte string. |
|||
msg105304 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2010-05-08 16:02 | |
>> Are you sure this doesn't get optimized away in practice ? > > I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :) Did you mean to include the hunk from msg105295 as part of the patch? If so, wouldn't that defeat the whole point of the patch? |
|||
msg105307 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 16:11 | |
On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org> wrote: > 1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so. > 2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension. I notice that PyModule_Create currently works like this: PyAPI_FUNC(PyObject *) PyModule_Create2(struct PyModuleDef*, int apiver); #define PyModule_Create(module) \ PyModule_Create2(module, PYTHON_API_VERSION) Instead of modifying PyModuleDef, what if we changed PyModule_Create to something like the following? PyAPI_FUNC(PyObject *) PyModule_Create3(struct PyModuleDef*, int apiver); #define PyModule_Create(module) \ PyModule_Create3(module, PYTHON_API_VERSION, PYTHON_UNICODE_SETTING) In most cases that will Just Work, without requiring the module writer to modify their PyModuleDef. |
|||
msg105309 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 16:19 | |
On Sat, May 8, 2010 at 11:02 AM, Martin v. Löwis <report@bugs.python.org> wrote: > Did you mean to include the hunk from msg105295 as part of the patch? > If so, wouldn't that defeat the whole point of the patch? My intention is to offer two compilation modes to extension authors: 1) A mode that defines Py_UNICODE, PyUnicodeObject, and various unsafe functions and macros. 2) A mode that does not define them. In mode #1, importing the module should fail if the Unicode settings are mismatched. I had meant to include the hunk from msg105295 only in mode #1. Right now I favor your idea of generating a runtime error when loading the module, instead of a linker error. Assuming we go that route, the hunk from msg105295 will not be used at all. |
|||
msg105310 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2010-05-08 16:20 | |
> In most cases that will Just Work, without requiring the module writer > to modify their PyModuleDef. I'd rather modify PyModuleDef_Base, to include a flags field, and perhaps put the api version into it, as well. That's an ABI change, unfortunately, but such a flags field may turn out useful, anyway. |
|||
msg105311 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 16:33 | |
On Sat, May 8, 2010 at 11:20 AM, Martin v. Löwis <report@bugs.python.org> wrote: > I'd rather modify PyModuleDef_Base, to include a flags field, and > perhaps put the api version into it, as well. That's an ABI change, > unfortunately, but such a flags field may turn out useful, anyway. That works for me. I'll work on a patch (which may end up needing further revision, but it will at least give us a concrete basis for further discussion). |
|||
msg105312 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-05-08 16:35 | |
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > > On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: >> Are you sure this doesn't get optimized away in practice ? > > I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :) > >> Sure, though, I don't see how this relates to C code relying >> on these details, e.g. a C extension will probably use different >> conversion code depending on whether UCS2 or UCS4 is compatible >> with some external library, etc. > > Can you give an example? > > All of the examples I can think of either: > - poke into PyUnicodeObject's internals, > - call a Python function that exposes Py_UNICODE or PyUnicodeObject > > I'm explicitly trying to protect those two cases. It's quite possible > that I'm missing something, but I can't think of any other unsafe way > for a C extension to convert a Python Unicode object to a byte string. One of the more important cases you are missing is the argument parser in Python: Py_UNICODE *x; Py_ssize_t y; PyArg_ParseTuple(args, "u#", &x, &y); This uses the native Py_UNICODE type, but doesn't rely on any Unicode APIs. Same for the tuple builder: args = Py_BuildValue("(u#)", x, y); |
|||
msg105315 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 16:47 | |
On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg <report@bugs.python.org> wrote: > One of the more important cases you are missing is the > argument parser in Python: Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h. > Py_UNICODE *x; > Py_ssize_t y; > PyArg_ParseTuple(args, "u#", &x, &y); My plan is to not define Py_UNICODE in Unicode-agnostic mode, to prevent that from compiling. |
|||
msg105321 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-08 17:50 | |
In Unicode-agnostic mode, instead of leaving Py_UNICODE, PyUnicodeObject, and many functions undefined, I wonder if it would be sufficient to declare Py_UNICODE like this: struct PY_UNICODE_TYPE; typedef struct PY_UNICODE_TYPE Py_UNICODE; That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to dereference the pointers nor evaluate sizeof(Py_UNICODE). That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions (anything that doesn't manipulate individual characters, basically). |
|||
msg105327 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2010-05-08 19:35 | |
> In Unicode-agnostic mode, instead of leaving Py_UNICODE, > PyUnicodeObject, and many functions undefined, I wonder if it would > be sufficient to declare Py_UNICODE like this: Sounds good. |
|||
msg105398 - (view) | Author: Marc-Andre Lemburg (lemburg) * | Date: 2010-05-09 15:12 | |
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > > On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg > <report@bugs.python.org> wrote: >> One of the more important cases you are missing is the >> argument parser in Python: > > Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h. > >> Py_UNICODE *x; >> Py_ssize_t y; >> PyArg_ParseTuple(args, "u#", &x, &y); > > My plan is to not define Py_UNICODE in Unicode-agnostic mode, to > prevent that from compiling. > And later... undefined, I wonder if it would be sufficient to declare Py_UNICODE like this: > > struct PY_UNICODE_TYPE; > typedef struct PY_UNICODE_TYPE Py_UNICODE; > > That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to dereference the pointers nor evaluate sizeof(Py_UNICODE). > > That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions (anything that doesn't manipulate individual characters, basically). Please make sure that these compiler tricks are portable enough across the supported Python platforms. Just checking with GCC 4.3 is not good enough. I suppose you could use the buildbots to gather information on how different compilers behave (e.g. by checking in a patch, letting the buildbots run and then reverting it, if there's a problem). I'm just not sure how you could check for optimization compiler bugs/features using the buildbots. |
|||
msg105400 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-09 16:32 | |
On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org>wrote: > 1. add a flag to PyModuleDef, indicating whether the module was built in > UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, > instead of having the dynamic linker do so. I just thought of another risk inherit in this approach. If the extension module is composed of more than one C file, the extension author may inadvertently compile the file defining the PyModuleDef in Unicode-agnostic mode but compile another file in Unicode-sensitive mode. Then they would have a Unicode-sensitive extension as Unicode-agnostic, which would lead to mysterious crashes if the Unicode settings are mismatched. :-( |
|||
msg105401 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-09 16:42 | |
On Sun, May 9, 2010 at 10:12 AM, Marc-Andre Lemburg <report@bugs.python.org>wrote: > I'm just not sure how you could check for optimization > compiler bugs/features using the buildbots. > Once I have a patch that I'm modestly confident in, I'll write automated tests to go with it. The tests will (at minimum): - Create a tweaked copy of pyconfig.h that uses the opposite Unicode settings (UCS4 v UCS2) - Build an extension module in Unicode-agnostic mode - Make sure it loads and works - Build an extension module in Unicode-sensitive mode - Make sure it doesn't load |
|||
msg105402 - (view) | Author: Martin v. Löwis (loewis) * | Date: 2010-05-09 16:53 | |
> I just thought of another risk inherit in this approach. If the extension > module is composed of more than one C file, the extension author may > inadvertently compile the file defining the PyModuleDef in Unicode-agnostic > mode but compile another file in Unicode-sensitive mode. I wouldn't worry about this case. |
|||
msg105419 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-05-09 22:14 | |
Here is a new patch. Usage, short version: By default, modules compile in Unicode-agnostic mode, where Py_UNICODE is an incomplete type. Unicode-agnostic modules will load in both UCS2 and UCS4 interpreters. If a module defines PY_REAL_PY_UNICODE before including Python.h, Py_UNICODE will be a complete type. The module can then dereference Py_UNICODE pointers and use sizeof(Py_UNICODE). Attempting to load the module into a mismatched interpreter will cause an ImportError. Longer version: In Unicode-agnostic mode, extensions can pass around Py_UNICODE pointers freely, but they will get a compilation error if they attempt to dereference the pointer or use sizeof(Py_UNICODE). In Unicode-agnostic mode, the following are never defined: Py_UNICODE_SIZE, Py_UNICODE_WIDE, HAVE_USABLE_CHAR_T, and functions and macros take Py_UNICODE as parameter or return it. Passing a Py_UNICODE pointer to a function that expects a wchar_t * (or something similar) will compile, but will generate an incompatible pointer types warning. Per MvL's suggestion, the patch adds a flag field to PyModuleDef_Base and defines two flags: one to indicate if the module depends on the Unicode width, and another to indicate the width the module was compiled with. These flags are checked in PyModule_Create2. The patch includes unit tests to ensure that importing works or doesn't work in the correct cases. 19 of the 77 modules in Modules/ needed PY_REAL_PY_UNICODE. 18 of them failed to compile without it (good!). 1 of them (_tkinter.c) output an "incompatible pointer types" warning without it, but compiled anyway (resulting in bad code - without evidence to the contrary it assumed Python was UCS2). I ran the test suite with a UCS2 Python and a UCS4 Python on Linux. I had hoped to run the test suite on Windows as well, but many tests are failing on Windows even without the patch. |
|||
msg115437 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-09-03 11:05 | |
I just noticed that the previous "new" patch I uploaded was actually my original patch. Oops. Anyway... In a discussion on the capi list, a few people objected to compiling in Unicode-agnostic mode by default (e.g., it would be annoying for Cython). Attached is a new patch that compiles in Unicode-agnostic mode only when Py_UNICODE_AGNOSTIC is defined (and brings the patch up to date with other changes to the py3k branch). It includes automated tests, and I have tested UCS2 Linux, UCS4 Linux, and Windows builds. After the patch, trying to import a module with a mismatched Unicode setting looks like this: >>> import _testunicodemismatched Traceback (most recent call last): File "<stdin>", line 1, in <module> ImportError: module's Unicode representation does not match the interpreter's Here's a birds-eye view of the patch: moduleobject.h: Add a flags field to PyModuleDef_Base moduleobject.c: Detect import with mismatched Unicode setting pyunicode.h: New file; non-agnostic code from unicodeobject.h moved here unicodeobject.h: Deleted ~150 lines of #defines (yay!) Everything else: tests I'm pretty happy with the patch. In the earlier discussion here, there was generally positive feedback for the overall idea. Someone want to do a patch review? |
|||
msg115461 - (view) | Author: Antoine Pitrou (pitrou) * | Date: 2010-09-03 15:38 | |
+#undef HAVE_USABLE_CHAR_T Isn't it HAVE_USABLE_WCHAR_T? +struct PY_UNICODE_STRUCT; +typedef struct PY_UNICODE_STRUCT Py_UNICODE_STRUCT; +#define Py_UNICODE Py_UNICODE_STRUCT Why not simply `typedef struct PY_UNICODE_STRUCT Py_UNICODE`? The _testunicodeagnostic module could have a simple test method making a couple of calls to PyUnicode_* functions. |
|||
msg115474 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2010-09-03 17:36 | |
Thanks, Antoine. Those are all good suggestions. Attached is an updated patch. Marc-Andre, Martin: would you like to look this over as well? |
|||
msg144619 - (view) | Author: STINNER Victor (vstinner) * | Date: 2011-09-29 19:38 | |
The PEP 393 has been accepted. There is no more narrow or wide build, Py_UNICODE is now always wchar_t. We have also a stable ABI which doesn't depend on the internal stucture of Unicode strings. In Python 3.2, the name of dynamic libraries is also different if the module was compiled in narrow or wide mode. See the PEP 3149. I think that it is enough. Can we close the issue? |
|||
msg144620 - (view) | Author: Daniel Stutzbach (stutzbach) | Date: 2011-09-29 19:39 | |
Gladly. :-) |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:00 | admin | set | github: 52900 |
2011-09-29 19:39:30 | stutzbach | set | status: open -> closed resolution: out of date messages: + msg144620 stage: patch review -> resolved |
2011-09-29 19:38:10 | vstinner | set | messages: + msg144619 |
2010-12-29 01:46:04 | eric.araujo | set | nosy:
+ eric.araujo |
2010-12-07 17:48:17 | belopolsky | set | nosy:
+ belopolsky |
2010-09-03 17:36:48 | stutzbach | set | files:
+ unicode-3.patch messages: + msg115474 |
2010-09-03 15:38:44 | pitrou | set | nosy:
+ pitrou messages: + msg115461 |
2010-09-03 14:57:22 | gvanrossum | set | nosy:
- gvanrossum |
2010-09-03 11:05:24 | stutzbach | set | keywords:
+ needs review files: + unicode-2.patch messages: + msg115437 stage: needs patch -> patch review |
2010-05-09 22:14:49 | stutzbach | set | files:
+ unicode.patch messages: + msg105419 |
2010-05-09 21:22:51 | stutzbach | set | files: - unicode.patch |
2010-05-09 16:53:48 | loewis | set | messages: + msg105402 |
2010-05-09 16:45:55 | stutzbach | set | files: - unnamed |
2010-05-09 16:45:51 | stutzbach | set | files: - unnamed |
2010-05-09 16:42:33 | stutzbach | set | files:
+ unnamed messages: + msg105401 |
2010-05-09 16:32:09 | stutzbach | set | files:
+ unnamed messages: + msg105400 |
2010-05-09 15:12:35 | lemburg | set | messages: + msg105398 |
2010-05-08 19:35:22 | loewis | set | messages: + msg105327 |
2010-05-08 17:55:07 | Arfrever | set | nosy:
+ Arfrever |
2010-05-08 17:50:21 | stutzbach | set | messages: + msg105321 |
2010-05-08 16:47:27 | stutzbach | set | messages: + msg105315 |
2010-05-08 16:35:34 | lemburg | set | messages: + msg105312 |
2010-05-08 16:33:59 | stutzbach | set | messages: + msg105311 |
2010-05-08 16:20:35 | loewis | set | messages: + msg105310 |
2010-05-08 16:19:27 | stutzbach | set | messages: + msg105309 |
2010-05-08 16:11:05 | stutzbach | set | messages: + msg105307 |
2010-05-08 16:02:49 | loewis | set | messages: + msg105304 |
2010-05-08 15:48:38 | stutzbach | set | messages: + msg105302 |
2010-05-08 15:16:07 | lemburg | set | messages: + msg105297 |
2010-05-08 15:01:06 | stutzbach | set | messages: + msg105295 |
2010-05-08 13:15:28 | lemburg | set | messages: + msg105286 |
2010-05-08 13:07:46 | loewis | set | messages: + msg105284 |
2010-05-08 10:03:50 | lemburg | set | messages: + msg105272 |
2010-05-08 04:46:24 | zooko | set | messages: + msg105260 |
2010-05-07 22:42:25 | vstinner | set | nosy:
+ vstinner |
2010-05-07 22:38:23 | r.david.murray | set | nosy:
+ r.david.murray, lemburg, loewis messages: + msg105242 |
2010-05-07 21:54:29 | stutzbach | set | messages: + msg105239 |
2010-05-07 19:28:56 | stutzbach | set | nosy:
+ gvanrossum, zooko, scoder messages: + msg105225 |
2010-05-07 19:24:09 | stutzbach | set | messages: + msg105224 |
2010-05-07 19:04:05 | stutzbach | set | files:
+ unicode.patch keywords: + patch |
2010-05-07 19:02:43 | stutzbach | create |