Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve ABI compatibility between UCS2 and UCS4 builds #52900

Closed
stutzbach mannequin opened this issue May 7, 2010 · 31 comments
Closed

Improve ABI compatibility between UCS2 and UCS4 builds #52900

stutzbach mannequin opened this issue May 7, 2010 · 31 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@stutzbach
Copy link
Mannequin

stutzbach mannequin commented May 7, 2010

BPO 8654
Nosy @malemburg, @loewis, @abalkin, @pitrou, @scoder, @vstinner, @merwok, @bitdancer
Files
  • unicode.patch
  • unicode-2.patch: Updated patch
  • unicode-3.patch: Updated based on Antoine's feedback
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2011-09-29.19:39:30.360>
    created_at = <Date 2010-05-07.19:02:43.448>
    labels = ['interpreter-core', 'type-bug', 'expert-unicode']
    title = 'Improve ABI compatibility between UCS2 and UCS4 builds'
    updated_at = <Date 2011-09-29.19:39:30.359>
    user = 'https://bugs.python.org/stutzbach'

    bugs.python.org fields:

    activity = <Date 2011-09-29.19:39:30.359>
    actor = 'stutzbach'
    assignee = 'stutzbach'
    closed = True
    closed_date = <Date 2011-09-29.19:39:30.360>
    closer = 'stutzbach'
    components = ['Interpreter Core', 'Unicode']
    creation = <Date 2010-05-07.19:02:43.448>
    creator = 'stutzbach'
    dependencies = []
    files = ['17278', '18723', '18730']
    hgrepos = []
    issue_num = 8654
    keywords = ['patch', 'needs review']
    message_count = 31.0
    messages = ['105222', '105224', '105225', '105239', '105242', '105260', '105272', '105284', '105286', '105295', '105297', '105302', '105304', '105307', '105309', '105310', '105311', '105312', '105315', '105321', '105327', '105398', '105400', '105401', '105402', '105419', '115437', '115461', '115474', '144619', '144620']
    nosy_count = 11.0
    nosy_names = ['lemburg', 'loewis', 'zooko', 'belopolsky', 'pitrou', 'scoder', 'vstinner', 'stutzbach', 'eric.araujo', 'Arfrever', 'r.david.murray']
    pr_nums = []
    priority = 'normal'
    resolution = 'out of date'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue8654'
    versions = ['Python 3.2']

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 7, 2010

    Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.

    Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:

        undefined symbol: PyUnicodeUCS2_FromString
    

    In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).

    Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

    Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.

    For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.

    On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped.

    The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds.

    I previously brought this issue up on python-ideas, see:
    http://mail.python.org/pipermail/python-ideas/2009-November/006543.html

    Here's a summary of that discussion:

    Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.

    Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes.

    Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.

    With my current patch, there are still two ways for that to happen:

    1. The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or

    2. The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.

    Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired.

    I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug.

    @stutzbach stutzbach mannequin self-assigned this May 7, 2010
    @stutzbach stutzbach mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error labels May 7, 2010
    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 7, 2010

    Forcing the compile-time and link-time encodings to match is tricky. The goal is:

    • Allow Unicode-agnostic modules to link, regardless of unicode settings
    • Cause a link failure if the unicode settings are mismatched for a module that pokes into PyUnicodeObject

    All of the solutions I've come up with have trade-offs. Here is one approach:

    Expose PyUnicodeObject if and only if the extension #defines a special flag (PY_EXPOSE_UNICODE_OBJECT?) before including Python.h.

    If an extension does NOT define the flag, then PyUnicodeObject will not be defined nor the accessor macros for accessing its fields. All of the opaque and non-opaque functions are still defined; the module just can't poke directly into PyUnicodeObject. Linking will succeed as long as the module avoids non-opaque functions.

    If the flag IS defined, then PyUnicodeObject will be defined along with the accessor macros. The unicodeobject.h header will also arrange to require an appropriate symbol so that linking will succeed only if the module is compiled with the same Unicode settings as Python.

    The upside of this approach is that it works: Unicode-agnostic modules will link and all others will not.

    The drawback is a few extension modules will need to have a #define before including Python.h. Only modules that poke into PyUnicodeObject will be impacted. No change will be required for modules that depend on the Unicode setting yet stick to functions (opaque or not).

    Thoughts?

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 7, 2010

    Adding people to the Nosy List who participated in the original thread on python-ideas, several months ago. Hope you don't mind. :-)

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 7, 2010

    I've been thinking about this a bit more. There are three types of symbols in unicodeobject.h:

    1. Functions that are always safe to use
    2. Functions that are only safe if the module is compiled with the same Unicode settings as Python
    3. Structures and macros that are only safe if the module is compiled with the same Unicode settings as Python

    The functions in #2 will generate a link error if the module actually uses them.

    We can add some symbols next to the structures and macros (#3), such that there will always be a link error if the Unicode settings are mismatched. However, we can't tell if the module actually uses the structure or not.

    The hard question is: what should be declared by default?

    Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread)

    Option 2: Make Unicode-dependence the default. If the compilation settings don't match, force a link failure. Allow extension authors to define a flag (Py_UNICODE_AGNOSTIC) before including Python.h to avoid defining any unsafe functions, structures, or macros. In practice, this is what Python 3 does today, except there's currently no way to declare Unicode-agnosticism.

    Option 3: Go for a middle ground. Modules are Unicode agnostic by default, unless they call a non-agnostic function (which will cause a link error if there's a mismatch). If they want to poke directly into PyUnicodeObject, they still need to include a separate header. I fear that this is the worst of both worlds, though.

    The more I think about it, the more I like the first option.

    Maybe I should bring this up on capi-sig and try to gather a consensus?

    @bitdancer
    Copy link
    Member

    Adding MvL because he wrote the ABI PEP, and MAL because he cares about the Unicode interface.

    @zooko
    Copy link
    Mannequin

    zooko mannequin commented May 8, 2010

    Option 1: Make Unicode-agnosticism the default and force anyone who cares about the Unicode setting to include a separate header. If they don't include that header, they can only call safe functions and can't poke at PyUnicodeObject's internals. If they include the header, their module will always generate a link failure if the Unicode settings are mismatched. (Guido proposed this solution in the python-ideas thread)

    +1

    The packaging and compatibility problems are pressing concerns for many people. Poking at PyUnicodeObject's internals seems like a rare pretty rare need for an extension module to do. If I understand correctly, this option provides the best solution to the packaging and compatibility issues (over the long term, as Python and the authors of extension modules upgrade).

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    New submission from Daniel Stutzbach <daniel@stutzbachenterprises.com>:

    Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4. To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions. For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.

    Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:

        undefined symbol: PyUnicodeUCS2_FromString
    

    In Python 2, only some extension modules were stung by this problem. For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object. It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).

    Right now, nearly all the functions in unicodeobject.h are wrapped. Several functions are not. Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

    That's true. The main point in wrapping the APIs was to make sure that
    if an extension module uses Unicode, it will likely use one of the
    wrapped APIs and then be protected by the name mangling to prevent
    extensions compiled as UCS2 to be loaded into a UCS4 interpreter
    and vice-versa.

    The main difference between those two build variants is the definition
    of Py_UNICODE. Unfortunately, there's no way to have the linker check
    that type definition. The wrapping was chosen as alternative protection.

    Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4. We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.

    For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects. It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.

    On the other hand, PyUnicode_FromString creates a Unicode object from a char *. The caller can treat the returned object as opaque, so the function should not be wrapped.

    The point of the wrapping was not to protect the APIs themselves.
    It is just meant to be able to use the linker as protection device
    when loading a module.

    If you can propose a different method of reliably protecting against
    mixed Unicode build module loads, that would be great. We could then
    get rid off the wrapping altogether.

    The attached patch implements that rule. It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped. "make test" works with both UCS2 and UCS4 builds.

    I don't think that removing a few wrapped function is going to help
    with the problem.

    Extension modules can still use Py_UNICODE internally and make
    use of the macros we have for accessing the PyUnicodeObject
    internals. You simply don't catch such usage with wrapping
    the APIs - as I said above: it's not a perfect solution, but
    only a method that works most of the time.

    I'd much rather like to see a solution that always works rather
    than making mixed Unicode build extension loading easier.

    Perhaps we could do something in the module import mechanism
    to check for Unicode build compatibility (much like the Python API
    version check, but with raising an error instead of just issuing
    a warning).

    I previously brought this issue up on python-ideas, see:
    http://mail.python.org/pipermail/python-ideas/2009-November/006543.html

    Here's a summary of that discussion:

    Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.

    Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require. PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care". My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers. In other words, they are complimentary fixes.

    Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.

    With my current patch, there are still two ways for that to happen:

    1. The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or

    2. The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.

    Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions. Consequently, they will still generate a linker error if the encoding is mismatched, as desired.

    I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions. I'll post about that in a separate comment to this bug.

    Please note that UCS2 and UCS4 builds of Python are different in
    more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
    use surrogates when converting between Unicode and bytes which
    UCS4 don't, sys.maxunicode is different, range checks use different
    bounds, unichr() behaves differently, etc. etc.

    It is simply not a good idea to have modules build for a UCS2
    interpreter work in a UCS4 interpreter and vice-versa.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 8, 2010

    I propose a different approach:

    1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
    2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

    Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes.

    There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create.

    @malemburg
    Copy link
    Member

    Martin v. Löwis wrote:

    Martin v. Löwis <martin@v.loewis.de> added the comment:

    I propose a different approach:

    1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
    2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

    Module authors then can make a choice whether or not to refer to the Unicode internal representation in their module. If they do, a UCS-2 version won't load into a UCS-4 interpreter. If they don't refer to Py_UNICODE at all, the module can be used across the two modes.

    There is a slight risk that a module may already crash before calling PyModule_Create. To prevent that, we need to specify that no Unicode API must be used before calling PyModule_Create.

    +1

    We could then get rid off the API renaming altogether.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:

    If you can propose a different method of reliably protecting against
    mixed Unicode build module loads, that would be great. We could then
    get rid off the wrapping altogether.

    The following code will cause a link error 100% of the time iff
    there's a mismatch:

    #ifndef Py_UNICODE_WIDE
    #define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
    #else
    #define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
    #endif
    extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
    static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol;

    In practice, I'd surrounded it with a bunch #ifdefs to disable the
    "defined but not used" warning in gcc and MSVC.

    Please note that UCS2 and UCS4 builds of Python are different in
    more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
    use surrogates when converting between Unicode and bytes which
    UCS4 don't, sys.maxunicode is different, range checks use different
    bounds, unichr() behaves differently, etc. etc.

    That's true, but those differences are visible from pure-Python code
    as well aren't they?

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

    On Sat, May 8, 2010 at 5:03 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    > If you can propose a different method of reliably protecting against
    > mixed Unicode build module loads, that would be great. We could then
    > get rid off the wrapping altogether.

    The following code will cause a link error 100% of the time iff
    there's a mismatch:

    #ifndef Py_UNICODE_WIDE
    #define _Py_Unicode_Build_Symbol _Py_UCS2_Build_Symbol
    #else
    #define _Py_Unicode_Build_Symbol _Py_UCS4_Build_Symbol
    #endif
    extern int _Py_Unicode_Build_Symbol; /* Defined in unicodeobject.c */
    static int *_Py_Unicode_Build_Symbol_Check = &_Py_Unicode_Build_Symbol;

    In practice, I'd surrounded it with a bunch #ifdefs to disable the
    "defined but not used" warning in gcc and MSVC.

    Are you sure this doesn't get optimized away in practice ?

    FWIW: I still like the import logic solution better, since that
    will make it possible for the user to see a truly useful
    error message rather than just some missing symbol linker
    error.

    > Please note that UCS2 and UCS4 builds of Python are different in
    > more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
    > use surrogates when converting between Unicode and bytes which
    > UCS4 don't, sys.maxunicode is different, range checks use different
    > bounds, unichr() behaves differently, etc. etc.

    That's true, but those differences are visible from pure-Python code
    as well aren't they?

    Sure, though, I don't see how this relates to C code relying
    on these details, e.g. a C extension will probably use different
    conversion code depending on whether UCS2 or UCS4 is compatible
    with some external library, etc.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:

    Are you sure this doesn't get optimized away in practice ?

    I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

    Sure, though, I don't see how this relates to C code relying
    on these details, e.g. a C extension will probably use different
    conversion code depending on whether UCS2 or UCS4 is compatible
    with some external library, etc.

    Can you give an example?

    All of the examples I can think of either:

    • poke into PyUnicodeObject's internals,
    • call a Python function that exposes Py_UNICODE or PyUnicodeObject

    I'm explicitly trying to protect those two cases. It's quite possible
    that I'm missing something, but I can't think of any other unsafe way
    for a C extension to convert a Python Unicode object to a byte string.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 8, 2010

    > Are you sure this doesn't get optimized away in practice ?

    I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

    Did you mean to include the hunk from msg105295 as part of the patch?
    If so, wouldn't that defeat the whole point of the patch?

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org> wrote:

    1. add a flag to PyModuleDef, indicating whether the module was built in UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module, instead of having the dynamic linker do so.
    2. provide a mode for the header files where Py_UNICODE is not defined. add another flag to PyModuleDef indicating whether that mode was used when compiling the extension.

    I notice that PyModule_Create currently works like this:

    PyAPI_FUNC(PyObject *) PyModule_Create2(struct PyModuleDef*, int apiver);
    #define PyModule_Create(module) \
            PyModule_Create2(module, PYTHON_API_VERSION)

    Instead of modifying PyModuleDef, what if we changed PyModule_Create
    to something like the following?

    PyAPI_FUNC(PyObject *) PyModule_Create3(struct PyModuleDef*, int apiver);
    #define PyModule_Create(module) \
            PyModule_Create3(module, PYTHON_API_VERSION, PYTHON_UNICODE_SETTING)

    In most cases that will Just Work, without requiring the module writer
    to modify their PyModuleDef.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    On Sat, May 8, 2010 at 11:02 AM, Martin v. Löwis <report@bugs.python.org> wrote:

    Did you mean to include the hunk from msg105295 as part of the patch?
    If so, wouldn't that defeat the whole point of the patch?

    My intention is to offer two compilation modes to extension authors:

    1. A mode that defines Py_UNICODE, PyUnicodeObject, and various unsafe
      functions and macros.
    2. A mode that does not define them.

    In mode #1, importing the module should fail if the Unicode settings
    are mismatched. I had meant to include the hunk from msg105295 only
    in mode #1.

    Right now I favor your idea of generating a runtime error when loading
    the module, instead of a linker error. Assuming we go that route, the
    hunk from msg105295 will not be used at all.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 8, 2010

    In most cases that will Just Work, without requiring the module writer
    to modify their PyModuleDef.

    I'd rather modify PyModuleDef_Base, to include a flags field, and
    perhaps put the api version into it, as well. That's an ABI change,
    unfortunately, but such a flags field may turn out useful, anyway.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    On Sat, May 8, 2010 at 11:20 AM, Martin v. Löwis <report@bugs.python.org> wrote:

    I'd rather modify PyModuleDef_Base, to include a flags field, and
    perhaps put the api version into it, as well. That's an ABI change,
    unfortunately, but such a flags field may turn out useful, anyway.

    That works for me. I'll work on a patch (which may end up needing
    further revision, but it will at least give us a concrete basis for
    further discussion).

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

    On Sat, May 8, 2010 at 10:16 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    > Are you sure this doesn't get optimized away in practice ?

    I'm sure it doesn't get optimized away by gcc 4.3, where I tested it. :)

    > Sure, though, I don't see how this relates to C code relying
    > on these details, e.g. a C extension will probably use different
    > conversion code depending on whether UCS2 or UCS4 is compatible
    > with some external library, etc.

    Can you give an example?

    All of the examples I can think of either:

    • poke into PyUnicodeObject's internals,
    • call a Python function that exposes Py_UNICODE or PyUnicodeObject

    I'm explicitly trying to protect those two cases. It's quite possible
    that I'm missing something, but I can't think of any other unsafe way
    for a C extension to convert a Python Unicode object to a byte string.

    One of the more important cases you are missing is the
    argument parser in Python:

    Py_UNICODE *x;
    Py_ssize_t y;
    PyArg_ParseTuple(args, "u#", &x, &y);

    This uses the native Py_UNICODE type, but doesn't rely on any
    Unicode APIs.

    Same for the tuple builder:

    args = Py_BuildValue("(u#)", x, y);

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:

    One of the more important cases you are missing is the
    argument parser in Python:

    Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h.

    Py_UNICODE *x;
    Py_ssize_t y;
    PyArg_ParseTuple(args, "u#", &x, &y);

    My plan is to not define Py_UNICODE in Unicode-agnostic mode, to
    prevent that from compiling.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 8, 2010

    In Unicode-agnostic mode, instead of leaving Py_UNICODE, PyUnicodeObject, and many functions undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:

    struct PY_UNICODE_TYPE;
    typedef struct PY_UNICODE_TYPE Py_UNICODE;

    That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to dereference the pointers nor evaluate sizeof(Py_UNICODE).

    That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions (anything that doesn't manipulate individual characters, basically).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 8, 2010

    In Unicode-agnostic mode, instead of leaving Py_UNICODE,
    PyUnicodeObject, and many functions undefined, I wonder if it would
    be sufficient to declare Py_UNICODE like this:

    Sounds good.

    @malemburg
    Copy link
    Member

    Daniel Stutzbach wrote:

    Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment:

    On Sat, May 8, 2010 at 11:35 AM, Marc-Andre Lemburg
    <report@bugs.python.org> wrote:
    > One of the more important cases you are missing is the
    > argument parser in Python:

    Thanks. I've had my head buried in c-api/unicode.html and unicodeobject.h.

    > Py_UNICODE *x;
    > Py_ssize_t y;
    > PyArg_ParseTuple(args, "u#", &x, &y);

    My plan is to not define Py_UNICODE in Unicode-agnostic mode, to
    prevent that from compiling.

    And later...

    undefined, I wonder if it would be sufficient to declare Py_UNICODE like this:

    struct PY_UNICODE_TYPE;
    typedef struct PY_UNICODE_TYPE Py_UNICODE;

    That would allow extensions to pass opaque Py_UNICODE pointers around, but not allow them to
    dereference the pointers nor evaluate sizeof(Py_UNICODE).

    That would make PyUnicodeObject safe, as well as PyUnicode_Encode* and several other functions
    (anything that doesn't manipulate individual characters, basically).

    Please make sure that these compiler tricks are portable
    enough across the supported Python platforms. Just checking
    with GCC 4.3 is not good enough.

    I suppose you could use the buildbots to gather information
    on how different compilers behave (e.g. by checking in a patch,
    letting the buildbots run and then reverting it, if there's
    a problem). I'm just not sure how you could check for optimization
    compiler bugs/features using the buildbots.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 9, 2010

    On Sat, May 8, 2010 at 8:07 AM, Martin v. Löwis <report@bugs.python.org>wrote:

    1. add a flag to PyModuleDef, indicating whether the module was built in
      UCS-2 or UCS-4 mode. Then let the interpreter refuse the load the module,
      instead of having the dynamic linker do so.

    I just thought of another risk inherit in this approach. If the extension
    module is composed of more than one C file, the extension author may
    inadvertently compile the file defining the PyModuleDef in Unicode-agnostic
    mode but compile another file in Unicode-sensitive mode. Then they would
    have a Unicode-sensitive extension as Unicode-agnostic, which would lead to
    mysterious crashes if the Unicode settings are mismatched. :-(

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 9, 2010

    On Sun, May 9, 2010 at 10:12 AM, Marc-Andre Lemburg
    <report@bugs.python.org>wrote:

    I'm just not sure how you could check for optimization
    compiler bugs/features using the buildbots.

    Once I have a patch that I'm modestly confident in, I'll write automated
    tests to go with it.

    The tests will (at minimum):

    • Create a tweaked copy of pyconfig.h that uses the opposite Unicode
      settings (UCS4 v UCS2)

    • Build an extension module in Unicode-agnostic mode

    • Make sure it loads and works

    • Build an extension module in Unicode-sensitive mode

    • Make sure it doesn't load

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 9, 2010

    I just thought of another risk inherit in this approach. If the extension
    module is composed of more than one C file, the extension author may
    inadvertently compile the file defining the PyModuleDef in Unicode-agnostic
    mode but compile another file in Unicode-sensitive mode.

    I wouldn't worry about this case.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented May 9, 2010

    Here is a new patch.

    Usage, short version:

    By default, modules compile in Unicode-agnostic mode, where Py_UNICODE is an incomplete type. Unicode-agnostic modules will load in both UCS2 and UCS4 interpreters.

    If a module defines PY_REAL_PY_UNICODE before including Python.h, Py_UNICODE will be a complete type. The module can then dereference Py_UNICODE pointers and use sizeof(Py_UNICODE). Attempting to load the module into a mismatched interpreter will cause an ImportError.

    Longer version:

    In Unicode-agnostic mode, extensions can pass around Py_UNICODE pointers freely, but they will get a compilation error if they attempt to dereference the pointer or use sizeof(Py_UNICODE). In Unicode-agnostic mode, the following are never defined: Py_UNICODE_SIZE, Py_UNICODE_WIDE, HAVE_USABLE_CHAR_T, and functions and macros take Py_UNICODE as parameter or return it.

    Passing a Py_UNICODE pointer to a function that expects a wchar_t * (or something similar) will compile, but will generate an incompatible pointer types warning.

    Per MvL's suggestion, the patch adds a flag field to PyModuleDef_Base and defines two flags: one to indicate if the module depends on the Unicode width, and another to indicate the width the module was compiled with. These flags are checked in PyModule_Create2.

    The patch includes unit tests to ensure that importing works or doesn't work in the correct cases.

    19 of the 77 modules in Modules/ needed PY_REAL_PY_UNICODE.

    18 of them failed to compile without it (good!).

    1 of them (_tkinter.c) output an "incompatible pointer types" warning without it, but compiled anyway (resulting in bad code - without evidence to the contrary it assumed Python was UCS2).

    I ran the test suite with a UCS2 Python and a UCS4 Python on Linux. I had hoped to run the test suite on Windows as well, but many tests are failing on Windows even without the patch.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Sep 3, 2010

    I just noticed that the previous "new" patch I uploaded was actually my original patch. Oops. Anyway...

    In a discussion on the capi list, a few people objected to compiling in Unicode-agnostic mode by default (e.g., it would be annoying for Cython). Attached is a new patch that compiles in Unicode-agnostic mode only when Py_UNICODE_AGNOSTIC is defined (and brings the patch up to date with other changes to the py3k branch).

    It includes automated tests, and I have tested UCS2 Linux, UCS4 Linux, and Windows builds.

    After the patch, trying to import a module with a mismatched Unicode setting looks like this:

    >>> import _testunicodemismatched
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ImportError: module's Unicode representation does not match the interpreter's

    Here's a birds-eye view of the patch:

    moduleobject.h: Add a flags field to PyModuleDef_Base
    moduleobject.c: Detect import with mismatched Unicode setting
    pyunicode.h: New file; non-agnostic code from unicodeobject.h moved here
    unicodeobject.h: Deleted ~150 lines of #defines (yay!)
    Everything else: tests

    I'm pretty happy with the patch. In the earlier discussion here, there was generally positive feedback for the overall idea. Someone want to do a patch review?

    @pitrou
    Copy link
    Member

    pitrou commented Sep 3, 2010

    +#undef HAVE_USABLE_CHAR_T

    Isn't it HAVE_USABLE_WCHAR_T?

    +struct PY_UNICODE_STRUCT;
    +typedef struct PY_UNICODE_STRUCT Py_UNICODE_STRUCT;
    +#define Py_UNICODE Py_UNICODE_STRUCT

    Why not simply typedef struct PY_UNICODE_STRUCT Py_UNICODE?

    The _testunicodeagnostic module could have a simple test method making a couple of calls to PyUnicode_* functions.

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Sep 3, 2010

    Thanks, Antoine. Those are all good suggestions. Attached is an updated patch.

    Marc-Andre, Martin: would you like to look this over as well?

    @vstinner
    Copy link
    Member

    The PEP-393 has been accepted. There is no more narrow or wide build, Py_UNICODE is now always wchar_t.

    We have also a stable ABI which doesn't depend on the internal stucture of Unicode strings.

    In Python 3.2, the name of dynamic libraries is also different if the module was compiled in narrow or wide mode. See the PEP-3149.

    I think that it is enough. Can we close the issue?

    @stutzbach
    Copy link
    Mannequin Author

    stutzbach mannequin commented Sep 29, 2011

    Gladly. :-)

    @stutzbach stutzbach mannequin closed this as completed Sep 29, 2011
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants