This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients gvanrossum, lemburg, loewis, r.david.murray, scoder, stutzbach, vstinner, zooko
Date 2010-05-08.10:03:46
SpamBayes Score 5.551115e-17
Marked as misclassified No
Message-id <4BE53701.7030207@egenix.com>
In-reply-to <1273258965.02.0.175407225836.issue8654@psf.upfronthosting.co.za>
Content
Daniel Stutzbach wrote:
> 
> New submission from Daniel Stutzbach <daniel@stutzbachenterprises.com>:
> 
> Currently, Python can be built with an internal Unicode representation of UCS2 or UCS4.  To prevent extension modules compiled with the wrong Unicode representation from linking, unicodeobject.h #defines many of the Unicode functions.  For example, PyUnicode_FromString becomes either PyUnicodeUCS2_FromString or PyUnicodeUCS4_FromString.
> 
> Consequently, if one installs a binary egg (e.g., with easy_install), there's a good chance one will get an error such as the following when trying to use it:
> 
>         undefined symbol: PyUnicodeUCS2_FromString
> 
> In Python 2, only some extension modules were stung by this problem.  For Python 3, virtually every extension type will need to call a PyUnicode_* function, since __repr__ must return a Unicode object.  It's basically fruitless to upload a binary egg for Python 3 to PyPi, since it will generate link errors for a large fraction of downloaders (I discovered this the hard way).
> 
> Right now, nearly all the functions in unicodeobject.h are wrapped.  Several functions are not.  Many of the unwrapped functions also have no documentation, so I'm guessing they are newer functions that were not wrapped when they were added.

That's true. The main point in wrapping the APIs was to make sure that
if an extension module uses Unicode, it will likely use one of the
wrapped APIs and then be protected by the name mangling to prevent
extensions compiled as UCS2 to be loaded into a UCS4 interpreter
and vice-versa.

The main difference between those two build variants is the definition
of Py_UNICODE. Unfortunately, there's no way to have the linker check
that type definition. The wrapping was chosen as alternative protection.

> Most extensions treat PyUnicodeObjects as opaque and do not care if the internal representation is UCS2 or UCS4.  We can improve ABI compatibility by only wrapping functions where the representation matters from the caller's point of view.
> 
> For example, PyUnicode_FromUnicode creates a Unicode object from an array of Py_UNICODE objects.  It will interpret the data differently on UCS2 vs UCS4, so the function should be wrapped.
>
> On the other hand, PyUnicode_FromString creates a Unicode object from a char *.  The caller can treat the returned object as opaque, so the function should not be wrapped.

The point of the wrapping was not to protect the APIs themselves.
It is just meant to be able to use the linker as protection device
when loading a module.

If you can propose a different method of reliably protecting against
mixed Unicode build module loads, that would be great. We could then
get rid off the wrapping altogether.

> The attached patch implements that rule.  It unwraps 64 opaque functions that were previously wrapped, and wraps 11 non-opaque functions that were previously unwrapped.  "make test" works with both UCS2 and UCS4 builds.

I don't think that removing a few wrapped function is going to help
with the problem.

Extension modules can still use Py_UNICODE internally and make
use of the macros we have for accessing the PyUnicodeObject
internals. You simply don't catch such usage with wrapping
the APIs - as I said above: it's not a perfect solution, but
only a method that works most of the time.

I'd much rather like to see a solution that always works rather
than making mixed Unicode build extension loading easier.

Perhaps we could do something in the module import mechanism
to check for Unicode build compatibility (much like the Python API
version check, but with raising an error instead of just issuing
a warning).

> I previously brought this issue up on python-ideas, see:
> http://mail.python.org/pipermail/python-ideas/2009-November/006543.html
> 
> Here's a summary of that discussion:
> 
> Zooko Wilcox-O'Hearn pointed out that my proposal is complimentary to his proposal to standardize on UCS4, to reduce the risk of extension modules built with a mismatched encoding.
> 
> Stefan Behnel pointed out that easy_install should allow eggs to specify the encoding they require.  PJE's proposed implementation of that feature (http://bit.ly/1bO62) would allow eggs to specify UCS2, UCS4, or "Don't Care".  My proposal greatly increases the number of eggs that could label themselves "Don't Care", reducing maintenance work for package maintainers.  In other words, they are complimentary fixes.
> 
> Guido liked the idea but expressed concern about the possibility of extension modules that link successfully, but later crash because they actually do depend on the UCS2/UCS4 distinction.
> 
> With my current patch, there are still two ways for that to happen:
> 
> 1) The extension uses only opaque functions, but casts the returned PyObject * to PyUnicodeObject * and accesses the str member, or
> 
> 2) The extension uses only opaque functions, but uses the PyUnicode_AS_UNICODE or PyUnicode_AS_DATA macros.
> 
> Most packages that poke into the internals of PyUnicodeObject also call non-opaque functions.  Consequently, they will still generate a linker error if the encoding is mismatched, as desired.
> 
> I'm trying to come up with a way to 100% guarantee that any extension poking into the internals will generate a linker error if the encoding is mismatched, even if they don't call any non-opaque functions.  I'll post about that in a separate comment to this bug.

Please note that UCS2 and UCS4 builds of Python are different in
more ways than just the underlying Py_UNICODE type. E.g. UCS2 builds
use surrogates when converting between Unicode and bytes which
UCS4 don't, sys.maxunicode is different, range checks use different
bounds, unichr() behaves differently, etc. etc.

It is simply not a good idea to have modules build for a UCS2
interpreter work in a UCS4 interpreter and vice-versa.
History
Date User Action Args
2010-05-08 10:03:52lemburgsetrecipients: + lemburg, gvanrossum, loewis, zooko, scoder, vstinner, stutzbach, r.david.murray
2010-05-08 10:03:50lemburglinkissue8654 messages
2010-05-08 10:03:46lemburgcreate