classification
Title: Simplify linking of shared libraries on the AIX OS
Type: enhancement Stage: resolved
Components: Build, Extension Modules Versions: Python 3.9, Python 3.8, Python 3.7, Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: David.Edelsohn, Michael.Felt, ericvw, pablogsal
Priority: normal Keywords: patch

Created on 2019-07-26 17:32 by ericvw, last changed 2019-07-29 16:42 by ericvw. This issue is now closed.

Files
File name Uploaded Description Edit
aix-extension-simplify.patch ericvw, 2019-07-26 17:32 Patch from 3.7 branch
Pull Requests
URL Status Linked Edit
PR 14965 closed ericvw, 2019-07-26 17:34
Messages (6)
msg348511 - (view) Author: Eric N. Vander Weele (ericvw) * Date: 2019-07-26 17:32
Have the approach of building shared libraries on the AIX operating
system be similar to that of a System V system.  The primary benefit of
this change is the elimination of custom AIX paths and reducing the
changes at `./configure` to affect just the `LDSHARED` environment
variable.

For background context, AIX sees shared libraries as fully linked and
resolved, where symbol references are resolved at link-time and cannot
be rebound at load-time.  System V resolves all global symbols by the
run-time linker.  Thus, conventional shared libraries in AIX cannot have
undefined symbols, while System V can.

However, AIX does allow for run-time linking in allowing symbols to be
undefined until load-time.

Therefore, this change affects how linking of shared libraries are
performed on AIX to behave similarly to that of System V.

Given that symbols are now going to be allowed to be undefined for AIX,
all the code paths for generating exported symbols and the related
wrapper scripts go away.

The real magic is in the `-G` flag for `LDSHARED`.  Effectively, `-G`
is equivalent to specifying the following:

* -berok: Suppress errors even if there are unresolved symbols
* -brtl: Enable run-time linking
* -bnortllib: Do not include a reference to the run-time linker
* -bnosymbolic: Assigns 'nosymbolic' attribute to most symbols (i.e.,
                can be rebound)
* -bnoautoexp: Prevent auto exportation of any symbols
* -bM:SRE: Set the module type to reusable (i.e., require a private copy
           of the data area for each process).

I have been using this patch for Python 3.7, 3.6, and 2.7 (with appropriate backporting adaptations) without issue for being able to build and use Python C/C++ extensions on AIX for about 6 months now.  Given that we haven't had any issues, I felt it was appropriate to see if this would be accepted upstream.
msg348519 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2019-07-26 19:12
Absolutely, positively no.  This is horrible and completely wrong.

Applications on AIX should not be compiled to allow dynamic linking to make them operate more like SVR4/Linux.  Python does not require dynamic linking. This simply is masking a symptom in a naive and incorrect manner. Use of runtime linking causes many internal changes to the behavior of AIX applications, severely affecting performance and potentially causing overflow of data structures.

We currently are going through the process of removing this brain damage from CMake.  I absolutely will not allow Python to go down this path and introduce this type of mistake.
msg348523 - (view) Author: Eric N. Vander Weele (ericvw) * Date: 2019-07-26 20:01
> This is horrible and completely wrong.

I'm not an expert in AIX and xlc, by any means.  I would greatly appreciate your help to better understand so I can see the problem in the way you are to figure the best approach I can take.

My primary motivation was to simplify/homogenize the mechanism by which Python C/C++ extensions are built.  For background, I have Python applications and libraries that need to run on Linux, Solaris, and AIX.  One of the challenges we ran into was how and when symbol resolution occurs, which is fundamentally different in AIX.

> Python does not require dynamic linking.

I understand Python does not require dynamic linking.  However, the problem I am running into is how this should work/behave for Python C/C++ extensions, which are imported (loaded) at runtime of a Python application.  Maybe this is where I have a fundamental misunderstanding, but it led me to believe that in AIX this should behave similarly to SVR4/Linux.  When scouring how Python interplays with AIX for building Python C/C++ extensions, this problem piqued my interest.

When conducting my self-research, I came across http://download.boulder.ibm.com/ibmdl/pub/software/dw/aix/es-aix_ll.pdf, which helped me in understanding the differences between dynamic loading run-time linking.  Thus, I went down the path of run-time linking with the '-G' flag, which appeared similar to what was done in Python for other operating systems.

> This simply is masking a symptom in a naive and incorrect manner.

This is leading up to my misunderstanding of what I was observing during my initial investigation of what was going on.

I'll need to revisit the symptom being observed, but I vaguely recall missing symbols when building Python C/C++ extensions when the interpreter is configured with '--enable-shared'.  Let me go back, undo the patch I have, and recreate the symptom/issue that was observed.

> Use of runtime linking causes many internal changes to the behavior of AIX applications, severely affecting performance and potentially causing overflow of data structures.

I'm really curious about this one.  What internal changes, performance concerns, and overflow of data structures could occur?  Luckily, I have observed nor experienced anything egregiously negative, thus far.  Understanding these concerns will help bolster my understanding.

> I absolutely will not allow Python to go down this path and introduce this type of mistake.

No worries.  I'm trying to solve a problem and appeared to have gone down an incorrect path.  Being able to better understand what the desired expectation is for Python and associated C/C++ extensions, will help guide me to focus where the misunderstanding is and to redirect focus on where the problem is that needs to be addressed.
msg348535 - (view) Author: David Edelsohn (David.Edelsohn) * Date: 2019-07-27 00:50
Runtime linking allows a dynamically loaded library to interpose symbols. The classic example is allowing a program or dynamic library to overload C++ operator new. A library or program overrides the symbol by name.

Python does not require this. Python does not need to allow an extension module to override a function in Python.

If one needs to add AIX ld -G and runtime linking, 99% of the time one is covering up a problem.

The downside of -G is that it forces all global functions to be called through the AIX glink code (equivalent to SVR4 PLT) and not inlined.  This allows every global function call to be overriden, but forces every call to go through a function pointer. This is expensive.

Calling functions through the "PLT" requires that the function pointers for each global function be placed in the AIX TOC (equivalent to SVR4 GOT).  If the program or shared library is large enough, this can overflow the "GOT", which then requires even more expensive fixup code.

The mistaken use of this option leads down a path with bad performance and potentially requiring more and more effort to recover from problems introduced by the choice.

I don't know exactly the symptoms that you observed, but one possibility is that the shared object you are building is not being linked against all of the dependent libraries.

Separate from runtime linking, SVR4 allows unresolved symbols when a shared library is created and used to export all global symbols by default (before the efforts on symbol visibility). A simplistic way of describing this is that a process into which an executable and shared libraries are loaded sort of has this soup of all global symbols floating around and available to the runtime loader.  When a new shared library is loaded, the dynamic linker can resolve the symbols from any definitions available in the process.  Allowing the unresolved symbols at shared library link time is a promise that the symbols will be provided by someone at runtime. At runtime, all of the symbol needs and definitions are thrown in the air and hopefully match up correctly when first referenced at runtime.

AIX requires that all shared objects be fully resolved at link-edit time.  It requires that the shared object refer to all dependent libraries at link time, even if those libraries also will be present and provided by other shared libraries or executable at runtime.

In other words, on AIX, one must link all C++ shared objects against the C++ standard library, even if the main executable is linked against the library.

So, again, one possible explanation for the error of missing symbols is that one or more dependent libraries are missing from the link command building the shared object and that omission coincidentally happens to work on SVR4/Linux because of its semantics, but it doesn't work in the more strict environment of AIX.

This type of error should not be solved through runtime linking to borrow the missing symbols from the running process, which is a very expensive solution.
msg348605 - (view) Author: Michael Felt (Michael.Felt) * Date: 2019-07-29 10:22
David gives several reasons why this PR should not be used.

And, in reading them - while I follow them at face value, there may be things I miss due to ignorance or being naive (more the system admin than tool developer).

Isn't there an configure --enable-shared that (sadly!) gives a SVR4 like shared library (sys-admin view - it is a .so file (libpython3.7m.so) rather than "the same file" as a member of an archive (e.g., libpython3.a[libpython3.7m.so]).

While it may be common on other OS systems to have two "lib" directories, e.g., /usr/lib and /usr/lib64 - on AIX there is expected - one directory (/usr/lib) and the "archives aka .a files" may have multiple members, e.g., a 32-bit and a 64-bit member.

Not using .a files makes it very hard to keep a "tight-ship" on an AIX server - and I feel it is incorrect for a tool to dictate system administration policy.

As I do not know how Python looks on other systems - here is a short view of Python and ldd when --enable-shared is used:

/opt/bin/python3 needs:
         /usr/lib/libc.a(shr.o)
         /usr/lib/libpthreads.a(shr_xpg5.o)
         /opt/lib/libpython3.7m.so
         /unix
         /usr/lib/libcrypt.a(shr.o)
         /usr/lib/libpthreads.a(shr_comm.o)
         /usr/lib/libdl.a(shr.o)


Here is an example not using --enable-shared:
/opt/bin/python3 needs:
         /usr/lib/libc.a(shr.o)
         /usr/lib/libpthreads.a(shr_xpg5.o)
         /usr/lib/libpthreads.a(shr_comm.o)
         /usr/lib/libdl.a(shr.o)
         /usr/lib/libintl.a(libintl.so.8)
         /unix
         /usr/lib/libcrypt.a(shr.o)
         /usr/lib/libpthread.a(shr_xpg5.o)
         /usr/lib/libiconv.a(libiconv.so.2)
         /usr/lib/libc.a(shr_64.o)
         /usr/lib/libcrypt.a(shr_64.o)

Both versions build ".so" files, that are accessed using dlopen()

root@x066:[/home/root]find /opt/lib/python3.7 -name \*.so | head
/opt/lib/python3.7/lib-dynload/_asyncio.so
/opt/lib/python3.7/lib-dynload/_bisect.so
/opt/lib/python3.7/lib-dynload/_blake2.so
/opt/lib/python3.7/lib-dynload/_bz2.so
/opt/lib/python3.7/lib-dynload/_codecs_cn.so
/opt/lib/python3.7/lib-dynload/_codecs_hk.so
/opt/lib/python3.7/lib-dynload/_codecs_iso2022.so
/opt/lib/python3.7/lib-dynload/_codecs_jp.so
/opt/lib/python3.7/lib-dynload/_codecs_kr.so
/opt/lib/python3.7/lib-dynload/_codecs_tw.so

and

root@x064:[/opt/lib/python3.7]find /opt/lib/python3.7 -name \*.so | head
/opt/lib/python3.7/lib-dynload/_asyncio.so
/opt/lib/python3.7/lib-dynload/_bisect.so
/opt/lib/python3.7/lib-dynload/_blake2.so
/opt/lib/python3.7/lib-dynload/_bz2.so
/opt/lib/python3.7/lib-dynload/_codecs_cn.so
/opt/lib/python3.7/lib-dynload/_codecs_hk.so
/opt/lib/python3.7/lib-dynload/_codecs_iso2022.so
/opt/lib/python3.7/lib-dynload/_codecs_jp.so
/opt/lib/python3.7/lib-dynload/_codecs_kr.so
/opt/lib/python3.7/lib-dynload/_codecs_tw.so

Lastly, The PR, asis, appears to be broken.

make: *** [Makefile:613: sharedmods] Illegal instruction (core dumped)
/opt/bin/make returned an error
root@x066:[/data/prj/python/python3-3.9]make V=1
 CC='xlc_r' LDSHARED='xlc_r -G    ' OPT='-DNDEBUG -O'   _TCLTK_INCLUDES='' _TCLTK_LIBS=''       ./python -E ../git/python3-3.9/setup.py  build
make: *** [Makefile:613: sharedmods] Illegal instruction (core dumped)

Note also: LDSHARED has added xlc_r to it's flags - that does not seem right either.

-1
msg348675 - (view) Author: Eric N. Vander Weele (ericvw) * Date: 2019-07-29 16:42
Thanks for the in-depth responses and feedback.

When reinvestigating this in more detail that led me to create this patch, I discovered that the premise upon which I was operating upon was not the default (desired) compiler and linker flags.  It turns out the environment I am working in builds all of the software using -bsvr4 and -brtl on AIX.

I have a lot more to unravel now.  I already closed the PR and will abandon this issue since it has been clearly illustrated that this is masking an underlying problem.

Thanks for taking the time to provide feedback and detail of what is problematic with this change.
History
Date User Action Args
2019-07-29 16:42:55ericvwsetstatus: open -> closed

messages: + msg348675
stage: patch review -> resolved
2019-07-29 10:22:41Michael.Feltsetmessages: + msg348605
2019-07-27 00:50:06David.Edelsohnsetmessages: + msg348535
2019-07-26 20:01:46ericvwsetmessages: + msg348523
2019-07-26 19:12:29David.Edelsohnsetmessages: + msg348519
2019-07-26 18:10:45pablogsalsetnosy: + David.Edelsohn
2019-07-26 17:39:06pablogsalsetnosy: + Michael.Felt, pablogsal
2019-07-26 17:34:00ericvwsetstage: patch review
pull_requests: + pull_request14732
2019-07-26 17:32:29ericvwcreate