This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author Henrik Bengtsson
Recipients Henrik Bengtsson, methane, vstinner
Date 2018-12-04.01:23:51
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1543886641.01.0.788709270274.issue35305@psf.upfronthosting.co.za>
In-reply-to
Content
Thank you both for the comments and suggests.  Before I'm getting to the "interesting" part, first to the "easy" one:

>> What is also useful to know, is that I'm observing this on a legacy RHEL 6 system *with a customized kernel* part of the Scyld ClusterWare (https://www.penguincomputing.com/products/software/scyld-clusterware/) that *cannot* be updated:

> Do you mean this trouble is happened only on the system?
> Or can this trouble be reproducible on normal Linux kernel?

The issue only occurs on this particular setup and I have not seen it on the two other RHEL/Centos systems I have access to.  More below.


> Are you able to reproduce the issue with Python 3?

Yes, I can confirm that I'm seeing this issue also with Python 3.6.5 where:

    import subprocess
    p = subprocess.Popen(['/sbin/ldconfig', '-p'], stdout=subprocess.PIPE)

stalls.


> Can you please also try http://pypi.org/project/subprocess32/?

Confirming that the following stalls on both Python 2.7.9 and 2.7.15:

    import subprocess32
    p = subprocess32.Popen(['/sbin/ldconfig', '-p'], stdout=subprocess32.PIPE)



POOR MAN'S DEBUGGING:

Some more narrowing down on exactly where in the code it get stuck, with the disclaimer that I'm kind of a rookie when it comes to Python so I don't really know how to debug interactively etc.

Using poor man's debug, that is, lots of print statements, I've narrowed down the stall of

    import subprocess
    p = subprocess.Popen(['/sbin/ldconfig', '-p'], stdout=subprocess.PIPE)

to occur in the call:

  _execute_child(args = ['/sbin/ldconfig', '-p'], executable = None, preexec_fn = None, close_fds = False, cwd = None, env = None, universal_newlines = False, startupinfo = None, creationflags = 0, shell = False, to_close = set([3, 4]), p2cread = None, p2cwrite = None, c2pread = 3, c2pwrite = 4, errread = None, errwrite = None)

where the *child* process (`self.pid == 0`) get stuck while calling _dup2(c2pwrite = 4, 1) which in turn calls os.dup2(a = 4, b = 1).  The *parent* process get stuck in the call data = _eintr_retry_call(os.read, errpipe_read, 1048576).  Not sure if that rules out certain things, or it just confirms what is already known/what strace is already saying.



SOME MORE TROUBLESHOOTING:

> So I think there are some bug in your kernel, relating to CLOEXEC.

I'm also leaning towards the issue is related to the kernel.  What is really interesting is that executable '/sbin/ldconfig' does *not* exist:

    $ ls /sbin/ldconfig
    ls: cannot access /sbin/ldconfig: No such file or directory

but yet, I can call it:

    $ /sbin/ldconfig -p | wc -c
    102460

and the output is indeed valid(*).  For me, this strongly suggests that this particular system call is intercepted.  I don't know the details, but I think this reflects the gist of the Scyld Clusterware kernel where it intercepts certain system calls makes a multi-node computer cluster to appear as one machine.  (*) By valid I mean '/sbin/ldconfig -p' gives identical out on the compute nodes where this problem occurs as on the master node (where /sbin/ldconfig indeed exists) which Scyld is trying to "mirror" on the compute nodes.  (I don't ask you to waste brain cycles on trying to follow this but I thought it's useful to put all this down for the record and potential future readers.)


> Maybe, you can consult with the company.

I will try to reach out to them to have them confirm my troubleshooting and see if this has been fixed in a later release of theirs.   (Unfortunately, it won't help our current legacy system, which we are slowly moving away.  Regardless, by posting this here, I hope I'll spare at least one other person some troubleshooting if they find this post; it caused me and lots of other users 100's of hours of confusion and troubleshooting before we got this far.)


It would also be interesting to understand exactly what causes the stall.  Is it indeed the pipe that gets filled up?  Is that because the kernel does *not* respect the pipe limit and just dumps all output at once (> 65,536 bytes), i.e. it is a bug?  Or is it that Python or one of its dependencies runs into a race condition because, say, it does not have a chance to set up the parent-child communication before the child (== the kernel) dumps too much data?

Inada Naoki, does the above new info give further evidence to your comment:

> So I think there are some bug in your kernel, relating to CLOEXEC.

or did it possibly bring something new to the table?  (I'm not too familiar with the Linux system calls and what to infer from strace logs).



A BROKEN DESIGN?

Finally, I don't know if the fact that `/sbin/ldconfig` does not exist but you can yet call it is (i) poorly designed kernel, or (ii) a valid design in the Unix world.  I don't know the answer to this and I don't claim one is more correct than the other.  I also don't know if there are other kernels out there that does this type of interception.  If it is indeed a valid design, then one could argue that Python and other software tools should be able to handle it.  FYI, this far I've/we've only hit this issue with Python (>= 2.7.13), maybe because of pure luck.  It did not cause a problem in Python (< 2.7.13) and it does not cause a problem if we use subprocess.Popen(..., 'shell = True').  On the other hand, if one would argue that it is a poor design, then would it make sense to protect against by for instance asserting that the executable actually exists before calling it:


--- subprocess.py         2018-12-03 16:34:05.777608059 -0800
+++ subprocess.py.2.7.15  2018-12-03 16:33:21.124909394 -0800
@@ -918,9 +918,6 @@
             if executable is None:
                 executable = args[0]

-            if not os.path.isfile(executable):
-                raise RuntimeError("No such executable: " + executable)
-
             def _close_in_parent(fd):
                 os.close(fd)
                 to_close.remove(fd)
History
Date User Action Args
2018-12-04 01:24:01Henrik Bengtssonsetrecipients: + Henrik Bengtsson, vstinner, methane
2018-12-04 01:24:01Henrik Bengtssonsetmessageid: <1543886641.01.0.788709270274.issue35305@psf.upfronthosting.co.za>
2018-12-04 01:24:00Henrik Bengtssonlinkissue35305 messages
2018-12-04 01:23:52Henrik Bengtssoncreate