This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: pass_fds sometimes fails
Type: behavior Stage: resolved
Components: Library (Lib), Tests Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: gregory.p.smith Nosy List: gregory.p.smith, loewis, pitrou
Priority: normal Keywords: patch

Created on 2011-01-04 14:44 by pitrou, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
spdoors.patch pitrou, 2011-01-29 10:48
Messages (10)
msg125326 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-04 14:44
This happens selectively (and intermittently) on the Solaris buildbot:

test_pass_fds (test.test_subprocess.POSIXProcessTestCase) ...  -- maxfd = 256
[36787 refs]
 -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13}
 -- fds that remained open: {0, 1, 2, 3, 5}
FAIL

test_pass_fds (test.test_subprocess.ProcessTestCasePOSIXPurePython) ...  -- maxfd = 256
[36787 refs]
 -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13}
 -- fds that remained open: {0, 1, 2, 3, 5}
FAIL

As you see (thanks to debug output), all fds have been closed except number 5.

(http://www.python.org/dev/buildbot/all/builders/sparc%20solaris10%20gcc%203.x/builds/2466/steps/test/logs/stdio)
msg125329 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-04 14:55
Another (perhaps more likely) possibility is that fd 5 was properly closed, but another one created at startup by the child Python interpreter. How could we diagnose that?
msg125358 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-04 19:10
Happened also on the OpenIndiana buildbot:

http://www.python.org/dev/buildbot/all/builders/x86%20OpenIndiana%203.x/builds/492/steps/test/logs/stdio
msg127060 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-25 21:55
So, I added some debug info to test_pass_fds:

 -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13}
 -- fds that remained open: {0, 1, 2, 3, 5}
 -- debug info:
0 8194 posix.stat_result(st_mode=8592, st_ino=12582920, st_dev=82051072, st_nlink=1, st_uid=1140, st_gid=7, st_size=0, st_atime=1295990436, st_mtime=1295991135, st_ctime=1295990436)
1 2 posix.stat_result(st_mode=4096, st_ino=79119887, st_dev=84148224, st_nlink=0, st_uid=1140, st_gid=1, st_size=0, st_atime=1295991135, st_mtime=1295991135, st_ctime=1295991135)
2 8194 posix.stat_result(st_mode=8592, st_ino=12582920, st_dev=82051072, st_nlink=1, st_uid=1140, st_gid=7, st_size=0, st_atime=1295990436, st_mtime=1295991135, st_ctime=1295990436)
3 2 posix.stat_result(st_mode=4096, st_ino=79119882, st_dev=84148224, st_nlink=0, st_uid=1140, st_gid=1, st_size=0, st_atime=1295991135, st_mtime=1295991135, st_ctime=1295991135)
5 8192 posix.stat_result(st_mode=53540, st_ino=56, st_dev=84410368, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1269710532, st_mtime=1269710532, st_ctime=1269710532)
FAIL

Each line is in the following form: <fd> <value returned by F_GETFD> <os.fstat(fd)>

Here is my interpretation for fd 5 (which is the offending fd):
- a F_GETFD value of 8192 is different from both pipes (which return 2) and standard stdin/stderr (which return 8194); if the values haven't changed between Solaris versions, 8192 is O_LARGEFILE + O_RDONLY
- a st_mode of 53540 is stat.S_IFSOCK + stat.S_IFIFO + stat.S_IRUSR + stat.S_IRGRP + stat.S_IROTH; so this seems to be some kind of read-only (filesystem-based?) socket

I think it is unlikely that this socket is inherited from the parent process, or present before calling exec(). Instead, perhaps it's the startup of the child Python interpreter (after exec()) which creates such a file descriptor, and doesn't close it. Unfortunately, it seems difficult to diagnose this in more detail.
msg127063 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-25 22:51
I managed to get further debug info from the OpenIndiana buildbot:

 -- maxfd = 65536
 -- fds that should have been closed: {5, 6, 7, 8, 9, 10, 11, 12, 13}
 -- fds that remained open: {0, 1, 2, 3, 5}
 -- debug info:
0 2 posix.stat_result(st_mode=4096, st_ino=38002789, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994652, st_mtime=1295994652, st_ctime=1295994652)
1 2 posix.stat_result(st_mode=4096, st_ino=38019207, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994988, st_mtime=1295994988, st_ctime=1295994988)
2 2 posix.stat_result(st_mode=4096, st_ino=38002791, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994987, st_mtime=1295994987, st_ctime=1295994987)
3 2 posix.stat_result(st_mode=4096, st_ino=38019202, st_dev=142868480, st_nlink=0, st_uid=101, st_gid=101, st_size=0, st_atime=1295994988, st_mtime=1295994988, st_ctime=1295994988)
5 8192 posix.stat_result(st_mode=53540, st_ino=210, st_dev=146014208, st_nlink=1, st_uid=0, st_gid=0, st_size=0, st_atime=1290335283, st_mtime=1290335283, st_ctime=1290335283)
total 2
dr-x------ 2 buildbot buildbot 528 Jan 25 22:36 .
dr-x--x--x 5 buildbot buildbot 864 Jan 25 22:36 ..
p--------- 0 buildbot buildbot   0 Jan 25 22:30 0
p--------- 0 buildbot buildbot   0 Jan 25 22:36 1
p--------- 0 buildbot buildbot   0 Jan 25 22:36 2
p--------- 0 buildbot buildbot   0 Jan 25 22:36 3
D--------- 1 root     root       0 Nov 21 10:28 5
  File: `/proc/22816/fd/5'
  Size: 0         	Blocks: 0          IO Block: 0      weird file
Device: 8b40000h/146014208d	Inode: 210         Links: 1
Access: (0000/D---------)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2010-11-21 10:28:03.659679536 +0000
Modify: 2010-11-21 10:28:03.659679536 +0000
Change: 2010-11-21 10:28:03.659679536 +0000


So, the offending fd (5) points to a "weird file" (!!) created in November 2010... It definitely can't be the same file as pointed to by fd 5 in the parent process, since that is a standard anonymous pipe. This reinforces the idea that starting up the child Python interpreter sometimes creates that fd and doesn't close it.

Also, while the sparc solaris buildbot doesn't have the "stat" command, it still displays the following debug output:

total 4
dr-x------   2 buildbot other        528 Jan 25 23:58 .
dr-x--x--x   5 buildbot other        832 Jan 25 23:58 ..
c---------   1 buildbot tty       24,  2 Jan 25 23:58 0
p---------   0 buildbot other          0 Jan 25 23:58 1
c---------   1 buildbot tty       24,  2 Jan 25 23:58 2
p---------   0 buildbot other          0 Jan 25 23:58 3
D---------   1 root     root           0 Mar 27  2010 5

... where fd number 5 seems to be the same kind of "weird file" created by root a long time ago. Martin, does that date (Mar 27 2010) ring a bell? Is it when the system was installed? When the buildslave was started?
msg127098 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2011-01-26 10:41
'D' denotes a door file. There are several of them in the system; the following ones (from /var/run) match the time stamp:

Dr--r--r--   1 daemon   daemon         0 Mar 27  2010 kcfd_door
Dr--r--r--   1 root     root           0 Mar 27  2010 name_service_door
Dr--r--r--   1 root     root           0 Mar 27  2010 picld_door
Drw-r--r--   1 root     root           0 Mar 27  2010 syslog_door

It's common for library routines to open a door and keep it open for the entire process lifetime.

If this was somehow reproducible, I could try to find out what specific door gets opened.
msg127099 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-26 10:48
> 'D' denotes a door file. There are several of them in the system; the
> following ones (from /var/run) match the time stamp:

Thanks for the explanation!

> It's common for library routines to open a door and keep it open for
> the entire process lifetime.
> 
> If this was somehow reproducible, I could try to find out what
> specific door gets opened.

test_subprocess fails quite often on your buildbot, so with a bit of
patience you can probably reproduce it for sure. If you use "-r
--randseed 2463396" it will schedule test_subprocess quite early, so you
can stop the test run as soon as that test is finished.

Also, debug output has been added in branches/test_subprocess_10826 :

http://www.python.org/dev/buildbot/all/builders/sparc%20solaris10%20gcc%203.x/builds/2574/steps/test/logs/stdio
msg127399 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-29 10:48
Here is a patch to ignore door files when running the fd_status script. It seems to eradicate the intermittent failures on the solaris buildbots.
msg128035 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2011-02-06 01:05
Your patch makes sense to me.  I'll commit it after the 3.2 release for 3.2.1.  The elease manager can feel free to commit it earlier if it is bothering an important buildbot.
msg129001 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-02-21 21:59
Committed in r88484 (3.x), r88485 (3.2).
History
Date User Action Args
2022-04-11 14:57:10adminsetgithub: 55035
2011-02-21 21:59:10pitrousetstatus: open -> closed
nosy: loewis, gregory.p.smith, pitrou
messages: + msg129001

resolution: fixed
stage: resolved
2011-02-06 01:05:31gregory.p.smithsetassignee: gregory.p.smith
messages: + msg128035
nosy: loewis, gregory.p.smith, pitrou
2011-01-29 10:48:53pitrousetfiles: + spdoors.patch

messages: + msg127399
keywords: + patch
nosy: loewis, gregory.p.smith, pitrou
2011-01-26 10:48:09pitrousetnosy: loewis, gregory.p.smith, pitrou
messages: + msg127099
2011-01-26 10:41:17loewissetnosy: loewis, gregory.p.smith, pitrou
messages: + msg127098
2011-01-25 22:51:24pitrousetnosy: + loewis
messages: + msg127063
2011-01-25 21:55:43pitrousetmessages: + msg127060
2011-01-04 19:10:23pitrousetmessages: + msg125358
2011-01-04 14:55:07pitrousetmessages: + msg125329
2011-01-04 14:44:49pitroucreate