Issue 43743: BlockingIOError: [Errno 11] Resource temporarily unavailable: on GPFS.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/87909

classification

Title:	BlockingIOError: [Errno 11] Resource temporarily unavailable: on GPFS.
Type:	behavior	Stage:	resolved
Components:	IO	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	PEAR, alexeicolin, giampaolo.rodola, gregory.p.smith, p.conesa.mingo, pmrv
Priority:	normal	Keywords:	patch

Created on 2021-04-06 08:21 by p.conesa.mingo, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
sendfile.py	pmrv, 2022-03-13 15:14
shutil.patch	pmrv, 2022-03-13 15:22	Fix for BlockingIOError in shutil.copy

Pull Requests
URL	Status	Linked	Edit
PR 26024	merged	giampaolo.rodola, 2021-05-10 20:24

Messages (9)
msg390297 - (view)	Author: Pablo Conesa (p.conesa.mingo)	Date: 2021-04-06 08:21
Hi, one of our users is reporting this starting to happen in a GPFS. All has been working fine for NTFS so far for many years. I had a look at my shutil code, and I can see the try/except code trying to fall back to the "slower" copyfileobj(fsrc, fdst). But it seems, by the stacktrace bellow that the "catch" is not happening. Any idea how to fix this? I guess something like: import shutil shutil._USE_CP_SENDFILE = False should avoid the fast_copy attempt. > Traceback (most recent call last): > File "/opt/pxsoft/scipion/v3/ubuntu20.04/scipion-em-esrf/esrf/workflow/esrf_launch_workflow.py", line 432, in <module> > project.scheduleProtocol(prot) > File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/project/project.py", line 633, in scheduleProtocol > pwutils.path.copyFile(self.dbPath, protocol.getDbPath()) > File "/opt/px/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/site-packages/pyworkflow/utils/path.py", line 247, in copyFile > shutil.copy(source, dest) > File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 415, in copy > copyfile(src, dst, follow_symlinks=follow_symlinks) > File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 272, in copyfile > _fastcopy_sendfile(fsrc, fdst) > File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 169, in _fastcopy_sendfile > raise err > File "/opt/pxsoft/scipion/v3/ubuntu20.04/anaconda3/envs/.scipion3env/lib/python3.8/shutil.py", line 149, in _fastcopy_sendfile > sent = os.sendfile(outfd, infd, offset, blocksize) > BlockingIOError: [Errno 11] Resource temporarily unavailable: 'project.sqlite' -> 'Runs/000002_ProtImportMovies/logs/run.db'
msg391649 - (view)	Author: Alexei Colin (alexeicolin)	Date: 2021-04-23 03:05
Can confirm that this BlockingIOError happens on GPFS (alpine) on Summit supercomputer, tested with Python 3.8 and 3.10a7. I found that it happens only for file sizes above 65536. Minimal example: This filesize works: $ rm -f srcfile dstfile && truncate --size 65535 srcfile && python3.10 -c "import shutil; shutil.copyfile(b'srcfile', b'dstfile')" This file size (and larger) does not work: $ rm -f srcfile dstfile && truncate --size 65536 srcfile && python3.10 -c "import shutil; shutil.copyfile(b'srcfile', b'dstfile')" Traceback (most recent call last): File "<string>", line 1, in <module> File "/.../usr/lib/python3.10/shutil.py", line 265, in copyfile _fastcopy_sendfile(fsrc, fdst) File "/.../usr/lib/python3.10/shutil.py", line 162, in _fastcopy_sendfile raise err File "/.../usr/lib/python3.10/shutil.py", line 142, in _fastcopy_sendfile sent = os.sendfile(outfd, infd, offset, blocksize) BlockingIOError: [Errno 11] Resource temporarily unavailable: b'srcfile' -> b'dstfile' I tried patching shutil.py to retry the the call on this EAGAIN, but subsequent attempts fail with EAGAIN again indefinitely. I also use OP's workaround: set _USE_CP_SENDFILE = False in shutil.py
msg392985 - (view)	Author: PEAR (PEAR)	Date: 2021-05-05 07:57
Most probably related: https://www.ibm.com/support/pages/apar/IJ28891
msg393134 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2021-05-06 19:04
I don't believe CPython should be working around a bug in specific Linux kernel versions in the standard library unless they are extremely pernicious and not considered to be a bug and thus ever be fixed in the OS kernel. As the sendfile system call appears to infinitely return one of EAGAIN, EALREADY, EWOULDBLOCK, or EINPROGRESS in this case, there isn't anything CPython could do. A retry/backoff loop won't help. This should be worked around at the application level by whatever means are appropriate.
msg393364 - (view)	Author: Pablo Conesa (p.conesa.mingo)	Date: 2021-05-10 08:08
So, is it ok, when the fast copy fails, not to _GiveupOnFastCopy(err)? I can understand that fast copy might fail, but then the Giveup part should happen and it wasn't. Additionally, _USE_CP_SENDFILE could be taken, optionally from an environment variable to cancel the fastcopy once we know it will fail?
msg393419 - (view)	Author: Gregory P. Smith (gregory.p.smith) *	Date: 2021-05-10 16:53
The logic for bailing out to a slow copy is currently: https://github.com/python/cpython/blob/main/Lib/shutil.py#L158 that condition appears to not be happening in Alexei's test. Suggesting that either at least one sendfile call succeeded and thus offset is non-zero or the lseek failed. run that test under pdb and walk thru the code, or under strace to look at the syscalls and find out. The question seems to be is if it should be okay to _GiveUpOnFastCopy after a partial (incomplete) copy has already occurred via sendfile.
msg393429 - (view)	Author: Giampaolo Rodola' (giampaolo.rodola) *	Date: 2021-05-10 20:06
> The question seems to be is if it should be okay to _GiveUpOnFastCopy after a partial (incomplete) copy has already occurred via sendfile. I think it should not. For posterity: my rationale for introducing _USE_CP_SENDFILE was to allow monkey patching for corner cases such as this one (see also bpo-36610 / GH-13675), but expose it as a private name because I expected them to be rare and likely up to a broken underlying implementation, as it appears this is the case. FWIW, I deem _USE_CP_SENDFILE usage in production code as legitimate, and as such it should stay private but never be removed.
msg415037 - (view)	Author: Marvin Poul (pmrv) *	Date: 2022-03-13 15:14
I hope you don't mind me necro posting, but I ran into this issue again and have a small patch to solve it. I attached an MWE that triggers the BlockingIOError reliably on ext4 filesystems in linux 4.12.14 and python 3.8.12. Running under strace -e sendfile gives the following output # manually calling sendfile to check that it works > sendfile(5, 4, [0] => [8388608], 8388608) = 8388608 # sendfile calls originating in shutil.copy > sendfile(5, 4, [0] => [8388608], 8388608) = 8388608 > sendfile(5, 4, [8388608], 8388608) = -1 EAGAIN (Resource temporarily unavailable) > Shutil Failed! > [Errno 11] Resource temporarily unavailable: '/cmmc/u/zora/scratch/sendfile_bug/tmpaqx2o4uj' -> '/cmmc/u/zora/scratch/sendfile_bug/tmpb8rzg8rg' > +++ exited with 0 +++ This shows that the first call to sendfile actually copies the whole file and the EAGAIN is only triggered on the second, unnecessary, call. I have tested with a small C program that it's triggered whenever sendfile's offset + count exceeds the file size of in_fd. This is weird behaviour on the kernels side that seems to have changed in newer kernel versions (issue is not present e.g. on my 5.16.12 laptop). Anyways my patch makes that second call not appear by keeping track of the file size and the bytes written so far. It's against the current python main branch, but if I see correctly this part hasn't changed in years. I have checked the error is not thrown when the patch is applied. (I can only attach one file, so patch is attached in a new one.)
msg415038 - (view)	Author: Marvin Poul (pmrv) *	Date: 2022-03-13 15:22
Here's the small patch. Sadly I have no overview what the affected linux kernel version are. I guess technically you can all this "working around a bug in specific linux version", but since it's a very minor change that saves one syscall even for non-breaking version, I feel it's justified. Let me know if you'd like any modification done however.

History
Date	User	Action	Args
2022-04-11 14:59:43	admin	set	github: 87909
2022-03-13 15:22:25	pmrv	set	files: + shutil.patch keywords: + patch messages: + msg415038
2022-03-13 15:14:54	pmrv	set	files: + sendfile.py nosy: + pmrv messages: + msg415037
2021-05-10 20:24:52	giampaolo.rodola	set	pull_requests: + pull_request24674
2021-05-10 20:06:07	giampaolo.rodola	set	messages: + msg393429 versions: + Python 3.9
2021-05-10 16:53:14	gregory.p.smith	set	nosy: + giampaolo.rodola messages: + msg393419
2021-05-10 08:08:26	p.conesa.mingo	set	messages: + msg393364
2021-05-06 19:04:38	gregory.p.smith	set	status: open -> closed type: crash -> behavior nosy: + gregory.p.smith messages: + msg393134 resolution: not a bug stage: resolved
2021-05-05 07:57:44	PEAR	set	messages: + msg392985
2021-04-27 13:40:33	PEAR	set	nosy: + PEAR
2021-04-23 03:06:34	alexeicolin	set	versions: + Python 3.8
2021-04-23 03:06:16	alexeicolin	set	versions: + Python 3.10, - Python 3.8
2021-04-23 03:05:57	alexeicolin	set	nosy: + alexeicolin messages: + msg391649
2021-04-06 08:21:31	p.conesa.mingo	create