classification
Title: Efficient zero-copy for shutil.copy* functions (Linux, OSX and Win)
Type: performance Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: SilentGhost, StyXman, asvetlov, facundobatista, giampaolo.rodola, gps, josh.r, martin.panter, ncoghlan, neologix, petr.viktorin, pitrou, python-dev, r.david.murray, scoder, socketpair, tarek, vstinner
Priority: normal Keywords: needs review, patch

Created on 2018-05-28 16:17 by giampaolo.rodola, last changed 2018-06-26 10:45 by desbma. This issue is now closed.

Files
File name Uploaded Description Edit
shutil-zero-copy.diff giampaolo.rodola, 2018-05-28 16:17
Pull Requests
URL Status Linked Edit
PR 7160 closed giampaolo.rodola, 2018-05-28 16:21
PR 7681 merged giampaolo.rodola, 2018-06-13 11:22
PR 7919 merged vstinner, 2018-06-26 00:11
PR 7800 merged vstinner, 2018-06-26 00:11
PR 7875 closed 4383, 2018-06-26 05:12
Messages (13)
msg317878 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2018-05-28 16:17
Patch in attachment uses platform specific zero-copy syscalls on Linux and Solaris (os.sendfile(2)), Windows (CopyFileW) and OSX (fcopyfile(2)) speeding up shutil.copyfile() and other functions using it (copy(), copy2(), copytree(), move()).

Average speedup for a 512MB file copy is +24% on Linux, +50% on OSX and +48% on Windows by copying file on the same partition (SSD disk was used).

Follows some benchmarks.

Setup
=====

Create 128K, 8M, 512M file:

    $ python -c "import os; f = open('f1', 'wb'); f.write(os.urandom(128 * 1024))"
    $ python -c "import os; f = open('f1', 'wb'); f.write(os.urandom(8 * 1024 * 1024))"
    $ python -c "import os; f = open('f1', 'wb'); f.write(os.urandom(512 * 1024 * 1024))"

Benchmark:

    $ time ./python -m timeit -s 'import shutil; p1 = "f1"; p2 = "f2"' 'shutil.copyfile(p1, p2)'

Linux
=====

128K copy (+13%):

    without patch:
        1000 loops, best of 5: 228 usec per loop
        real    0m1.756s
        user    0m0.386s
        sys     0m1.116s

    with patch:
        1000 loops, best of 5: 198 usec per loop
        real    0m1.464s
        user    0m0.281s
        sys     0m0.958s

8MB copy (+24%):

    without patch:
        50 loops, best of 5: 10.1 msec per loop
        real    0m2.703s
        user    0m0.316s
        sys     0m1.847s

    with patch:
        50 loops, best of 5: 7.78 msec per loop
        real    0m2.447s
        user    0m0.086s
        sys     0m1.682s

512MB copy (+26%):

    without patch:
        1 loop, best of 5: 872 msec per loop
        real    0m5.574s
        user    0m0.402s
        sys     0m3.115s

    with patch:
        1 loop, best of 5: 646 msec per loop
        real    0m5.475s
        user    0m0.037s
        sys     0m2.959s

OSX
===

128K copy (+8.5%):

    without patch:
        500 loops, best of 5: 508 usec per loop
        real    0m2.971s
        user    0m0.442s
        sys     0m2.168s

    with patch:
        500 loops, best of 5: 464 usec per loop
        real    0m2.798s
        user    0m0.379s
        sys     0m2.031s

8MB copy (+67%):

    without patch:
        20 loops, best of 5: 32.8 msec per loop
        real    0m3.672s
        user    0m0.357s
        sys     0m1.434s

    with patch:
        20 loops, best of 5: 10.8 msec per loop
        real    0m1.860s
        user    0m0.079s
        sys     0m0.719s

512MB copy (+50%):

    without patch:
        1 loop, best of 5: 953 msec per loop
        real    0m5.930s
        user    0m1.021s
        sys     0m4.835s
    
    with patch:
        1 loop, best of 5: 480 msec per loop
        real    0m3.150s
        user    0m0.067s
        sys     0m2.740s

Windows
=======

128K copy (+69%):

    without patch:
        50 loops, best of 5: 6.45 msec per loop
    with patch:
        50 loops, best of 5: 1.99 msec per loop

8M copy (+64%):

    without patch:
        10 loops, best of 5: 22.6 msec per loop
    with patch:
        50 loops, best of 5: 7.95 msec per loop

512M copy (+48%):

    without patch:
        1 loop, best of 5: 1.21 sec per loop
    with patch:
        1 loop, best of 5: 629 msec per loop
msg317880 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2018-05-28 16:22
PR: https://github.com/python/cpython/pull/7160
msg317905 - (view) Author: Stefan Behnel (scoder) * Date: 2018-05-28 19:36
Nice, I really like this.

Apart from the usual bit of minor style issues, I couldn't see anything inherently wrong with the PR, but I'll leave the detailed reviews to those who'd have to maintain the code in the future. :)
msg317906 - (view) Author: Stefan Behnel (scoder) * Date: 2018-05-28 19:38
Regarding the benchmarks, just to be sure, did you try reversing the run order to make sure you don't get unfair caching effects for the later runs?
msg317932 - (view) Author: Марк Коренберг (socketpair) * Date: 2018-05-28 22:05
http://man7.org/linux/man-pages/man2/ioctl_ficlonerange.2.html

That possibly should be used under Linux in order to really acheive zero-copying. Just like modern cp command.
msg317989 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2018-05-29 08:43
Yes, I tried changing benchmarks order and zero-copy variants are always faster. As for instantaneous CoW copy, it is debatable. E.g. "cp" command does not do it by default:   
https://unix.stackexchange.com/questions/80351/why-is-cp-reflink-auto-not-the-default-behaviour
I think shutil should follow the same lead, and perhaps provide a cow=bool argument in the future.
msg319401 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2018-06-12 21:04
New changeset 4a172ccc739065bb658c75e8929774a8e94af9e9 by Giampaolo Rodola in branch 'master':
bpo-33671: efficient zero-copy for shutil.copy* functions (Linux, OSX and Win) (#7160)
https://github.com/python/cpython/commit/4a172ccc739065bb658c75e8929774a8e94af9e9
msg319405 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2018-06-12 21:48
For future reference, as per https://github.com/python/cpython/pull/7160 discussion, we decided not to use CopyFileEx on Windows and instead increase read() buffer size from 16KB to 1MB (Windows only) resulting in a 40.8% speedup (instead of 48%). Also copyfileobj() has been optimized on all platforms by using readinto()/memoryview()/bytearray().
Updated benchmarks on Windows:

128KB copy (+27%)

    without patch:
        50 loops, best of 5: 7.69 sec per loop
    with patch:
        50 loops, best of 5: 5.61 sec per loop

8MB copy (+45.6%)

    without patch:
        10 loops, best of 5: 20.8 sec per loop
    with patch:
        20 loops, best of 5: 11.3 sec per loop

512MB copy (+40.8%)

    without patch:
        1 loop, best of 5: 1.26 sec per loop
    with patch:
        1 loop, best of 5: 646 msec per loop
msg319484 - (view) Author: Marcos Dione (StyXman) * Date: 2018-06-13 19:50
Thanks Gianpaolo for pushing for this. Great job.
msg319486 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-13 21:20
> Thanks Gianpaolo for pushing for this. Great job.

I concur: great job! Cool optimization.
msg319980 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2018-06-19 15:27
New changeset c7f02a965936f197354d7f4e6360f4cfc86817ed by Giampaolo Rodola in branch 'master':
bpo-33671 / shutil.copyfile: use memoryview() with dynamic size on Windows (#7681)
https://github.com/python/cpython/commit/c7f02a965936f197354d7f4e6360f4cfc86817ed
msg320247 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-22 17:25
New changeset 8fbbdf0c3107c3052659e166f73990b466eacbb0 by Victor Stinner in branch 'master':
bpo-33671: Add support.MS_WINDOWS and support.MACOS (GH-7800)
https://github.com/python/cpython/commit/8fbbdf0c3107c3052659e166f73990b466eacbb0
msg320456 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018-06-26 00:11
New changeset 937ee9e745d7ff3c2010b927903c0e2a83623324 by Victor Stinner in branch 'master':
Revert "bpo-33671: Add support.MS_WINDOWS and support.MACOS (GH-7800)" (GH-7919)
https://github.com/python/cpython/commit/937ee9e745d7ff3c2010b927903c0e2a83623324
History
Date User Action Args
2018-07-27 13:26:04berker.peksaglinkissue25156 superseder
2018-06-26 10:45:30desbmasetnosy: - desbma
2018-06-26 05:12:024383setpull_requests: + pull_request7536
2018-06-26 00:11:14vstinnersetpull_requests: + pull_request7529
2018-06-26 00:11:09vstinnersetmessages: + msg320456
2018-06-26 00:11:08vstinnersetpull_requests: + pull_request7528
2018-06-25 23:19:26giampaolo.rodolasetpull_requests: - pull_request7403
2018-06-25 23:18:58giampaolo.rodolasetpull_requests: - pull_request7525
2018-06-25 23:18:47giampaolo.rodolasetpull_requests: - pull_request7482
2018-06-25 23:16:35vstinnersetpull_requests: + pull_request7525
2018-06-23 13:47:57python-devsetpull_requests: + pull_request7482
2018-06-22 17:25:46vstinnersetmessages: + msg320247
2018-06-19 16:18:09vstinnersetpull_requests: + pull_request7403
2018-06-19 15:27:32giampaolo.rodolasetmessages: + msg319980
2018-06-13 21:20:16vstinnersetmessages: + msg319486
2018-06-13 19:50:34StyXmansetmessages: + msg319484
2018-06-13 11:22:33giampaolo.rodolasetpull_requests: + pull_request7293
2018-06-12 21:50:44yselivanovsetnosy: - yselivanov
2018-06-12 21:48:06giampaolo.rodolasetstatus: open -> closed
resolution: fixed
messages: + msg319405

stage: patch review -> resolved
2018-06-12 21:04:57giampaolo.rodolasetmessages: + msg319401
2018-05-29 08:43:22giampaolo.rodolasetmessages: + msg317989
2018-05-28 22:05:46socketpairsetnosy: + socketpair
messages: + msg317932
2018-05-28 19:38:51scodersetmessages: + msg317906
2018-05-28 19:36:41scodersetnosy: + scoder
messages: + msg317905
2018-05-28 16:33:01giampaolo.rodolasetnosy: + facundobatista, ncoghlan, pitrou, vstinner, gps, StyXman, tarek, r.david.murray, petr.viktorin, asvetlov, SilentGhost, neologix, python-dev, martin.panter, desbma, yselivanov, josh.r
2018-05-28 16:28:30giampaolo.rodolalinkissue33639 superseder
2018-05-28 16:22:15giampaolo.rodolasetmessages: + msg317880
2018-05-28 16:21:02giampaolo.rodolasetpull_requests: + pull_request6795
2018-05-28 16:18:28giampaolo.rodolasettitle: Efficient efficient zero-copy syscalls for shutil.copy* functions (Linux, OSX and Win) -> Efficient zero-copy for shutil.copy* functions (Linux, OSX and Win)
2018-05-28 16:17:23giampaolo.rodolacreate