classification
Title: Increase shutil.COPY_BUFSIZE
Type: resource usage Stage: resolved
Components: Library (Lib) Versions: Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: desbma, giampaolo.rodola, inada.naoki
Priority: normal Keywords: patch

Created on 2019-02-25 09:27 by inada.naoki, last changed 2019-03-02 04:32 by inada.naoki. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 12115 merged inada.naoki, 2019-03-01 07:23
Messages (10)
msg336505 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-25 09:38
shutil.COPY_BUFSIZE is 16KiB on non-Windows platform.
But it seems bit small for performance.

As this article[1], 128KiB is the best performance on common system.
[1]: https://eklitzke.org/efficient-file-copying-on-linux

Another resource: EBS document [2] uses 128KiB I/O for throughput.
[2]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

Can we increase shutil.COPY_BUFSIZE to 128KiB by default?

Note that 128KiB is small enough when comparing with Windows (1MB by default).
msg336523 - (view) Author: desbma (desbma) * Date: 2019-02-25 14:24
Your first link explains why 128kB buffer size is faster in the context of cp: it's due to fadvise and kernel read ahead.

None of the shutil functions call fadvise, so the benchmark and conclusions are irrelevant to the Python buffer size IMO.

In general, the bigger buffer, the better, to reduce syscall frequency (also explained in the article), but going from 16kB to 128kB is clearly in the micro optimization range, unlikely to do any significant difference.

Also with 3.8, in many typical file copy cases (but not all), sendfile will be used, which makes buffer size even less important.
msg336533 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-25 15:52
> Your first link explains why 128kB buffer size is faster in the context of cp: it's due to fadvise and kernel read ahead.
> 
> None of the shutil functions call fadvise, so the benchmark and conclusions are irrelevant to the Python buffer size IMO.

Even without fadvice, readahead works automatically.  fadvice doubles readahead size on Linux.  But I don't know it really doubles readahead size when block device advertised readahead size.


> In general, the bigger buffer, the better, to reduce syscall frequency (also explained in the article), but going from 16kB to 128kB is clearly in the micro optimization range, unlikely to do any significant difference.
>
> Also with 3.8, in many typical file copy cases (but not all), sendfile will be used, which makes buffer size even less important.

It is used for copyfileobj.  So better default value may worth enough.

In my Linux box, SATA SSD (Samsung SSD 500GB 860EVO) is used.
It has unstable sequential write performance.

Here is quick test:

$ dd if=/dev/urandom of=f1 bs=1M count=1k

$ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 8*1024); f1.close(); f2.close()'
raw times: 301 msec, 302 msec, 301 msec, 301 msec, 300 msec

1 loop, best of 5: 300 msec per loop

$ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 16*1024); f1.close(); f2.close()'
raw times: 194 msec, 194 msec, 193 msec, 193 msec, 193 msec

1 loop, best of 5: 193 msec per loop

$ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 32*1024); f1.close(); f2.close()'
raw times: 140 msec, 140 msec, 140 msec, 140 msec, 140 msec

1 loop, best of 5: 140 msec per loop

$ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 64*1024); f1.close(); f2.close()'
raw times: 112 msec, 112 msec, 112 msec, 112 msec, 112 msec

1 loop, best of 5: 112 msec per loop

$ ./python -m timeit -n1 -r5 -v -s 'import shutil;' -- 'f1=open("f1","rb"); f2=open("/dev/null", "wb"); shutil.copyfileobj(f1, f2, 128*1024); f1.close(); f2.close()'
raw times: 101 msec, 101 msec, 101 msec, 101 msec, 101 msec


As far as this result, I think 64KiB is the best balance.
msg336546 - (view) Author: desbma (desbma) * Date: 2019-02-25 18:01
If you do a benchmark by reading from a file, and then writing to /dev/null several times, without clearing caches, you are measuring *only* the syscall overhead:
* input data is read from the Linux page cache, not the file on your SSD itself
* no data is written (obviously because output is /dev/null)

Your current command line also measures open/close timings, without that I think the speed should linearly increase when doubling buffer size, but of course this is misleading, because its a synthetic benchmark.

Also if you clear caches in between tests, and  write the output file to the SSD itself, sendfile will be used, and should be even faster.

So again I'm not sure this means much compared to real world usage.
msg336623 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-26 07:12
>
> desbma <dutch109@gmail.com> added the comment:
>
> If you do a benchmark by reading from a file, and then writing to /dev/null several times, without clearing caches, you are measuring *only* the syscall overhead:
> * input data is read from the Linux page cache, not the file on your SSD itself

Yes.  I measures syscall overhead to determine reasonable buffer size.
shutil may be used when page cache is warm.

> * no data is written (obviously because output is /dev/null)

As I said before, my SSD doesn't have stable write performance.  (It
is typical for consumer SSD).
So this is intensional.
And there are use cases copy from/to io.BytesIO or other file-like objects.

>
> Your current command line also measures open/close timings, without that I think the speed should linearly increase when doubling buffer size, but of course this is misleading, because its a synthetic benchmark.

I'm not measuring speed of my cheap SSD.  The goal of this benchmark is finding
reasonable buffer size.
There are vary real usages.  So reducing syscall overhead with
reasonable buffer size
is worth enough.

>
> Also if you clear caches in between tests, and  write the output file to the SSD itself, sendfile will be used, and should be even faster.

No.  sendfile is not used by shutil.copyfileobj, even if dst is real
file on disk.

>
> So again I'm not sure this means much compared to real world usage.
>

"Real world usage" is vary.  Sometime it is not affected.  Sometime it affects.

On the other hand, what is the cons of changing 16KiB to 64KiB?
Windows used 1MiB already.  And CPython runtime uses a few MBs of memory too.
msg336643 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2019-02-26 10:59
@Inada: having played with this in the past I seem to remember that on Linux the bigger bufsize doesn't make a reasonable difference (but I may be wrong), that's why I suggest to try some benchmarks. In issue33671 I pasted some one-liners you can use (and you should target copyfileobj() instead of copyfile() in order to skip the os.sendfile() path). Also on Linux "echo 3 | sudo tee /proc/sys/vm/drop_caches" is supposed to  disable the cache.
msg336648 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-26 11:24
> Also on Linux "echo 3 | sudo tee /proc/sys/vm/drop_caches" is supposed to  disable the cache.

As I said already, shutil is not used only with cold cache.

If cache is cold, disk speed will be upper bound in most cases.
But when cache is hot, or using very fast NVMe disk, syscall overhead
can be non-negligible.
msg336685 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-02-26 15:38
Read this file too.
http://git.savannah.gnu.org/cgit/coreutils.git/tree/src/ioblksize.h

coreutils choose 128KiB for *minimal* buffer size to reduce syscall overhead.
In case of shutil, we have Python interpreter overhead adding to syscall overhead.
Who has deeper insights than coreutils author?

I think 128KiB is the best, but I'm OK to 64KiB for conservative decision.
msg336987 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-02 04:31
New changeset 4f1903061877776973c1bbfadd3d3f146920856e by Inada Naoki in branch 'master':
bpo-36103: change default buffer size of shutil.copyfileobj() (GH-12115)
https://github.com/python/cpython/commit/4f1903061877776973c1bbfadd3d3f146920856e
msg336988 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-02 04:32
I chose 64 KiB because performance difference between 64 and 128 KiB
I can see is only up to 5%.
History
Date User Action Args
2019-03-02 04:32:31inada.naokisetstatus: open -> closed
resolution: fixed
messages: + msg336988

stage: patch review -> resolved
2019-03-02 04:31:03inada.naokisetmessages: + msg336987
2019-03-01 07:23:15inada.naokisetkeywords: + patch
stage: patch review
pull_requests: + pull_request12121
2019-02-26 15:38:20inada.naokisetmessages: + msg336685
2019-02-26 11:24:02inada.naokisetmessages: + msg336648
2019-02-26 10:59:12giampaolo.rodolasetnosy: + giampaolo.rodola
messages: + msg336643
2019-02-26 07:12:05inada.naokisetmessages: + msg336623
2019-02-25 18:01:04desbmasetmessages: + msg336546
2019-02-25 15:52:19inada.naokisetmessages: + msg336533
2019-02-25 14:24:12desbmasetnosy: + desbma
messages: + msg336523
2019-02-25 09:38:53inada.naokisetversions: + Python 3.8
title: Increase -> Increase shutil.COPY_BUFSIZE
messages: + msg336505

components: + Library (Lib)
type: resource usage
2019-02-25 09:27:56inada.naokicreate