Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutil.copyfile should internally use os.sendfile when possible #69343

Closed
desbma mannequin opened this issue Sep 17, 2015 · 21 comments
Closed

shutil.copyfile should internally use os.sendfile when possible #69343

desbma mannequin opened this issue Sep 17, 2015 · 21 comments
Labels
stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@desbma
Copy link
Mannequin

desbma mannequin commented Sep 17, 2015

BPO 25156
Nosy @vstinner, @giampaolo, @bitdancer, @vadmium, @desbma, @MojoVampire
Superseder
  • bpo-33671: Efficient zero-copy for shutil.copy* functions (Linux, OSX and Win)
  • Files
  • issue25156.patch: First patch
  • issue25156_v2.patch
  • issue25156_v3.patch
  • issue25156_v4.patch
  • issue25156_v5.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2018-07-27.13:13:35.509>
    created_at = <Date 2015-09-17.20:08:04.996>
    labels = ['type-feature', 'library']
    title = 'shutil.copyfile should internally use os.sendfile when possible'
    updated_at = <Date 2018-07-27.13:26:04.738>
    user = 'https://github.com/desbma'

    bugs.python.org fields:

    activity = <Date 2018-07-27.13:26:04.738>
    actor = 'berker.peksag'
    assignee = 'none'
    closed = True
    closed_date = <Date 2018-07-27.13:13:35.509>
    closer = 'giampaolo.rodola'
    components = ['Library (Lib)']
    creation = <Date 2015-09-17.20:08:04.996>
    creator = 'desbma'
    dependencies = []
    files = ['40881', '40906', '40911', '41243', '41252']
    hgrepos = []
    issue_num = 25156
    keywords = ['patch']
    message_count = 21.0
    messages = ['250918', '251246', '253643', '253648', '253649', '253650', '253669', '253701', '253767', '253778', '253781', '253790', '253799', '255769', '255908', '255981', '255983', '257345', '259808', '317628', '322489']
    nosy_count = 8.0
    nosy_names = ['vstinner', 'giampaolo.rodola', 'gps', 'r.david.murray', 'SilentGhost', 'martin.panter', 'desbma', 'josh.r']
    pr_nums = []
    priority = 'normal'
    resolution = 'duplicate'
    stage = 'resolved'
    status = 'closed'
    superseder = '33671'
    type = 'enhancement'
    url = 'https://bugs.python.org/issue25156'
    versions = ['Python 3.6']

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Sep 17, 2015

    This is related to bpo-25063 (https://bugs.python.org/issue25063).

    Trying to use sendfile internally in shutil.copyfileobj was considered risky because of special Python files that expose a file descriptor but wrap it to add special behavior (eg: GzipFile).

    I believe such risk does not exist for shutil.copyfile, and it would be possible to use sendfile if available.

    @desbma desbma mannequin added stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Sep 17, 2015
    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Sep 21, 2015

    Additional advantage of calling sendfile from shutil.copyfile: other fonctions in shutil module would automatically benefit from the use of senfile because they call copyfile directly (copy, copy2) or indirectly (copytree).

    So for example, the performance of shutil.copytree should be improved for free for directory trees containing big files.

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Oct 28, 2015

    Thoughts anyone?
    Here is a patch that implements the change.

    My tests show a 30-40% performance improvement for 128KB-512MB single file copy:

    128 KB file copy:

    $ dd if=/dev/urandom of=/tmp/f1 bs=1K count=128

    Without the patch:
    $ ./python -m timeit -s 'import shutil; p1 = "/tmp/f1"; p2 = "/tmp/f2"' 'shutil.copyfile(p1, p2)'
    10000 loops, best of 3: 109 usec per loop

    With the patch:
    $ ./python -m timeit -s 'import shutil; p1 = "/tmp/f1"; p2 = "/tmp/f2"' 'shutil.copyfile(p1, p2)'
    10000 loops, best of 3: 75.7 usec per loop

    --------
    8 MB file copy:

    $ dd if=/dev/urandom of=/tmp/f1 bs=1M count=8

    Without the patch:
    $ ./python -m timeit -s 'import shutil; p1 = "/tmp/f1"; p2 = "/tmp/f2"' 'shutil.copyfile(p1, p2)'
    100 loops, best of 3: 4.99 msec per loop

    With the patch:
    $ ./python -m timeit -s 'import shutil; p1 = "/tmp/f1"; p2 = "/tmp/f2"' 'shutil.copyfile(p1, p2)'
    100 loops, best of 3: 3.03 msec per loop

    --------
    512 MB file copy:

    $ dd if=/dev/urandom of=/tmp/f1 bs=1M count=512

    Without the patch:
    $ ./python -m timeit -s 'import shutil; p1 = "/tmp/f1"; p2 = "/tmp/f2"' 'shutil.copyfile(p1, p2)'
    10 loops, best of 3: 305 msec per loop

    With the patch:
    $ ./python -m timeit -s 'import shutil; p1 = "/tmp/f1"; p2 = "/tmp/f2"' 'shutil.copyfile(p1, p2)'
    10 loops, best of 3: 178 msec per loop

    @MojoVampire
    Copy link
    Mannequin

    MojoVampire mannequin commented Oct 29, 2015

    Adding interested parties from earlier ticket.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 29, 2015

    I’ve never used sendfile() nor shutil.copyfile(), but my immediate reaction is maybe we need a backup plan if os.sendfile() is available but not supported in the circumstances. E.g. if it is practical to use copyfile() to copy from a named socket in the filesystem, the Linux man page <http://man7.org/linux/man-pages/man2/sendfile.2.html\> says it will raise EINVAL in this case. Maybe a test case would be good to prove this is still handled.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 29, 2015

    Also, the os.sendfile() doc suggests that some platforms only support writing to sockets, so I definitely think a backup plan is needed.

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Oct 29, 2015

    Thanks for the comment.

    Also, the os.sendfile() doc suggests that some platforms only support writing to sockets, so I definitely think a backup plan is needed.

    You are right, the man page clearly says:

    Applications may wish to fall back to read(2)/write(2) in the case
    where sendfile() fails with EINVAL or ENOSYS.

    I will improve the code and add tests for conditions where sendfile fails.

    I have tested it manually, but I will also add a test with a copy of a file > 4GB (it causes several calls to sendfile).

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Oct 29, 2015

    I played a bit with Unix domain sockets, and it appears you can not open them like a file with open().
    So they do no work with the current implementation of shutil.copyfile anyway.

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Oct 30, 2015

    Here is an improved patch with the following changes:

    • Fallback to copyfileobj if sendfile fails with errno set to EINVAL or ENOSYS
    • Add a test for > 4GB file

    @vadmium
    Copy link
    Member

    vadmium commented Oct 31, 2015

    I left a few comments. But it might be good if someone more familiar with the APIs could review this.

    Have you seen the socket.sendfile() implementation? It’s a bit of a mess, and the circumstances are slightly different, but it may offer partial inspiration. It seems a shame to have two separate high-level sendfile() wrappers, but I guess the use cases are just a little too far apart to be worth sharing much code.

    For the test case, it may be worth skipping the test if you run out of disk space. Similar to test_mmap and test_largefile.

    @vadmium
    Copy link
    Member

    vadmium commented Oct 31, 2015

    Also, man pages for Free BSD and OS X (where writing to a disk file is not supported) say it raises:

    • ENOTSUP if “the ‘fd’ argument does not refer to a regular file”
    • EBADF if “the ‘s’ argument is not a valid socket descriptor”
    • ENOTSOCK if “the ‘s’ argument is not a socket”
    • EOPNOTSUPP if “the file system for descriptor fd does not support sendfile()”

    It is not clear what the priority of these errors is, so it might be safest to catch them all. But I wouldn’t catch any arbitrary OSError, because you may end up doing weird double copying or something for an out-of-space error.

    @bitdancer
    Copy link
    Member

    I'm not at all sure this is worth the maintenance burden at this point in time. So let's say I'm -0.5. I agree that a review and opinion by someone more familiar with the API would be best. I'm adding gps as nosy since this feels like the kind of performance improvement he might find interesting, so if he thinks it is worth it I'll change my vote :)

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Oct 31, 2015

    Here is an updated patch that takes into account Martin's suggestions, both here and in the code review.

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Dec 2, 2015

    Ping

    A small patch, but a good performance improvement :)

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Dec 5, 2015

    Here is a new patch, with changes suggested by SilentGhost and josh.rosenberg in the review.

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Dec 5, 2015

    Thank you SilentGhost for the second review on the v4 patch.

    Attached is the v5 patch which hopefully is getting even better.

    @SilentGhost
    Copy link
    Mannequin

    SilentGhost mannequin commented Dec 5, 2015

    No further comments from me. I haven't run the test, but I trust it passes without any warnings.

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Jan 2, 2016

    Can this patch be merged, or is there something I can do to improve it?

    @desbma
    Copy link
    Mannequin Author

    desbma mannequin commented Feb 7, 2016

    If anyone is interested, I have created a package to monkey patch shutil.copyfile to benefit from sendfile, similarly to the last patch, but it also works on older Python versions down to 2.7.

    PyPI link: https://pypi.python.org/pypi/pyfastcopy/

    @vstinner
    Copy link
    Member

    Different implementation: bpo-33639.

    @giampaolo
    Copy link
    Contributor

    Closing as duplicate of bpo-33671.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    4 participants