Date 2018-06-12.21:48:06
For future reference, as per discussion, we decided not to use CopyFileEx on Windows and instead increase read() buffer size from 16KB to 1MB (Windows only) resulting in a 40.8% speedup (instead of 48%). Also copyfileobj() has been optimized on all platforms by using readinto()/memoryview()/bytearray().
Updated benchmarks on Windows:

128KB copy (+27%)

    without patch:
        50 loops, best of 5: 7.69 sec per loop
    with patch:
        50 loops, best of 5: 5.61 sec per loop

8MB copy (+45.6%)

    without patch:
        10 loops, best of 5: 20.8 sec per loop
    with patch:
        20 loops, best of 5: 11.3 sec per loop

512MB copy (+40.8%)

    without patch:
        1 loop, best of 5: 1.26 sec per loop
    with patch:
        1 loop, best of 5: 646 msec per loop
