Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shutil.rmtree can have O(n^2) performance on large dirs #76634

Open
nh2 mannequin opened this issue Dec 30, 2017 · 12 comments
Open

shutil.rmtree can have O(n^2) performance on large dirs #76634

nh2 mannequin opened this issue Dec 30, 2017 · 12 comments
Labels
3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement

Comments

@nh2
Copy link
Mannequin

nh2 mannequin commented Dec 30, 2017

BPO 32453
Nosy @pitrou, @giampaolo, @nh2, @serhiy-storchaka
Files
  • bench_rmtree.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2017-12-30.05:15:09.821>
    labels = ['3.7', 'type-feature', 'library']
    title = 'shutil.rmtree can have O(n^2) performance on large dirs'
    updated_at = <Date 2021-10-30.12:33:10.794>
    user = 'https://github.com/nh2'

    bugs.python.org fields:

    activity = <Date 2021-10-30.12:33:10.794>
    actor = 'nh2'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Library (Lib)']
    creation = <Date 2017-12-30.05:15:09.821>
    creator = 'nh2'
    dependencies = []
    files = ['47355']
    hgrepos = []
    issue_num = 32453
    keywords = []
    message_count = 12.0
    messages = ['309217', '309227', '309230', '309300', '309303', '309305', '309308', '309317', '309353', '309359', '309367', '405367']
    nosy_count = 4.0
    nosy_names = ['pitrou', 'giampaolo.rodola', 'nh2', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue32453'
    versions = ['Python 3.7']

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Dec 30, 2017

    See http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a for the explanation and equivalent fix in coreutils.

    The gist ist that deleting entries in inode order can improve deletion performance dramatically.

    To obtain inode numbers and sort by them, one needs to getdents() all the entries ahead of time, but rmtree() already gets all dirents ahead of the deletion. https://bugs.python.org/issue28564 recently improved shutil.rmtree() performance by using scandir(), but nevertheless the returned generator is list()ed, so we already have all necessary informtaion in memory and would just have to perform an O(n) integer sort.

    I propose we check if the improvements made to rm -r in coreutils should be ported to shutil.rmtree().

    @nh2 nh2 mannequin added 3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement labels Dec 30, 2017
    @pitrou
    Copy link
    Member

    pitrou commented Dec 30, 2017

    I don't think filesystem-specific optimizations belong in the Python stdlib. Python is compatible with multiple operating systems, including Windows, macOS, Android, many POSIX variants. It would be much better if this were fixed in the kernel...

    @serhiy-storchaka
    Copy link
    Member

    I have tested shutil.rmtree() with a large number of files using modified benchmark from bpo-28564. For 400000 files it takes less than 5 seconds. From the comment to the coreutils benchmark (http://git.savannah.gnu.org/cgit/coreutils.git/commit/?id=24412edeaf556a):

    # Using rm -rf to remove a 400k-entry directory takes:
    # - 9 seconds with the patch, on a 2-yr-old system
    # - 350 seconds without the patch, on a high-end system (disk 20-30% faster) threshold_seconds=60

    Running the coreutils benchmark gives the result 3 seconds on my computer.

    It seems to me that this issue have been fixed in the kernel.

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Dec 31, 2017

    Serhiy, did you run your benchmark on an SSD or a spinning disk?

    The coreutils bug mentions that the problem is seek times.

    My tests on a spinning disk with 400k files suggest that indeed rmtree() is ~30x slower than rm -r:

    # time (mkdir dirtest && cd dirtest && seq 1 100000 | xargs touch)
    
    real  0m0.722s
    user  0m0.032s
    sys 0m0.680s
    
    # time rm -rf dirtest/
    
    real  0m0.519s
    user  0m0.074s
    sys 0m0.437s
    
    # time (mkdir dirtest && cd dirtest && seq 1 100000 | xargs touch)
    
    real  0m0.693s
    user  0m0.039s
    sys 0m0.659s
    
    # time python -c 'import shutil; shutil.rmtree("dirtest")'
    
    real  0m0.756s
    user  0m0.225s
    sys 0m0.499s
    
    # time (mkdir dirtest && cd dirtest && seq 1 100000 | xargs touch)
    
    real  0m0.685s
    user  0m0.032s
    sys 0m0.658s
    
    # time python3 -c 'import shutil; shutil.rmtree("dirtest")'
    
    real  0m0.965s
    user  0m0.424s
    sys 0m0.528s
    
    # time (mkdir dirtest && cd dirtest && seq 1 400000 | xargs touch)
    
    real  0m4.249s
    user  0m0.098s
    sys 0m2.804s
    
    # time rm -rf dirtest/
    
    real  0m10.782s
    user  0m0.265s
    sys 0m2.213s
    
    # time (mkdir dirtest && cd dirtest && seq 1 400000 | xargs touch)
    
    real  0m5.236s
    user  0m0.107s
    sys 0m2.832s
    
    # time python -c 'import shutil; shutil.rmtree("dirtest")'
    
    real  3m8.006s
    user  0m1.323s
    sys 0m3.929s
    
    # time (mkdir dirtest && cd dirtest && seq 1 400000 | xargs touch)
    
    real  0m4.671s
    user  0m0.097s
    sys 0m2.832s
    
    # time python3 -c 'import shutil; shutil.rmtree("dirtest")'
    
    real  2m49.476s
    user  0m2.196s
    sys 0m3.695s
    

    The tests were done with coreutils rm 8.28, Python 2.7.14, Python 3.6.3, on ext4 (rw,relatime,data=ordered), on a dmraid RAID1 across 2 WDC_WD4000FYYZ disks (WD 4 TB Enterprise).

    Also note how deleting 100k files takes ~0.5 seconds with rm -r and the Pythons, but deleting 4x more files takes 20x longer with rm -r and ~300x longer with the Pythons.

    There is clearly some boundary below which we are hitting some nice cached behaviour.

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Dec 31, 2017

    It turns out I was wrong when saying that there's some cache we're hitting.

    In fact, rm -r is still precisely O(n^2), even with the coreutils patch I linked.

    Quick overview table of the benchmark:

    nfiles real user sys

    100000 0.51s 0.07s 0.43s
    200000 2.46s 0.15s 0.89s
    400000 10.78s 0.26s 2.21s
    800000 44.72s 0.58s 6.03s
    1600000 180.37s 1.06s 10.70s

    Each 2x increase of number of files results in 4x increased deletion time.

    Full benchmark output:

    # time (mkdir dirtest && cd dirtest && seq 1 100000 | xargs touch)
    
    real  0m0.722s
    user  0m0.032s
    sys 0m0.680s
    
    # time rm -rf dirtest/
    
    real  0m0.519s
    user  0m0.074s
    sys 0m0.437s
    
    # time (mkdir dirtest && cd dirtest && seq 1 200000 | xargs touch)
    
    real  0m1.576s
    user  0m0.044s
    sys 0m1.275s
    
    # time rm -r dirtest/
    
    real  0m2.469s
    user  0m0.150s
    sys 0m0.890s
    
    # time (mkdir dirtest && cd dirtest && seq 1 400000 | xargs touch)
    
    real  0m4.249s
    user  0m0.098s
    sys 0m2.804s
    
    # time rm -rf dirtest/
    
    real  0m10.782s
    user  0m0.265s
    sys 0m2.213s
    
    # time (mkdir dirtest && cd dirtest && seq 1 800000 | xargs touch)
    
    real  0m10.533s
    user  0m0.204s
    sys 0m5.758s
    
    # time rm -rf dirtest/
    
    real  0m44.725s
    user  0m0.589s
    sys 0m6.037s
    
    # time (mkdir dirtest && cd dirtest && seq 1 1600000 | xargs touch)
    
    real  0m34.480s
    user  0m0.382s
    sys 0m12.057s
    
    # time rm -r dirtest/
    
    real  3m0.371s
    user  0m1.069s
    sys 0m10.704s
    

    @pitrou
    Copy link
    Member

    pitrou commented Dec 31, 2017

    Did you try to sync and flush caches before running rm -r?

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Dec 31, 2017

    Did you try to sync and flush caches before running rm -r?

    Yes, it doesn't make a difference for me, I still see the same O(n²) behaviour in rm -r.

    I've sent an email "O(n^2) performance of rm -r" to bug-coreutils@gnu.org just now, unfortunately I can't link it yet because the mailman archive doesn't show it yet. It should appear soon on http://lists.gnu.org/archive/html/bug-coreutils/2017-12/threads.html.

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Jan 1, 2018

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Jan 1, 2018

    A better location to view the whole coreutils thread is:

    https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29921

    @serhiy-storchaka
    Copy link
    Member

    Thanks for your investigations Niklas. I ran my benchmark on a spinning disk, but significant nonlinearity is observed only for the size of directory around 1000000 and more. And yes, sorting by inode number helps.

    $ for j in 1 2 3 4 6 8 12 16; do i=$((j*100000)); mkdir x; (cd x && seq $i|xargs touch); env time -f "%e $i" python3.7-unpatched -c 'import shutil; shutil.rmtree("x")'; done
    1.01 100000
    3.80 200000
    3.64 300000
    4.89 400000
    8.72 600000
    11.86 800000
    56.80 1200000
    209.82 1600000
    
    $ for j in 1 2 3 4 6 8 12 16; do i=$((j*100000)); mkdir x; (cd x && seq $i|xargs touch); env time -f "%e $i" python3.7-patched -c 'import shutil; shutil.rmtree("x")'; done
    0.97 100000
    2.42 200000
    3.84 300000
    4.48 400000
    7.07 600000
    10.01 800000
    15.53 1200000
    23.24 1600000
    
    $ for j in 1 2 3 4 6 8 12 16; do i=$((j*100000)); mkdir x; (cd x && seq $i|xargs touch); env time -f "%e $i" env rm -rf x; done
    0.68 100000
    1.34 200000
    2.10 300000
    3.95 400000
    5.95 600000
    10.28 800000
    47.66 1200000
    89.32 1600000

    On an SSD:

    $ for j in 1 2 3 4 6 8 12 16; do i=$((j*100000)); mkdir x; (cd x && seq $i|xargs touch); env time -f "%e $i" python3.7-unpatched -c 'import shutil; shutil.rmtree("x")'; done
    1.00 100000
    1.93 200000
    2.90 300000
    4.98 400000
    7.05 600000
    9.87 800000
    21.45 1200000
    36.19 1600000
    
    $ for j in 1 2 3 4 6 8 12 16; do i=$((j*100000)); mkdir x; (cd x && seq $i|xargs touch); env time -f "%e $i" python3.7-patched -c 'import shutil; shutil.rmtree("x")'; done
    0.96 100000
    1.99 200000
    3.09 300000
    4.85 400000
    7.55 600000
    9.44 800000
    16.03 1200000
    21.28 1600000
    
    $ for j in 1 2 3 4 6 8 12 16; do i=$((j*100000)); mkdir x; (cd x && seq $i|xargs touch); env time -f "%e $i" env rm -rf x; done
    0.67 100000
    1.38 200000
    2.41 300000
    2.82 400000
    5.24 600000
    7.02 800000
    18.60 1200000
    30.58 1600000

    Interesting that patched Python is faster than 'rm' (GNU coreutils 8.26) for large numbers.

    @pitrou
    Copy link
    Member

    pitrou commented Jan 2, 2018

    Yes, so rm -rf is quadratic on my SSD too...

    @nh2
    Copy link
    Mannequin Author

    nh2 mannequin commented Oct 30, 2021

    A small update / summary so far:

    From here this developed into coreutils discussion:

    bpo-29921 O(n^2) performance of rm -r
    https://debbugs.gnu.org/cgi/bugreport.cgi?bug=29921
    

    and finally a linux-fsdevel discussion:

    O(n^2) deletion performance
    https://lore.kernel.org/linux-fsdevel/5ca3808d-4eea-afec-75a6-2cc41f44b868@nh2.me/t/#u
    

    Dave Chinner (xfs dev) suggests that on XFS there is no quadratic behaviour once the problem is bound by seek-time of the spinning disk.

    Somebody should try to confirm that it becomes linear in even larger tests, e.g. way larger than 21 minutes deletion time.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life stdlib Python modules in the Lib dir type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants