Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python 3.8 regression: endless loop in shutil.copytree #82869

Closed
cboltz mannequin opened this issue Nov 4, 2019 · 10 comments
Closed

Python 3.8 regression: endless loop in shutil.copytree #82869

cboltz mannequin opened this issue Nov 4, 2019 · 10 comments
Labels
3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@cboltz
Copy link
Mannequin

cboltz mannequin commented Nov 4, 2019

BPO 38688
Nosy @giampaolo, @kinow
PRs
  • bpo-38688: Consume iterator and create list of entries to prevent infinite loop #17098
  • [3.8] bpo-38688, shutil.copytree: consume iterator and create list of entries to prevent infinite recursion (GH-17098) #17397
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-11-27.04:50:18.931>
    created_at = <Date 2019-11-04.21:21:38.595>
    labels = ['3.8', 'type-bug', 'library', '3.9']
    title = 'Python 3.8 regression: endless loop in shutil.copytree'
    updated_at = <Date 2019-11-27.04:50:18.930>
    user = 'https://bugs.python.org/cboltz'

    bugs.python.org fields:

    activity = <Date 2019-11-27.04:50:18.930>
    actor = 'giampaolo.rodola'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-11-27.04:50:18.931>
    closer = 'giampaolo.rodola'
    components = ['Library (Lib)']
    creation = <Date 2019-11-04.21:21:38.595>
    creator = 'cboltz'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 38688
    keywords = ['patch', '3.8regression']
    message_count = 10.0
    messages = ['355981', '356302', '357134', '357141', '357146', '357149', '357156', '357157', '357544', '357556']
    nosy_count = 3.0
    nosy_names = ['giampaolo.rodola', 'kinow', 'cboltz']
    pr_nums = ['17098', '17397']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue38688'
    versions = ['Python 3.8', 'Python 3.9']

    @cboltz
    Copy link
    Mannequin Author

    cboltz mannequin commented Nov 4, 2019

    The following test script works with Python 3.7 (and older), but triggers an endless loop with Python 3.8:

    #!/usr/bin/python3

    import shutil
    import os
    
    os.mkdir('/dev/shm/t')
    os.mkdir('/dev/shm/t/pg')
    
    with open('/dev/shm/t/pg/pol', 'w+') as f:
        f.write('pol')
    
    shutil.copytree('/dev/shm/t/pg', '/dev/shm/t/pg/somevendor/1.0')

    The important point is probably that 'pg' gets copied into a subdirectory of itsself. While this worked in Python up to 3.7, doing the same in Python 3.8 runs into an endless loop:

    # python3 /home/abuild/rpmbuild/SOURCES/test.py
    Traceback (most recent call last):
      File "/home/abuild/rpmbuild/SOURCES/test.py", line 15, in <module> 
        shutil.copytree('/dev/shm/t/pg', '/dev/shm/t/pg/somevendor/1.0')
      File "/usr/lib/python3.8/shutil.py", line 547, in copytree 
        return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
      File "/usr/lib/python3.8/shutil.py", line 486, in _copytree
        copytree(srcobj, dstname, symlinks, ignore, copy_function,
    ...
        copytree(srcobj, dstname, symlinks, ignore, copy_function,
      File "/usr/lib/python3.8/shutil.py", line 547, in copytree 
        return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
      File "/usr/lib/python3.8/shutil.py", line 449, in _copytree
        os.makedirs(dst, exist_ok=dirs_exist_ok)
      File "/usr/lib/python3.8/os.py", line 206, in makedirs 
        head, tail = path.split(name)
      File "/usr/lib/python3.8/posixpath.py", line 104, in split 
        sep = _get_sep(p)
      File "/usr/lib/python3.8/posixpath.py", line 42, in _get_sep 
        if isinstance(path, bytes):
    RecursionError: maximum recursion depth exceeded while calling a Python object

    I also reported this at https://bugzilla.opensuse.org/show_bug.cgi?id=1155839

    @cboltz cboltz mannequin added type-crash A hard crash of the interpreter, possibly with a core dump 3.8 only security fixes stdlib Python modules in the Lib dir labels Nov 4, 2019
    @terryjreedy terryjreedy added type-bug An unexpected behavior, bug, or error and removed type-crash A hard crash of the interpreter, possibly with a core dump labels Nov 8, 2019
    @kinow
    Copy link
    Mannequin

    kinow mannequin commented Nov 9, 2019

    Hi,

    I did a quick git bisect using the example provided, and looks like this regression was added in the fix for bpo-33695, commit 19c46a4.

    It looks to me that the iterator returned by with os.scandir(...) is including the newly created dst directory (see the call for os.makedirs(dst, exist_ok=dirs_exist_ok) in

    os.makedirs(dst, exist_ok=dirs_exist_ok)
    ).

    This results in the function finding an extra directory, and repeating the steps for this folder and its subfolder recursively. This only happens because in the example in this issue, dst is a subdirectory of src.

    The bpo-33695 commit had more changes, so I've reverted just this block of the copytree as a tentative fix, and added a unit test: #17098

    --

    Here's a simplified version of what's going on:

    import os
    import shutil
    
    shutil.rmtree('/tmp/test/', True)
    os.makedirs('/tmp/test')
    with open('/tmp/test/foo', 'w+') as f:
      f.write('foo')
    
    # now we have /tmp/test/foo, let's simulate what happens in copytree on master
    
    with os.scandir('/tmp/test') as entries:
      # up to this point, /tmp/test/foo is the only entry
      os.makedirs('/tmp/test/1/2/3/', exist_ok=True) # <---- when the iterator starts below in `f in entries`, it will find 1 too
      # now 1 will have been added too
      for f in entries:
        print(f)
    

    @giampaolo
    Copy link
    Contributor

    PR-17098 basically reverts https://bugs.python.org/issue33695. Not good. =)
    I think we can simply consume the iterator immediately as in:

        def copytree(src, ...):
            with os.scandir(src) as itr:
                entries = list(itr)
            return _copytree(entries=entries, ...)

    @kinow
    Copy link
    Mannequin

    kinow mannequin commented Nov 21, 2019

    Hi Giampaolo,

    I think it is more or less the same as the previous code, which was using os.list to return a list in memory. My first tentative fix was:

        def copytree(src, ...):
            entries = os.list(src)
            return _copytree(entries=entries, ...)

    But the previous PR also altered _copytree to use the return of os.scandir DirEntry's, so the change above results in AttributeError: 'str' object has no attribute 'name'.

    Would be better to avoid using iterator to populate a list, and also using the DirEntry in _copytree, and just stick with the previous code with (i.e. os.listdir() and a single copytree method as before)? Or if you think we should go with your suggestion, I'm good with it as well - especially as it'd be a much simpler PR :)

    Thanks
    Bruno

    @giampaolo
    Copy link
    Contributor

    The speedup introduced in bpo-33695 is mostly because the number of os.stat() syscall was reduced from 6 to 1 per file (both by using scandir() and because stat() results are cached and passed around between function calls). As such, even if we immediately consume scandir() iterator I believe it won't have a meaningful impact in terms of speed. FWIW bpo-33695 has a benchmark script attached (but it's not very stable).

    @kinow
    Copy link
    Mannequin

    kinow mannequin commented Nov 21, 2019

    I really liked that improvement, and didn't think it needed to be removed. That's why the PR reverts it partially. I think the os.stat improvements were in the other methods changed, and should not be changed in my PR - unless I changed it by accident.

    So with the current PR for this issue, or with your suggested patch, both should have similar performance I think. (I hadn't seen that script to measure performance, thanks.)

    I am happy if we revert partially, or if we build the fix on top of the current code consuming the iterator. Should I update the PR with your suggested fix then?

    @giampaolo
    Copy link
    Contributor

    PR-17098 as it stands re-introduces some stat() syscall. I suggest to just consume the iterator: it's a small change and it should fix the issue.

    @kinow
    Copy link
    Mannequin

    kinow mannequin commented Nov 21, 2019

    Done. Rebased on master too, and edited commit message & GH PR title. Thanks Giampaolo!

    @giampaolo
    Copy link
    Contributor

    New changeset 9bbcbc9 by Giampaolo Rodola (Bruno P. Kinoshita) in branch 'master':
    bpo-38688, shutil.copytree: consume iterator and create list of entries to prevent infinite recursion (GH-17098)
    9bbcbc9

    @giampaolo giampaolo added the 3.9 only security fixes label Nov 27, 2019
    @giampaolo
    Copy link
    Contributor

    New changeset 65c92c5 by Giampaolo Rodola (Bruno P. Kinoshita) in branch '3.8':
    [3.8] bpo-38688, shutil.copytree: consume iterator and create list of entries to prevent infinite recursion (GH-17397)
    65c92c5

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants