Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test.support.unlink issue on Windows platform #51692

Closed
asvetlov opened this issue Dec 5, 2009 · 34 comments
Closed

test.support.unlink issue on Windows platform #51692

asvetlov opened this issue Dec 5, 2009 · 34 comments
Assignees
Labels
OS-windows tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error

Comments

@asvetlov
Copy link
Contributor

asvetlov commented Dec 5, 2009

BPO 7443
Nosy @loewis, @jaraco, @ncoghlan, @pitrou, @tjguk, @bitdancer, @briancurtin, @asvetlov, @florentx, @cjerdonek
Files
  • python3.patch: patch for py3k
  • python2.patch: patch for trunk
  • watch_dir.py
  • test-case.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/tjguk'
    closed_at = <Date 2013-07-31.15:12:19.626>
    created_at = <Date 2009-12-05.15:20:02.770>
    labels = ['type-bug', 'tests', 'OS-windows']
    title = 'test.support.unlink issue on Windows platform'
    updated_at = <Date 2013-07-31.15:26:00.800>
    user = 'https://github.com/asvetlov'

    bugs.python.org fields:

    activity = <Date 2013-07-31.15:26:00.800>
    actor = 'r.david.murray'
    assignee = 'tim.golden'
    closed = True
    closed_date = <Date 2013-07-31.15:12:19.626>
    closer = 'tim.golden'
    components = ['Tests', 'Windows']
    creation = <Date 2009-12-05.15:20:02.770>
    creator = 'asvetlov'
    dependencies = []
    files = ['15454', '15455', '16839', '16869']
    hgrepos = []
    issue_num = 7443
    keywords = ['patch', 'needs review']
    message_count = 34.0
    messages = ['95991', '95992', '95993', '96004', '96012', '99089', '99093', '101355', '102049', '102137', '102685', '102696', '102698', '102720', '102833', '102837', '102860', '102863', '102869', '102873', '102880', '114124', '132203', '132326', '132330', '132333', '132352', '132589', '132766', '132787', '132793', '167191', '194002', '194003']
    nosy_count = 14.0
    nosy_names = ['loewis', 'jafo', 'jaraco', 'ncoghlan', 'pitrou', 'techtonik', 'tim.golden', 'nadeem.vawda', 'r.david.murray', 'brian.curtin', 'asvetlov', 'flox', 'chris.jerdonek', 'jdigital']
    pr_nums = []
    priority = 'high'
    resolution = 'works for me'
    stage = 'needs patch'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue7443'
    versions = ['Python 3.3']

    @asvetlov
    Copy link
    Contributor Author

    asvetlov commented Dec 5, 2009

    On Windows there are tiny delay between call to os.unlink and real file
    removing. Periodically it leads to unittest crashes in cases like this:

    test.support.unlink(filename)
    f = open(filename, 'wb')

    Proposed solution: wait in support.unlink for end of deletion asking
    os.stat for removed file (only if os.name == 'nt', of course).

    Also test.test_linecache:LineCacheTests.test_checkcache should be fixed

    • this one miss to close last opened file and Windows cannot remove it.

    Both patches for trunk and py3k is attached.

    @asvetlov asvetlov added tests Tests in the Lib/test dir OS-windows type-bug An unexpected behavior, bug, or error labels Dec 5, 2009
    @asvetlov
    Copy link
    Contributor Author

    asvetlov commented Dec 5, 2009

    patch for python 2 is attached

    @asvetlov
    Copy link
    Contributor Author

    asvetlov commented Dec 5, 2009

    Problem can be reproduced with several run of test_bufio in both python
    versions. trunk contains more tests (+1 test case) so it's easer to catch
    'access denied' exception on it.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Dec 5, 2009

    I'm fairly skeptical that your analysis is right. Can you prove that a
    file still exists after unlink returns?

    @asvetlov
    Copy link
    Contributor Author

    asvetlov commented Dec 5, 2009

    You right, problem is not in os.unlink (Windows call DeleteFile) itself.

    I have TortoiseSVN (very popular Explorer extension to work with
    Subversion) installed.
    This tool run TSVNCache.exe process to update own data in background.
    TSVNCache.exe receive NotifyChangeDirectory events for svn checkout
    folders.
    support.TESTFN is tempfile name in current working directory. If I try
    to run unittests from some folder from python checkout TSVNCache.exe get
    change notify and analyze it.

    So, sometimes I can see race condition:

    • python: os.unlink(file)
    • TSVNCache.exe: get change event
    • TSVNCache.exe: query changes
    • python: open(file) - oops, TSVNCache.exe still processing notification
      and lock deleted file by holding opened handles to it.
      According to MSDN for DeleteFile function:
      The DeleteFile function marks a file for deletion on close. Therefore,
      the file deletion does not occur until the last handle to the file is
      closed. Subsequent calls to CreateFile to open the file fail with
      ERROR_ACCESS_DENIED.
      We get exactly same.
    • TSVNCache.exe finish of change update and release handle. File is
      actually deleted.

    I see this situation in sysinternals Process Monitor tool.
    Probability of race condition is tiny but non-zero. Intensive
    create/drop/create again sequences can catch this one.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Feb 9, 2010

    Another report of something looking very similar on bpo-7712.

    @florentx
    Copy link
    Mannequin

    florentx mannequin commented Feb 9, 2010

    Note: the fix in test_linecache.py is useless.

    The "with open(source_name, 'w') as source:" context manager is in charge of closing the file on __exit__.

    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Mar 20, 2010

    Andrew: There have been changes committed within the last week to bpo-7712, which Florent suggested might be related. Can you please re-test this to see if it still exists, and if it does, bug me and I'll try to get some more movement on this.

    @techtonik
    Copy link
    Mannequin

    techtonik mannequin commented Apr 1, 2010

    It would be nice to see standalone test case that illustrates the problem.

    @asvetlov
    Copy link
    Contributor Author

    asvetlov commented Apr 1, 2010

    Florent, sorry.
    I have no Windows workstation now, so I cannot follow this issue.

    @tjguk
    Copy link
    Member

    tjguk commented Apr 9, 2010

    This is basically a rerun of this discussion a couple of years ago:

    http://mail.python.org/pipermail/python-dev/2008-April/078333.html

    The problem certainly still happens against trunk -- I have a semi-aggressive test-harness which can cause it to reproduce pretty much on-demand. I proposed an approach here:

    http://mail.python.org/pipermail/python-dev/2008-April/078339.html

    but when I started digging into test_support it all got a bit hairy because -- naturally -- test.support.unlink is used in a *lot* of places. In short, there's still a problem to be fixed. I believe that a rename-unlink dance would solve it, but only at the cost of affecting a lot of tests.

    @techtonik
    Copy link
    Mannequin

    techtonik mannequin commented Apr 9, 2010

    It is unlikely that it will go further then discussion unless this bug can be reliably reproduced to be debugged. If not testcase when at least Process Monitor log would be helpful.

    @tjguk
    Copy link
    Member

    tjguk commented Apr 9, 2010

    In one window run the attached script (assumes you have pywin32 installed) with a parameter of the directory the TESTFN file will end up in. Then run, eg, test_zipfile in another window. For me:

    c:\temp> watch_dir.py C:\work_in_progress\make-snapshots\trunk\python\Lib

    C:\work_in_progress\make-snapshots\trunk\python\Lib> ..\pcbuild\python.exe -m test.test_zipfile

    Obviously, you'd have to change the path to be wherever you're running the test suite from.

    The watch_dir script sits there looking for file activity, then takes and releases a delete-share handle on the file. It's enough to disrupt certain tests (such as test_zipfile) pretty much every time. Other tests are affected less, or only the first few times. Not sure why, but it's certainly enough to reproduce the general effect of TortoiseSVN or indexer or virus checker.

    @bitdancer
    Copy link
    Member

    If the problem with the fix is that lots of tests use test_support.unlink, then I don't see why the rename dance can't be implemented in test_support.unlink. (Possibly conditioned on whether or not the tests are running on a windows platform.) Dealing with unlink problems is why that method exists in the first place.

    There are probably places in the test suite that *don't* use test_support.unlink, though, so fixing test_support.unlink will not necessarily fix all of the problems. We'll have to fix those other tests (probably by using the new test_support.unlink) as we find them.

    An actual patch will need a test that doesn't rely on win32file (ctypes would be OK). It may be necessary to rename to a unique filename, too. (To be clear, I think a unit test that reproduces the problem by doing an open with FILE_SHARE_DELETE is fine, we don't need a test that reproduces the race condition itself. The windows experts will correct me if I'm wrong :)

    I'm changing the stage to patch needed because it seems to me that using a technique like rename that doesn't introduce additional delays into the test suite is to be preferred.

    @tjguk
    Copy link
    Member

    tjguk commented Apr 11, 2010

    I put together a trivial patch against the 2.7 trunk (basically: I added
    a os.rename before the os.remove in test_support.unlink) and reran my
    test harness with test_zipfile... and it still failed because, of course,
    test_zipfile calls shutil.rmtree which bypasses the test_support.unlink
    altogether etc. etc.

    At this point several things occur to me:

    1. There's little point in targetting the 2.x tree since 2.7 is due
      out any time now and is effectively end-of-line for 2.x and this
      isn't a release-blocker. Therefore whatever effort is brought to bear
      should be against the 3.x latest

    2. This is a repeatable but relatively minority case: it only applies
      to Windows boxes and only to those where some behind-the-scenes process
      is holding this kind of lock on files for long enough to affect the
      tests. I'd certainly like to nail it but...

    3. The amount of work -- and intrusion in the tests -- is quite substantial.
      Just looking [*] for straight os.unlink instances, without even considering
      shutil use gives 71 instances; os.remove gives another 90. That's even
      without the issues of the tests for shutil and os in any case.

    I'm willing to do the legwork of moving all the tests use, eg, support.unlink,
    support.rmtree and so on, but quis custodiet? who'll test the tests?

    grep "os\.unlink" *.py | wc -l
    grep "os\.remove" *.py | wc -l

    1. All that said, the result should be a cleaner, more controllable test environment,
      at least for temp files. Another solution would be to rework the use of
      TESTFN on Windows to use a new temporary file every time. But that would be as much
      work and more than the unlink / rmtree work above.

    I'd like to hear opinions before I move forward with a non-trivial patch
    for this.

    For the sake of completeness, I attach a tiny test case which shows that the
    rename/remove approach should in fact work for the symptom we're seeing.

    @jaraco
    Copy link
    Member

    jaraco commented Apr 11, 2010

    When I was working on a routine to checkout/patch/build/test/cleanup Python (see https://svn.jaraco.com/jaraco/python/jaraco.develop, and particularly scripts/test-windows-symlink-patch.py), I ran into a similar problem during the cleanup step. I tried using shutil.rmtree to clean up the folder that was checked out, but I repeatedly got 'access denied' exceptions.

    I ended up working around the problem by using subprocess and cmd.exe's "rmdir /s /q".

    I think this demonstrates three facets to this problem:

    1. It doesn't just affect the test suite. It happens to other Python programs that are using shutil.rmtree (and possibly remove/unlink) to remove files that are in use.

    2. It doesn't have to be that way. At the very least, there could (should?) be a function in Python that replicates 'rmdir /s /q', which is not subject to the 'access denied' error.

    3. We could use subprocess and cmd.exe to perform the unlink and rmtree operations in the test suite to bypass the transient failures until Python supports the behavior natively.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 11, 2010

    1. I agree that we should fix the unlinking problem on Windows. I also think that such a fix should be independent of the test suite - many people run into failed unlink problems.

    2. Tim already said it, but I repeat: the common theory is that the culprit for this kind of problem is software like virus checkers, desktop search spiders, Tortoise, ...

    3. I'm not convinced that "rmdir/s/q" *really* solves the problem reliably. Because it's a timing issue, it may be that the additional startup cost to invoke rmdir was enough to let the virus scanner win the race, so that rmdir actually had no problems with removing the file.

    4. I believe the official party line for removing files on Windows is this: "If DeleteFile fails, move the file to the trash bin (of the disk), and use NtSetInformationFile to set the delete disposition for the file". See cygwin's unlink_nt for an elaborate implementation of unlinking:

    http://tinyurl.com/y7w6rrj

    @tjguk
    Copy link
    Member

    tjguk commented Apr 11, 2010

    I'm afraid that the problem doesn't lie in the unlink: DeleteFile
    succeeds. The problem is that the file is only marked for delete
    until such time as the last SHARE_DELETE handle on it is closed.
    Until that time, an attempt to (re)create the file for anything
    other than SHARE_DELETE will fail. As you say, it's a timing
    issue.

    Making os.unlink on Windows more robust may be a good
    idea, but it's not going to help this issue. See my test-case.py
    on an earlier message for reproduction:

    http://bugs.python.org/file16869

    TJG

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 11, 2010

    I'm afraid that the problem doesn't lie in the unlink: DeleteFile
    succeeds. The problem is that the file is only marked for delete
    until such time as the last SHARE_DELETE handle on it is closed.

    Then we shouldn't use DeleteFile in the first place to delete the file,
    but instead CreateFile, with DELETE access (and FILE_SHARE_DELETE
    sharing). If that fails, we need to move the file to the bin
    (see unlink_nt for details).

    Making os.unlink on Windows more robust may be a good
    idea, but it's not going to help this issue. See my test-case.py
    on an earlier message for reproduction:

    It certainly will help this case also. It would detect that the file is
    still open, and move it into the trash bin.

    @tjguk
    Copy link
    Member

    tjguk commented Apr 11, 2010

    Then we shouldn't use DeleteFile in the first place to delete the file,
    but instead CreateFile, with DELETE access (and FILE_SHARE_DELETE
    sharing). If that fails, we need to move the file to the bin
    (see unlink_nt for details).

    I see what you're getting at. I'm getting to the end of my day
    here, but I'll try to put a patch together for posixmodule.c
    when I can, if no-one else gets there first.

    Would you agree that py3k is the only target branch worth aiming for?

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 11, 2010

    Would you agree that py3k is the only target branch worth aiming for?

    Most certainly, yes.

    @tjguk tjguk self-assigned this Aug 6, 2010
    @ncoghlan
    Copy link
    Contributor

    I closed bpo-9627 as a duplicate of this - the buildbot failure referenced there was most likely due to something else holding open a temporary file that the test suite thought was closed.

    @asvetlov
    Copy link
    Contributor Author

    As I can see cygwin's unlink_nt uses Nt* functions family (NtSetInformationFile etc) which are part of Windows DDK.

    Do you like to use them or prefer SDK ones (say SetFileInformationByHandle)?

    In second case python unlink can borrow deletion schema from cygwin for modern Windows versions (Vista+) and return to legacy trivial DeleteFile call if OS is WinXP.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 27, 2011

    FWIW, Mercurial uses the following dance:
    http://selenic.com/repo/hg/file/463aca32a937/mercurial/windows.py#l296

    (Mercurial is under the GPL, so we can't copy that code verbatim; but it can serve as an inspiration)

    @tjguk
    Copy link
    Member

    tjguk commented Mar 27, 2011

    For clarity, while making unlink more robust is no bad thing, the error occurs when the unlink *succeeds* but a subsequent create of the same name fails. This happens when an indexer, Virus scanner or TortoiseSvn etc. has opened the file with SHARE_DELETE. This allows a DeleteFile to succeed but continues to hold an open handle on the file. A following test which uses an identically named file (such as the global TESTFN) will fail if not enough time has elapsed for the background process to release its handle. A good candidate to see this in action is the test for tarfile.

    I did start to undertake a conversion of TESTFN to a named temporary, but it started to sprawl all over the place and came up against a number of corner cases (eg where tests deliberately wanted two filenames to be the same) so I gave up.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 27, 2011

    I did start to undertake a conversion of TESTFN to a named temporary,
    but it started to sprawl all over the place and came up against a
    number of corner cases (eg where tests deliberately wanted two
    filenames to be the same) so I gave up.

    How about renaming to a unique random name just before the unlink(), as
    Mercurial does?
    Would it alleviate the problem or am I missing something?

    @tjguk
    Copy link
    Member

    tjguk commented Mar 27, 2011

    Well http://bugs.python.org/issue7443#msg102833 outlines the problems I encountered while trying to do essentially that. Nothing insurmountable, but definitely bigger than simply adding one line of code.

    Looks to me like there are two avenues of approach (and, given the chatter on this issue, several people willing to address them):

    • Patch Py_DeleteFileW in posixmodule.c so that it renames before deleting: should solve the problem overall but obviously has a possible wider impact, in general and on performance in particular. This rename might be a simple rename-to-guid or something more sophisticated such as the rename-to-recycler which cygwin uses.

    • Patch support.unlink in the test package to do the rename dance on the basis that it'll fix at least some of the problems with less impact overall.

    Opinions? I'm willing to do either.

    @pitrou
    Copy link
    Member

    pitrou commented Mar 30, 2011

    • Patch Py_DeleteFileW in posixmodule.c so that it renames before
      deleting: should solve the problem overall but obviously has a
      possible wider impact, in general and on performance in particular.
      This rename might be a simple rename-to-guid or something more
      sophisticated such as the rename-to-recycler which cygwin uses.

    • Patch support.unlink in the test package to do the rename dance on
      the basis that it'll fix at least some of the problems with less
      impact overall.

    Opinions? I'm willing to do either.

    Well, since I'm not a Windows expert, I can only give an intuitive
    opinion. I think that we should start with patching support.unlink();
    tweaking Py_DeleteFile() so as to do something more sophisticated than a
    simple removal sounds like it could hide some behaviour change that
    could hit some legitimate uses.

    (as an aside, for higher-level variants of OS functions, the shutil may
    be an appropriate recipient)

    @jdigital
    Copy link
    Mannequin

    jdigital mannequin commented Apr 1, 2011

    • Patch Py_DeleteFileW in posixmodule.c so that it renames before
      deleting: should solve the problem overall but obviously has a
      possible wider impact, in general and on performance in particular.
      This rename might be a simple rename-to-guid or something more
      sophisticated such as the rename-to-recycler which cygwin uses.

    • Patch support.unlink in the test package to do the rename dance on
      the basis that it'll fix at least some of the problems with less
      impact overall.

    Opinions? I'm willing to do either.

    I vote for fixing the test package.

    File system "extensions" may track and record this activity. To use DropBox as an example, doing the rename and delete will cause the renamed and deleted file to be recorded.

    Just my opinion, but the code path to delete a file should be as short as possible. Making lots of other OS calls just doesn't seem right.

    I understand the wish to have a reliable unlink call but I'd be uncomfortable with a workaround that may be visible around the edges.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Apr 2, 2011

    I'll throw in my 2 cents as to a possible way forward:

    • fix test.support.unlink in all current maintenance branches (2.7, 3.1, 3.2, default)
    • expose the feature to users as "shutil.rmfile" for 3.3

    @asvetlov
    Copy link
    Contributor Author

    asvetlov commented Apr 2, 2011

    FYI: implementation of unlink already has multiple calls for unicode version to process symlinks. Ansi version is the simple DeleteFileA call.
    Now we have different behavior for unicode and ansi variants as I can see.

    Now I'm working on unlink implementations partially borrowed from cygwin as Martin suggested. I still not have a final patch covering all known cases. Perhaps I can introduce it next week.

    @ncoghlan
    Copy link
    Contributor

    ncoghlan commented Aug 2, 2012

    See bpo-15496 for an alternative approach to solving this problem, at least in the test suite - as noted in that issue, the rename dance isn't sufficient when the problem gets triggered by a different sequence like:

    unlink(file_in_parent_dir)
    unlink(parent_dir)  # This fails if file is still locked

    @tjguk
    Copy link
    Member

    tjguk commented Jul 31, 2013

    This has been covered off by work done with the test.support package including context managers for temporary files / directories, and a waitfor mechanism which waits some time if a file can't be accessed.

    @tjguk tjguk closed this as completed Jul 31, 2013
    @bitdancer
    Copy link
    Member

    The support package is fixed (I presume :), but there was also a desire expressed to expose this functionality in shutil, since it can arise in "normal" Python programs as well as in the test suite. Given that the fix has been successful in the test suite, shouldn't we open an issue to expose it in shutil?

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    OS-windows tests Tests in the Lib/test dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants