classification
Title: shutil.copyfile throws incorrect SameFileError on Google Drive File Stream
Type: behavior Stage:
Components: Library (Lib), Windows Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Deniz Bozyigit, eryksun, gary ruben, giampaolo.rodola, paul.moore, r.david.murray, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2018-06-22 00:38 by Deniz Bozyigit, last changed 2019-02-20 01:45 by eryksun.

Messages (11)
msg320199 - (view) Author: Deniz Bozyigit (Deniz Bozyigit) Date: 2018-06-22 00:38
When using shutil.copyfile on the Google Drive File Stream file system, a incorrect SameFileError can occur. 

MWE (assuming foo.txt exists in your google drive G:\\):
>>> f1 = 'G:\\My Drive\\foo.txt'
>>> f2 = 'G:\\My Drive\\foo2.txt'
>>> import shutil
>>> shutil.copyfile(f1, f2)
>>> shutil.copyfile(f1, f2)

--> Last line throws incorrect SameFileError. In comparison, executing the same code on a different file system (e.g. local hard drive) will result in no errors.

More details described here: https://github.com/jupyter/notebook/issues/3615

The error originates in the library in generalpath.py in the function samestat: Google Drive File Stream reports inode==0 which makes os.path.samefile(f1, f2) == True for any files f1 and f2 on Google File Stream.

I propose the following patch, which currently works for me:

--- genericpath.py      2018-06-22 02:14:27.145744900 +0200
+++ genericpath_new.py  2018-06-22 02:10:44.485961100 +0200
@@ -86,8 +86,11 @@
 # describing the same file?
 def samestat(s1, s2):
     """Test whether two stat buffers reference the same file"""
-    return (s1.st_ino == s2.st_ino and
-            s1.st_dev == s2.st_dev)
+    return (s1.st_ino != 0 and
+                       s2.st_ino != 0 and
+                       s1.st_ino == s2.st_ino and
+            s1.st_dev == s2.st_dev)
+


 # Are two filenames really pointing to the same file?
msg320201 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-06-22 02:00
That patch would cause references to the same file on a google drive to report that the files were different.

I'd say this is a bug in Google Drive's posix emulation.

I'm not sure there's a good answer here, because even if every other element of the stat were equal, that wouldn't mean it was the same file: if you use copystat as well as copyfile the stats would otherwise be equal.

I think we should close this as a third party bug.
msg320202 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2018-06-22 02:07
David, this is a bug in Python, and the proposed patch is insufficient. We use the volume serial number as st_dev, and this is not guaranteed to be unique in Windows, and may be 0, just as a file's index number is also allowed to be 0. Both possibilities are well documented. With a WebDAV volume you will find that both numbers are 0. Using the VSN and file index as if they're the same as POSIX st_dev and st_ino is technically wrong. There is no guarantee that this tuple uniquely identifies a file in Windows.
msg320204 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2018-06-22 02:25
OK.  Finding a solution for this (other than never raising samefile on such systems and documenting it as a limitation) may be difficult.
msg320210 - (view) Author: Deniz Bozyigit (Deniz Bozyigit) Date: 2018-06-22 05:30
Hi, thank you for looking into this. I'm aware that the shown patch is not the ideal solution and a mere fix to get my jupyter operational. 

An indication on a workable solution could be the _samefile function in shutil that wraps os.path.samefile:

def _samefile(src, dst):
    # Macintosh, Unix.
    if hasattr(os.path, 'samefile'):
        try:
            return os.path.samefile(src, dst)
        except OSError:
            return False

    # All other platforms: check for same pathname.
    return (os.path.normcase(os.path.abspath(src)) ==
            os.path.normcase(os.path.abspath(dst)))


I understand that the implicit platform differentiation that is done here (see the comment line) is not valid anymore since os.path.samefile is now available on windows systems. It seems that for a windows system the here implemented file name comparison could be workable (even moving it into os.path.samefile?), if the platform is identified correctly.
msg320243 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2018-06-22 16:03
For Windows it would be best (though slower) to pass the paths through os._getfinalpathname before comparison. Detecting that function is an easy way to get the platform right, too.

Unfortunately, the MSDN docs don't make clear that the VSN can be modified, and even goes as far as saying you can compare the two values Eryk has pointed out that you shouldn't :(

The problem is that the values on Windows are coming directly from the filesystem, and apparently there's no requirement that they actually be provided by the filesystem...
msg335770 - (view) Author: gary ruben (gary ruben) Date: 2019-02-17 11:20
I wanted to add a datapoint to this. I also experience this problem in Windows but not with Google Drive File Stream. In my case it is also being triggered by Jupyter, which was where Deniz first noticed it, but I was saving the notebook to my Z: drive, which is a mapping of a network drive that is backed up with Windows' Sync Center; nothing to do with Google's file stream, but a standard Windows feature that triggers the same bug. For the moment, I find that Deniz's workaround lets me continue to use Jupyter in conjunction with that drive path.
msg335771 - (view) Author: Giampaolo Rodola' (giampaolo.rodola) * (Python committer) Date: 2019-02-17 11:51
@eryksun

> Using the VSN and file index as if they're the same as POSIX st_dev and st_ino is technically wrong. There is no guarantee that this tuple uniquely identifies a file in Windows.

I agree. FWIW, I faced the same issue on pyftpdlib and ended up not supporting Windows: 
https://github.com/giampaolo/pyftpdlib/blob/eef8a5650cd611da1df5fce16974ce90f43f4dc0/pyftpdlib/filesystems.py#L596-L605
It seems to me the problem here is os.path.samefile():
 https://github.com/python/cpython/blob/0185f34ddcf07b78feb6ac666fbfd4615d26b028/Lib/genericpath.py#L87-L98

@steve.dower
> For Windows it would be best (though slower) to pass the paths through os._getfinalpathname before comparison.

It seems this is how samefile() was originally implemented but it was changed:
https://github.com/python/cpython/commit/490b32a3976d84eaf1d6ca8cdcb00eac0ce5055b
https://bugs.python.org/issue11939
msg335881 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-02-19 03:27
Here's a WebDAV example:

    net use Z: \\live.sysinternals.com\tools

Both st_dev (volume serial number) and st_ino (file number) are 0 for all files on this drive. _getfinalpathname also fails since WebDAV doesn't support the default FILE_NAME_NORMALIZED flag (i.e. replace 8.3 short names in the path with normal names). We need realpath(), which will reasonably handle cases where _getfinalpathname fails. See issue 14094.

samefile() has to go back into ntpath.py in Windows. The generic implementation relies on the POSIX guarantee that the tuple (st_dev, st_ino) uniquely identifies a file, which Windows doesn't provide.

I suggest the following for samefile(). If either st_ino or st_dev is different, return False. If they're equal and non-zero, return True. Otherwise compare the final paths. If st_ino is zero, compare the entire paths. If st_ino is non-zero, then compare only the drives. (This supports the unusual case of hardlinks on a volume that has no serial number.) The final paths can come from realpath() if issue 14094 is resolved. For example:

    def samefile(fn1, fn2):
        """Test whether two file names reference the same file"""
        s1 = os.stat(fn1)
        s2 = os.stat(fn2)
        if s1.st_ino != s2.st_ino or s1.st_dev != s2.st_dev:
            return False
        if s1.st_ino and s1.st_dev:
            return True
        rp1, rp2 = realpath(fn1), realpath(fn2)
        if s1.st_ino:
            return splitdrive(rp1)[0] == splitdrive(rp2)[0]
        return rp1 == rp2

For sameopenfile(), it's trivial to extend _getfinalpathname to support the argument as a file-descriptor, in which case it simply calls _get_osfhandle to get the handle instead of opening the file via CreateFileW. This loses the flexibility of realpath(), but below I propose extending the range of paths supported by _getfinalpathname.

---

Note that the root directory in FAT32 is file number 0. For NTFS, the file number is never 0. The high word of the 64-bit file reference number is a sequence number that begins at 1. 

For local drives the volume serial number shouldn't be zero. It's possible to manually change it to zero, but that's intentional mischief. There's a small chance that two drives have the same serial number, and an even smaller chance that we get an (st_dev, st_ino) match that's a false positive. I'm not happy with that, however small the probability, but I don't know a simple way to address the problem.

For local storage, Windows does have a device number that's similar to POSIX st_dev. It's actually three numbers -- device type (16-bit), device number (32-bit), and partition number (32-bit) -- that taken together constitute an 80-bit ID. The problem is that we have to query IOCTL_STORAGE_GET_DEVICE_NUMBER directly using a handle for the volume device. Getting a handle for the volume can be expensive since we may be starting from a file handle or have a volume that's mounted as a filesystem junction. Plus this lacks generality since it's not implemented by MUP (Multiple UNC Provider, the proxy device for UNC providers) -- not even for a local SMB share such as "\\localhost\C$". 

To improve reliability for corner cases, _getfinalpathname could be extended to try all path types, with and without normalization. Start with the DOS name. Next try the GUID name (i.e. a device that supports the mountpoint manager but isn't auto-mounted as a DOS drive or file-system junction) and finally the NT name (i.e. a device that doesn't support the mountpoint manager, such as an ImDisk virtual disk, the named-pipe file system, or mailslot file system). For an NT path, _getfinalpathname can try to manually resolve a normal mountpoint via QueryDosDeviceW, but limit this to just drive-letter names and well-known global names from "HKLM\System\CurrentControlSet\Control\Session Manager\DOS Devices" (e.g. PIPE, MAILSLOT). If there's no normal mountpoint, prefix the NT path with the "\\?\GLOBALROOT" link. For example a file named "\spam" on the devices "\Device\ImDisk0" (a ramdisk, say it's mounted as R:), "\Device\NamedPipe", and "\Device\NtOnly" would resolve to "\\?\R:\spam", "\\?\PIPE\spam", and "\\?\GLOBALROOT\Device\NtOnly\spam".
msg335977 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2019-02-19 16:53
Would it make sense to add a parameter to _getfinalpathname that specifies the type of the path? For same[open]file(), we can probably just go to the most unreadable but broadly supported type, whereas the other functions that are actually going to return the path can work their way up through easiest-to-read options.

I like the proposed logic for st_dev and st_ino, though I don't see any value in testing the drive name separately. At that point we've already gone to the filesystem driver, so any chances of doing it fast are gone and we may as well have simpler logic that goes straight to the full path.
msg336034 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-02-20 01:45
> I don't see any value in testing the drive name separately. 

If the file number is non-zero, then the volume is the only question. I limited the comparison to just the drives in case the volume supports hardlinks. We can change the check to `s1.st_ino and s1.st_nlink > 1`, so we only incur the expense when necessary. Maybe that's never in practice, but I'd rather error on the side of caution.

Anyway, it turns out we can't use splitdrive(). It doesn't handle UNC device paths the same way as regular UNC paths. It returns "\\?\UNC" as the drive for "\\?\UNC\server\share", whereas for a regular UNC path the drive is "\\server\share". 

> Would it make sense to add a parameter to _getfinalpathname that 
> specifies the type of the path? For same[open]file(), we can 
> probably just go to the most unreadable but broadly supported 
> type

That's probably the simplest change we can make here, in addition to file-descriptor support. The non-normalized NT path (VOLUME_NAME_NT | FILE_NAME_OPENED) is the most broadly supported and cheapest path to get. It's just two system calls: NtQueryObject and NtQueryInformationFile.
History
Date User Action Args
2019-02-20 01:45:32eryksunsetmessages: + msg336034
2019-02-19 16:53:10steve.dowersetmessages: + msg335977
versions: - Python 3.4, Python 3.5, Python 3.6
2019-02-19 03:27:28eryksunsetmessages: + msg335881
2019-02-17 11:51:01giampaolo.rodolasetmessages: + msg335771
2019-02-17 11:20:42gary rubensetnosy: + gary ruben
messages: + msg335770
2018-07-11 07:55:46serhiy.storchakasettype: crash -> behavior
2018-06-25 22:24:19giampaolo.rodolasetnosy: + giampaolo.rodola
2018-06-22 16:03:30steve.dowersetmessages: + msg320243
2018-06-22 05:30:49Deniz Bozyigitsetmessages: + msg320210
2018-06-22 02:25:34r.david.murraysetmessages: + msg320204
2018-06-22 02:07:35eryksunsetnosy: + eryksun
messages: + msg320202
2018-06-22 02:00:36r.david.murraysetnosy: + r.david.murray
messages: + msg320201
2018-06-22 00:39:10Deniz Bozyigitsettitle: shutil.copyfile throws incorrect SameFileError -> shutil.copyfile throws incorrect SameFileError on Google Drive File Stream
2018-06-22 00:38:26Deniz Bozyigitcreate