Issue 33935: [Windows] samefile() should not use zero-valued st_dev and st_ino

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/78116

classification

Title:	[Windows] samefile() should not use zero-valued st_dev and st_ino
Type:	behavior	Stage:
Components:	Extension Modules, Library (Lib), Windows	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Deniz Bozyigit, alexkowel, eryksun, gary ruben, giampaolo.rodola, paul.moore, r.david.murray, steve.dower, tim.golden, zach.ware
Priority:	normal	Keywords:

Created on 2018-06-22 00:38 by Deniz Bozyigit, last changed 2022-04-11 14:59 by admin.

Messages (13)
msg320199 - (view)	Author: Deniz Bozyigit (Deniz Bozyigit)	Date: 2018-06-22 00:38
When using shutil.copyfile on the Google Drive File Stream file system, a incorrect SameFileError can occur. MWE (assuming foo.txt exists in your google drive G:\\): >>> f1 = 'G:\\My Drive\\foo.txt' >>> f2 = 'G:\\My Drive\\foo2.txt' >>> import shutil >>> shutil.copyfile(f1, f2) >>> shutil.copyfile(f1, f2) --> Last line throws incorrect SameFileError. In comparison, executing the same code on a different file system (e.g. local hard drive) will result in no errors. More details described here: https://github.com/jupyter/notebook/issues/3615 The error originates in the library in generalpath.py in the function samestat: Google Drive File Stream reports inode==0 which makes os.path.samefile(f1, f2) == True for any files f1 and f2 on Google File Stream. I propose the following patch, which currently works for me: --- genericpath.py 2018-06-22 02:14:27.145744900 +0200 +++ genericpath_new.py 2018-06-22 02:10:44.485961100 +0200 @@ -86,8 +86,11 @@ # describing the same file? def samestat(s1, s2): """Test whether two stat buffers reference the same file""" - return (s1.st_ino == s2.st_ino and - s1.st_dev == s2.st_dev) + return (s1.st_ino != 0 and + s2.st_ino != 0 and + s1.st_ino == s2.st_ino and + s1.st_dev == s2.st_dev) + # Are two filenames really pointing to the same file?
msg320201 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2018-06-22 02:00
That patch would cause references to the same file on a google drive to report that the files were different. I'd say this is a bug in Google Drive's posix emulation. I'm not sure there's a good answer here, because even if every other element of the stat were equal, that wouldn't mean it was the same file: if you use copystat as well as copyfile the stats would otherwise be equal. I think we should close this as a third party bug.
msg320202 - (view)	Author: Eryk Sun (eryksun) *	Date: 2018-06-22 02:07
David, this is a bug in Python, and the proposed patch is insufficient. We use the volume serial number as st_dev, and this is not guaranteed to be unique in Windows, and may be 0, just as a file's index number is also allowed to be 0. Both possibilities are well documented. With a WebDAV volume you will find that both numbers are 0. Using the VSN and file index as if they're the same as POSIX st_dev and st_ino is technically wrong. There is no guarantee that this tuple uniquely identifies a file in Windows.
msg320204 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2018-06-22 02:25
OK. Finding a solution for this (other than never raising samefile on such systems and documenting it as a limitation) may be difficult.
msg320210 - (view)	Author: Deniz Bozyigit (Deniz Bozyigit)	Date: 2018-06-22 05:30
Hi, thank you for looking into this. I'm aware that the shown patch is not the ideal solution and a mere fix to get my jupyter operational. An indication on a workable solution could be the _samefile function in shutil that wraps os.path.samefile: def _samefile(src, dst): # Macintosh, Unix. if hasattr(os.path, 'samefile'): try: return os.path.samefile(src, dst) except OSError: return False # All other platforms: check for same pathname. return (os.path.normcase(os.path.abspath(src)) == os.path.normcase(os.path.abspath(dst))) I understand that the implicit platform differentiation that is done here (see the comment line) is not valid anymore since os.path.samefile is now available on windows systems. It seems that for a windows system the here implemented file name comparison could be workable (even moving it into os.path.samefile?), if the platform is identified correctly.
msg320243 - (view)	Author: Steve Dower (steve.dower) *	Date: 2018-06-22 16:03
For Windows it would be best (though slower) to pass the paths through os._getfinalpathname before comparison. Detecting that function is an easy way to get the platform right, too. Unfortunately, the MSDN docs don't make clear that the VSN can be modified, and even goes as far as saying you can compare the two values Eryk has pointed out that you shouldn't :( The problem is that the values on Windows are coming directly from the filesystem, and apparently there's no requirement that they actually be provided by the filesystem...
msg335770 - (view)	Author: gary ruben (gary ruben)	Date: 2019-02-17 11:20
I wanted to add a datapoint to this. I also experience this problem in Windows but not with Google Drive File Stream. In my case it is also being triggered by Jupyter, which was where Deniz first noticed it, but I was saving the notebook to my Z: drive, which is a mapping of a network drive that is backed up with Windows' Sync Center; nothing to do with Google's file stream, but a standard Windows feature that triggers the same bug. For the moment, I find that Deniz's workaround lets me continue to use Jupyter in conjunction with that drive path.
msg335771 - (view)	Author: Giampaolo Rodola' (giampaolo.rodola) *	Date: 2019-02-17 11:51
@eryksun > Using the VSN and file index as if they're the same as POSIX st_dev and st_ino is technically wrong. There is no guarantee that this tuple uniquely identifies a file in Windows. I agree. FWIW, I faced the same issue on pyftpdlib and ended up not supporting Windows: https://github.com/giampaolo/pyftpdlib/blob/eef8a5650cd611da1df5fce16974ce90f43f4dc0/pyftpdlib/filesystems.py#L596-L605 It seems to me the problem here is os.path.samefile(): https://github.com/python/cpython/blob/0185f34ddcf07b78feb6ac666fbfd4615d26b028/Lib/genericpath.py#L87-L98 @steve.dower > For Windows it would be best (though slower) to pass the paths through os._getfinalpathname before comparison. It seems this is how samefile() was originally implemented but it was changed: https://github.com/python/cpython/commit/490b32a3976d84eaf1d6ca8cdcb00eac0ce5055b https://bugs.python.org/issue11939
msg335881 - (view)	Author: Eryk Sun (eryksun) *	Date: 2019-02-19 03:27
Here's a WebDAV example: net use Z: \\live.sysinternals.com\tools Both st_dev (volume serial number) and st_ino (file number) are 0 for all files on this drive. _getfinalpathname also fails since WebDAV doesn't support the default FILE_NAME_NORMALIZED flag (i.e. replace 8.3 short names in the path with normal names). We need realpath(), which will reasonably handle cases where _getfinalpathname fails. See issue 14094. samefile() has to go back into ntpath.py in Windows. The generic implementation relies on the POSIX guarantee that the tuple (st_dev, st_ino) uniquely identifies a file, which Windows doesn't provide. I suggest the following for samefile(). If either st_ino or st_dev is different, return False. If they're equal and non-zero, return True. Otherwise compare the final paths. If st_ino is zero, compare the entire paths. If st_ino is non-zero, then compare only the drives. (This supports the unusual case of hardlinks on a volume that has no serial number.) The final paths can come from realpath() if issue 14094 is resolved. For example: def samefile(fn1, fn2): """Test whether two file names reference the same file""" s1 = os.stat(fn1) s2 = os.stat(fn2) if s1.st_ino != s2.st_ino or s1.st_dev != s2.st_dev: return False if s1.st_ino and s1.st_dev: return True rp1, rp2 = realpath(fn1), realpath(fn2) if s1.st_ino: return splitdrive(rp1)[0] == splitdrive(rp2)[0] return rp1 == rp2 For sameopenfile(), it's trivial to extend _getfinalpathname to support the argument as a file-descriptor, in which case it simply calls _get_osfhandle to get the handle instead of opening the file via CreateFileW. This loses the flexibility of realpath(), but below I propose extending the range of paths supported by _getfinalpathname. --- Note that the root directory in FAT32 is file number 0. For NTFS, the file number is never 0. The high word of the 64-bit file reference number is a sequence number that begins at 1. For local drives the volume serial number shouldn't be zero. It's possible to manually change it to zero, but that's intentional mischief. There's a small chance that two drives have the same serial number, and an even smaller chance that we get an (st_dev, st_ino) match that's a false positive. I'm not happy with that, however small the probability, but I don't know a simple way to address the problem. For local storage, Windows does have a device number that's similar to POSIX st_dev. It's actually three numbers -- device type (16-bit), device number (32-bit), and partition number (32-bit) -- that taken together constitute an 80-bit ID. The problem is that we have to query IOCTL_STORAGE_GET_DEVICE_NUMBER directly using a handle for the volume device. Getting a handle for the volume can be expensive since we may be starting from a file handle or have a volume that's mounted as a filesystem junction. Plus this lacks generality since it's not implemented by MUP (Multiple UNC Provider, the proxy device for UNC providers) -- not even for a local SMB share such as "\\localhost\C$". To improve reliability for corner cases, _getfinalpathname could be extended to try all path types, with and without normalization. Start with the DOS name. Next try the GUID name (i.e. a device that supports the mountpoint manager but isn't auto-mounted as a DOS drive or file-system junction) and finally the NT name (i.e. a device that doesn't support the mountpoint manager, such as an ImDisk virtual disk, the named-pipe file system, or mailslot file system). For an NT path, _getfinalpathname can try to manually resolve a normal mountpoint via QueryDosDeviceW, but limit this to just drive-letter names and well-known global names from "HKLM\System\CurrentControlSet\Control\Session Manager\DOS Devices" (e.g. PIPE, MAILSLOT). If there's no normal mountpoint, prefix the NT path with the "\\?\GLOBALROOT" link. For example a file named "\spam" on the devices "\Device\ImDisk0" (a ramdisk, say it's mounted as R:), "\Device\NamedPipe", and "\Device\NtOnly" would resolve to "\\?\R:\spam", "\\?\PIPE\spam", and "\\?\GLOBALROOT\Device\NtOnly\spam".
msg335977 - (view)	Author: Steve Dower (steve.dower) *	Date: 2019-02-19 16:53
Would it make sense to add a parameter to _getfinalpathname that specifies the type of the path? For same[open]file(), we can probably just go to the most unreadable but broadly supported type, whereas the other functions that are actually going to return the path can work their way up through easiest-to-read options. I like the proposed logic for st_dev and st_ino, though I don't see any value in testing the drive name separately. At that point we've already gone to the filesystem driver, so any chances of doing it fast are gone and we may as well have simpler logic that goes straight to the full path.
msg336034 - (view)	Author: Eryk Sun (eryksun) *	Date: 2019-02-20 01:45
> I don't see any value in testing the drive name separately. If the file number is non-zero, then the volume is the only question. I limited the comparison to just the drives in case the volume supports hardlinks. We can change the check to `s1.st_ino and s1.st_nlink > 1`, so we only incur the expense when necessary. Maybe that's never in practice, but I'd rather error on the side of caution. Anyway, it turns out we can't use splitdrive(). It doesn't handle UNC device paths the same way as regular UNC paths. It returns "\\?\UNC" as the drive for "\\?\UNC\server\share", whereas for a regular UNC path the drive is "\\server\share". > Would it make sense to add a parameter to _getfinalpathname that > specifies the type of the path? For same[open]file(), we can > probably just go to the most unreadable but broadly supported > type That's probably the simplest change we can make here, in addition to file-descriptor support. The non-normalized NT path (VOLUME_NAME_NT \| FILE_NAME_OPENED) is the most broadly supported and cheapest path to get. It's just two system calls: NtQueryObject and NtQueryInformationFile.
msg387145 - (view)	Author: Alexander (alexkowel)	Date: 2021-02-17 11:27
Hi, This issue was also confirmed by a Verge3D user (Python 3.7.7) https://www.soft8soft.com/topic/export-gltf-error
msg387519 - (view)	Author: Eryk Sun (eryksun) *	Date: 2021-02-22 17:28
On second thought, I think it's cleaner and avoids potential race conditions to implement ntpath.sameopenfile(). Then use that to implement ntpath.samefile(). Here's a summary of suggested changes, with sample code to flesh out the concepts: * Extend nt._getfinalpathname to support file descriptors via _get_osfhandle(), and to support a flags parameter that defaults to 0 (i.e. VOLUME_NAME_DOS \| FILE_NAME_NORMALIZED). * Add the constants VOLUME_NAME_NONE, VOLUME_NAME_NT, VOLUME_NAME_GUID, and FILE_NAME_OPENED. * Fix _winapi.CreateFile to link with CreateFileW instead of CreateFileA, and add the constant FILE_FLAG_BACKUP_SEMANTICS. * Implement ntpath.sameopenfile(): def _getfinalpathname_try_norm(file, flags): flags &= ~_nt.FILE_NAME_OPENED try: return _getfinalpathname(file, flags) except OSError as e: # 1: ERROR_INVALID_FUNCTION # 50: ERROR_NOT_SUPPORTED # 87: ERROR_INVALID_PARAMETER if e.winerror not in (1, 87, 50): raise # Try to resolve the path as opened, which may contain # legacy 8.3 short names. return _getfinalpathname(file, flags \| _nt.FILE_NAME_OPENED) def sameopenfile(fp1, fp2): """Test whether two open file objects reference the same file""" s1 = os.fstat(fp1) s2 = os.fstat(fp2) if s1.st_ino != s2.st_ino or s1.st_dev != s2.st_dev: return False if s1.st_dev and s1.st_ino: return True # st_dev or st_ino is 0, e.g. both are 0 for a WebDAV filesystem. # Compare the final paths instead. if s1.st_dev or s1.st_ino: # Get the final paths without the device name. r1 = _getfinalpathname_try_norm(fp1, _nt.VOLUME_NAME_NONE) r2 = _getfinalpathname_try_norm(fp2, _nt.VOLUME_NAME_NONE) if s1.st_dev: # st_dev is non-zero, so just compare these # device-relative paths. return r1 == r2 n1 = _getfinalpathname_try_norm(fp1, _nt.VOLUME_NAME_NT) n2 = _getfinalpathname_try_norm(fp2, _nt.VOLUME_NAME_NT) if s1.st_ino: # st_ino is non-zero, so ignore the paths on the device(s), # which could be different hardlink paths for the same file. # Just compare the device names. index1, index2 = n1.rfind(r1), n2.rfind(r2) if index1 == -1 or index2 == -1: # This should never happen, but never say never. index1 = index2 = None d1 = n1[:index1] d2 = n2[:index2] return d1 == d2 # st_dev and st_ino are both 0. Compare the full NT paths. return n1 == n2 * Implement ntpath.samefile() via sameopenfile(): def samefile(fn1, fn2): """Test whether two file names reference the same file""" # Open the files to avoid race conditions. to_close = [] if not isinstance(fn1, int): h = _winapi.CreateFile( os.fsdecode(fn1), 0, 0, 0, _winapi.OPEN_EXISTING, _winapi.FILE_FLAG_BACKUP_SEMANTICS, 0) try: fn1 = _msvcrt.open_osfhandle(h, 0) except: _winapi.CloseHandle(h) raise to_close.append(fn1) try: if not isinstance(fn2, int): h = _winapi.CreateFile( os.fsdecode(fn2), 0, 0, 0, _winapi.OPEN_EXISTING, _winapi.FILE_FLAG_BACKUP_SEMANTICS, 0) try: fn2 = _msvcrt.open_osfhandle(h, 0) except: _winapi.CloseHandle(h) raise to_close.append(fn2) return sameopenfile(fn1, fn2) finally: for fd in to_close: os.close(fd)

History
Date	User	Action	Args
2022-04-11 14:59:02	admin	set	github: 78116
2021-03-21 05:25:57	eryksun	set	title: shutil.copyfile throws incorrect SameFileError on Google Drive File Stream -> [Windows] samefile() should not use zero-valued st_dev and st_ino
2021-03-21 05:16:10	eryksun	set	components: + Extension Modules versions: + Python 3.9, Python 3.10, - Python 3.7
2021-02-22 17:28:06	eryksun	set	messages: + msg387519
2021-02-17 11:27:49	alexkowel	set	nosy: + alexkowel messages: + msg387145
2019-02-20 01:45:32	eryksun	set	messages: + msg336034
2019-02-19 16:53:10	steve.dower	set	messages: + msg335977 versions: - Python 3.4, Python 3.5, Python 3.6
2019-02-19 03:27:28	eryksun	set	messages: + msg335881
2019-02-17 11:51:01	giampaolo.rodola	set	messages: + msg335771
2019-02-17 11:20:42	gary ruben	set	nosy: + gary ruben messages: + msg335770
2018-07-11 07:55:46	serhiy.storchaka	set	type: crash -> behavior
2018-06-25 22:24:19	giampaolo.rodola	set	nosy: + giampaolo.rodola
2018-06-22 16:03:30	steve.dower	set	messages: + msg320243
2018-06-22 05:30:49	Deniz Bozyigit	set	messages: + msg320210
2018-06-22 02:25:34	r.david.murray	set	messages: + msg320204
2018-06-22 02:07:35	eryksun	set	nosy: + eryksun messages: + msg320202
2018-06-22 02:00:36	r.david.murray	set	nosy: + r.david.murray messages: + msg320201
2018-06-22 00:39:10	Deniz Bozyigit	set	title: shutil.copyfile throws incorrect SameFileError -> shutil.copyfile throws incorrect SameFileError on Google Drive File Stream
2018-06-22 00:38:26	Deniz Bozyigit	create