Issue40095
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2020-03-28 08:27 by mbrijun@gmail.com, last changed 2022-04-11 14:59 by admin.
Messages (4) | |||
---|---|---|---|
msg365206 - (view) | Author: Martynas Brijunas (mbrijun@gmail.com) | Date: 2020-03-28 08:27 | |
On a Windows 10 volume formatted with ReFS, pathlib.Path.stat() returns an incorrect value for "st_ino". The correct value returned by the OS: C:\Users>fsutil file queryfileid u:\test\test.jpg File ID is 0x00000000000029d500000000000004ae An incorrect value obtained with pathlib.Path.stat(): Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> from pathlib import Path >>> hex(Path('U:/test/test.jpg').stat().st_ino) '0x4000000004ae29d5' The problem does *not* exist on an NTFS volume: C:\Users>fsutil file queryfileid o:\OneDrive\test\test.jpg File ID is 0x0000000000000000000300000001be39 >>> hex(Path('O:/OneDrive/test/test.jpg').stat().st_ino) '0x300000001be39' |
|||
msg365207 - (view) | Author: Steve Dower (steve.dower) * | Date: 2020-03-28 12:14 | |
There's no guarantee that st_ino will be the same as any other queryable value. Only that it will (hopefully) be unique on that disk. stat() on Windows is just an approximation of stat() from POSIX - it's not a reliable way of reading Windows-specific filesystems. If you can generate a collision, it may be worth investing in a solution. Otherwise, if you have a specific need to get the same file ID as fsutil, you should probably go to the native API yourself. |
|||
msg365214 - (view) | Author: Eryk Sun (eryksun) * | Date: 2020-03-28 14:57 | |
> C:\Users>fsutil file queryfileid u:\test\test.jpg > File ID is 0x00000000000029d500000000000004ae ReFS uses a 128-bit file ID, which I gather consists of a 64-bit directory ID and a 64-bit relative ID. (Take this with a grain of salt. AFAIK, Microsoft hasn't published a spec for ReFS.) The latter is 0 for the directory itself and increments by 1 for each file created in the directory, with no reuse of previous values if a file is deleted or moved. If that's correct, and if "test.jpg" was created in "\test", then the directory ID of "\test" is 0x29d5, and the relative file ID is 0x4ae. > >>> from pathlib import Path > >>> hex(Path('U:/test/test.jpg').stat().st_ino) > '0x4000000004ae29d5' os.stat calls WINAPI GetFileInformationByHandle, which returns a 64-bit file ID. It appears that ReFS generates this ID by concatenating the relative ID and directory ID in a way that is "not guaranteed to be unique" according to the BY_HANDLE_FILE_INFORMATION [1] docs. I haven't checked whether this 64-bit file ID can even be used successfully with OpenFileById [2]. It could be that ReFS simply fails an open-by-ID request unless it includes the full 128-bit ID (i.e. ExtendedFileIdType). You can request the 128-bit ID as a FILE_ID_128 record (an array of 16 bytes) via GetFileInformationByHandleEx: FileIdInfo [3][4]. Maybe os.stat should try to query the 128-bit ID and use it as st_ino (or st_ino_128) if it's available. However, looking into my crystal ball, I don't see this happening, unless someone makes a strong case in its favor. > The problem does *not* exist on an NTFS volume: > > C:\Users>fsutil file queryfileid o:\OneDrive\test\test.jpg > File ID is 0x0000000000000000000300000001be39 NTFS uses a 64-bit file ID, which consists of a 48-bit MFT record number and a 16-bit sequence number. The latter gets incremented when an MFT record is reused in order to detect stale references. In the above case, the 48-bit record number is 0x00000001be39, and the sequence number is 0x0003. [1]: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information [2]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-openfilebyid [3]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getfileinformationbyhandleex [4]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_id_info |
|||
msg365228 - (view) | Author: Martynas Brijunas (mbrijun@gmail.com) | Date: 2020-03-28 20:13 | |
Hi Steve, Eryk, thank you very much for looking into this. I was looking into "st_ino" as a potential substitute of a full path of a file when it comes to uniquely identifying that file in a database. > ReFS uses a 128-bit file ID, which I gather consists of a 64-bit directory ID and a 64-bit relative ID. (Take this with a grain of salt. AFAIK, Microsoft hasn't published a spec for ReFS.) The latter is 0 for the directory itself and increments by 1 for each file created in the directory, with no reuse of previous values if a file is deleted or moved. If that's correct, and if "test.jpg" was created in "\test", then the directory ID of "\test" is 0x29d5, and the relative file ID is 0x4ae. This assumption seems to be correct. All files within the same directory have identical first half of their ID, as reported by "fsutil". U:\test>fsutil file queryfileid test.jpg File ID is 0x00000000000029d500000000000004ae U:\test>fsutil file queryfileid test.nef File ID is 0x00000000000029d50000000000000483 U:\test>fsutil file queryfileid test.ARW File ID is 0x00000000000029d50000000000000484 U:\test>fsutil file queryfileid test.db File ID is 0x00000000000029d50000000000000495 > > > >>> from pathlib import Path > > >>> hex(Path('U:/test/test.jpg').stat().st_ino) > > '0x4000000004ae29d5' > > os.stat calls WINAPI GetFileInformationByHandle, which returns a 64-bit file ID. It appears that ReFS generates this ID by concatenating the relative ID and directory ID in a way that is "not guaranteed to be unique" according to the BY_HANDLE_FILE_INFORMATION [1] docs. The feedack from "st_ino" appears to be in total sync with "fsutil". The only real difference (apart for the for the missing leading zeros in each half) is the inclusion of a hex "4" at the very beginning of the hex sequence. But even that is consistent as the "4" is present in all cases. >>> hex(Path('U:/test/test.jpg').stat().st_ino) '0x4000000004ae29d5' >>> hex(Path('U:/test/test.nef').stat().st_ino) '0x40000000048329d5' >>> hex(Path('U:/test/test.arw').stat().st_ino) '0x40000000048429d5' >>> hex(Path('U:/test/test.db').stat().st_ino) '0x40000000049529d5' |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:28 | admin | set | github: 84276 |
2020-03-28 20:13:28 | mbrijun@gmail.com | set | messages: + msg365228 |
2020-03-28 14:57:07 | eryksun | set | nosy:
+ eryksun messages: + msg365214 |
2020-03-28 12:14:47 | steve.dower | set | messages: + msg365207 |
2020-03-28 09:39:22 | eryksun | set | nosy:
+ paul.moore, tim.golden, zach.ware, steve.dower components: + Windows versions: + Python 3.9 |
2020-03-28 08:27:27 | mbrijun@gmail.com | create |