This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Incorrect st_ino returned for ReFS on Windows 10
Type: behavior Stage:
Components: Library (Lib), Windows Versions: Python 3.9, Python 3.8
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, mbrijun@gmail.com, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2020-03-28 08:27 by mbrijun@gmail.com, last changed 2022-04-11 14:59 by admin.

Messages (4)
msg365206 - (view) Author: Martynas Brijunas (mbrijun@gmail.com) Date: 2020-03-28 08:27
On a Windows 10 volume formatted with ReFS, pathlib.Path.stat() returns an incorrect value for "st_ino".

The correct value returned by the OS:

C:\Users>fsutil file queryfileid u:\test\test.jpg
File ID is 0x00000000000029d500000000000004ae

An incorrect value obtained with pathlib.Path.stat():

Python 3.8.2 (tags/v3.8.2:7b3ab59, Feb 25 2020, 23:03:10) [MSC v.1916 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from pathlib import Path
>>> hex(Path('U:/test/test.jpg').stat().st_ino)
'0x4000000004ae29d5'

The problem does *not* exist on an NTFS volume:

C:\Users>fsutil file queryfileid o:\OneDrive\test\test.jpg
File ID is 0x0000000000000000000300000001be39

>>> hex(Path('O:/OneDrive/test/test.jpg').stat().st_ino)
'0x300000001be39'
msg365207 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-03-28 12:14
There's no guarantee that st_ino will be the same as any other queryable value. Only that it will (hopefully) be unique on that disk.

stat() on Windows is just an approximation of stat() from POSIX - it's not a reliable way of reading Windows-specific filesystems.

If you can generate a collision, it may be worth investing in a solution. Otherwise, if you have a specific need to get the same file ID as fsutil, you should probably go to the native API yourself.
msg365214 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-03-28 14:57
> C:\Users>fsutil file queryfileid u:\test\test.jpg
> File ID is 0x00000000000029d500000000000004ae

ReFS uses a 128-bit file ID, which I gather consists of a 64-bit directory ID and a 64-bit relative ID. (Take this with a grain of salt. AFAIK, Microsoft hasn't published a spec for ReFS.) The latter is 0 for the directory itself and increments by 1 for each file created in the directory, with no reuse of previous values if a file is deleted or moved. If that's correct, and if "test.jpg" was created in "\test", then the directory ID of "\test" is 0x29d5, and the relative file ID is 0x4ae. 

> >>> from pathlib import Path
> >>> hex(Path('U:/test/test.jpg').stat().st_ino)
> '0x4000000004ae29d5'

os.stat calls WINAPI GetFileInformationByHandle, which returns a 64-bit file ID. It appears that ReFS generates this ID by concatenating the relative ID and directory ID in a way that is "not guaranteed to be unique" according to the BY_HANDLE_FILE_INFORMATION [1] docs. 

I haven't checked whether this 64-bit file ID can even be used successfully with OpenFileById [2]. It could be that ReFS simply fails an open-by-ID request unless it includes the full 128-bit ID (i.e. ExtendedFileIdType).

You can request the 128-bit ID as a FILE_ID_128 record (an array of 16 bytes) via GetFileInformationByHandleEx: FileIdInfo [3][4]. Maybe os.stat should try to query the 128-bit ID and use it as st_ino (or st_ino_128) if it's available. However, looking into my crystal ball, I don't see this happening, unless someone makes a strong case in its favor.

> The problem does *not* exist on an NTFS volume:
> 
> C:\Users>fsutil file queryfileid o:\OneDrive\test\test.jpg
> File ID is 0x0000000000000000000300000001be39

NTFS uses a 64-bit file ID, which consists of a 48-bit MFT record number and a 16-bit sequence number. The latter gets incremented when an MFT record is reused in order to detect stale references. In the above case, the 48-bit record number is 0x00000001be39, and the sequence number is 0x0003.

[1]: https://docs.microsoft.com/en-us/windows/win32/api/fileapi/ns-fileapi-by_handle_file_information
[2]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-openfilebyid
[3]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-getfileinformationbyhandleex
[4]: https://docs.microsoft.com/en-us/windows/win32/api/winbase/ns-winbase-file_id_info
msg365228 - (view) Author: Martynas Brijunas (mbrijun@gmail.com) Date: 2020-03-28 20:13
Hi Steve, Eryk,

thank you very much for looking into this. I was looking into "st_ino"
as a potential substitute of a full path of a file when it comes to
uniquely identifying that file in a database.

> ReFS uses a 128-bit file ID, which I gather consists of a 64-bit directory ID and a 64-bit relative ID. (Take this with a grain of salt. AFAIK, Microsoft hasn't published a spec for ReFS.) The latter is 0 for the directory itself and increments by 1 for each file created in the directory, with no reuse of previous values if a file is deleted or moved. If that's correct, and if "test.jpg" was created in "\test", then the directory ID of "\test" is 0x29d5, and the relative file ID is 0x4ae.

This assumption seems to be correct. All files within the same
directory have identical first half of their ID, as reported by
"fsutil".

U:\test>fsutil file queryfileid test.jpg
File ID is 0x00000000000029d500000000000004ae

U:\test>fsutil file queryfileid test.nef
File ID is 0x00000000000029d50000000000000483

U:\test>fsutil file queryfileid test.ARW
File ID is 0x00000000000029d50000000000000484

U:\test>fsutil file queryfileid test.db
File ID is 0x00000000000029d50000000000000495

>
> > >>> from pathlib import Path
> > >>> hex(Path('U:/test/test.jpg').stat().st_ino)
> > '0x4000000004ae29d5'
>
> os.stat calls WINAPI GetFileInformationByHandle, which returns a 64-bit file ID. It appears that ReFS generates this ID by concatenating the relative ID and directory ID in a way that is "not guaranteed to be unique" according to the BY_HANDLE_FILE_INFORMATION [1] docs.

The feedack from "st_ino" appears to be in total sync with "fsutil".
The only real difference (apart for the for the missing leading zeros
in each half) is the inclusion of a hex "4" at the very beginning of
the hex sequence. But even that is consistent as the "4" is present in
all cases.

>>> hex(Path('U:/test/test.jpg').stat().st_ino)
'0x4000000004ae29d5'
>>> hex(Path('U:/test/test.nef').stat().st_ino)
'0x40000000048329d5'
>>> hex(Path('U:/test/test.arw').stat().st_ino)
'0x40000000048429d5'
>>> hex(Path('U:/test/test.db').stat().st_ino)
'0x40000000049529d5'
History
Date User Action Args
2022-04-11 14:59:28adminsetgithub: 84276
2020-03-28 20:13:28mbrijun@gmail.comsetmessages: + msg365228
2020-03-28 14:57:07eryksunsetnosy: + eryksun
messages: + msg365214
2020-03-28 12:14:47steve.dowersetmessages: + msg365207
2020-03-28 09:39:22eryksunsetnosy: + paul.moore, tim.golden, zach.ware, steve.dower

components: + Windows
versions: + Python 3.9
2020-03-28 08:27:27mbrijun@gmail.comcreate