This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Title: os.path.realpath() normalizes paths before resolving links on Windows
Type: Stage:
Components: Library (Lib), Windows Versions: Python 3.11, Python 3.10, Python 3.9, Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: barneygale, eryksun, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2021-04-24 21:22 by barneygale, last changed 2022-04-11 14:59 by admin.

File name Uploaded Description Edit eryksun, 2021-04-26 02:11
Messages (2)
msg391804 - (view) Author: Barney Gale (barneygale) * Date: 2021-04-24 21:22
Capturing a write-up by eryksun on GitHub into a new bug.


> `nt._getfinalpathname()` opens a handle to a file/directory with `CreateFileW()` and calls `GetFinalPathNameByHandleW()`. The latter makes a few system calls to get the final opened path in the filesystem (e.g. "\Windows\explorer.exe") and the canonical DOS name of the volume device on which the filesystem is mounted (e.g. "\Device\HarddiskVolume2" -> "\\?\C:") in order to return a canonical DOS path (e.g. "\\?\C:\Windows\explorer.exe").
> Opening a handle with `CreateFileW()` entails first getting a fully-qualified and normalized NT path, which, among other things, entails resolving ".." components naively in the path string. This does not take reparse points such as symlinks and mountpoints into account. The only time Windows parses ".." components in an opened path the way POSIX does is in the kernel when they're in the target path of a relative symlink.
> `nt.readlink()` opens a handle to the file with the flag `FILE_FLAG_OPEN_REPARSE_POINT`. If the final path component is a reparse point, it opens it instead of traversing it. Then the reparse point is read with the filesystem control request, `FSCTL_GET_REPARSE_POINT`. System symlinks and mountpoints (`IO_REPARSE_TAG_SYMLINK` and `IO_REPARSE_TAG_MOUNT_POINT`) are the only supported name-surrogate reparse-point types, though `os.stat()` and `os.lstat()` handle all name-surrogate types as 'links'. Moreover, only symlinks get the `S_IFLNK` mode flag in a stat result, because they're the only ones we can create with `os.symlink()` to satisfy the usage `if os.path.islink(src): os.symlink(os.readlink(src), dst)`.
> > What would it take to do a POSIX-style "normalize as we resolve",
> > and would we want to? I guess we'd need to call nt._getfinalpathname()
> > on each path component in turn (C:, C:\Users, C:\Users\Barney etc),
> > which from my pretty basic Windows knowledge might be rather slow if
> > that involves file handles.
> You asked, so I decided to write up an outline of what implementing a POSIX-style `realpath()` might look like in Windows. At its core, it's similar to POSIX: lstat(), and, for a symlink, readlink() and recur. The equivalent calls in Windows are the following:
>     * `CreateFileW()` (open a handle)
>     * `GetFileInformationByHandleEx()`: `FileAttributeTagInfo`
>     * `DeviceIoControl()`: `FSCTL_GET_REPARSE_POINT`
> A symlink has the reparse tag `IO_REPARSE_TAG_SYMLINK`.
> Filesystem mountpoints (aka junctions, which are like Unix bind mountpoints) must be retained in the resolved path in order to correctly resolve relative symlinks such as "\spam" (relative to the resolved device) and "..\..\spam". Anyway, this is consistent with the UNC case, since mountpoints on a remote server can never be resolved (i.e. a final UNC path never resolves mountpoints).
> Here are some of the notable differences compared to POSIX:
>     * If the source path is not a "\\?\" verbatim path, `GetFullPathNameW()` must be called initially.  However, ".." components in the target path of a relative symlink must be resolved the POSIX way, else symlinks in the target path may be removed incorrectly before their target is resolved (e.g. "foo\symlink\..\bar" incorrectly resolved as "foo\bar"). The opened path is initially normalized as follows:
>       * replace forward slashes with backslashes
>       * collapse repeated backslashes (except the UNC root must have exactly two backslashes)
>       * resolve a relative path (e.g. "spam"), drive-relative path (e.g. "Z:spam"), or rooted path (e.g. "\spam") as a fully-qualified path (e.g. "Z:\eggs\spam")
>       * resolve "." and ".." components in the opened path (naive to symlinks)
>       * strip trailing spaces and dots from the final component (e.g. "C:\spam. . ." -> "C:\spam")
>       * resolve reserved device names in the final component of a non-UNC path (e.g. "C:\nul" -> "\\.\nul")
>     * Substitute drives (e.g. created by "subst.exe", or `DefineDosDeviceW`) and mapped drives (e.g. created by "net.exe", or `WNetAddConnection2W`) must be resolved, respectively via `QueryDosDeviceW()` and `WNetGetUniversalNameW()`. Like all DOS 'devices', these drives are implemented as object symlinks (i.e. symlinks in the object namespace, not to be confused with filesystem symlinks). The target path of these drives, however, is not a Device object, but rather a filesystem path on a device that can include any number of path components, some of which may be filesystem symlinks that need to be resolved. Normally when a path is opened, the system object manager reparses all DOS 'devices' to the path of an actual Device object, or a path on a Device object, before the I/O manager's parse routine ever sees the path. Such drives need to be resolved whenever parsing starts or restarts at a drive, but the result can be cached in case multiple filesystem symlinks target the same drive.
>       * Substitute drives can target paths on other substitute drives, so `QueryDosDeviceW()` has to be called in a loop that accumulates the tail path components until it reaches a real device (i.e. a target path that doesn't begin with "\??\").
>       * `WNetGetUniversalNameW()` has to be called after resolving substitute drives. It resolves the underlying UNC  path of a mapped drive. The target path of the object symlink that implements a mapped drive is of the form "\Device\<redirector device name>\;<something>\server\share\some\filesystem\path". The "redirector device name" component is usually (post Windows Vista) an object symlink to a path on the system's Multiple UNC Provider (MUP) device, "\Device\Mup". The mapped-drive target path ultimately resolves to a redirected filesystem that's mounted in the MUP device namespace at the "share" name. This is an implementation detail of the filesystem redirector and MUP device, which the Multiple Provider Router (MPR) WNet API encapsulates. For example, for the mapped drive path "Z:\spam\eggs", it returns a UNC path of the form "\\server\share\some\filesystem\path\spam\eggs".
>     * A join that tries to resolve ".." against the drive or share root path must fail, whereas this is ignored for the root path in POSIX. For example, `symlink_join("C:\\", "..\\spam")` must fail, since the system would fail an open that tried to reparse that symlink target.
>     * At the end, the resolved path should be tested to try to remove "\\?\" if the source path didn't have this prefix. Call `GetFullPathNameW()` to check for a reserved name in the final component and `PathCchCanonicalizeEx()` to check for long-path support. (The latter calls the system runtime library function `RtlAreLongPathsEnabled`, but that's an undocumented implementation detail.)
> `GetFinalPathNameByHandleW()` is not required. Optionally, it can be called for the last valid component if the caller wants a final path with all mountpoints resolved, i.e. add a `final_path=False` option. Of course, a final UNC path must retain mountpoints, so there's nothing we can do in that case. It's fine that this `realpath()` implementation would return a path that contains mountpoints in Windows (as the current implementation also does for UNC paths). They are not symlinks, and this matches the behavior of POSIX.
> I'd include a warning in the documentation that getting a final path via `GetFinalPathNameByHandleW()` in the non-strict case may be dysfunctional. The unresolved tail end of the path may become valid again if a server or device comes back online. If the unresolved part contains symlinks with relative targets such as "\spam" and "..\..\spam", and the `realpath()` call resolved away mountpoints, the reminaing path may not resolve correctly against the final path, as compared to how it would resolve against the original path. It definitely will not resolve the same for a rooted target path such as "\spam" if the last resolved reparse point in the original path was a mountpoint, since it will reparse to the root path of the mountpoint device instead of the original opened device, or instead of the last resolved device of a symlink in the path.
msg391875 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2021-04-26 02:11
> os.path.realpath() normalizes paths before resolving links 
> on Windows

Normalizing the input path is required in order to be consistent with the Windows file API. OTOH, the target path of a relative symlink gets resolved in a POSIX-ly correct manner in the kernel, and ntpath._readlink_deep() doesn't ensure this. 

I've attached a prototype that I wrote for a POSIX-like implementation that recursively resolves both the drive and the path. It uses the final path only as a shortcut to normalize volume GUID names as drives and the proper casing of UNC server and share names. However, it's considerably more work than the final-path approach, and more work always has the potential for more bugs. I'm providing it for the sake of discussion, or just for people to point to it as an example of what not to do... ;-)

Patching up the current implementation would probably involve extending _getfinalpathname() to support follow_symlinks=False. Aspects of the POSIX implementation would have to be adopted, but I think it can be kept relatively simple when integrated with _getfinalpathname(path, follow_symlinks=False). The latter also makes it easy to identify a UNC path, which is necessary because mountpoints should never be resolved in a UNC path, which is something the current implementation gets wrong.

What this wouldn't support is resolving an inaccessible drive as much as possible. Mapped drives are object symlinks that expand to UNC paths that can include an arbitrary filepath on a share. Substitute drives by definition target an arbitrary filepath, and can even target other substitute and mapped drives. A final-path only approach would leave the inaccessible drive in the result, along with any symlinks that are internal to the drive.

A final-path approach also can't support targets with rooted paths or ".." components that traverse a mountpoint. The final path will be on the mountpoint's device, which will change how such relative symlinks resolve. That said, rooted symlink targets are almost never seen in Windows, and targets that traverse a mountpoint by way of a ".." component should be rare, in principle. 

One problem is the frequent use of bind mountpoints in place of symlinks in Windows. In CMD, bind mountpoints can be created by anyone via `mklink /j`. Here's a fabricated example with a mountpoint (i.e. junction) that's used where normally a symlink should be used.

                bar [junction -> C:\work\bar]
                remote [symlink -> \\baz\spam]
                remote [symlink -> ..\remote]
            remote [symlink -> \\qux\eggs]

C:\work\foo\bar\remote normally resolves as follows:

        -> C:\work\foo\bar + ..\remote
        -> C:\work\foo\remote
        -> \\baz\spam

Assume that \\baz\spam is down, so C:\work\foo\bar\remote can't be strictly resolved. If the non-strict algorithm relies on getting the final path of C:\work\foo\bar\remote before resolving the target of "remote", then the result for this case will be incorrect.

        -> C:\work\bar\remote
        -> C:\work\bar + ..\remote
        -> C:\work\remote
        -> \\qux\eggs
Date User Action Args
2022-04-11 14:59:44adminsetgithub: 88102
2021-04-26 02:11:50eryksunsetfiles: +
versions: - Python 3.6, Python 3.7
nosy: + paul.moore, tim.golden, eryksun, zach.ware, steve.dower

messages: + msg391875

components: + Windows
2021-04-24 21:22:05barneygalecreate