Author steve.dower
Recipients eryksun, paul.moore, steve.dower, tim.golden, zach.ware
Date 2019-08-16.16:31:03
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1565973064.24.0.103734698374.issue37834@roundup.psfhosted.org>
In-reply-to
Content
>> I don't want to add any parameters - I want to have predictable and
>> reasonable default behaviour. os.readlink() already exists for 
>> "open reparse point" behaviour.
>
> I'd appreciate a parameter to always open reparse points, even if a
> filter-driver or the I/O manager handles them. 

But that's only required if lstat() traverses junctions, which I'm not proposing to do.

> I'm no longer a big fan of mapping "follow_symlinks" to name surrogates
> ... But it's not up to me. If follow_symlinks means name surrogates, at
> least then lstat can open any reparse point that claims to link to
> another path and thus *should* have link-like behavior (hard link or
> soft link).

Yeah, ultimately it looks like it'll be up to me, which is why I want as much of your input as I can get before having to make a call :)

I'm leaning towards "if the OS follows it by default, stat() will follow it", and (given some of your later comments), "when ERROR_CANT_ACCESS_FILE occurs on a reparse point (is that the only scenario?) behave like lstat()".

(And I carefully phrased that to not imply that Python is going to partially follow a link chain for you if the eventual end result is not valid. e.g. stat() on a symlink to an appexeclink will return the symlink, because ERROR_CANT_ACCESS_FILE was the result. Then the suggestion will be "if stat() returns a link, use os.path.realpath() if you want to resolve it as far as possible".)

> The questions for me are whether os.readlink() should also read
> junctions and exactly what follow_symlinks means in Windows. We have a
> complicated story to tell if follow_symlinks=False (lstat) opens any
> reparse point or opens just name-surrogate reparse points, and islink()
> is made consistent with this, but then readlink() doesn't work.

I don't think it's that complicated:
* os.lstat() will not traverse any links or reparse points
* os.stat() will traverse any links and reparse points supported by the OS, or else return the same as os.lstat()
* os.readlink() can read the target of a symlink or (on Windows) junction point

> If junctions are handled as symlinks, then islink(), readlink(),
> symlink() would be used to copy a junction 'link' while copying a tree
> (e.g. shutil.copytree with symlinks=True). This would transform
> junctions into directory symlinks. In this case, we potentially have a
> problem that relative symlinks in the tree no longer target the same
> files when accessed via a directory symlink instead of a junction. No
> one thinks about this problem on the POSIX side because it would be
> weird to copy a mountpoint as a symlink. In POSIX, a mountpoint is
> always seen as just a directory and always traversed.

This isn't difficult to handle specifically if we want to, though, since the stat result now includes the reparse tag. And you're right, we probably have to handle it.

>> I'm still not convinced that this is what we want to do. I don't
>> have a true Linux machine handy to try it out (Python 3.6 and 3.7 on
>> WSL behave exactly like the semantics I'm proposing, but that may
>> just be because it's the Windows kernel below it).

> ... This is a decision in WSL's drvfs file-system driver, and I have to
> assume it's intentional.

I assumed it was intentional as well, though it'll almost certainly be based on pragmatically getting things to work (e.g. the '/mnt/c/Document and Settings' junction... though now I try it that those don't actually work...)

>> ismount() is currently not true for junctions. And I can't find any
>> reference that says that POSIX symlinks can't point to directories,
>
> Our current implementation for junctions is based on GetVolumePathNameW,
> which will be true for junctions that use the "Volume{...}" name to
> mount the file-system root directory. That's a volume mount point.
> 
> I don't know why someone decided that's the sum total of "mount point"
> in Windows. DOS drives and UNC drives can refer to arbitrary file system
> directories. They don't have to refer to file-system root directory. We
> can have "W:" -> "\\??\\C:\\Windows", etc.
> 
> Per the docs, a mount point for ismount() is a "point in a file system
> where a different file system has been mounted". The mounted directory
> doesn't have to be the root directory of the file system. I'd relax this
> definition to include all "hard" name grafting links to other
> directories, even within the same file system. What matter to me is the
> semantics of how this differs from the soft name grafting of a symlink.

I think the intent is that it's mounted the root of another file system. There was discussion of just using the reparse tag in issue9035, but it looks like from msg138197 that returning True for regular directory links was not the intent (despite the tests in that message being insufficient).

> If follow_symlinks=False applies to name surrogates, then a junction
> would be detectable via os.lstat(filename).st_reparse_tag, which is not
> only much cheaper than GetVolumePathNameW, but also more generally
> correct and consistent with DOS and UNC drive mount points.

Ah, the future possibilities of getting that one field added in this PR :) Let's save other enhancements for another PR - especially when they won't be impacted by this one.

> S_IFDIR is suppressed for directory symlinks in the stat result. But
> os.path.isdir() is supposed to be based on os.stat, and thus follows
> symlinks. To that end, our nt._isdir is broken because it assumes
> GetFileAttributesW is sufficient. Since we're supposed to follow links,
> it's not working right for link targets that don't exist. It should
> return False in that case.

That's easily fixed.

> I assumed stat() would return the reparse point for all tags that fail
> to reparse with ERROR_CANT_ACCESS_FILE since it's not an invalid reparse
> point, just an unhandled one that has no other significance at the file
> system level. To stat(), it can never be anything other than a reparse
> point. Whatever relevance it has depends on some other context (e.g. a
> CreateProcessW call).

This is easy - just remove the "call again with traverse=TRUE" - and provided we don't set S_IFLNK in this case then we are being "correct", right? (Nobody has yet defined whether S_IFLNK is required for st_reparse_tag to be set.)

> That doesn't mean, however, that I wouldn't like the ability to detect
> "name surrogate" reparse points in general to implement safer behavior
> for shutil.rmtree and anything else that walks a directory tree.

nt.lstat().st_reparse_tag and the constants in the stat module are sufficient for this, right?

> If islink() is true, then st_mode has S_IFLNK and not S_IFDIR. So we
> have a mount point that's a symlink, which is not possible in POSIX, and
> it's not a directory, which is unusual in POSIX to say the least.

This entire discussion is because POSIX can't fully describe Windows filesystems :) The two goals need to be making the "most consistent" behaviour the default, while enabling special cases to be specialised.

So I care less about the correct information being reported than it being acted upon in the correct way - if that means lying about junctions and mount points being links and not directories (since POSIX apparently doesn't support setting both flags?), then let's lie about it. lstat().st_reparse_tag will reveal whether it's a junction, and ismount() whether it's linking to a different device.
History
Date User Action Args
2019-08-16 16:31:04steve.dowersetrecipients: + steve.dower, paul.moore, tim.golden, zach.ware, eryksun
2019-08-16 16:31:04steve.dowersetmessageid: <1565973064.24.0.103734698374.issue37834@roundup.psfhosted.org>
2019-08-16 16:31:04steve.dowerlinkissue37834 messages
2019-08-16 16:31:03steve.dowercreate