Message 349888 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	eryksun
Recipients	eryksun, paul.moore, steve.dower, tim.golden, zach.ware
Date	2019-08-16.22:14:03
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1565993644.53.0.0320037706152.issue37834@roundup.psfhosted.org>
In-reply-to

Content
> Where "links" are the generic term for the set that includes > "reparse point", "symlink", "mount point", "junction", etc.) Why group all reparse points under the banner of 'link'? If we have a typical HSM reparse point (the vast majority of allocated tags), then all operations, such as delete and rename, act on the file itself, not simply the reparse point. We should be able to delete or rename a link without affecting the target. In this case, there's also no chance that the reparse point is a surrogate for another path on the system, so code that walks paths doesn't have to worry about loops with regard to these reparse points. The only practical use case I can think of for detecting/opening this type of reparse point is backup software that should avoid triggering an HSM recall. For example: https://www.ibm.com/support/knowledgecenter/en/SSEQVQ_8.1.0/client/r_opt_hsmreparsetag.html As I've previously suggested (and this is the last time because I'm becoming a broken record), lstat() should at least be restricted to opening only name-surrogate reparse points that are supposed to be like links in that they target another path in the system. Plus it also has to open unhandled reparse points. Personally, I'm only comfortable with opening it up to name surrogates if islink() and readlink() are still limited to just Unix-like symlinks that we can create via symlink(). Nothing changes there. It's just a restriction of how lstat() currently works. The addition of the reparse tag in the stat result enables special handling of non-symlink surrogates. > shutil.copytree(path): Unchanged. (requires a minor fix to > continue to recursively copy through junctions (using above test), > but not symlinks.) Everyone else who relies on islink(), readlink(), and symlink() to copy symlinks isn't special casing their code to look for junctions or anything else we lump under the banner of islink(). They could code defensively if readlink() fails for a 'link' that we can't read. But that leaves the problem of readlink() succeeding for a junction. That can causes problems if the target is passed to os.symlink(), which changes the link from a hard name grafting to a soft name grafting. Why would we need to read the target of a junction? It's not needed for realpath() in Windows. We should only have to resolve symlinks. For example: C:/Mount/junction/spam/eggs junction -> Z:/bar/baz We don't have to resolve this as "Z:/bar/baz/spam/eggs", and doing so may even be wrong for someone using it to manually resolve a relative symlink. "C:/Mount/junction/spam/eggs" is a solid path. In Unix it would not be resolved by realpath(). A solid path is needed to figure out how to create a relative symlink, or how to manually resolve one for a given path. For example, if "foo_link" in "C:/Mount/junction/spam/eggs" targets "../../../foo", this refers to "C:/Mount/foo". On the other hand, if the junction mount point were replaced by a soft symlink, then "C:/Mount/symlink/spam/eggs" is not a solid path. "foo_link" is instead evaluated over the target path: "Z:/bar/baz/spam/eggs/foo_link", so the link resolves to "Z:/bar/foo". IMO, S_IFLNK need not be set for anything other than Unix-like symbolic links. We would just need to document that on Windows, lstat opens any link-like reparse point that indicates it targets another path on the system, plus any reparse point that's not handled, but that islink() is only true for actual Unix symlinks that can be created via os.symlink() and read via os.readlink(). This preserves how islink() and readlink() currently work, while still leaving the door open to fix misbehavior in particular cases. Code, including our own code, that needs to look for the broader Windows category of "name surrogate" can examine the reparse tag. For convenience we can provide issurrogate() that checks lstat(filename).st_reparse_tag & 0x2000_0000. This can be true for directories. Also, a surrogate doesn't have to behave like a Unix "soft" symlink, i.e. it applies to "hard" mount points. In Unix, issurrogate() could just be an alias for islink() since Unix provides only one type of name surrogate. Currently the name surrogate category includes the following tags: Microsoft name surrogate (bits 31 and 29) IO_REPARSE_TAG_MOUNT_POINT 0xA0000003 IO_REPARSE_TAG_SYMLINK 0xA000000C IO_REPARSE_TAG_IIS_CACHE 0xA0000010 IO_REPARSE_TAG_GLOBAL_REPARSE 0xA0000019 IO_REPARSE_TAG_LX_SYMLINK 0xA000001D IO_REPARSE_TAG_WCI_TOMBSTONE 0xA000001F IO_REPARSE_TAG_PROJFS_TOMBSTONE 0xA0000022 Non-Microsoft name surrogate (bit 29) IO_REPARSE_TAG_SOLUTIONSOFT 0x2000000D IO_REPARSE_TAG_OSR_SAMPLE 0x20000017 IO_REPARSE_TAG_QI_TECH_HSM 0x2000002F IO_REPARSE_TAG_MAXISCALE_HSM 0x20000035 IO_REPARSE_TAG_ALERTBOOT 0x2000004C IO_REPARSE_TAG_NVIDIA_UNIONFS 0x20000054 IO_REPARSE_TAG_OSR_SAMPLE is used by OSR sample code in their Windows driver curriculum, so that one is unlikely to be seen in practice. I don't know anything about the other non-Microsoft tags. NVidia's UnionFS looks interesting. Using reparse points to merge file systems is probably not the most efficient way to handle that problem, but I'm sure the devil is in the details there. > os.unlink(path): unchanged (still removes the junction, not the > contents) Whatever we're calling a link should be capable of being deleted via os.unlink. If we apply S_IFLNK, then it won't have S_IFDIR (at least how POSIX code expects it), and unlink should work on it. The current state of affairs in which unlink/remove works on a junction, which is reported by stat() as a directory, is inconsistent. It's not specified to remove directories, so nothing that it can remove should be a directory. > shutil.rmtree(path): Will now remove a junction rather than > recursively deleting its contents (net improvement, IMHO) I'd like for it to remove all name-surrogate directories like CMD's `rmdir /s` does. In contrast, Unix shutil.rmtree traverses into a mount point, deletes everything, and then fails because the directory is mounted and can't be removed. That's hideous, IMO.

> Where "links" are the generic term for the set that includes 
> "reparse point", "symlink", "mount point", "junction", etc.)

Why group all reparse points under the banner of 'link'? If we have a typical HSM reparse point (the vast majority of allocated tags), then all operations, such as delete and rename, act on the file itself, not simply the reparse point. We should be able to delete or rename a link without affecting the target. 

In this case, there's also no chance that the reparse point is a surrogate for another path on the system, so code that walks paths doesn't have to worry about loops with regard to these reparse points. The only practical use case I can think of for detecting/opening this type of reparse point is backup software that should avoid triggering an HSM recall. For example:

https://www.ibm.com/support/knowledgecenter/en/SSEQVQ_8.1.0/client/r_opt_hsmreparsetag.html

As I've previously suggested (and this is the last time because I'm becoming a broken record), lstat() should at least be restricted to opening only name-surrogate reparse points that are supposed to be like links in that they target another path in the system. Plus it also has to open unhandled reparse points. 

Personally, I'm only comfortable with opening it up to name surrogates if islink() and readlink() are still limited to just Unix-like symlinks that we can create via symlink(). Nothing changes there. It's just a restriction of how lstat() currently works. The addition of the reparse tag in the stat result enables special handling of non-symlink surrogates.

> shutil.copytree(path): Unchanged. (requires a minor fix to 
> continue to recursively copy through junctions (using above test), 
> but not symlinks.)

Everyone else who relies on islink(), readlink(), and symlink() to copy symlinks isn't special casing their code to look for junctions or anything else we lump under the banner of islink(). They could code defensively if readlink() fails for a 'link' that we can't read. But that leaves the problem of readlink() succeeding for a junction. That can causes problems if the target is passed to os.symlink(), which changes the link from a hard name grafting to a soft name grafting.

Why would we need to read the target of a junction? It's not needed for realpath() in Windows. We should only have to resolve symlinks. For example:

    C:/Mount/junction/spam/eggs

             junction -> Z:/bar/baz

We don't have to resolve this as "Z:/bar/baz/spam/eggs", and doing so may even be wrong for someone using it to manually resolve a relative symlink. "C:/Mount/junction/spam/eggs" is a solid path. In Unix it would not be resolved by realpath(). A solid path is needed to figure out how to create a relative symlink, or how to manually resolve one for a given path. 

For example, if "foo_link" in "C:/Mount/junction/spam/eggs" targets "../../../foo", this refers to "C:/Mount/foo". On the other hand, if the junction mount point were replaced by a soft symlink, then "C:/Mount/symlink/spam/eggs" is not a solid path. "foo_link" is instead evaluated over the target path: "Z:/bar/baz/spam/eggs/foo_link", so the link resolves to "Z:/bar/foo".

IMO, S_IFLNK need not be set for anything other than Unix-like symbolic links. We would just need to document that on Windows, lstat opens any link-like reparse point that indicates it targets another path on the system, plus any reparse point that's not handled, but that islink() is only true for actual Unix symlinks that can be created via os.symlink() and read via os.readlink(). 

This preserves how islink() and readlink() currently work, while still leaving the door open to fix misbehavior in particular cases. Code, including our own code, that needs to look for the broader Windows category of "name surrogate" can examine the reparse tag. For convenience we can provide issurrogate() that checks lstat(filename).st_reparse_tag & 0x2000_0000. This can be true for directories. Also, a surrogate doesn't have to behave like a Unix "soft" symlink, i.e. it applies to "hard" mount points. In Unix, issurrogate() could just be an alias for islink() since Unix provides only one type of name surrogate.

Currently the name surrogate category includes the following tags:

    Microsoft name surrogate (bits 31 and 29)

    IO_REPARSE_TAG_MOUNT_POINT                  0xA0000003
    IO_REPARSE_TAG_SYMLINK                      0xA000000C
    IO_REPARSE_TAG_IIS_CACHE                    0xA0000010
    IO_REPARSE_TAG_GLOBAL_REPARSE               0xA0000019
    IO_REPARSE_TAG_LX_SYMLINK                   0xA000001D
    IO_REPARSE_TAG_WCI_TOMBSTONE                0xA000001F
    IO_REPARSE_TAG_PROJFS_TOMBSTONE             0xA0000022

    Non-Microsoft name surrogate (bit 29)

    IO_REPARSE_TAG_SOLUTIONSOFT                 0x2000000D
    IO_REPARSE_TAG_OSR_SAMPLE                   0x20000017
    IO_REPARSE_TAG_QI_TECH_HSM                  0x2000002F
    IO_REPARSE_TAG_MAXISCALE_HSM                0x20000035
    IO_REPARSE_TAG_ALERTBOOT                    0x2000004C
    IO_REPARSE_TAG_NVIDIA_UNIONFS               0x20000054

IO_REPARSE_TAG_OSR_SAMPLE is used by OSR sample code in their Windows driver curriculum, so that one is unlikely to be seen in practice. I don't know anything about the other non-Microsoft tags. NVidia's UnionFS looks interesting. Using reparse points to merge file systems is probably not the most efficient way to handle that problem, but I'm sure the devil is in the details there.

> os.unlink(path): unchanged (still removes the junction, not the 
> contents)

Whatever we're calling a link should be capable of being deleted via os.unlink. If we apply S_IFLNK, then it won't have S_IFDIR (at least how POSIX code expects it), and unlink should work on it. The current state of affairs in which unlink/remove works on a junction, which is reported by stat() as a directory, is inconsistent. It's not specified to remove directories, so nothing that it can remove should be a directory.

> shutil.rmtree(path): Will now remove a junction rather than 
> recursively deleting its contents (net improvement, IMHO)

I'd like for it to remove all name-surrogate directories like CMD's `rmdir /s` does. In contrast, Unix shutil.rmtree traverses into a mount point, deletes everything, and then fails because the directory is mounted and can't be removed. That's hideous, IMO.

History
Date	User	Action	Args
2019-08-16 22:14:04	eryksun	set	recipients: + eryksun, paul.moore, tim.golden, zach.ware, steve.dower
2019-08-16 22:14:04	eryksun	set	messageid: <1565993644.53.0.0320037706152.issue37834@roundup.psfhosted.org>
2019-08-16 22:14:04	eryksun	link	issue37834 messages
2019-08-16 22:14:03	eryksun	create