classification
Title: shutil.copyfile mutates symlink for absolute path
Type: behavior Stage: resolved
Components: Windows Versions: Python 3.8
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, jaraco, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords: 3.8regression

Created on 2020-05-17 03:26 by jaraco, last changed 2020-05-28 12:25 by eryksun. This issue is now closed.

Messages (18)
msg369091 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-17 03:26
In https://github.com/jaraco/path/issues/186, the Path project discovered a regression with Python 3.8. It seems that if one creates a symlink with an absolute path. I used `shutil.copytree('temp', 'temp2', True)` and it produced this result:

```
~ # dir temp
 Volume in drive C has no label.
 Volume Serial Number is B8F4-40BB

 Directory of C:\Users\jaraco\temp

2020-05-16  11:05 PM    <DIR>          .
2020-05-16  11:05 PM    <DIR>          ..
2020-05-16  11:05 PM    <SYMLINK>      bar [c:\Users\jaraco\temp\foo]
2020-05-16  11:04 PM                 0 foo
               2 File(s)              0 bytes
               2 Dir(s)  17,495,805,952 bytes free
~ # dir temp2
 Volume in drive C has no label.
 Volume Serial Number is B8F4-40BB

 Directory of C:\Users\jaraco\temp2

2020-05-16  11:05 PM    <DIR>          .
2020-05-16  11:05 PM    <DIR>          ..
2020-05-16  11:06 PM    <SYMLINK>      bar [\\?\c:\Users\jaraco\temp\foo]
2020-05-16  11:04 PM                 0 foo
               2 File(s)              0 bytes
               2 Dir(s)  17,495,846,912 bytes free
```

As you can see, in the copy, bar has an additional `\\?\` prefix on the symlink path. On Python 3.7 and earlier, the copy was made without mutating the metadata.
msg369094 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-05-17 06:25
Copying a symlink verbatim requires copying both the print name and the substitute name in the reparse buffer [1]. For a file, CopyFileExW: COPY_FILE_COPY_SYMLINK implements this by enabling the symlink privilege for the thread and copying the reparse point via FSCTL_GET_REPARSE_POINT and FSCTL_SET_REPARSE_POINT. For a directory, CreateDirectoryExW is implemented similarly when lpTemplateDirectory is a symlink or mount point. (For a "\\\\?\\Volume{GUID}\\" volume mountpoint as opposed to a bind mountpoint, CreateDirectoryExW punts to SetVolumeMountPointW, which also updates the system mountpoint manager.)

If you can only have one or the other, the substitute name is more reliable according to the wording in [MS-FSCC] [2]. 

symlinks:

    A symbolic link has a substitute name and a print name associated 
    with it. The substitute name is a pathname (section 2.1.5) 
    identifying the target of the symbolic link. The print name SHOULD 
    be an informative pathname, suitable for display to a user, that 
    also identifies the target of the symbolic link. Either pathname 
    can contain dot directory names as specified in section 2.1.5.1. 

mount points (junctions):

    A mount point has a substitute name and a print name associated 
    with it. The substitute name is a pathname (section 2.1.5) 
    identifying the target of the mount point. The print name SHOULD 
    be an informative pathname (section 2.1.5), suitable for display 
    to a user, that also identifies the target of the mount point. 
    Neither of these pathnames can contain dot directory names.

The operative weasel word is "should", instead of a reliable "must" (RFC2119). 

An example of the power of "should" is that PowerShell doesn't even set a print name when it creates a mount point via `New-Item -ItemType Junction`. I don't agree that nt.readlink should read junctions, but it does, so the potentially missing print name is a problem. If it were just symlinks created by CreateSymbolicLinkW, the print name is reliable because we know that it sets the print name to whatever was passed as lpTargetFileName. I suppose nt.readlink could fall back on using the substitute name if there's no print name.

Also, if nt.readlink is used to manually resolve a broken path (e.g. ntpath._readlink_deep), and the process doesn't have long paths enabled, then the "\\?\" extended path from the substitute name is more reliable. (But one could also call _getfullpathname on the print name and convert the result to extended form if it's not already an extended path.)

If you search around, you'll find some projects using the print name and some using the substitute name to implement POSIX readlink, but using the print name is more popular.

Do you want 3.8 to revert to using the print name, at least for symlinks? (ntpath._readlink_deep would need to be modified to support long target paths.) Or would you rather that shutil used a more reliable way to copy symlinks verbatim on Windows? For example, use CopyFileExW for a file and CreateDirectoryEx for a directory.

[1]: https://docs.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/ns-ntifs-_reparse_data_buffer
[2]: https://docs.microsoft.com/en-us/openspecs/windows_protocols/ms-fscc/b41f1cbf-10df-4a47-98d4-1c52a833d913
msg369123 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-17 14:56
Thank you Eryk for the thorough and informative investigation. Seriously, wow.

> Do you want 3.8 to revert to using the print name, at least for symlinks? (ntpath._readlink_deep would need to be modified to support long target paths.) Or would you rather that shutil used a more reliable way to copy symlinks verbatim on Windows? For example, use CopyFileExW for a file and CreateDirectoryEx for a directory.

In this case, my instinct is that `shutil` should devise a more reliable way to copy symlinks. It claims that it does to the extent the platform allows [1].

> symbolic links in the source tree are represented as symbolic links in the new tree and the metadata of the original links will be copied as far as the platform allows

It hadn't occurred to me that the effect may be manifest in readlink, but I see that now too:

~ # py -3.7 -c "import os; print(os.readlink('temp/bar'))"
c:\Users\jaraco\temp\foo
~ # py -3.8 -c "import os; print(os.readlink('temp/bar'))"
\\?\c:\Users\jaraco\temp\foo

So even if shutil.copyfile were to copy the symlinks exactly, the expectation would still be missed that the output of readlink doesn't match the input to symlink, so the expectation would still be missed. A user/programmer should be able to predict the output of 'readlink' given the input to 'symlink' (or cmd /c mklink).

Based on that reasoning, my answer would be "both", but I'd put a priority on restoring the use of "print name" for `readlink`, as that may be sufficient to satisfy most (if not all) use-cases.


[1]: https://docs.python.org/3/library/shutil.html#shutil.copytree
msg369124 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-17 15:02
I strongly suspect bpo-37834 and GH-15231 is where the difference was introduced.
msg369202 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-05-18 10:22
This is only an issue for broken symlinks, right? (More precisely, those that cannot be resolved, which may include very long paths on some machines.) 

Fixing copy is my preferred option.

Also Jason, what do you need to be able to test against prerelease? It would be nice to find these impacts during alpha/beta rather than three bug fix releases later.
msg369212 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-18 13:10
> This is only an issue for broken symlinks, right? (More precisely, those that cannot be resolved, which may include very long paths on some machines.) 

Unfortunately, no. In the original post above, you can see temp/bar points to C:\Users\jaraco\temp\foo, which exists. Yet os.readlink returns \\?\C:\Users\jaraco\temp\foo. That's the root cause of the issue with copytree.

> what do you need to be able to test against prerelease? It would be nice to find these impacts during alpha/beta rather than three bug fix releases later.

I need a mechanically-reproducible technique to test dozens of projects against pre-releases on Windows. Historically, I've had Travis as my prime CI (non-existent Windows support) and AppVeyor for select projects that demanded some Windows testing. I've recently started migrating to Azure Pipelines, which has nice multi-platform support and that's why I've started catching these issues. Presumably, I'll be able to add a pre-release of Python 3.9 to the pipelines config at jaraco/skeleton - that will ensure that all projects I maintain going forward will get tested on the pre-release.
msg369232 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-05-18 14:33
Ah right, it's realpath() that has the final step to remove an unnecessary \\?\ prefix.

Firstly, I still think fixing copy is more important and more valuable.

Given the possibility that the print name won't point to the real target, I'd much prefer to keep reading the substitute name. In this case, this looks like a change between versions (3.7->3.8) rather than a bug (assuming the symlink works, which hasn't been mentioned yet), and while it does result in needing a test update in Path, I don't think it makes sense to push that fix upstream. Changing readlink to always return the correct path was deliberate.

Unless you're specifically testing single steps through symlink chains, you probably want to just use realpath anyway. Presumably you weren't using that beforehand because it didn't work, but the same fix fixed that, and we do extra work to retain the existing behaviour of the higher-level function (including rejecting junctions, unlike readlink).
msg369233 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-05-18 14:35
> Presumably, I'll be able to add a pre-release of Python 3.9 to the pipelines config at jaraco/skeleton - that will ensure that all projects I maintain going forward will get tested on the pre-release.

I don't have a straightforward task for it, but these should give you a Python 3.9 prerelease version (or any version you like if you replace "-Prerelease" with "-Version x.y.z", IIRC):

nuget install python -Prerelease -OutputDirectory . -ExcludeVersion
.\python\tools\python.exe ...

nuget install pythonx86 -Prerelease -OutputDirectory . -ExcludeVersion
.\pythonx86\tools\python.exe ...
msg369616 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-22 16:51
> Changing readlink to always return the correct path was deliberate.

Understood. However, this statement assumes the "correct path" is the most precise path to resolve the target. If you instead define "correct path" as the one that would be most friendly to the user who created the path, readlink no longer honors that expectation. With this change, the following invariant holds on every platform except Python 3.8 on Windows (at least in the general case):

>>> os.symlink(x, y)
>>> assert os.readlink(y) == x

More importantly, AFAIK, Python provides no function to transform `x` into what one can expect as the result of `os.readlink(y)`. In other words, what value of `f` would make this invariant pass?

>>> os.symlink(x, y)
>>> assert os.readlink(y) == f(x)

Or put another way, is "C:\Users\jaraco\temp" an "incorrect path"?
msg369617 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-22 16:58
> Unless you're specifically testing single steps through symlink chains, you probably want to just use realpath anyway.

I do see now the references to `realpath` in the docs... and I think that satisfies the need I described above. It doesn't supply the `f`, but it provides a mechanism to resolve the ultimate file. As you say, there's still no regular solution for resolving intermediate links in a chain of symlinks with user-friendly names, but it seems like a much less severe limitation (platform variation).

I'll see if `realpath` satisfies the test suite needs for path pie.
msg369624 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-22 18:07
Perhaps related, I've encountered another apparent regression on Python 3.8 on Windows when the current working directory is in a symlink to another volume and one runs `setup.py develop` on a project using setuptools_scm (https://github.com/pypa/setuptools_scm/issues/436).
msg369628 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-22 18:26
> I'll see if `realpath` satisfies the test suite needs for path pie.

I've tried replacing `readlink` with `realpath` and the tests still pass on Unix-like OSs, and also passes on Python 3.8 on Windows, but now fails on older Pythons on Windows.

Is there a backport of the new realpath for older Pythons? I see there's [one here](https://github.com/jaraco/jaraco.windows/blob/1318d7afce2a9257f5bd7342783fdb796462d66b/jaraco/windows/filesystem/backports.py#L7). I wonder if that one can be used.
msg369630 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-22 18:32
> Perhaps related...

Now I'm thinking the issue is different, so I've created issue40732 to track the realpath issue.
msg369916 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-05-25 19:49
I think we can safely say this is by design (I know Jason got his backport working).

> Understood. However, this statement assumes the "correct path" is the most precise path to resolve the target. If you instead define "correct path" as the one that would be most friendly to the user who created the path, readlink no longer honors that expectation.

Nothing about the os module is meant to be user-friendly first - it's based on the POSIX spec ;)

The most important thing is that operations that traverse symlinks should end up at the same file as a manual traversal using readlink. The easiest way to spoil this is to optimise for readability over correctness.

As discussed, realpath does a little more work to ensure readability, and anything else that cares about UI can do similar work. But if the lowest-level function loses critical information, there's no way for the developer to get it back.
msg369946 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-26 06:31
Thanks Steve. While I was able to avoid the original symptom by not using readlink, I still think there's an issue here, both in the surprising behavior observed by shutil.copyfile, but also by the essential behavior of readlink. The more essential issue can be illustrated this simple problem: how could one write the `cmd.exe` `dir` command using Python? Using cmd.exe:

```
C:\Users\jaraco>mklink foo C:\USERS\jaraco\bar
symbolic link created for foo <<===>> C:\USERS\jaraco\bar

C:\Users\jaraco>dir foo
 Volume in drive C has no label.
 Volume Serial Number is B8F4-40BB

 Directory of C:\Users\jaraco

2020-05-26  02:04 AM    <SYMLINK>      foo [C:\USERS\jaraco\bar]
               1 File(s)              0 bytes
               0 Dir(s)  21,078,786,048 bytes free
```

Similarly in powershell:

```
PS C:\Users\jaraco> cmd /c mklink foo c:\users\jaraco\bar
symbolic link created for foo <<===>> c:\users\jaraco\bar
PS C:\Users\jaraco> dir 'foo' | ?{$_.LinkType} | select Target

Target
------
{c:\users\jaraco\bar}
```

Whether 'bar' exists or not, it seems to me that these tools use the "print name" to render the next hop of the link to the user (even if that target doesn't exist).

`realpath` doesn't provide this functionality and neither does `readlink`.

Perhaps more importantly, the "print name" is lost when copying the file using shutil, but for the same underlying reason--the print name is not exposed by the low level API, and the result is that implementation details about the platform leak out to the user interface.

If "\\?\C:\Users\jaraco\bar" is the more correct form of the target, then shouldn't these tools also be exposing that value to the user? How would a Python developer wishing to implement the dir command do so? Is there a way to retain the fidelity of the print name during an shutil.copy?
msg369964 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2020-05-26 09:44
> how could one write the `cmd.exe` `dir` command using Python?

I haven't checked, but if the dir command resolves the entire symlink chain rather than one step at a time, you'd use realpath (though that seems unlikely, tbh).

Failing that, you'd add extra logic to prepare the path for display to the user, and then you _would not_ use that prepared display path to resolve the link.

readlink() is intended for resolving links, not for displaying links. That's why it doesn't prepare them for display, but preserves the most accurate information possible. (We chose a more pragmatic route with realpath(), by checking that the nicer path resolves to the same target as the accurate one, and only then returning the nicer one.)

Really, for dir, you'd read the print name and display it when it's in the contents of the target. But you'd use the substitute name to resolve the target when "dir <symlink>" is called. We _could_ add a new API to Python for this, but it wouldn't be cross-platform and it doesn't solve a correctness issue, so we didn't. Feel free to request it, bearing in mind that it would be Windows-only (AFAIK).
msg370034 - (view) Author: Jason R. Coombs (jaraco) * (Python committer) Date: 2020-05-26 21:55
What do you think about readlink returning something like:

class Link(str):
  print_name = None  # type: str | None

  @property
  def friendly_name(self) -> str:
    return self.print_name or self

os.readlink would always return one of these Link objects, and since it's a substr, it would behave the way it currently behaves, but with an additional property. This approach would expose the underlying duality of the "name" of a link or junction, such that tools that need a friendly name could use .friendly_name and tools that need to inspect the print name could do that... but for the cases that wish for the most precise target could use the default value.
msg370198 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-05-28 12:25
Modifying readlink() to return an str subclass that preserves the print name is an interesting idea, but not on topic here. 

For the cases where os.readlink is used to manually follow links (*), such as ntpath.realpath, using the substitute name is the most reliable option since that's the actual path that the system uses. But this issue is about the use of os.readlink in order to copy symlinks in shutil.move, shutil.copyfile, and shutil.copytree. I'd still be happy to assist with the development of an os.copylink function that copies the reparse point exactly via low-level FSCTL_GET_REPARSE_POINT and FSCTL_SET_REPARSE_POINT. It has to use the low-level API because it turns out that CopyFileExW and CreateDirectoryExW fail if they can't enable the symlink privilege, which is not actually required if the system is in developer mode. For shutil, it would be used as shutil._copylink. In POSIX, shutil._copylink would continue to just use readlink and symlink.

---

(*) off-topic note

Manually following remote mountpoints is never correct. They are intended to be evaluated by the remote system using its local devices. 

Manually following remote-to-local (R2L) symlinks is almost always incorrect and should be disallowed by local symlink evaluation policy Check `fsutil behavior query symlinkevaluation`. A remote SMB server opens a path in a way that has the remote I/O manager stop parsing at the first symlink, which may be a directory symlink in the path (it's not necessarily the final component, unlike FILE_FLAG_OPEN_REPARSE_POINT). The server completes the request with a message to the client redirector that contains the parsed path of the symlink, the remaining unparsed path, and the target of the symlink. The local SMB redirector decides whether to allow reparsing based on its L2L, L2R, R2L, and R2R symlink policies. This reparse is implemented locally, with local devices. It is unlikely that a non-relative symlink that targets local device paths (e.g. "//server/share/symlink" -> "E:/work/file") was intended to be evaluated on another machine, so R2L should almost always be disallowed. Bypassing the system's R2L policy when manually following a symlink is always wrong.
History
Date User Action Args
2020-05-28 12:25:46eryksunsetmessages: + msg370198
2020-05-26 21:55:30jaracosetmessages: + msg370034
2020-05-26 09:44:46steve.dowersetmessages: + msg369964
2020-05-26 06:31:37jaracosetmessages: + msg369946
2020-05-25 19:49:29steve.dowersetstatus: open -> closed
resolution: not a bug
messages: + msg369916

stage: resolved
2020-05-22 18:32:07jaracosetmessages: + msg369630
2020-05-22 18:26:51jaracosetmessages: + msg369628
2020-05-22 18:07:47jaracosetmessages: + msg369624
2020-05-22 16:58:01jaracosetmessages: + msg369617
2020-05-22 16:51:04jaracosetmessages: + msg369616
2020-05-18 14:35:33steve.dowersetmessages: + msg369233
2020-05-18 14:33:18steve.dowersetmessages: + msg369232
2020-05-18 13:10:30jaracosetmessages: + msg369212
2020-05-18 10:22:39steve.dowersetmessages: + msg369202
2020-05-17 15:02:57jaracosetmessages: + msg369124
2020-05-17 14:56:41jaracosetmessages: + msg369123
2020-05-17 06:25:19eryksunsetnosy: + eryksun
messages: + msg369094
2020-05-17 03:26:35jaracocreate