This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Update shutil to work with max file path length on Windows
Type: Stage:
Components: Windows Versions:
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: brett.cannon, eryksun, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2016-08-10 23:02 by brett.cannon, last changed 2022-04-11 14:58 by admin.

Messages (5)
msg272384 - (view) Author: Brett Cannon (brett.cannon) * (Python committer) Date: 2016-08-10 23:02
It would be nice to have a place in the stdlib that can work with long file names on Windows no matter what. shutils seems like a possibly reasonable place.
msg272386 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-08-10 23:20
Issue #27731 will resolve this for the latest versions of Windows 10, but if we want to support this for earlier versions of Windows here's the algorithm I'd suggest:

* if len(path) < 260 - we're good, don't do anything
* get the last '\' or '/' character before the 260th character and split the path into two parts (path1 and path2)
* use CreateFile to get a handle to path1, then GetFinalPathNameByHandle to get a normalized npath1 (which will have the '\\?\' prefix)
* split path2 by '/' and '\' characters, trim leading and trailing spaces from each segment, trim trailing dots from each segment, and append them to npath1 separated by '\' characters
* use this normalized path instead of path

It's a relatively expensive operation, but it is the most reliable way to normalize a path. The place where it'll fall down is that trimming spaces and trailing dots and replacing '/' with '\' is not guaranteed to be the only processing that is done. However, it should solve 99% of cases which is better than the 0% that currently work.
msg272405 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-08-11 06:58
Standard users have SeChangeNotifyPrivilege, which allows traversing a directory that they can't access, so Python should only work with paths as strings instead of trying to open a directory handle. 

I think it's best to make Windows do as much of the normalization work as possible. Why reinvent the wheel instead of relying on GetFullPathNameW [1]? 

For example, you propose to trim leading and trailing spaces and trailing dots from each component, but Windows itself doesn't go that far. Leading spaces are never removed. Only the final path component has all trailing spaces and dots trimmed. From the preceding components Windows will strip one and only one trailing dot, and a trailing space is never removed. Some examples:

    >>> os.path.exists(r'C:\Temp\test\dir1\dir2\file')
    True
    >>> os.path.exists(r'C:\Temp\test\dir1\dir2\file. . . . . .')
    True
    >>> os.path.exists(r'C:\Temp\test\dir1.\dir2.\file')
    True
    >>> os.path.exists(r'C:\Temp\test\dir1..\dir2\file')
    False
    >>> os.path.exists(r'C:\Temp\test\dir1 \dir2\file')
    False
    >>> os.path.exists(r'C:\Temp\test\ dir1\dir2\file')
    False

Components that consist of only "." and ".." should also be normalized:

    >>> os.path.abspath(r'C:\Temp\test\dir1\..\dir1\.\dir2\...\file')
    'C:\\Temp\\test\\dir1\\dir2\\...\\file'

Paths with DOS devices also need to be translated beforehand, since the existence of classic DOS devices in every directory is emulated by the NT runtime library when it translates from DOS paths to native NT paths. For example:

    >>> os.path.abspath(r'C:\Temp\con')
    '\\\\.\\con'
    >>> os.path.abspath(r'C:\Temp\nul')
    '\\\\.\\nul'
    >>> os.path.abspath(r'C:\Temp\prn')
    '\\\\.\\prn'
    >>> os.path.abspath(r'C:\Temp\aux')
    '\\\\.\\aux'

GetFullPathNameW handles all of these corner cases already, so I think a simpler algorithm is to just rely on Windows to do most of the work:

* If len(path) < 260 or the path starts with L"\\\\?\\" or L"\\\\.\\", don't do anything.
* Call GetFullPathNameW to calculate the required path length.
* If the path starts with L"\\\\", over-allocate by sizeof(WCHAR) * 6. Otherwise over-allocate by sizeof(WCHAR) * 4.
* Call GetFullPatheNameW again, with the buffer pointer adjusted past the overallocation. 
* If the path is a UNC path, copy the L"\\\\?\\UNC" prefix to the start of the buffer. Otherwise copy L"\\\\?\\".

Contrary to the documentation on MSDN, Windows doesn't need the \\?\ prefix to use a long path with GetFullPathNameW. On NT systems it has always worked with long paths. The implementation uses the RtlGetFullPathName_U* family of functions, which immediately wrap the input buffer in a UNICODE_STRING, which has a limit of 32,768 characters. 

The only MAX_PATH limit here is one that can't be avoided. The process working directory is limited to MAX_PATH, as are the per-drive working directories (stored in hidden environment variables, e.g. "=C:"). At least that's the case prior to the upcoming change in Windows 10. With the change you propose in issue 27731, Windows 10 users should be able to set a working directory that exceeds MAX_PATH. 

For example, the following demonstrates (in Windows 10.0.10586) that the value of "=Z:" is only used when its length is less than MAX_PATH and the target directory exists.

Create a long test path:

    >>> path = 'Z:' + r'\test' * 50
    >>> os.makedirs('\\\\?\\' + path + r'\last\test')

A drive-relative path is resolved relative to the root directory if the current directory on the drive doesn't exist or is inaccessible:

    >>> kernel32.SetEnvironmentVariableW('=Z:', path + r'\test')
    1
    >>> os.path._getfullpathname('Z:file')
    'Z:\\file'

It also uses the root directory if the current directory on the drive exceeds MAX_PATH:

    >>> kernel32.SetEnvironmentVariableW('=Z:', path + r'\last\test')
    1
    >>> os.path._getfullpathname('Z:file')
    'Z:\\file'

It resolves correctly if the current directory can be opened and the path length doesn't exceed MAX_PATH:

    >>> kernel32.SetEnvironmentVariableW('=Z:', path + r'\last')
    1
    >>> os.path._getfullpathname('Z:file')
    'Z:\\test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\
    test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\
    test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\
    test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\test\\
    test\\test\\test\\test\\test\\test\\test\\last\\file'

    >>> shutil.rmtree(r'\\?\Z:\test')

[1]: https://msdn.microsoft.com/en-us/library/aa364963
msg272446 - (view) Author: Steve Dower (steve.dower) * (Python committer) Date: 2016-08-11 13:39
I thought I'd tested GetFullPathNameW and seen the limit kick in at 260, but if that's not actually the case (across all platforms we support) then yes, let's use that.

When I reread the documentation yesterday it didn't guarantee the result would include the prefix, whereas GetFinalPathByHandle does. Again, if the documentation is incorrect here, then we should use the simpler function.

The fact that I described the normalization process inadequately shows why we really need to be careful trying to emulate it.
msg272530 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-08-12 12:17
I overlooked some aspects of the problem:

* A short relative path may end up exceeding MAX_PATH when normalized as a fully qualified path.
* The working directory may be a UNC path or may already have the \\?\ prefix. It's not thread-safe to check this beforehand.
* A path that contains a reserved DOS device name results in a \\.\ path.
 
Thus on second thought I think it's safer to call GetFullPathNameW for all paths that lack the \\?\ prefix, and then copy the result if it needs to be prefixed by \\?\ or \\?\UNC. The final path, if it's a filesystem path, should always use the \\?\ namespace to ensure that functions such as shutil.rmtree won't fail. 

For example:

    _DOS_DEVICES = "\\\\.\\"
    _NT_DOS_DEVICES = "\\\\?\\"
    _NT_UNC_DEVICE = "\\\\?\\UNC"

    def winapi_path(path):
        path = os.fsdecode(path) or '.'
        if path.startswith(_NT_DOS_DEVICES):
            return path
        temp = os.path._getfullpathname(path)
        if temp.startswith((_NT_DOS_DEVICES, _DOS_DEVICES)):
            return path if temp == path else temp
        if temp.startswith('\\\\'):
            return _NT_UNC_DEVICE + temp[1:]
        return _NT_DOS_DEVICES + temp

For reference, here's a typical call pattern when Windows 10.0.10586 converts a DOS path to an NT path:

    RtlInitUnicodeStringEx
    RtlDosPathNameToRelativeNtPathName_U_WithStatus
        RtlInitUnicodeStringEx
        RtlDosPathNameToRelativeNtPathName
            RtlGetFullPathName_Ustr
            RtlDetermineDosPathNameType_Ustr
            RtlAllocateHeap
            memcpy

RtlGetFullPathName_Ustr is called with a buffer that's sizeof(WCHAR) * MAX_PATH bytes. GetFullPathNameW also calls RtlGetFullPathName_Ustr, but with a caller-supplied buffer that can be up to sizeof(WCHAR) * 32768 bytes.

Here's the call pattern for a \\?\ path:

    RtlInitUnicodeStringEx
    RtlDosPathNameToRelativeNtPathName_U_WithStatus
        RtlInitUnicodeStringEx
        RtlDosPathNameToRelativeNtPathName
            RtlpWin32NtNameToNtPathName
                RtlAllocateHeap
                RtlAppendUnicodeStringToString
                RtlAppendUnicodeStringToString

RtlpWin32NtNameToNtPathName copies the path, replacing \\? with the object manager's \?? virtual DOS devices directory. 

Here's some background information for those who don't already know the basics of how Windows implements DOS devices in NT's object namespace, which you can explore using Microsoft's free WinObj tool.

In Windows NT 3 and 4 (before Terminal Services) there was a single \DosDevices directory, which is where the system created DOS device links to the actual NT devices in \Device, such as C: => \Device\HarddiskVolume2. Windows 2000 changed this in ways that were problematic, mostly due to using a per-session directory instead of a per-logon directory. (Tokens for multiple logon sessions can be used in a single Windows session, and almost always are since UAC split tokens arrived in Vista.) 

The design was changed again in Windows XP. \DosDevices is now just a link to the virtual \?? directory. The system creates DOS devices in a local (per logon) directory, except for system threads and LocalSystem logons (typically services), which use the \GLOBAL?? directory. The per-logon directories are located at \Sessions\0\DosDevices\[LogonAuthenticationId]. The object manager parses \?? by first checking the local DOS devices and then the global DOS devices. Each local DOS devices directory also has a Global link back to \GLOBAL??. It's accessible as \\?\Global\[Device Name], which is useful when a local device has the same name as a global one. The root directory of the object namespace is accessible to administrators using the \GLOBAL??\GLOBALROOT link, which from the Windows API is \\?\GLOBALROOT.
History
Date User Action Args
2022-04-11 14:58:34adminsetgithub: 71917
2016-08-12 12:17:45eryksunsetmessages: + msg272530
2016-08-11 13:39:08steve.dowersetmessages: + msg272446
2016-08-11 06:58:54eryksunsetnosy: + eryksun
messages: + msg272405
2016-08-10 23:20:13steve.dowersetmessages: + msg272386
2016-08-10 23:02:05brett.cannoncreate