classification
Title: tarfile and zipfile: handling Windows (path) illegal characters in archive member names
Type: enhancement Stage:
Components: Library (Lib), Windows Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: CristiFati, eryksun, lars.gustaebel, paul.moore, steve.dower, tim.golden, vstinner, zach.ware
Priority: normal Keywords:

Created on 2019-04-05 10:04 by CristiFati, last changed 2020-10-16 09:21 by vstinner.

Messages (10)
msg339486 - (view) Author: Cristi Fati (CristiFati) * Date: 2019-04-05 10:04
Although tar is a Nix based (and mostly used) format, it gains popularity on Win too.

As tarfile is running on Win, I think it should handle (work around) path incompatibilities, as zipfile (`ZipFile._sanitize_windows_name`) does.

Applies to all branches.

More details on [Tarfile/Zipfile extractall() changing filename of some files](https://stackoverflow.com/questions/55340013/tarfile-zipfile-extractall-changing-filename-of-some-files/55348443#55348443).

Regarding the current zipfile handling: it also can be improved (as it has a small bug), for example if the archive contains 2 files ("file:" and "file_") it won't work as expected. But this is a rare corner case.

I didn't prepare a patch, since I did so for another issue (https://bugs.python.org/issue36247 - which I consider an ugly one),  
 and it wasn't well received, also it was rejected (for different reasons). If this issue gets the green light from whomever is in charge, I'll be happy to provide one.
msg339501 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-04-05 12:59
_sanitize_windows_name() fails to translate the reserved control characters (0x01-0x1F) and backslash in names. 

What I've seen done in some cases (e.g. Unix network shares mapped to SMB) is to translate names using the private use area block, e.g. 0xF001 - 0xF07F. Windows has no problem with characters in this range in a filename. (Displaying these characters sensibly is another matter.) For Windows 10, this is especially useful since the Linux subsystem automatically translates this PUA block back to ASCII when accessing a Windows volume via drvfs. For example:

    C:\Temp\pua>python -q
    >>> import sys
    >>> sys.platform
    'win32'
    >>> name = ''.join(map(chr, range(0xf001, 0xf080)))
    >>> _ = open(name, 'w')
    >>> ^Z

    C:\Temp\pua>bash -c "python3 -q"
    >>> import os, sys
    >>> sys.platform
    'linux'
    >>> os.listdir()
    ['\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f
      \x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
       !"#$%&\'()*+,-./0123456789:;<=>?
      @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_
      `abcdefghijklmnopqrstuvwxyz{|}~\x7f']

Also, while _sanitize_windows_name() handles trailing dots, for some reason it overlooks trailing spaces. It also doesn't handle reserved DOS device names. The reserved names include NUL, CON, CONIN$, CONOUT$, AUX, PRN, COM[1-9], LPT[1-9], and these names plus zero or more spaces and possibly a dot or colon and any subsequent characters. For example:

    >>> os.path._getfullpathname('con')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con  ')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con:')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con :')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con : spam')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con . eggs')
    '\\\\.\\con'

It's not a reserved device name if the first character after zero or more spaces is not a dot or colon. For example:

    >>> os.path._getfullpathname('con spam')
    'C:\\con spam'

We can create filenames with reserved device names or trailing spaces and dots by using a \\?\ prefixed path (i.e. a non-normalized device path). However, most programs don't use \\?\ paths, so it's probably better to translate these names.
msg378113 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-06 11:06
> Also, while _sanitize_windows_name() handles trailing dots, for some reason it overlooks trailing spaces. It also doesn't handle reserved DOS device names.

The pathlib module has _WindowsFlavour.reserved_names list of Windows
reserved names:

>>> pprint.pprint(sorted(pathlib._WindowsFlavour.reserved_names))
['AUX',
 'COM1',
 'COM2',
 'COM3',
 'COM4',
 'COM5',
 'COM6',
 'COM7',
 'COM8',
 'COM9',
 'CON',
 'LPT1',
 'LPT2',
 'LPT3',
 'LPT4',
 'LPT5',
 'LPT6',
 'LPT7',
 'LPT8',
 'LPT9',
 'NUL',
 'PRN']
msg378126 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-10-06 14:27
> The pathlib module has _WindowsFlavour.reserved_names list of 
> Windows reserved names:

pathlib._WindowsFlavour.reserved_names is missing "CONIN$" and "CONOUT$". Prior to Windows 8 these two are reserved as relative names. In Windows 8+, they're also reserved in directories, just like the other reserved device names.

pathlib._WindowsFlavour.is_reserved() fails to reserve names containing ASCII control characters [0-31], vertical bar [|], the file-stream delimiter [:] (i.e. "filename:streamname:streamtype"), and the five wildcard characters [*?"<>]. (Maybe it should allow the file-stream delimiter, but that requires validating that a file stream is proper.) It fails to reserve names that end with a dot or space, which includes UNC and device paths except for \\?\ verbatim paths. It fails to match all reserved base names, which begin with a reserved device name, followed by zero or more spaces, a dot or colon, and zero or more characters. If names that contain colon are already reserved, then this check only has to be modified to strip trailing spaces before comparing against the list of reserved device names.
msg378140 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-06 23:24
> pathlib._WindowsFlavour.is_reserved() fails to reserve names (...)

This issue is about tarfile. Maybe create another issue to enhance the pathlib module?
msg378143 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-10-06 23:48
> This issue is about tarfile. Maybe create another issue to enhance 
> the pathlib module?

IIRC there's already an open issue for that. But in case anyone were to look to pathlib as an example of what should be reserved, I wanted to highlight here how its reserved_names list is incomplete and how its is_reserved() method is insufficient.
msg378152 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-07 08:41
> IIRC there's already an open issue for that.

Ah, I found bpo-27827 "pathlib is_reserved fails for some reserved paths on Windows", open since 2016 (by you ;-)).
msg378154 - (view) Author: Cristi Fati (CristiFati) * Date: 2020-10-07 09:33
As I see things now, there are multiple things (not necessarily related to this issue) to deal with:

1. Update *tarfile* and add *\_sanitize\_windows\_name* (name can change), that uses *pathlib.\_WindowsFlavour.reserved\_names* (or some public wrapper), and also handles control chars (pointed out by @eriksun), so that it covers as many cases as possible (I'd say all, but there's almost always one that gets away)

2. Fix *pathlib.\_WindowsFlavour.reserved\_names*

3. Apply the fix to *zipfile* as well

4. (optional) extract the sanitizing function into a common module (could be *pathlib*?) to avoid duplicates
msg378162 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-10-07 14:13
> extract the sanitizing function into a common module 
> (could be *pathlib*?) to avoid duplicates

I would prefer something common, cross-platform, and function-based such as os.path.isreservedname and os.path.sanitizename. In posixpath, it would just have to reserve and sanitize slash [/] and null [\0]. The real work would be in ntpath.
msg378709 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2020-10-16 09:21
zipfile is also impacted by the issue of reserved Windows filenames like "NUL". ZipFile._sanitize_windows_name() does not handle them.
History
Date User Action Args
2020-10-16 09:21:22vstinnersetmessages: + msg378709
title: tarfile: handling Windows (path) illegal characters in archive member names -> tarfile and zipfile: handling Windows (path) illegal characters in archive member names
2020-10-07 14:13:23eryksunsetmessages: + msg378162
2020-10-07 09:33:38CristiFatisetmessages: + msg378154
2020-10-07 08:41:58vstinnersetmessages: + msg378152
2020-10-06 23:48:30eryksunsetmessages: + msg378143
2020-10-06 23:24:35vstinnersetmessages: + msg378140
2020-10-06 14:27:58eryksunsetmessages: + msg378126
2020-10-06 11:06:11vstinnersetnosy: + vstinner
messages: + msg378113
2019-04-05 12:59:15eryksunsetnosy: + eryksun
messages: + msg339501
2019-04-05 10:12:42xtreaksetnosy: + lars.gustaebel, paul.moore, tim.golden, zach.ware, steve.dower
components: + Windows
2019-04-05 10:04:31CristiFaticreate