classification
Title: tarfile: handling Windows (path) illegal characters in archive member names
Type: enhancement Stage:
Components: Library (Lib), Windows Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: CristiFati, eryksun, lars.gustaebel, paul.moore, steve.dower, tim.golden, zach.ware
Priority: normal Keywords:

Created on 2019-04-05 10:04 by CristiFati, last changed 2019-04-05 12:59 by eryksun.

Messages (2)
msg339486 - (view) Author: Cristi Fati (CristiFati) * Date: 2019-04-05 10:04
Although tar is a Nix based (and mostly used) format, it gains popularity on Win too.

As tarfile is running on Win, I think it should handle (work around) path incompatibilities, as zipfile (`ZipFile._sanitize_windows_name`) does.

Applies to all branches.

More details on [Tarfile/Zipfile extractall() changing filename of some files](https://stackoverflow.com/questions/55340013/tarfile-zipfile-extractall-changing-filename-of-some-files/55348443#55348443).

Regarding the current zipfile handling: it also can be improved (as it has a small bug), for example if the archive contains 2 files ("file:" and "file_") it won't work as expected. But this is a rare corner case.

I didn't prepare a patch, since I did so for another issue (https://bugs.python.org/issue36247 - which I consider an ugly one),  
 and it wasn't well received, also it was rejected (for different reasons). If this issue gets the green light from whomever is in charge, I'll be happy to provide one.
msg339501 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2019-04-05 12:59
_sanitize_windows_name() fails to translate the reserved control characters (0x01-0x1F) and backslash in names. 

What I've seen done in some cases (e.g. Unix network shares mapped to SMB) is to translate names using the private use area block, e.g. 0xF001 - 0xF07F. Windows has no problem with characters in this range in a filename. (Displaying these characters sensibly is another matter.) For Windows 10, this is especially useful since the Linux subsystem automatically translates this PUA block back to ASCII when accessing a Windows volume via drvfs. For example:

    C:\Temp\pua>python -q
    >>> import sys
    >>> sys.platform
    'win32'
    >>> name = ''.join(map(chr, range(0xf001, 0xf080)))
    >>> _ = open(name, 'w')
    >>> ^Z

    C:\Temp\pua>bash -c "python3 -q"
    >>> import os, sys
    >>> sys.platform
    'linux'
    >>> os.listdir()
    ['\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f
      \x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
       !"#$%&\'()*+,-./0123456789:;<=>?
      @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_
      `abcdefghijklmnopqrstuvwxyz{|}~\x7f']

Also, while _sanitize_windows_name() handles trailing dots, for some reason it overlooks trailing spaces. It also doesn't handle reserved DOS device names. The reserved names include NUL, CON, CONIN$, CONOUT$, AUX, PRN, COM[1-9], LPT[1-9], and these names plus zero or more spaces and possibly a dot or colon and any subsequent characters. For example:

    >>> os.path._getfullpathname('con')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con  ')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con:')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con :')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con : spam')
    '\\\\.\\con'
    >>> os.path._getfullpathname('con . eggs')
    '\\\\.\\con'

It's not a reserved device name if the first character after zero or more spaces is not a dot or colon. For example:

    >>> os.path._getfullpathname('con spam')
    'C:\\con spam'

We can create filenames with reserved device names or trailing spaces and dots by using a \\?\ prefixed path (i.e. a non-normalized device path). However, most programs don't use \\?\ paths, so it's probably better to translate these names.
History
Date User Action Args
2019-04-05 12:59:15eryksunsetnosy: + eryksun
messages: + msg339501
2019-04-05 10:12:42xtreaksetnosy: + lars.gustaebel, paul.moore, tim.golden, zach.ware, steve.dower
components: + Windows
2019-04-05 10:04:31CristiFaticreate