msg339486 - (view) |
Author: Cristi Fati (CristiFati) * |
Date: 2019-04-05 10:04 |
Although tar is a Nix based (and mostly used) format, it gains popularity on Win too.
As tarfile is running on Win, I think it should handle (work around) path incompatibilities, as zipfile (`ZipFile._sanitize_windows_name`) does.
Applies to all branches.
More details on [Tarfile/Zipfile extractall() changing filename of some files](https://stackoverflow.com/questions/55340013/tarfile-zipfile-extractall-changing-filename-of-some-files/55348443#55348443).
Regarding the current zipfile handling: it also can be improved (as it has a small bug), for example if the archive contains 2 files ("file:" and "file_") it won't work as expected. But this is a rare corner case.
I didn't prepare a patch, since I did so for another issue (https://bugs.python.org/issue36247 - which I consider an ugly one),
and it wasn't well received, also it was rejected (for different reasons). If this issue gets the green light from whomever is in charge, I'll be happy to provide one.
|
msg339501 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2019-04-05 12:59 |
_sanitize_windows_name() fails to translate the reserved control characters (0x01-0x1F) and backslash in names.
What I've seen done in some cases (e.g. Unix network shares mapped to SMB) is to translate names using the private use area block, e.g. 0xF001 - 0xF07F. Windows has no problem with characters in this range in a filename. (Displaying these characters sensibly is another matter.) For Windows 10, this is especially useful since the Linux subsystem automatically translates this PUA block back to ASCII when accessing a Windows volume via drvfs. For example:
C:\Temp\pua>python -q
>>> import sys
>>> sys.platform
'win32'
>>> name = ''.join(map(chr, range(0xf001, 0xf080)))
>>> _ = open(name, 'w')
>>> ^Z
C:\Temp\pua>bash -c "python3 -q"
>>> import os, sys
>>> sys.platform
'linux'
>>> os.listdir()
['\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f
\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
!"#$%&\'()*+,-./0123456789:;<=>?
@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_
`abcdefghijklmnopqrstuvwxyz{|}~\x7f']
Also, while _sanitize_windows_name() handles trailing dots, for some reason it overlooks trailing spaces. It also doesn't handle reserved DOS device names. The reserved names include NUL, CON, CONIN$, CONOUT$, AUX, PRN, COM[1-9], LPT[1-9], and these names plus zero or more spaces and possibly a dot or colon and any subsequent characters. For example:
>>> os.path._getfullpathname('con')
'\\\\.\\con'
>>> os.path._getfullpathname('con ')
'\\\\.\\con'
>>> os.path._getfullpathname('con:')
'\\\\.\\con'
>>> os.path._getfullpathname('con :')
'\\\\.\\con'
>>> os.path._getfullpathname('con : spam')
'\\\\.\\con'
>>> os.path._getfullpathname('con . eggs')
'\\\\.\\con'
It's not a reserved device name if the first character after zero or more spaces is not a dot or colon. For example:
>>> os.path._getfullpathname('con spam')
'C:\\con spam'
We can create filenames with reserved device names or trailing spaces and dots by using a \\?\ prefixed path (i.e. a non-normalized device path). However, most programs don't use \\?\ paths, so it's probably better to translate these names.
|
msg378113 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2020-10-06 11:06 |
> Also, while _sanitize_windows_name() handles trailing dots, for some reason it overlooks trailing spaces. It also doesn't handle reserved DOS device names.
The pathlib module has _WindowsFlavour.reserved_names list of Windows
reserved names:
>>> pprint.pprint(sorted(pathlib._WindowsFlavour.reserved_names))
['AUX',
'COM1',
'COM2',
'COM3',
'COM4',
'COM5',
'COM6',
'COM7',
'COM8',
'COM9',
'CON',
'LPT1',
'LPT2',
'LPT3',
'LPT4',
'LPT5',
'LPT6',
'LPT7',
'LPT8',
'LPT9',
'NUL',
'PRN']
|
msg378126 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2020-10-06 14:27 |
> The pathlib module has _WindowsFlavour.reserved_names list of
> Windows reserved names:
pathlib._WindowsFlavour.reserved_names is missing "CONIN$" and "CONOUT$". Prior to Windows 8 these two are reserved as relative names. In Windows 8+, they're also reserved in directories, just like the other reserved device names.
pathlib._WindowsFlavour.is_reserved() fails to reserve names containing ASCII control characters [0-31], vertical bar [|], the file-stream delimiter [:] (i.e. "filename:streamname:streamtype"), and the five wildcard characters [*?"<>]. (Maybe it should allow the file-stream delimiter, but that requires validating that a file stream is proper.) It fails to reserve names that end with a dot or space, which includes UNC and device paths except for \\?\ verbatim paths. It fails to match all reserved base names, which begin with a reserved device name, followed by zero or more spaces, a dot or colon, and zero or more characters. If names that contain colon are already reserved, then this check only has to be modified to strip trailing spaces before comparing against the list of reserved device names.
|
msg378140 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2020-10-06 23:24 |
> pathlib._WindowsFlavour.is_reserved() fails to reserve names (...)
This issue is about tarfile. Maybe create another issue to enhance the pathlib module?
|
msg378143 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2020-10-06 23:48 |
> This issue is about tarfile. Maybe create another issue to enhance
> the pathlib module?
IIRC there's already an open issue for that. But in case anyone were to look to pathlib as an example of what should be reserved, I wanted to highlight here how its reserved_names list is incomplete and how its is_reserved() method is insufficient.
|
msg378152 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2020-10-07 08:41 |
> IIRC there's already an open issue for that.
Ah, I found bpo-27827 "pathlib is_reserved fails for some reserved paths on Windows", open since 2016 (by you ;-)).
|
msg378154 - (view) |
Author: Cristi Fati (CristiFati) * |
Date: 2020-10-07 09:33 |
As I see things now, there are multiple things (not necessarily related to this issue) to deal with:
1. Update *tarfile* and add *\_sanitize\_windows\_name* (name can change), that uses *pathlib.\_WindowsFlavour.reserved\_names* (or some public wrapper), and also handles control chars (pointed out by @eriksun), so that it covers as many cases as possible (I'd say all, but there's almost always one that gets away)
2. Fix *pathlib.\_WindowsFlavour.reserved\_names*
3. Apply the fix to *zipfile* as well
4. (optional) extract the sanitizing function into a common module (could be *pathlib*?) to avoid duplicates
|
msg378162 - (view) |
Author: Eryk Sun (eryksun) * |
Date: 2020-10-07 14:13 |
> extract the sanitizing function into a common module
> (could be *pathlib*?) to avoid duplicates
I would prefer something common, cross-platform, and function-based such as os.path.isreservedname and os.path.sanitizename. In posixpath, it would just have to reserve and sanitize slash [/] and null [\0]. The real work would be in ntpath.
|
msg378709 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2020-10-16 09:21 |
zipfile is also impacted by the issue of reserved Windows filenames like "NUL". ZipFile._sanitize_windows_name() does not handle them.
|
|
Date |
User |
Action |
Args |
2022-04-11 14:59:13 | admin | set | github: 80715 |
2021-03-30 19:37:29 | vstinner | set | nosy:
- vstinner
|
2021-03-30 18:28:59 | eryksun | set | versions:
+ Python 3.10, - Python 3.7 |
2020-10-16 09:21:22 | vstinner | set | messages:
+ msg378709 title: tarfile: handling Windows (path) illegal characters in archive member names -> tarfile and zipfile: handling Windows (path) illegal characters in archive member names |
2020-10-07 14:13:23 | eryksun | set | messages:
+ msg378162 |
2020-10-07 09:33:38 | CristiFati | set | messages:
+ msg378154 |
2020-10-07 08:41:58 | vstinner | set | messages:
+ msg378152 |
2020-10-06 23:48:30 | eryksun | set | messages:
+ msg378143 |
2020-10-06 23:24:35 | vstinner | set | messages:
+ msg378140 |
2020-10-06 14:27:58 | eryksun | set | messages:
+ msg378126 |
2020-10-06 11:06:11 | vstinner | set | nosy:
+ vstinner messages:
+ msg378113
|
2019-04-05 12:59:15 | eryksun | set | nosy:
+ eryksun messages:
+ msg339501
|
2019-04-05 10:12:42 | xtreak | set | nosy:
+ lars.gustaebel, paul.moore, tim.golden, zach.ware, steve.dower components:
+ Windows
|
2019-04-05 10:04:31 | CristiFati | create | |