This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile doesn't support undecodable filename in PAX format
Type: Stage:
Components: Library (Lib) Versions: Python 3.2
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: lars.gustaebel, loewis, vstinner
Priority: normal Keywords:

Created on 2010-05-05 22:14 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
z-pax.tar vstinner, 2010-05-05 22:14
Messages (6)
msg105094 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-05 22:14
tarfile is unable to open a TAR archive in PAX format embedding invalid filenames (filename not encoded in utf8, an undecodable filename). Attached file is an example (contain the file b'z/\xff', not decodable from utf8).

PAX specification has a "invalid" option with 4 values: bypass (default), rename, UTF-8, write.
http://www.opengroup.org/onlinepubs/009695399/utilities/pax.html

As it was done for other formats in issue #8390, PAX can use Python surrogateescape error handler to store undecodable bytes as unicode surrogates.

I think that PAX should be strict by default, but have an option to enable surrogateescape mode.
msg105102 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-05 22:32
I think you are misinterpreting the spec. A PAX file MUST encode its file names in UTF-8. The "invalid" flag only applies when these invalid names cannot map to file names - either because they are not supported in the locale, or because they are not supported by the file system on which you want to extract the files (e.g. if they contain a colon ':' and you try to extract to a FAT filesystem).

The case that the file names are not actually in UTF-8 in the PAX file is a format error, just like any other format error in the file.
msg105104 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-05 22:37
I didn't read the whole spec, only read quickly the invalid option.

The idead behind this issue is to be able to read a file generated by GNU tar which keeps the filename unchanged if it's not encodable to utf8. (z-pax.tar attachment was generated by GNU tar).

See also msg105085.
msg105135 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-05-06 11:25
Victor, you misunderstood the pax definition, but it wouldn't harm tarfile if it knew how to handle these corrupt GNU tar archives. In the meantime I filed a bug report on bug-tar@gnu.org for this.

I said in msg105085 that POSIX gives no advice on how to handle broken filename encodings, but it does in POSIX:2008. libarchive (bsdtar) uses the way that is described there. The solution is to use a field called "hdrcharset".

See http://www.opengroup.org/onlinepubs/9699919799/utilities/pax.html
msg105141 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-05-06 12:28
I am currently working on a patch to let tarfile use the hdrcharset field.
msg105924 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-05-17 18:04
I added support for the hdrcharset method and a workaround for the GNU tar bug, see r81273.
History
Date User Action Args
2022-04-11 14:57:00adminsetgithub: 52879
2010-05-17 18:04:56lars.gustaebelsetstatus: open -> closed
resolution: accepted
messages: + msg105924
2010-05-16 07:12:38lars.gustaebelsetassignee: lars.gustaebel
2010-05-06 12:28:42lars.gustaebelsetmessages: + msg105141
2010-05-06 11:25:48lars.gustaebelsetmessages: + msg105135
2010-05-05 22:37:16vstinnersetmessages: + msg105104
2010-05-05 22:32:25loewissetmessages: + msg105102
2010-05-05 22:14:49vstinnercreate