New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tarfile: use surrogates for undecode fields #52637
Comments
When reading a tar archive, tarfile decodes fields using "replace" error handler by default. The result is that we loose informations if there is an undecodable character. Since the PEP-383, undecodable filenames are stored using surrogates in Python3. I think that it's a good idea to use surrogates for tar, because it's a common problem to have undecodable data in a tar archive (see the unicode section of the tarfile documentation). |
lars: Do you have an opinion about this suggestion? |
Yes, I will soon have ;-) Please give me a few days... |
A better fix is maybe to store fields as bytes, but it would break the compatibility and unicode is pratical in Python3. |
I think it is helpful to read the pax specification here: http://www.opengroup.org/onlinepubs/009695399/utilities/pax.html pax defines (IIUC) that all strings in a pax-compliant tar file are UTF-8 encoded. For the "invalid" option, they offer the alternatives bypass, rename, UTF-8, and write. It may be useful to provide the same options, in some form. |
My patch changes test_uname_unicode() of test_tarfile for the GNU and ustar formats (but not PAX). In GNU and ustar formats, the fields can be encoded in any encoding, and may contain invalid byte sequences. |
I think it is a good suggestion to use "surrogateescape" as the default, because (I hope) it produces the fewest errors and is the best choice if tarfile is used in connection with Python's filesystem calls.
@martin: As I understand it, the pax "invalid"-option is supposed to deal with the case when strings from a pax header are not representable in the user's encoding. In tarfile's case we don't have this problem when reading the archive until we try to extract it. Unfortunately, POSIX says nothing about how to store bad filenames in a pax archive. tarfile raises an error. GNU tar fails silently, it just puts the unchanged original filename into the pax header without converting it to utf-8, thus violating the standard. |
Thank you for your review. I commited the patch as r80824 (I fixed the documentation, :versionadded => :versionchanged), blocked as r80825 (3.2). --
Right. I opened a new issue about that: bpo-8333. I consider that it's a different problem. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: