Message 105085 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	lars.gustaebel
Recipients	lars.gustaebel, loewis, vstinner
Date	2010-05-05.20:23:46
SpamBayes Score	2.003891e-05
Marked as misclassified	No
Message-id	<1273091033.25.0.390040104355.issue8390@psf.upfronthosting.co.za>
In-reply-to

Content
I think it is a good suggestion to use "surrogateescape" as the default, because (I hope) it produces the fewest errors and is the best choice if tarfile is used in connection with Python's filesystem calls. - When reading tar headers, undecodable chars in filenames end up as surrogates. This way no information is lost. In principle tarfile is merely a gateway to a filesystem inside an archive, so it feels natural if it treats filenames the same as Python's filesystem calls. - When writing tar headers, filenames with surrogate chars (e.g. from os.listdir()) will be converted back to bytes in the header (in case of gnu and ustar formats). Filenames will remain unchanged, this is exactly as one would expect. - When writing pax headers, filenames with surrogates will raise a UnicodeError because we may only use strict utf-8 inside a pax header. This is actually no difference to the status quo. @Martin: As I understand it, the pax "invalid"-option is supposed to deal with the case when strings from a pax header are not representable in the user's encoding. In tarfile's case we don't have this problem when reading the archive until we try to extract it. Unfortunately, POSIX says nothing about how to store bad filenames in a pax archive. tarfile raises an error. GNU tar fails silently, it just puts the unchanged original filename into the pax header without converting it to utf-8, thus violating the standard.

I think it is a good suggestion to use "surrogateescape" as the default, because (I hope) it produces the fewest errors and is the best choice if tarfile is used in connection with Python's filesystem calls.

- When reading tar headers, undecodable chars in filenames end up as surrogates. This way no information is lost. In principle tarfile is merely a gateway to a filesystem inside an archive, so it feels natural if it treats filenames the same as Python's filesystem calls.

- When writing tar headers, filenames with surrogate chars (e.g. from os.listdir()) will be converted back to bytes in the header (in case of gnu and ustar formats). Filenames will remain unchanged, this is exactly as one would expect.

- When writing pax headers, filenames with surrogates will raise a UnicodeError because we may only use strict utf-8 inside a pax header. This is actually no difference to the status quo.

@Martin: As I understand it, the pax "invalid"-option is supposed to deal with the case when strings from a pax header are not representable in the user's encoding. In tarfile's case we don't have this problem when reading the archive until we try to extract it.

Unfortunately, POSIX says nothing about how to store bad filenames in a pax archive. tarfile raises an error. GNU tar fails silently, it just puts the unchanged original filename into the pax header without converting it to utf-8, thus violating the standard.

History
Date	User	Action	Args
2010-05-05 20:23:53	lars.gustaebel	set	recipients: + lars.gustaebel, loewis, vstinner
2010-05-05 20:23:53	lars.gustaebel	set	messageid: <1273091033.25.0.390040104355.issue8390@psf.upfronthosting.co.za>
2010-05-05 20:23:51	lars.gustaebel	link	issue8390 messages
2010-05-05 20:23:47	lars.gustaebel	create