Issue 7693: tarfile.extractall can't have unicode extraction path

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/51942

classification

Title:	tarfile.extractall can't have unicode extraction path
Type:	behavior	Stage:	test needed
Components:	Library (Lib)	Versions:	Python 2.7, Python 2.6

process

Status:	closed	Resolution:	works for me
Dependencies:		Superseder:
Assigned To:	lars.gustaebel	Nosy List:	ezio.melotti, lars.gustaebel, pbienst
Priority:	normal	Keywords:

Created on 2010-01-13 14:08 by pbienst, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (11)
msg97717 - (view)	Author: Peter Bienstman (pbienst)	Date: 2010-01-13 14:08
import tarfile fname = unichr(40960) + u"a.ogg" f = file(fname, "w") f.write("A") f.close() tar_pipe = tarfile.open("test.tar", mode="w\|", format=tarfile.PAX_FORMAT) tar_pipe.add(fname) tar_pipe.close() tar_pipe = tarfile.open("test.tar") tar_pipe.extractall(u".") # Just "." as string works fine. This gives: Traceback (most recent call last): File "a.py", line 15, in <module> tar_pipe.extractall(u".") # Just "." as string works fine. File "/usr/lib/python2.6/tarfile.py", line 2031, in extractall self.extract(tarinfo, path) File "/usr/lib/python2.6/tarfile.py", line 2068, in extract self._extract_member(tarinfo, os.path.join(path, tarinfo.name)) File "/usr/lib/python2.6/posixpath.py", line 70, in join path += '/' + b UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 1: ordinal not in range(128)
msg97746 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-14 01:15
When test.tar is opened, the filename is read as a string, so when os.path.join() is called in self._extract_member(tarinfo, os.path.join(path, tarinfo.name)), path is u'.' and tarinfo.name is '\xea\x80\x80a.ogg'. tarinfo.name is a byte string, so in os.path.join it is converted implicitly to Unicode using the ascii codec because the path is unicode and since it contains non-ascii chars the error is raised.
msg97803 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2010-01-15 10:05
In the 2.x branch tarfile is not prepared to deal with unicode pathnames at all. This changed in Python 3. The fact that it works anyway (in the majority of cases) to add filenames as unicode objects is pure coincidence - I suppose you have a utf-8 system encoding. On a latin-1 system your script would fail much earlier during the add() call. Some reading: http://docs.python.org/library/tarfile.html#unicode-issues
msg97804 - (view)	Author: Peter Bienstman (pbienst)	Date: 2010-01-15 10:31
So what do suggest then as the best approach if I want to use unicode paths in tar files in Python 2.x in a way that is portable across different systems?
msg97805 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2010-01-15 10:51
First, use a string pathname for extractall(). Most likely, your script is going to work. Convert all pathnames to strings using sys.getfilesystemencoding() before you add() them. Ensure that all systems you are going to use the archives on have the same filesystem encoding, e.g. utf-8. Pax archives are probably the best choice if you plan to keep the archives for several years. If you simply want to transfer data from one system to the other throwing the archives away afterwards, the format is rather irrelevant.
msg97807 - (view)	Author: Peter Bienstman (pbienst)	Date: 2010-01-15 12:26
On Friday 15 January 2010 11:51:24 am Lars Gustäbel wrote: > Lars Gustäbel <lars@gustaebel.de> added the comment: > > First, use a string pathname for extractall(). Most likely, your script is > going to work. Convert all pathnames to strings using > sys.getfilesystemencoding() before you add() them. Ensure that all systems > you are going to use the archives on have the same filesystem encoding, > e.g. utf-8. Unfortunately, that is beyond my control. Am I then totally out of luck? Would the implementation of tarfile in 3.0 be useable on 2.6 (perhaps with small modifications?) > Pax archives are probably the best choice if you plan to keep > the archives for several years. If you simply want to transfer data from > one system to the other throwing the archives away afterwards, the format > is rather irrelevant. The archives are throw-away, transfer only, but they could be used on any system.
msg97810 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2010-01-15 13:14
I suppose you do not have a real problem here. I thought your problem was that you want to use unicode pathnames as input and output to tarfile. You don't need that. You want to transfer an archive from one system to another. You can do that with tarfile already. Python 3.x's tarfile does the same as Python 2.x's tarfile, except that in 3.x all strings are unicode strings. If you have different encodings on these systems, that should not be a problem unless these encodings are not compatible with each other. If you want to use a tar archive created on a utf-8 system on a iso-8859-1 system that is no problem, as long as you use the pax format and all the utf-8 characters used are also valid iso-8859-1 characters.
msg97811 - (view)	Author: Peter Bienstman (pbienst)	Date: 2010-01-15 13:36
On Friday 15 January 2010 02:14:30 pm Lars Gustäbel wrote: > Lars Gustäbel <lars@gustaebel.de> added the comment: > > I suppose you do not have a real problem here. I thought your problem was > that you want to use unicode pathnames as input and output to tarfile. You > don't need that. > > You want to transfer an archive from one system to another. You can do that > with tarfile already. Python 3.x's tarfile does the same as Python 2.x's > tarfile, except that in 3.x all strings are unicode strings. > > If you have different encodings on these systems, that should not be a > problem unless these encodings are not compatible with each other. If you > want to use a tar archive created on a utf-8 system on a iso-8859-1 system > that is no problem, as long as you use the pax format and all the utf-8 > characters used are also valid iso-8859-1 characters. I think I do have a problem. I want to create a tar archive on one system, where the filenames could contain non latin characters. I'm sending this tar file over a socket to a different system (with potentially a different encoding), where I want to extract it to a directory which name could contain non-latin characters.
msg97875 - (view)	Author: Lars Gustäbel (lars.gustaebel) *	Date: 2010-01-16 13:07
So, use the pax format. It stores the filenames as utf-8 and this way you will be on the safe side. I hope we both agree that the solution to your particular problem is nothing tarfile.py can provide. So, I am going to close this issue now.
msg97887 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2010-01-16 17:59
Lars, I think the situation can still be improved. If tarfile works with bytes strings it should accept only bytes strings or unicode strings that can be encoded in ASCII, and encode them as soon as it gets them. In the problem reported by Peter, he was passing u"." that is a unicode ASCII-only string. Later in the program this string gets mixed with a byte string and this causes an implicit decoding, i.e. it turns the byte strings to unicode (and possibly fails if the filename is non-ASCII). Even if the decoding succeeds, eventually tarfile will have to convert the unicode string to a byte string again. A better approach would be to encode using the ASCII codec all the unicode strings that are passed. If the unicode strings are ASCII-only (like the u"." Peter was passing), they can be encoded without problems. When they get mixed with other strings they are all bytes strings so no implicit decoding happens. If the unicode strings are non-ASCII, the encoding will fail immediately and warn the user that he will have to encode the unicode string before passing it to the function.
msg97932 - (view)	Author: Peter Bienstman (pbienst)	Date: 2010-01-17 08:31
> Lars Gustäbel <lars@gustaebel.de> added the comment: > > So, use the pax format. It stores the filenames as utf-8 and this way you > will be on the safe side. > > I hope we both agree that the solution to your particular problem is > nothing tarfile.py can provide. If I want to extract a pax archive to a unicode path with non-latin characters, how should I encode the path before passing it to 'extractall'? would utf-8 be OK? Peter

History
Date	User	Action	Args
2022-04-11 14:56:56	admin	set	github: 51942
2010-01-17 08:31:33	pbienst	set	messages: + msg97932
2010-01-16 17:59:04	ezio.melotti	set	messages: + msg97887
2010-01-16 13:07:41	lars.gustaebel	set	status: open -> closed resolution: works for me messages: + msg97875
2010-01-15 13:36:45	pbienst	set	messages: + msg97811
2010-01-15 13:14:28	lars.gustaebel	set	messages: + msg97810
2010-01-15 12:26:18	pbienst	set	messages: + msg97807
2010-01-15 10:51:23	lars.gustaebel	set	messages: + msg97805
2010-01-15 10:31:49	pbienst	set	messages: + msg97804
2010-01-15 10:05:02	lars.gustaebel	set	messages: + msg97803
2010-01-14 01:15:36	ezio.melotti	set	messages: + msg97746
2010-01-13 21:38:55	lars.gustaebel	set	assignee: lars.gustaebel nosy: + lars.gustaebel
2010-01-13 21:27:50	ezio.melotti	set	nosy: + ezio.melotti versions: + Python 2.7 priority: normal components: + Library (Lib), - Extension Modules type: crash -> behavior stage: test needed
2010-01-13 14:08:22	pbienst	create