classification
Title: tarfile.extractall can't have unicode extraction path
Type: behavior Stage: test needed
Components: Library (Lib) Versions: Python 2.7, Python 2.6
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: lars.gustaebel Nosy List: ezio.melotti, lars.gustaebel, pbienst
Priority: normal Keywords:

Created on 2010-01-13 14:08 by pbienst, last changed 2010-01-17 08:31 by pbienst. This issue is now closed.

Messages (11)
msg97717 - (view) Author: Peter Bienstman (pbienst) Date: 2010-01-13 14:08
import tarfile

fname = unichr(40960) + u"a.ogg"

f = file(fname, "w")
f.write("A")
f.close()
        
tar_pipe = tarfile.open("test.tar", mode="w|",
    format=tarfile.PAX_FORMAT)
tar_pipe.add(fname)
tar_pipe.close()

tar_pipe = tarfile.open("test.tar")
tar_pipe.extractall(u".") # Just "." as string works fine.

This gives:

Traceback (most recent call last):
  File "a.py", line 15, in <module>
    tar_pipe.extractall(u".") # Just "." as string works fine.
  File "/usr/lib/python2.6/tarfile.py", line 2031, in extractall
    self.extract(tarinfo, path)
  File "/usr/lib/python2.6/tarfile.py", line 2068, in extract
    self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
  File "/usr/lib/python2.6/posixpath.py", line 70, in join
    path += '/' + b
UnicodeDecodeError: 'ascii' codec can't decode byte 0xea in position 1: ordinal not in range(128)
msg97746 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-01-14 01:15
When test.tar is opened, the filename is read as a string, so when os.path.join() is called in self._extract_member(tarinfo, os.path.join(path, tarinfo.name)), path is u'.' and tarinfo.name is '\xea\x80\x80a.ogg'.
tarinfo.name is a byte string, so in os.path.join it is converted implicitly to Unicode using the ascii codec because the path is unicode and since it contains non-ascii chars the error is raised.
msg97803 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-01-15 10:05
In the 2.x branch tarfile is not prepared to deal with unicode pathnames at all. This changed in Python 3. The fact that it works anyway (in the majority of cases) to add filenames as unicode objects is pure coincidence - I suppose you have a utf-8 system encoding. On a latin-1 system your script would fail much earlier during the add() call.

Some reading: http://docs.python.org/library/tarfile.html#unicode-issues
msg97804 - (view) Author: Peter Bienstman (pbienst) Date: 2010-01-15 10:31
So what do suggest then as the best approach if I want to use unicode paths in tar files in Python 2.x in a way that is portable across different systems?
msg97805 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-01-15 10:51
First, use a string pathname for extractall(). Most likely, your script is going to work. Convert all pathnames to strings using sys.getfilesystemencoding() before you add() them. Ensure that all systems you are going to use the archives on have the same filesystem encoding, e.g. utf-8. Pax archives are probably the best choice if you plan to keep the archives for several years. If you simply want to transfer data from one system to the other throwing the archives away afterwards, the format is rather irrelevant.
msg97807 - (view) Author: Peter Bienstman (pbienst) Date: 2010-01-15 12:26
On Friday 15 January 2010 11:51:24 am Lars Gustäbel wrote:
> Lars Gustäbel <lars@gustaebel.de> added the comment:
> 
> First, use a string pathname for extractall(). Most likely, your script is
>  going to work. Convert all pathnames to strings using
>  sys.getfilesystemencoding() before you add() them. Ensure that all systems
>  you are going to use the archives on have the same filesystem encoding,
>  e.g. utf-8. 

Unfortunately, that is beyond my control. Am I then totally out of luck? Would 
the implementation of tarfile in 3.0 be useable on 2.6 (perhaps with small 
modifications?)

>  Pax archives are probably the best choice if you plan to keep
>  the archives for several years. If you simply want to transfer data from
>  one system to the other throwing the archives away afterwards, the format
>  is rather irrelevant.

The archives are throw-away, transfer only, but they could be used on any 
system.
msg97810 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-01-15 13:14
I suppose you do not have a real problem here. I thought your problem was that you want to use unicode pathnames as input and output to tarfile. You don't need that.

You want to transfer an archive from one system to another. You can do that with tarfile already. Python 3.x's tarfile does the same as Python 2.x's tarfile, except that in 3.x *all* strings are unicode strings.

If you have different encodings on these systems, that should not be a problem unless these encodings are not compatible with each other. If you want to use a tar archive created on a utf-8 system on a iso-8859-1 system that is no problem, as long as you use the pax format and all the utf-8 characters used are also valid iso-8859-1 characters.
msg97811 - (view) Author: Peter Bienstman (pbienst) Date: 2010-01-15 13:36
On Friday 15 January 2010 02:14:30 pm Lars Gustäbel wrote:
> Lars Gustäbel <lars@gustaebel.de> added the comment:
> 
> I suppose you do not have a real problem here. I thought your problem was
>  that you want to use unicode pathnames as input and output to tarfile. You
>  don't need that.
> 
> You want to transfer an archive from one system to another. You can do that
>  with tarfile already. Python 3.x's tarfile does the same as Python 2.x's
>  tarfile, except that in 3.x *all* strings are unicode strings.
> 
> If you have different encodings on these systems, that should not be a
>  problem unless these encodings are not compatible with each other. If you
>  want to use a tar archive created on a utf-8 system on a iso-8859-1 system
>  that is no problem, as long as you use the pax format and all the utf-8
>  characters used are also valid iso-8859-1 characters.

I think I *do* have a problem. I want to create a tar archive on one system, 
where the filenames could contain non latin characters. I'm sending this tar 
file over a socket to a different system (with potentially a different encoding), 
where I want to extract it to a directory which name could contain non-latin 
characters.
msg97875 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-01-16 13:07
So, use the pax format. It stores the filenames as utf-8 and this way you will be on the safe side.

I hope we both agree that the solution to your particular problem is nothing tarfile.py can provide. So, I am going to close this issue now.
msg97887 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2010-01-16 17:59
Lars, I think the situation can still be improved. If tarfile works with bytes strings it should accept only bytes strings or unicode strings that can be encoded in ASCII, and encode them as soon as it gets them.
In the problem reported by Peter, he was passing u"." that is a unicode ASCII-only string. Later in the program this string gets mixed with a byte string and this causes an implicit decoding, i.e. it turns the byte strings to unicode (and possibly fails if the filename is non-ASCII). Even if the decoding succeeds, eventually tarfile will have to convert the unicode string to a byte string again.

A better approach would be to encode using the ASCII codec all the unicode strings that are passed.
If the unicode strings are ASCII-only (like the u"." Peter was passing), they can be encoded without problems. When they get mixed with other strings they are all bytes strings so no implicit decoding happens.
If the unicode strings are non-ASCII, the encoding will fail immediately and warn the user that he will have to encode the unicode string before passing it to the function.
msg97932 - (view) Author: Peter Bienstman (pbienst) Date: 2010-01-17 08:31
> Lars Gustäbel <lars@gustaebel.de> added the comment:
> 
> So, use the pax format. It stores the filenames as utf-8 and this way you
>  will be on the safe side.
> 
> I hope we both agree that the solution to your particular problem is
>  nothing tarfile.py can provide.

If I want to extract a pax archive to a unicode path with non-latin 
characters, how should I encode the path before passing it to 'extractall'? 
would utf-8 be OK?

Peter
History
Date User Action Args
2010-01-17 08:31:33pbienstsetmessages: + msg97932
2010-01-16 17:59:04ezio.melottisetmessages: + msg97887
2010-01-16 13:07:41lars.gustaebelsetstatus: open -> closed
resolution: works for me
messages: + msg97875
2010-01-15 13:36:45pbienstsetmessages: + msg97811
2010-01-15 13:14:28lars.gustaebelsetmessages: + msg97810
2010-01-15 12:26:18pbienstsetmessages: + msg97807
2010-01-15 10:51:23lars.gustaebelsetmessages: + msg97805
2010-01-15 10:31:49pbienstsetmessages: + msg97804
2010-01-15 10:05:02lars.gustaebelsetmessages: + msg97803
2010-01-14 01:15:36ezio.melottisetmessages: + msg97746
2010-01-13 21:38:55lars.gustaebelsetassignee: lars.gustaebel

nosy: + lars.gustaebel
2010-01-13 21:27:50ezio.melottisetnosy: + ezio.melotti
versions: + Python 2.7
priority: normal
components: + Library (Lib), - Extension Modules
type: crash -> behavior
stage: test needed
2010-01-13 14:08:22pbienstcreate