classification
Title: TarFile.list() fails on some files
Type: behavior Stage: resolved
Components: IO, Library (Lib), Unicode Versions: Python 3.4, Python 3.3, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: benjamin.peterson, berker.peksag, ezio.melotti, haypo, lars.gustaebel, lemburg, pitrou, python-dev, serhiy.storchaka, vajrasky
Priority: normal Keywords: patch

Created on 2013-12-07 17:37 by serhiy.storchaka, last changed 2014-02-05 18:59 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
fix_tarfile_list_print_lone_surrogate.patch vajrasky, 2013-12-17 08:13 Python 3.4 review
fix_tarfile_list_print_lone_surrogate_v2.patch vajrasky, 2013-12-19 03:43 review
fix_tarfile_list_print_lone_surrogate_v3.patch vajrasky, 2014-01-22 05:07 review
fix_tarfile_list_print_lone_surrogate_v4.patch vajrasky, 2014-01-29 07:56 review
fix_tarfile_list_print_lone_surrogate_v5.patch serhiy.storchaka, 2014-01-29 12:35 review
Messages (11)
msg205475 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2013-12-07 17:37
TarFile.list() fails on some files. In particular on Lib/test/testtar.tar.

>>> import tarfile
>>> tarfile.open('Lib/test/testtar.tar').list()
?rw-r--r-- tarfile/tarfile       7011 2003-01-06 01:19:43 ustar/conttype 
?rw-r--r-- tarfile/tarfile       7011 2003-01-06 01:19:43 ustar/regtype 
?rwxr-xr-x tarfile/tarfile          0 2003-01-06 01:19:43 ustar/dirtype/ 
?rwxr-xr-x tarfile/tarfile        255 2003-01-06 01:19:43 ustar/dirtype-with-size/ 
?rw-r--r-- tarfile/tarfile          0 2003-01-06 01:19:43 ustar/lnktype link to ustar/regtype 
?rwxrwxrwx tarfile/tarfile          0 2003-01-06 01:19:43 ustar/symtype -> regtype 
?rw-rw---- tarfile/tarfile        3,0 2003-01-06 01:19:43 ustar/blktype 
?rw-rw-rw- tarfile/tarfile        1,3 2003-01-06 01:19:43 ustar/chrtype 
?rw-r--r-- tarfile/tarfile          0 2003-01-06 01:19:43 ustar/fifotype 
?rw-r--r-- tarfile/tarfile      86016 2003-01-06 01:19:43 ustar/sparse 
?rw-r--r-- tarfile/tarfile       7011 2003-01-06 01:19:43 Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 1846, in list
    print(tarinfo.name + ("/" if tarinfo.isdir() else ""), end=' ')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 14: surrogates not allowed

Command-line interface of the tarfile module also fails:

$ ./python -m tarfile -v -l Lib/test/testtar.tar
?rw-r--r-- tarfile/tarfile       7011 2003-01-06 01:19:43 ustar/conttype 
?rw-r--r-- tarfile/tarfile       7011 2003-01-06 01:19:43 ustar/regtype 
?rwxr-xr-x tarfile/tarfile          0 2003-01-06 01:19:43 ustar/dirtype/ 
?rwxr-xr-x tarfile/tarfile        255 2003-01-06 01:19:43 ustar/dirtype-with-size/ 
?rw-r--r-- tarfile/tarfile          0 2003-01-06 01:19:43 ustar/lnktype link to ustar/regtype 
?rwxrwxrwx tarfile/tarfile          0 2003-01-06 01:19:43 ustar/symtype -> regtype 
?rw-rw---- tarfile/tarfile        3,0 2003-01-06 01:19:43 ustar/blktype 
?rw-rw-rw- tarfile/tarfile        1,3 2003-01-06 01:19:43 ustar/chrtype 
?rw-r--r-- tarfile/tarfile          0 2003-01-06 01:19:43 ustar/fifotype 
?rw-r--r-- tarfile/tarfile      86016 2003-01-06 01:19:43 ustar/sparse 
Traceback (most recent call last):
  File "/home/serhiy/py/cpython/Lib/runpy.py", line 160, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/home/serhiy/py/cpython/Lib/runpy.py", line 73, in _run_code
    exec(code, run_globals)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2500, in <module>
    main()
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 2444, in main
    tf.list(verbose=args.verbose)
  File "/home/serhiy/py/cpython/Lib/tarfile.py", line 1846, in list
    print(tarinfo.name + ("/" if tarinfo.isdir() else ""), end=' ')
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcc4' in position 14: surrogates not allowed
?rw-r--r-- tarfile/tarfile       7011 2003-01-06 01:19:43 serhiy@raxxla:~/py/cpython$
msg206412 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-12-17 08:13
Here is the preliminary patch.
msg206574 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2013-12-19 03:43
Thanks, Serhiy, for your review. Here is the patch to address your concern.
msg208747 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-01-22 04:47
Here is the patch addressing some of Serhiys concerns. Thanks for the review.

There are some things that I could not make it up:
1. The test for unencodable tarinfo.linkname is not done yet, because maybe it is better to be done in a separate ticket. To make the test simple, we need to modify the testtar.tar file. We need to add file with unencodable linkname. Is it too much to do it in this ticket?

2. "should exist separate test (not in CommandLineTest) for the TarFile.list() method itself." -> I haven't got the inspiration yet how to create this test to add additional value.
msg208967 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-23 16:41
I have added new portion of nitpicks on Rietveld.

For you questions:

1. Yes, we can add unencodable tarinfo.linkname later. Just add tests for external tar files.

2. Use support.captured_stdout() (see test_list_command_verbose).
msg209331 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-01-26 15:41
I already have a patch addressing your concerns, Serhiy. But before I upload it here, some questions:

1. "Yes, we can add unencodable tarinfo.linkname later. Just add tests for external tar files." You mean, we need to create a tar file containing unencodable tarinfo.linkname dynamically in the test? Wouldn't modifying testtar.tar be easier?

2. "stdout encoding is just sys.stdout.encoding. Be aware that it can be None (if
sys.stdout is StringIO), in that case the encoding/decoding is not needed".

from io import StringIO
import sys

old_stdout = sys.stdout
sys.stdout = mystdout = StringIO()

import locale
print(locale.getpreferredencoding())

sys.stdout = old_stdout

print(mystdout.getvalue())

I always get UTF-8. Is there anything I miss?

Once I get clarity from these questions, I'll upload the patch. Thanks!
msg209333 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-26 15:57
1. No. Just use existing nonmodified testtar.tar now. Later we will add new 
items in it.

2. The encoding of sys.stdout is not always the same as default locale 
encoding. You can redirect sys.stdout to text file (opened with different 
encoding) or to socket or pipe (wrapped with TextIOWrapper). So you should use 
sys.stdout.encoding and nothing else. But even this not always work, because 
it can be None (in case of StringIO) or be absent (in case of user file-like 
object). Be aware of all this corner cases.
msg209622 - (view) Author: Vajrasky Kok (vajrasky) * Date: 2014-01-29 07:56
Here is the patch to accommodate Serhiy's request. I added new test soutside CommandLineTest. The fix always uses sys.stdout.encoding. The test_list_command and test_list_command_verbose uses testtarnames now.
msg209637 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-01-29 12:35
Well, good.

But I still have several nitpicks. Here is revised patch.

* Now ASCII encoding is used to test list() output. And tests now run even if sys.stdout is a StringIO.
* test_list_verbose now test that printed files are actually separated by one new line, not by just spaces and not by two new lines.
* safe_print was simplified and renamed to _safe_print. Now streams without the "encoding" attribute are supported.
* Minor style fixes.
msg210337 - (view) Author: Roundup Robot (python-dev) Date: 2014-02-05 18:55
New changeset a5895fca91f3 by Serhiy Storchaka in branch '3.3':
Issue #19920: TarFile.list() no longer fails when outputs a listing
http://hg.python.org/cpython/rev/a5895fca91f3

New changeset 077aa6d4f3b7 by Serhiy Storchaka in branch 'default':
Issue #19920: TarFile.list() no longer fails when outputs a listing
http://hg.python.org/cpython/rev/077aa6d4f3b7

New changeset 48c5c18110ae by Serhiy Storchaka in branch '2.7':
Issue #19920: Added tests for TarFile.list(). Based on patch by Vajrasky Kok.
http://hg.python.org/cpython/rev/48c5c18110ae
msg210338 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2014-02-05 18:59
Thank you for your patch Vajrasky.
History
Date User Action Args
2014-02-05 18:59:53serhiy.storchakasetstatus: open -> closed
resolution: fixed
messages: + msg210338

stage: resolved
2014-02-05 18:55:28python-devsetnosy: + python-dev
messages: + msg210337
2014-01-29 12:35:36serhiy.storchakasetfiles: + fix_tarfile_list_print_lone_surrogate_v5.patch

messages: + msg209637
2014-01-29 07:56:20vajraskysetfiles: + fix_tarfile_list_print_lone_surrogate_v4.patch

messages: + msg209622
2014-01-26 15:57:03serhiy.storchakasetmessages: + msg209333
2014-01-26 15:41:16vajraskysetmessages: + msg209331
2014-01-23 16:41:02serhiy.storchakasetmessages: + msg208967
2014-01-22 05:07:18vajraskysetfiles: + fix_tarfile_list_print_lone_surrogate_v3.patch
2014-01-22 04:51:02vajraskysetfiles: - fix_tarfile_list_print_lone_surrogate_v3.patch
2014-01-22 04:47:58vajraskysetfiles: + fix_tarfile_list_print_lone_surrogate_v3.patch

messages: + msg208747
2014-01-21 21:32:26serhiy.storchakasetassignee: serhiy.storchaka
2013-12-19 03:43:59vajraskysetfiles: + fix_tarfile_list_print_lone_surrogate_v2.patch

messages: + msg206574
2013-12-17 08:13:17vajraskysetfiles: + fix_tarfile_list_print_lone_surrogate.patch

nosy: + vajrasky
messages: + msg206412

keywords: + patch
2013-12-08 07:25:55berker.peksagsetnosy: + berker.peksag
2013-12-07 17:37:40serhiy.storchakacreate