classification
Title: Allow reading member names with bogus encodings in zipfile
Type: enhancement Stage: patch review
Components: Library (Lib) Versions: Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: inada.naoki, ned.deily, serhiy.storchaka, sjt
Priority: normal Keywords: needs review, patch

Created on 2016-09-11 18:23 by sjt, last changed 2016-12-27 22:57 by sjt.

Files
File name Uploaded Description Edit
encoded-member-names sjt, 2016-09-11 19:21 Mercurial queue style patch for Lib/zipfile.py, Doc/Lib/zipfile.rst, Lib/test/test_zipfile.py review
encoded-member-names-v2 sjt, 2016-09-11 23:58 review
encoded-member-names-v3 sjt, 2016-09-12 14:42 review
Messages (14)
msg275828 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 18:59
Could you please provide more information sjt?
msg275834 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 19:21
Suggested NEWS/whatsnew entry:

Add a new *memberNameEncoding* argument to the ZipFile constructor, allowing
:mod:`zipfile` to read filenames in non-conforming encodings from the
zipfile as Unicode.  This implementation assumes all member names have the same encoding.

Motivation:

There are applications in Japan that create zipfiles with directories containing filenames encoded in Shift JIS.  There may be such software in other countries as well.  As this is a violation of the Zip format definition, this library implements only an option to read such files.

Done:

(1) Add a memberNameEncoding argument to the main() function, which may be set from the command line with "--membernameencoding={codec}".  This command line option may be used with -e or -l, but not -c or -t.  There is no point to it in the latter, since the member names are not printed.
(2) Add a memberNameEncoding argument to the ZipFile constructor.  This is the only way to set it, so this is global to the ZipFile.
(3) Add this attribute to repr.
(4) Add a check that the mode is `read` in main() and in the ZipFile constructor, and if not invoke USAGE and exit or raise RuntimeError.
(5) When retrieving member names in constructing ZipInfo instances, check if memberNameEncoding is set, and if so use it, unless the UTF-8 bit is set. In that case, obey the UTF-8 bit, as the specified encoding is surely user error.
(6) Add a CODEC_USAGE message.
(7) Update the docs (docstrings, library reference, NEWS).
(8) Add tests:
    (a) List a zipfile's SJIS-encoded directory.
    (b) List a UTF-8-encoded directory and an ISO-8859-1-encoded directory as Shift-JIS.
    (c) Check that USAGE is invoked on attempts to write a zipfile in main().
    (d) Check that an appropriate error is raised on attempts to write in other functions.
    Many other tests are run as well.
    ALL TESTS PASS.
(9) Docs build without error.

To do (?):

(10) NEWS/whatsnew
(11) Check relevant code paths are all covered by tests.
(12) Review docs for clarity and organization.

Not done:

I don't think these are appropriate/needed at this time, but listed in case somebody thinks otherwise.

(13) Add a subtype of RuntimeError (see 7d)?
(14) Issue warning if both membernameencoding and utf-8 bit are set (see 4)?
(15) Support InfoZip encoding extension mentioned in APPNOTE.TXT - .ZIP File Format Specification, v6.3.4.
(16) Support per-member encodings (I think the zipfile standard permits, but not sure).
msg275837 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 19:27
I should have a contributor agreement form on file.

Ned Deily suggested that I try to get this patch in before the 12 noon deadline Sept. 12, so here it is.

I believe the patch is "safe" in the sense that its functionality needs to be explicitly enabled, and it should be very difficult to persuade it to inadvertantly write to any file.  No existing execution paths should be changed at all.
msg275842 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 20:02
This issue is a duplicate of issue10614. But proposed patch looks more ready.
msg275844 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-09-11 20:04
Stephen, thanks for submitting the patch.  Unless another core developer has time to review and commit this prior to the feature code off tomorrow, this will probably need to wait for 3.7.  Also, at the moment, your tracker user record does not indicate that there is a contributor agreement on file for you.  You could followup on that with Ewa at the PSF or just submit a new form.
msg275847 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 20:28
Re: wait for 3.7 if reviewers are busy, understood.  N.B. Contributor agreement is now on file (I received the PDF from python.org already).

Re: existing patches:
My patch is very similar in the basic approach to Sergey Dorofeev's patch in issue10614.  Main differences:
(1) Sergey's patch treats the "encoding" parameter as a first class citizen with a default to cp437, whereas mine treats it as a special case defaulting to None, with utf-8 and cp437 getting special treatment as the standard encodings.  Subtle point, but I like it this way.
(2) My patch includes support for the argument in the __main__ script.
(3) Sergey's patch misses one execution path in the current code so needs update before application.

The Japanese patches by umedoblock are very Japanese-centric, and worse, they try to guess the encoding by the crude method of seeing what decodes successfully.  They are not acceptable IMO.

Aaargh.  Just noticed the Japanese in test_zipfile.py.  Will change it to use \u escapes soon.
msg275856 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 21:12
Added comments on Rietveld. Maybe I'll commit rewritten patch tomorrow before 12:00 UTC (oh, already today!).
msg275891 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 23:58
Can't reply on Rietveld?  Lost 2 hours work!

Patch updated (encoded-member-names-v2), most changes accepted.  Not happy about name change or default to cp437, I want this API to be hard to use and not be part of the normal process (utf-8 or cp437).  Considering errors= argument, but that must default to 'strict' -- the problem this patch solves is zip utilities extracting to files with unreadable names, surrogateescape is more of the same.  Two incomplete tests (assertRaisesRegex and capture main() stderr) still in progress, must do dayjob now.
msg275996 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-12 08:33
Sorry, I can't push the patch in a haste. In needs more design discussion, comparing with other implementations (I found only that ZipFile in Java can take the charset argument, but "charset" is common name for text encoding in Java), wider discussion. The problem exists for years, and there are workarounds.
msg276025 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-12 11:56
If you have a workaround that's available to nonprogrammers, I'd like to hear about it.  I have found none, that's why I went to the trouble to put together a patch even though I knew that the odds of actually getting it in to Python 3.6 was very low -- my patch (or Sergey Dorofeev's, but that needs work to be applicable to trunk) does everything I've ever needed, so I suppose it would do for all the use cases so far posted (except umedoblock's encoding-guessing approach, but that can be handled by many 3rd-party encoding-guessing codecs, I think, and IMO should be).
msg276051 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-12 14:15
Python is programming language, I don't understand what you mean saying "available to nonprogrammers". As a programmer you can recode ZipInfo name before outputting or what you want to do with it:

    filename = filename.encode('cp437').decode(encoding)

In command line you can can use iconv or other converters:

    python3 -m zipfile -l myzip.zip | iconv -t cp437 | iconv -f "$encoding"

This is not very handy, but works in most cases.
msg276054 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-12 14:42
Cleaned up a few loose ends while it's all fresh in mind.  Will ping python-dev in 4-6 weeks for review for 3.7.

Thanks to Serhiy for review.  The current version of the patch is much improved over the initial submission due to his efforts.
msg284029 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-26 13:29
A ZipFile can be read when open in not read-only mode. Thus the encoding argument should be accepted when mode != 'r'.

It would be weird to read file names and write new entries with different encodings. Thus the encoding argument should affect output encoding too.

You have named the new ZipFile attribute metadataEncoding. Indeed, I missed this, and other developers missed this when ported to Python 3, but the specification says, that the UTF-8 bit affect not just the encoding of file names, but the encoding of comments. Thus a file comment must be a string, and be decoded with the same encoding as a file name. Currently it is of type bytes. I don't know what is the best way to resolve this issue without breaking backward compatibility. Perhaps add the text_comment property.
msg284139 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-12-27 22:57
Thanks for followup!  I was just about to write you, now that 3.6 is out.  Season's Greetings!

First, how do you propose to proceed with issue28115 ("use argparse for the ZipFile module")?  If you expect to commit that first (I'm in no hurry for this patch, BTW, as long as it gets into 3.7 I'm happy), this issue should depend on it and use argparse too.

I don't see any good reason for allowing non-UTF-8 encoding to a file open for writing, and a good reason (the ZipFile standard) for not allowing it.  Certainly the CLI should not allow it, any more than it does now.  At least in my experiments InfoZip and the default zip utilities on Windows and Mac DTRT with UTF-8 zipfiles, so there is no absolute need for writing nonconforming zipfiles.  If you want to block on a convert-to-UTF-8 option I can do that (but I don't need it myself).  (Note to self: if writing to existing zipfile is extension of existing file, need to prevent mixed encodings.  Also warn about conversion.)

I thought I checked that comments were decoded.  Maybe that's only on the UTF-8 path?  Or maybe I needed more coffee.  (Hope so, that would be a messy problem if ASCII/Latin1 returns bytes and UTF-8 returns str!)  I'll think about this.  Yes, it's a backwards-compatibility issue so needs care.  Would be weird if names are decoded but other metadata (comments) not, though.  Surely someone would complain if they actually used comments?  (I'm thinking maybe a compatibility break might be OK?  With deprecation cycle?)  I expect to check all execution paths accessing metadata and have a proposed patch by 12/31.

I think I'm still short some tests, will check and write them if needed.
History
Date User Action Args
2019-11-27 11:45:52serhiy.storchakalinkissue38861 dependencies
2016-12-27 22:57:12sjtsetmessages: + msg284139
2016-12-26 13:29:49serhiy.storchakasetmessages: + msg284029
2016-12-26 13:10:11inada.naokisetnosy: + inada.naoki
2016-12-26 13:06:43inada.naokilinkissue10614 superseder
2016-09-12 14:42:44sjtsetfiles: + encoded-member-names-v3

messages: + msg276054
2016-09-12 14:15:57serhiy.storchakasetmessages: + msg276051
2016-09-12 11:56:26sjtsetmessages: + msg276025
2016-09-12 08:33:53serhiy.storchakasetmessages: + msg275996
versions: - Python 3.6
2016-09-11 23:58:41sjtsetfiles: + encoded-member-names-v2

messages: + msg275891
2016-09-11 21:12:50serhiy.storchakasetmessages: + msg275856
2016-09-11 20:28:40sjtsetmessages: + msg275847
2016-09-11 20:04:13ned.deilysetnosy: + ned.deily

messages: + msg275844
versions: + Python 3.7
2016-09-11 20:02:41serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg275842
stage: patch review
2016-09-11 19:27:57sjtsetmessages: + msg275837
2016-09-11 19:21:52sjtsetkeywords: + needs review
files: + encoded-member-names
status: pending -> open
messages: + msg275834
2016-09-11 18:59:35serhiy.storchakasetstatus: open -> pending
nosy: + serhiy.storchaka
messages: + msg275828

2016-09-11 18:23:14sjtcreate