This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Allow reading member names with bogus encodings in zipfile
Type: enhancement Stage: commit review
Components: Library (Lib) Versions: Python 3.11
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: serhiy.storchaka Nosy List: gregory.p.smith, methane, ned.deily, serhiy.storchaka, sjt
Priority: normal Keywords: needs review, patch

Created on 2016-09-11 18:23 by sjt, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
encoded-member-names sjt, 2016-09-11 19:21 Mercurial queue style patch for Lib/zipfile.py, Doc/Lib/zipfile.rst, Lib/test/test_zipfile.py review
encoded-member-names-v2 sjt, 2016-09-11 23:58 review
encoded-member-names-v3 sjt, 2016-09-12 14:42 review
Pull Requests
URL Status Linked Edit
PR 32007 merged serhiy.storchaka, 2022-03-20 14:03
Messages (19)
msg275828 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 18:59
Could you please provide more information sjt?
msg275834 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 19:21
Suggested NEWS/whatsnew entry:

Add a new *memberNameEncoding* argument to the ZipFile constructor, allowing
:mod:`zipfile` to read filenames in non-conforming encodings from the
zipfile as Unicode.  This implementation assumes all member names have the same encoding.

Motivation:

There are applications in Japan that create zipfiles with directories containing filenames encoded in Shift JIS.  There may be such software in other countries as well.  As this is a violation of the Zip format definition, this library implements only an option to read such files.

Done:

(1) Add a memberNameEncoding argument to the main() function, which may be set from the command line with "--membernameencoding={codec}".  This command line option may be used with -e or -l, but not -c or -t.  There is no point to it in the latter, since the member names are not printed.
(2) Add a memberNameEncoding argument to the ZipFile constructor.  This is the only way to set it, so this is global to the ZipFile.
(3) Add this attribute to repr.
(4) Add a check that the mode is `read` in main() and in the ZipFile constructor, and if not invoke USAGE and exit or raise RuntimeError.
(5) When retrieving member names in constructing ZipInfo instances, check if memberNameEncoding is set, and if so use it, unless the UTF-8 bit is set. In that case, obey the UTF-8 bit, as the specified encoding is surely user error.
(6) Add a CODEC_USAGE message.
(7) Update the docs (docstrings, library reference, NEWS).
(8) Add tests:
    (a) List a zipfile's SJIS-encoded directory.
    (b) List a UTF-8-encoded directory and an ISO-8859-1-encoded directory as Shift-JIS.
    (c) Check that USAGE is invoked on attempts to write a zipfile in main().
    (d) Check that an appropriate error is raised on attempts to write in other functions.
    Many other tests are run as well.
    ALL TESTS PASS.
(9) Docs build without error.

To do (?):

(10) NEWS/whatsnew
(11) Check relevant code paths are all covered by tests.
(12) Review docs for clarity and organization.

Not done:

I don't think these are appropriate/needed at this time, but listed in case somebody thinks otherwise.

(13) Add a subtype of RuntimeError (see 7d)?
(14) Issue warning if both membernameencoding and utf-8 bit are set (see 4)?
(15) Support InfoZip encoding extension mentioned in APPNOTE.TXT - .ZIP File Format Specification, v6.3.4.
(16) Support per-member encodings (I think the zipfile standard permits, but not sure).
msg275837 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 19:27
I should have a contributor agreement form on file.

Ned Deily suggested that I try to get this patch in before the 12 noon deadline Sept. 12, so here it is.

I believe the patch is "safe" in the sense that its functionality needs to be explicitly enabled, and it should be very difficult to persuade it to inadvertantly write to any file.  No existing execution paths should be changed at all.
msg275842 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 20:02
This issue is a duplicate of issue10614. But proposed patch looks more ready.
msg275844 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2016-09-11 20:04
Stephen, thanks for submitting the patch.  Unless another core developer has time to review and commit this prior to the feature code off tomorrow, this will probably need to wait for 3.7.  Also, at the moment, your tracker user record does not indicate that there is a contributor agreement on file for you.  You could followup on that with Ewa at the PSF or just submit a new form.
msg275847 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 20:28
Re: wait for 3.7 if reviewers are busy, understood.  N.B. Contributor agreement is now on file (I received the PDF from python.org already).

Re: existing patches:
My patch is very similar in the basic approach to Sergey Dorofeev's patch in issue10614.  Main differences:
(1) Sergey's patch treats the "encoding" parameter as a first class citizen with a default to cp437, whereas mine treats it as a special case defaulting to None, with utf-8 and cp437 getting special treatment as the standard encodings.  Subtle point, but I like it this way.
(2) My patch includes support for the argument in the __main__ script.
(3) Sergey's patch misses one execution path in the current code so needs update before application.

The Japanese patches by umedoblock are very Japanese-centric, and worse, they try to guess the encoding by the crude method of seeing what decodes successfully.  They are not acceptable IMO.

Aaargh.  Just noticed the Japanese in test_zipfile.py.  Will change it to use \u escapes soon.
msg275856 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-11 21:12
Added comments on Rietveld. Maybe I'll commit rewritten patch tomorrow before 12:00 UTC (oh, already today!).
msg275891 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-11 23:58
Can't reply on Rietveld?  Lost 2 hours work!

Patch updated (encoded-member-names-v2), most changes accepted.  Not happy about name change or default to cp437, I want this API to be hard to use and not be part of the normal process (utf-8 or cp437).  Considering errors= argument, but that must default to 'strict' -- the problem this patch solves is zip utilities extracting to files with unreadable names, surrogateescape is more of the same.  Two incomplete tests (assertRaisesRegex and capture main() stderr) still in progress, must do dayjob now.
msg275996 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-12 08:33
Sorry, I can't push the patch in a haste. In needs more design discussion, comparing with other implementations (I found only that ZipFile in Java can take the charset argument, but "charset" is common name for text encoding in Java), wider discussion. The problem exists for years, and there are workarounds.
msg276025 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-12 11:56
If you have a workaround that's available to nonprogrammers, I'd like to hear about it.  I have found none, that's why I went to the trouble to put together a patch even though I knew that the odds of actually getting it in to Python 3.6 was very low -- my patch (or Sergey Dorofeev's, but that needs work to be applicable to trunk) does everything I've ever needed, so I suppose it would do for all the use cases so far posted (except umedoblock's encoding-guessing approach, but that can be handled by many 3rd-party encoding-guessing codecs, I think, and IMO should be).
msg276051 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-12 14:15
Python is programming language, I don't understand what you mean saying "available to nonprogrammers". As a programmer you can recode ZipInfo name before outputting or what you want to do with it:

    filename = filename.encode('cp437').decode(encoding)

In command line you can can use iconv or other converters:

    python3 -m zipfile -l myzip.zip | iconv -t cp437 | iconv -f "$encoding"

This is not very handy, but works in most cases.
msg276054 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-09-12 14:42
Cleaned up a few loose ends while it's all fresh in mind.  Will ping python-dev in 4-6 weeks for review for 3.7.

Thanks to Serhiy for review.  The current version of the patch is much improved over the initial submission due to his efforts.
msg284029 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-12-26 13:29
A ZipFile can be read when open in not read-only mode. Thus the encoding argument should be accepted when mode != 'r'.

It would be weird to read file names and write new entries with different encodings. Thus the encoding argument should affect output encoding too.

You have named the new ZipFile attribute metadataEncoding. Indeed, I missed this, and other developers missed this when ported to Python 3, but the specification says, that the UTF-8 bit affect not just the encoding of file names, but the encoding of comments. Thus a file comment must be a string, and be decoded with the same encoding as a file name. Currently it is of type bytes. I don't know what is the best way to resolve this issue without breaking backward compatibility. Perhaps add the text_comment property.
msg284139 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2016-12-27 22:57
Thanks for followup!  I was just about to write you, now that 3.6 is out.  Season's Greetings!

First, how do you propose to proceed with issue28115 ("use argparse for the ZipFile module")?  If you expect to commit that first (I'm in no hurry for this patch, BTW, as long as it gets into 3.7 I'm happy), this issue should depend on it and use argparse too.

I don't see any good reason for allowing non-UTF-8 encoding to a file open for writing, and a good reason (the ZipFile standard) for not allowing it.  Certainly the CLI should not allow it, any more than it does now.  At least in my experiments InfoZip and the default zip utilities on Windows and Mac DTRT with UTF-8 zipfiles, so there is no absolute need for writing nonconforming zipfiles.  If you want to block on a convert-to-UTF-8 option I can do that (but I don't need it myself).  (Note to self: if writing to existing zipfile is extension of existing file, need to prevent mixed encodings.  Also warn about conversion.)

I thought I checked that comments were decoded.  Maybe that's only on the UTF-8 path?  Or maybe I needed more coffee.  (Hope so, that would be a messy problem if ASCII/Latin1 returns bytes and UTF-8 returns str!)  I'll think about this.  Yes, it's a backwards-compatibility issue so needs care.  Would be weird if names are decoded but other metadata (comments) not, though.  Surely someone would complain if they actually used comments?  (I'm thinking maybe a compatibility break might be OK?  With deprecation cycle?)  I expect to check all execution paths accessing metadata and have a proposed patch by 12/31.

I think I'm still short some tests, will check and write them if needed.
msg415605 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-20 14:22
I experimented with this a lot. There is a problem with the append mode. We can read in the append mode, therefore we need an encoding. But when we close a ZipFile after appending, non-ASCII file names will be encoded in UTF-8 in the central directory. Next time when we open the archive for reading with different encoding we will get an error because filenames in the central directory and in local headers are different. We need to write non-ASCII files back with the specified encoding to get a self-consistent data.

Finally I left this as it was initially. We can return to the problem with the append module later.

The differences between PR 32007 and your patches:

* The parameter was renamed to metadata_encoding to avoid confusion with existing parameter of ZipFile.open() encoding. In future I am going to use it also for comments. The attribute and the CLI option were renamed correspondingly.
* --metadata-encoding can also be used with the -t option.
* "surrogateescape" no longer used. If the encoding in not suitable, you will get an error. Use the default and decode filenames manually in such cases. We can change this in future.
* Updated documentation.
* Tests were significantly rewritten. Now they test the behavior with wrong metadata_encoding, mixed UTF-8 and legacy encodings, and reading after append.

I was going to make more changes, but left it for future.
msg415746 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-03-22 06:40
Your PR looks good to me.

I agree with not making it easy to _write_ zipfiles with non-standard encoding used for names.

There is a possibility that someone wants that ability when writing zip files (not yet clear) in https://bugs.python.org/issue40172.  I don't like it, but if the need exists, that feature can be addressed on that issue via relevant options.
msg415756 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2022-03-22 09:53
New changeset a25a985535ccbb7df8caddc0017550ff4eae5855 by Serhiy Storchaka in branch 'main':
bpo-28080: Add support for the fallback encoding in ZIP files (GH-32007)
https://github.com/python/cpython/commit/a25a985535ccbb7df8caddc0017550ff4eae5855
msg415876 - (view) Author: Stephen J. Turnbull (sjt) * (Python triager) Date: 2022-03-23 14:02
I'm not going to have time to look at the PR for a couple days.

I don't understand what the use case is for writing or appending with filenames in a non-UTF-8 encoding.  At least in my experience, reading such files is rare, but I have never been asked to write one.  The correspondents who send me zipfiles with the directory encoded in shift_jisx0213 are perfectly happy to read zipfiles with the directory encoded in UTF-8.

If that is true for other users, then unzipping the file to a temporary directory with the appropriate --metadata-encoding, adding the required paths there, and zipping a new archive seems perfectly workable.  In that case making this feature read-only makes the most sense to me.
msg415917 - (view) Author: Gregory P. Smith (gregory.p.smith) * (Python committer) Date: 2022-03-24 00:33
Thanks Serhiy!
History
Date User Action Args
2022-04-11 14:58:36adminsetgithub: 72267
2022-03-24 00:33:09gregory.p.smithsetstatus: open -> closed
resolution: fixed
messages: + msg415917

stage: patch review -> commit review
2022-03-23 14:02:06sjtsetmessages: + msg415876
2022-03-22 09:53:24serhiy.storchakasetmessages: + msg415756
2022-03-22 06:40:23gregory.p.smithsetnosy: + gregory.p.smith
messages: + msg415746
2022-03-20 14:22:14serhiy.storchakasetmessages: + msg415605
versions: + Python 3.11, - Python 3.7
2022-03-20 14:03:40serhiy.storchakasetpull_requests: + pull_request30095
2019-11-27 11:45:52serhiy.storchakalinkissue38861 dependencies
2016-12-27 22:57:12sjtsetmessages: + msg284139
2016-12-26 13:29:49serhiy.storchakasetmessages: + msg284029
2016-12-26 13:10:11methanesetnosy: + methane
2016-12-26 13:06:43methanelinkissue10614 superseder
2016-09-12 14:42:44sjtsetfiles: + encoded-member-names-v3

messages: + msg276054
2016-09-12 14:15:57serhiy.storchakasetmessages: + msg276051
2016-09-12 11:56:26sjtsetmessages: + msg276025
2016-09-12 08:33:53serhiy.storchakasetmessages: + msg275996
versions: - Python 3.6
2016-09-11 23:58:41sjtsetfiles: + encoded-member-names-v2

messages: + msg275891
2016-09-11 21:12:50serhiy.storchakasetmessages: + msg275856
2016-09-11 20:28:40sjtsetmessages: + msg275847
2016-09-11 20:04:13ned.deilysetnosy: + ned.deily

messages: + msg275844
versions: + Python 3.7
2016-09-11 20:02:41serhiy.storchakasetassignee: serhiy.storchaka
messages: + msg275842
stage: patch review
2016-09-11 19:27:57sjtsetmessages: + msg275837
2016-09-11 19:21:52sjtsetkeywords: + needs review
files: + encoded-member-names
status: pending -> open
messages: + msg275834
2016-09-11 18:59:35serhiy.storchakasetstatus: open -> pending
nosy: + serhiy.storchaka
messages: + msg275828

2016-09-11 18:23:14sjtcreate