classification
Title: zipfile: allow surrogates in filenames
Type: Stage:
Components: Library (Lib), Unicode Versions: Python 3.4, Python 3.3
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, a.badger, asvetlov, ethan.furman, ezio.melotti, haypo, r.david.murray, serhiy.storchaka, stefanholek
Priority: normal Keywords: patch

Created on 2012-10-24 12:33 by stefanholek, last changed 2013-10-14 22:52 by ethan.furman.

Files
File name Uploaded Description Edit
python3-zipfile-surrogate.patch a.badger, 2013-03-21 00:54 ver 1 of a patch to allow surrogates in zipfiles review
python3-zipfile-surrogate.patch a.badger, 2013-03-21 17:23 ver 2 of patch for surrogates in zipfiles review
Messages (13)
msg173676 - (view) Author: Stefan Holek (stefanholek) Date: 2012-10-24 12:33
Please allow for surrogates in the zipfile module like it was done for tarfile in #8390.

Currently zipfile breaks when encountering surrogates:

Traceback (most recent call last):
  File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 392, in _encodeFilenameFlags
    return self.filename.encode('ascii'), self.flag_bits
UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 21: ordinal not in range(128)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "setup.py", line 20, in <module>
    'setuptools',
  File "/usr/local/python3.3/lib/python3.3/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 917, in run_commands
    self.run_command(cmd)
  File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 936, in run_command
    cmd_obj.run()
  File "/home/stefan/sandbox/setuptools-git/lib/python3.3/site-packages/distribute-0.6.30-py3.3.egg/setuptools/command/sdist.py", line 161, in run
    self.make_distribution()
  File "/usr/local/python3.3/lib/python3.3/distutils/command/sdist.py", line 447, in make_distribution
    file = self.make_archive(base_name, fmt, base_dir=base_dir)
  File "/usr/local/python3.3/lib/python3.3/distutils/cmd.py", line 370, in make_archive
    dry_run=self.dry_run)
  File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 178, in make_archive
    filename = func(base_name, base_dir, **kwargs)
  File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 118, in make_zipfile
    zip.write(path, path)
  File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 1328, in write
    self.fp.write(zinfo.FileHeader())
  File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 382, in FileHeader
    filename, flag_bits = self._encodeFilenameFlags()
  File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 394, in _encodeFilenameFlags
    return self.filename.encode('utf-8'), self.flag_bits | 0x800
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 21: surrogates not allowed
msg173682 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-10-24 13:15
The problem you are reporting looks different than the problem addressed in issue 8390.  There, the surrogates are being introduced when reading filenames from the archive file.  Here, the surrogates presumably arose because the filename on your file system was not utf-8 encoded and so Python introduced the surrogates to preserve the filename.  The bug is that zipfile is not handling surrogates when *building* the archive...which may in fact be correct.  If I understand correctly there are two encodings supported by zipfile, a Microsoft code page and utf-8.  Anything else should probably be rejected as invalid, but with a better error message.  If you really need to include invalid filenames in an archive, we would introduce an explict flag for allowing that.

But, that's just my opinion.  ("Be generous in what you accept, and strict in what you send")
msg173687 - (view) Author: Stefan Holek (stefanholek) Date: 2012-10-24 14:19
A little more context perhaps:

The use-case is building Python distributions containing non-ASCII filenames. These seemingly "invalid" filenames can occur in real-life when the files have been created by, say, a 'git clone' operation.

So yes, I have Latin-1 bytes on the filesystem, even though my locale is UTF-8. And yes, Python 3 decodes that filename using surrogates. Creating .tar.gz distributions in this situation appears to work (even re-creating the foreign bytes when the archive is later extracted), whereas .zip archives fail in the way described above.

I was hoping zipfile could be made to work the same as tarfile in this regard. Concerns for standards certainly didn't keep tarfile from supporting surrogates. ;-)
msg173689 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2012-10-24 14:38
I'm guessing that is because (if you read the issue) there are no specified standards for the filenames in tar (other than PAX format).  Although I would personally have preferred to need to specify a "yes really use these binary filenames" flag to tar, as well.

I'm not sure there are real "standards" for zip, either.  I'll have to leave that answer to someone more knowledgeable.

As for your immediate issue, can't you just set your locale to latin-1 while building the archive?  The filenames should then get encoded to utf-8 in the zip archive, which should do the right thing with respect to the user's locale when extracted.  I would think that that would be more portable.
msg173754 - (view) Author: Stefan Holek (stefanholek) Date: 2012-10-25 11:56
What we are trying to do is make distribute work with non-ASCII filenames, and this is one of the things we ran into.

Fact 1: Filenames are bytes, whether you like it or not. Treating them as strings is going to give you more trouble than dragging the bytes along.

Fact 2: Surrogates are Python 3's way of dealing with bytes.

Fact 3: What follows is that surrogates must be supported wherever Python 3 deals with filenames.

Fact 4: This is a *bug* since Python breaks its own rules here (I have removed the enhancement marker). The issue is not what ZIP can do, but what Python 3 *must* do. Creating a potentially non-standard ZIP file is fine, exploding in the user's face is not.
msg173756 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-25 13:30
If we allow for surrogates in the names, it will not correct UTF-8.  This can breaks other software.

We should clear 11th flag bit in this case.
msg173766 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012-10-25 16:26
Related issues: issue10614, issue10757, issue10972.
msg174166 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-10-30 00:37
> The use-case is building Python distributions containing
> non-ASCII filenames.

It's possible to distribute Python packages with non-ASCII filenames.

> So yes, I have Latin-1 bytes on the filesystem,
> even though my locale is UTF-8.

You system is not configured correctly. If you would like to distribute such invalid filename, how do you plan to access it on other platforms where the filename is decoded differently? It would be safer to build your project on a well configured system.

See issues mentionned in msg173766 to support: creating a ZIP archive with invalid filenames, and be able to specify the encoding of filenames when decoding a ZIP archive.
msg174196 - (view) Author: Stefan Holek (stefanholek) Date: 2012-10-30 10:06
> It's possible to distribute Python packages with non-ASCII filenames.

Well, it wasn't until very recently (distribute 0.6.29):
https://bitbucket.org/tarek/distribute/issue/303/no-support-for-unicode-manifest-files
Unless we are not talking about the same thing, which is possible. ;-)

>> So yes, I have Latin-1 bytes on the filesystem,
>> even though my locale is UTF-8.

> You system is not configured correctly. If you would like to distribute such invalid filename,
> how do you plan to access it on other platforms where the filename is decoded differently?
> It would be safer to build your project on a well configured system.

This was done on purpose, to test how Python fares. Such files can easily come into existence, e.g. when cloning a Git repo created on a different system. I am not after "correct" ZIP files in this case, I am after Python not raising UnicodeErrors when it is supposed to a) support non-ASCII module names and b) support surrogates.

    python setup.py sdist --formats=gztar -> works

    python setup.py sdist --formats=zip -> UnicodeError

If I am the only one to think this is wrong, then so be it. Our current workaround is to disallow surrogates in the manifest. /me shrugs.
msg174201 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2012-10-30 10:59
> If I am the only one to think this is wrong, then so be it.
> Our current workaround is to disallow surrogates in the manifest. /me shrugs.

You are not alone, that's why there are 3 open issues. But someone
should finish the different proposition and write a new fully
functionnal patch to support bytes filenames.
msg184288 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-16 03:44
I found some "standards" docs that could bear on this:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D:
"D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437."
[..]
"D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding.  If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification."
[..]

So there's two choices for a filename in a zipfile:

* bytes that make valid UTF-8 strings
* bytes that make valid strings in code page 437

http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page

Code Page 437 takes up all 256 possible bit patterns available in a byte.

These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes.  If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8.  

The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437.   However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.
msg184820 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-21 00:54
Okay, here's the first version of a patch to add surrogate support to a zipfile.  I think it's the minimum required to fix this bug.

When archiving, if a filename contains surrogateescape'd bytes, it switches to cp437 when it saves the filename into the zipfile.  This seems to be the strategy of other zip tools.  Nothing changes when unarchiving (probably to deal with what comes out of other tools).

The documentation is also updated to mention that unknown encodings are a problem that the zipfile module doesn't handle automatically for you.

I think we could do better but this is a major improvement over the status quo (no tracebacks).  Would someone care to review this for merge and then we could work on adding some notion of a user-specified encoding to override cp437 encoding on dearchiving.  (which I think would satisfy:  issue10614, issue10972).

The use case in issue10757 might be fixed by this patch (or this patch plus the user specified encoding).  Have to look a little harder at it.
msg184890 - (view) Author: Toshio Kuratomi (a.badger) * Date: 2013-03-21 17:23
Version 2 of the patch

* fixes for the style problems noted by ezio.melotti
History
Date User Action Args
2013-10-14 22:52:10ethan.furmansetnosy: + ethan.furman
2013-03-21 17:23:31a.badgersetfiles: + python3-zipfile-surrogate.patch

messages: + msg184890
2013-03-21 00:54:21a.badgersetfiles: + python3-zipfile-surrogate.patch
keywords: + patch
messages: + msg184820
2013-03-16 03:44:45a.badgersetnosy: + a.badger
messages: + msg184288
2012-10-30 10:59:18hayposetmessages: + msg174201
2012-10-30 10:06:21stefanholeksetmessages: + msg174196
2012-10-30 00:37:31hayposetmessages: + msg174166
2012-10-28 12:30:07asvetlovsetnosy: + asvetlov
2012-10-25 16:26:18serhiy.storchakasetmessages: + msg173766
2012-10-25 13:30:57serhiy.storchakasetmessages: + msg173756
2012-10-25 11:56:45stefanholeksettype: enhancement ->
messages: + msg173754
versions: + Python 3.3
2012-10-24 17:48:56pitrousetnosy: + haypo
2012-10-24 16:16:09Arfreversetnosy: + Arfrever
2012-10-24 14:38:34r.david.murraysetmessages: + msg173689
2012-10-24 14:19:18stefanholeksetmessages: + msg173687
2012-10-24 13:15:01r.david.murraysetnosy: + r.david.murray
messages: + msg173682
2012-10-24 12:59:55serhiy.storchakasetnosy: + serhiy.storchaka

type: enhancement
versions: + Python 3.4, - Python 3.3
2012-10-24 12:33:45stefanholekcreate