Created on 2012-10-24 12:33 by stefanholek, last changed 2015-07-21 08:10 by ethan.furman.
|python3-zipfile-surrogate.patch||a.badger, 2013-03-21 00:54||ver 1 of a patch to allow surrogates in zipfiles||review|
|python3-zipfile-surrogate.patch||a.badger, 2013-03-21 17:23||ver 2 of patch for surrogates in zipfiles||review|
|msg173676 - (view)||Author: Stefan Holek (stefanholek)||Date: 2012-10-24 12:33|
Please allow for surrogates in the zipfile module like it was done for tarfile in #8390. Currently zipfile breaks when encountering surrogates: Traceback (most recent call last): File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 392, in _encodeFilenameFlags return self.filename.encode('ascii'), self.flag_bits UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 21: ordinal not in range(128) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "setup.py", line 20, in <module> 'setuptools', File "/usr/local/python3.3/lib/python3.3/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 917, in run_commands self.run_command(cmd) File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 936, in run_command cmd_obj.run() File "/home/stefan/sandbox/setuptools-git/lib/python3.3/site-packages/distribute-0.6.30-py3.3.egg/setuptools/command/sdist.py", line 161, in run self.make_distribution() File "/usr/local/python3.3/lib/python3.3/distutils/command/sdist.py", line 447, in make_distribution file = self.make_archive(base_name, fmt, base_dir=base_dir) File "/usr/local/python3.3/lib/python3.3/distutils/cmd.py", line 370, in make_archive dry_run=self.dry_run) File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 178, in make_archive filename = func(base_name, base_dir, **kwargs) File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 118, in make_zipfile zip.write(path, path) File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 1328, in write self.fp.write(zinfo.FileHeader()) File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 382, in FileHeader filename, flag_bits = self._encodeFilenameFlags() File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 394, in _encodeFilenameFlags return self.filename.encode('utf-8'), self.flag_bits | 0x800 UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 21: surrogates not allowed
|msg173682 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-10-24 13:15|
The problem you are reporting looks different than the problem addressed in issue 8390. There, the surrogates are being introduced when reading filenames from the archive file. Here, the surrogates presumably arose because the filename on your file system was not utf-8 encoded and so Python introduced the surrogates to preserve the filename. The bug is that zipfile is not handling surrogates when *building* the archive...which may in fact be correct. If I understand correctly there are two encodings supported by zipfile, a Microsoft code page and utf-8. Anything else should probably be rejected as invalid, but with a better error message. If you really need to include invalid filenames in an archive, we would introduce an explict flag for allowing that. But, that's just my opinion. ("Be generous in what you accept, and strict in what you send")
|msg173687 - (view)||Author: Stefan Holek (stefanholek)||Date: 2012-10-24 14:19|
A little more context perhaps: The use-case is building Python distributions containing non-ASCII filenames. These seemingly "invalid" filenames can occur in real-life when the files have been created by, say, a 'git clone' operation. So yes, I have Latin-1 bytes on the filesystem, even though my locale is UTF-8. And yes, Python 3 decodes that filename using surrogates. Creating .tar.gz distributions in this situation appears to work (even re-creating the foreign bytes when the archive is later extracted), whereas .zip archives fail in the way described above. I was hoping zipfile could be made to work the same as tarfile in this regard. Concerns for standards certainly didn't keep tarfile from supporting surrogates. ;-)
|msg173689 - (view)||Author: R. David Murray (r.david.murray) *||Date: 2012-10-24 14:38|
I'm guessing that is because (if you read the issue) there are no specified standards for the filenames in tar (other than PAX format). Although I would personally have preferred to need to specify a "yes really use these binary filenames" flag to tar, as well. I'm not sure there are real "standards" for zip, either. I'll have to leave that answer to someone more knowledgeable. As for your immediate issue, can't you just set your locale to latin-1 while building the archive? The filenames should then get encoded to utf-8 in the zip archive, which should do the right thing with respect to the user's locale when extracted. I would think that that would be more portable.
|msg173754 - (view)||Author: Stefan Holek (stefanholek)||Date: 2012-10-25 11:56|
What we are trying to do is make distribute work with non-ASCII filenames, and this is one of the things we ran into. Fact 1: Filenames are bytes, whether you like it or not. Treating them as strings is going to give you more trouble than dragging the bytes along. Fact 2: Surrogates are Python 3's way of dealing with bytes. Fact 3: What follows is that surrogates must be supported wherever Python 3 deals with filenames. Fact 4: This is a *bug* since Python breaks its own rules here (I have removed the enhancement marker). The issue is not what ZIP can do, but what Python 3 *must* do. Creating a potentially non-standard ZIP file is fine, exploding in the user's face is not.
|msg173756 - (view)||Author: Serhiy Storchaka (serhiy.storchaka) *||Date: 2012-10-25 13:30|
If we allow for surrogates in the names, it will not correct UTF-8. This can breaks other software. We should clear 11th flag bit in this case.
|msg173766 - (view)||Author: Serhiy Storchaka (serhiy.storchaka) *||Date: 2012-10-25 16:26|
Related issues: issue10614, issue10757, issue10972.
|msg174166 - (view)||Author: STINNER Victor (haypo) *||Date: 2012-10-30 00:37|
> The use-case is building Python distributions containing > non-ASCII filenames. It's possible to distribute Python packages with non-ASCII filenames. > So yes, I have Latin-1 bytes on the filesystem, > even though my locale is UTF-8. You system is not configured correctly. If you would like to distribute such invalid filename, how do you plan to access it on other platforms where the filename is decoded differently? It would be safer to build your project on a well configured system. See issues mentionned in msg173766 to support: creating a ZIP archive with invalid filenames, and be able to specify the encoding of filenames when decoding a ZIP archive.
|msg174196 - (view)||Author: Stefan Holek (stefanholek)||Date: 2012-10-30 10:06|
> It's possible to distribute Python packages with non-ASCII filenames. Well, it wasn't until very recently (distribute 0.6.29): https://bitbucket.org/tarek/distribute/issue/303/no-support-for-unicode-manifest-files Unless we are not talking about the same thing, which is possible. ;-) >> So yes, I have Latin-1 bytes on the filesystem, >> even though my locale is UTF-8. > You system is not configured correctly. If you would like to distribute such invalid filename, > how do you plan to access it on other platforms where the filename is decoded differently? > It would be safer to build your project on a well configured system. This was done on purpose, to test how Python fares. Such files can easily come into existence, e.g. when cloning a Git repo created on a different system. I am not after "correct" ZIP files in this case, I am after Python not raising UnicodeErrors when it is supposed to a) support non-ASCII module names and b) support surrogates. python setup.py sdist --formats=gztar -> works python setup.py sdist --formats=zip -> UnicodeError If I am the only one to think this is wrong, then so be it. Our current workaround is to disallow surrogates in the manifest. /me shrugs.
|msg174201 - (view)||Author: STINNER Victor (haypo) *||Date: 2012-10-30 10:59|
> If I am the only one to think this is wrong, then so be it. > Our current workaround is to disallow surrogates in the manifest. /me shrugs. You are not alone, that's why there are 3 open issues. But someone should finish the different proposition and write a new fully functionnal patch to support bytes filenames.
|msg184288 - (view)||Author: Toshio Kuratomi (a.badger) *||Date: 2013-03-16 03:44|
I found some "standards" docs that could bear on this: http://www.pkware.com/documents/casestudies/APPNOTE.TXT Appendix D: "D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437." [..] "D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification." [..] So there's two choices for a filename in a zipfile: * bytes that make valid UTF-8 strings * bytes that make valid strings in code page 437 http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page Code Page 437 takes up all 256 possible bit patterns available in a byte. These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes. If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8. The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437. However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.
|msg184820 - (view)||Author: Toshio Kuratomi (a.badger) *||Date: 2013-03-21 00:54|
Okay, here's the first version of a patch to add surrogate support to a zipfile. I think it's the minimum required to fix this bug. When archiving, if a filename contains surrogateescape'd bytes, it switches to cp437 when it saves the filename into the zipfile. This seems to be the strategy of other zip tools. Nothing changes when unarchiving (probably to deal with what comes out of other tools). The documentation is also updated to mention that unknown encodings are a problem that the zipfile module doesn't handle automatically for you. I think we could do better but this is a major improvement over the status quo (no tracebacks). Would someone care to review this for merge and then we could work on adding some notion of a user-specified encoding to override cp437 encoding on dearchiving. (which I think would satisfy: issue10614, issue10972). The use case in issue10757 might be fixed by this patch (or this patch plus the user specified encoding). Have to look a little harder at it.
|msg184890 - (view)||Author: Toshio Kuratomi (a.badger) *||Date: 2013-03-21 17:23|
Version 2 of the patch * fixes for the style problems noted by ezio.melotti
messages: + msg184890
keywords: + patch
messages: + msg184820
messages: + msg184288
|2012-10-30 10:59:18||haypo||set||messages: + msg174201|
|2012-10-30 10:06:21||stefanholek||set||messages: + msg174196|
|2012-10-30 00:37:31||haypo||set||messages: + msg174166|
|2012-10-25 16:26:18||serhiy.storchaka||set||messages: + msg173766|
|2012-10-25 13:30:57||serhiy.storchaka||set||messages: + msg173756|
|2012-10-25 11:56:45||stefanholek||set||type: enhancement -> |
messages: + msg173754
versions: + Python 3.3
|2012-10-24 14:38:34||r.david.murray||set||messages: + msg173689|
|2012-10-24 14:19:18||stefanholek||set||messages: + msg173687|
messages: + msg173682
versions: + Python 3.4, - Python 3.3