Issue 16310: zipfile: allow surrogates in filenames

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/60514

classification

Title:	zipfile: allow surrogates in filenames
Type:		Stage:
Components:	Library (Lib), Unicode	Versions:	Python 3.3, Python 3.4

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, a.badger, asvetlov, ezio.melotti, r.david.murray, serhiy.storchaka, stefanholek, vstinner
Priority:	normal	Keywords:	patch

Created on 2012-10-24 12:33 by stefanholek, last changed 2022-04-11 14:57 by admin.

Files
File name	Uploaded	Description	Edit
python3-zipfile-surrogate.patch	a.badger, 2013-03-21 00:54	ver 1 of a patch to allow surrogates in zipfiles	review
python3-zipfile-surrogate.patch	a.badger, 2013-03-21 17:23	ver 2 of patch for surrogates in zipfiles	review

Messages (13)
msg173676 - (view)	Author: Stefan Holek (stefanholek)	Date: 2012-10-24 12:33
Please allow for surrogates in the zipfile module like it was done for tarfile in #8390. Currently zipfile breaks when encountering surrogates: Traceback (most recent call last): File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 392, in _encodeFilenameFlags return self.filename.encode('ascii'), self.flag_bits UnicodeEncodeError: 'ascii' codec can't encode character '\udcfc' in position 21: ordinal not in range(128) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "setup.py", line 20, in <module> 'setuptools', File "/usr/local/python3.3/lib/python3.3/distutils/core.py", line 148, in setup dist.run_commands() File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 917, in run_commands self.run_command(cmd) File "/usr/local/python3.3/lib/python3.3/distutils/dist.py", line 936, in run_command cmd_obj.run() File "/home/stefan/sandbox/setuptools-git/lib/python3.3/site-packages/distribute-0.6.30-py3.3.egg/setuptools/command/sdist.py", line 161, in run self.make_distribution() File "/usr/local/python3.3/lib/python3.3/distutils/command/sdist.py", line 447, in make_distribution file = self.make_archive(base_name, fmt, base_dir=base_dir) File "/usr/local/python3.3/lib/python3.3/distutils/cmd.py", line 370, in make_archive dry_run=self.dry_run) File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 178, in make_archive filename = func(base_name, base_dir, **kwargs) File "/usr/local/python3.3/lib/python3.3/distutils/archive_util.py", line 118, in make_zipfile zip.write(path, path) File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 1328, in write self.fp.write(zinfo.FileHeader()) File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 382, in FileHeader filename, flag_bits = self._encodeFilenameFlags() File "/usr/local/python3.3/lib/python3.3/zipfile.py", line 394, in _encodeFilenameFlags return self.filename.encode('utf-8'), self.flag_bits \| 0x800 UnicodeEncodeError: 'utf-8' codec can't encode character '\udcfc' in position 21: surrogates not allowed
msg173682 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-10-24 13:15
The problem you are reporting looks different than the problem addressed in issue 8390. There, the surrogates are being introduced when reading filenames from the archive file. Here, the surrogates presumably arose because the filename on your file system was not utf-8 encoded and so Python introduced the surrogates to preserve the filename. The bug is that zipfile is not handling surrogates when building the archive...which may in fact be correct. If I understand correctly there are two encodings supported by zipfile, a Microsoft code page and utf-8. Anything else should probably be rejected as invalid, but with a better error message. If you really need to include invalid filenames in an archive, we would introduce an explict flag for allowing that. But, that's just my opinion. ("Be generous in what you accept, and strict in what you send")
msg173687 - (view)	Author: Stefan Holek (stefanholek)	Date: 2012-10-24 14:19
A little more context perhaps: The use-case is building Python distributions containing non-ASCII filenames. These seemingly "invalid" filenames can occur in real-life when the files have been created by, say, a 'git clone' operation. So yes, I have Latin-1 bytes on the filesystem, even though my locale is UTF-8. And yes, Python 3 decodes that filename using surrogates. Creating .tar.gz distributions in this situation appears to work (even re-creating the foreign bytes when the archive is later extracted), whereas .zip archives fail in the way described above. I was hoping zipfile could be made to work the same as tarfile in this regard. Concerns for standards certainly didn't keep tarfile from supporting surrogates. ;-)
msg173689 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-10-24 14:38
I'm guessing that is because (if you read the issue) there are no specified standards for the filenames in tar (other than PAX format). Although I would personally have preferred to need to specify a "yes really use these binary filenames" flag to tar, as well. I'm not sure there are real "standards" for zip, either. I'll have to leave that answer to someone more knowledgeable. As for your immediate issue, can't you just set your locale to latin-1 while building the archive? The filenames should then get encoded to utf-8 in the zip archive, which should do the right thing with respect to the user's locale when extracted. I would think that that would be more portable.
msg173754 - (view)	Author: Stefan Holek (stefanholek)	Date: 2012-10-25 11:56
What we are trying to do is make distribute work with non-ASCII filenames, and this is one of the things we ran into. Fact 1: Filenames are bytes, whether you like it or not. Treating them as strings is going to give you more trouble than dragging the bytes along. Fact 2: Surrogates are Python 3's way of dealing with bytes. Fact 3: What follows is that surrogates must be supported wherever Python 3 deals with filenames. Fact 4: This is a bug since Python breaks its own rules here (I have removed the enhancement marker). The issue is not what ZIP can do, but what Python 3 must do. Creating a potentially non-standard ZIP file is fine, exploding in the user's face is not.
msg173756 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-25 13:30
If we allow for surrogates in the names, it will not correct UTF-8. This can breaks other software. We should clear 11th flag bit in this case.
msg173766 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2012-10-25 16:26
Related issues: issue10614, issue10757, issue10972.
msg174166 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-30 00:37
> The use-case is building Python distributions containing > non-ASCII filenames. It's possible to distribute Python packages with non-ASCII filenames. > So yes, I have Latin-1 bytes on the filesystem, > even though my locale is UTF-8. You system is not configured correctly. If you would like to distribute such invalid filename, how do you plan to access it on other platforms where the filename is decoded differently? It would be safer to build your project on a well configured system. See issues mentionned in msg173766 to support: creating a ZIP archive with invalid filenames, and be able to specify the encoding of filenames when decoding a ZIP archive.
msg174196 - (view)	Author: Stefan Holek (stefanholek)	Date: 2012-10-30 10:06
> It's possible to distribute Python packages with non-ASCII filenames. Well, it wasn't until very recently (distribute 0.6.29): https://bitbucket.org/tarek/distribute/issue/303/no-support-for-unicode-manifest-files Unless we are not talking about the same thing, which is possible. ;-) >> So yes, I have Latin-1 bytes on the filesystem, >> even though my locale is UTF-8. > You system is not configured correctly. If you would like to distribute such invalid filename, > how do you plan to access it on other platforms where the filename is decoded differently? > It would be safer to build your project on a well configured system. This was done on purpose, to test how Python fares. Such files can easily come into existence, e.g. when cloning a Git repo created on a different system. I am not after "correct" ZIP files in this case, I am after Python not raising UnicodeErrors when it is supposed to a) support non-ASCII module names and b) support surrogates. python setup.py sdist --formats=gztar -> works python setup.py sdist --formats=zip -> UnicodeError If I am the only one to think this is wrong, then so be it. Our current workaround is to disallow surrogates in the manifest. /me shrugs.
msg174201 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-10-30 10:59
> If I am the only one to think this is wrong, then so be it. > Our current workaround is to disallow surrogates in the manifest. /me shrugs. You are not alone, that's why there are 3 open issues. But someone should finish the different proposition and write a new fully functionnal patch to support bytes filenames.
msg184288 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-03-16 03:44
I found some "standards" docs that could bear on this: http://www.pkware.com/documents/casestudies/APPNOTE.TXT Appendix D: "D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437." [..] "D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification." [..] So there's two choices for a filename in a zipfile: * bytes that make valid UTF-8 strings * bytes that make valid strings in code page 437 http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page Code Page 437 takes up all 256 possible bit patterns available in a byte. These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes. If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8. The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437. However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.
msg184820 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-03-21 00:54
Okay, here's the first version of a patch to add surrogate support to a zipfile. I think it's the minimum required to fix this bug. When archiving, if a filename contains surrogateescape'd bytes, it switches to cp437 when it saves the filename into the zipfile. This seems to be the strategy of other zip tools. Nothing changes when unarchiving (probably to deal with what comes out of other tools). The documentation is also updated to mention that unknown encodings are a problem that the zipfile module doesn't handle automatically for you. I think we could do better but this is a major improvement over the status quo (no tracebacks). Would someone care to review this for merge and then we could work on adding some notion of a user-specified encoding to override cp437 encoding on dearchiving. (which I think would satisfy: issue10614, issue10972). The use case in issue10757 might be fixed by this patch (or this patch plus the user specified encoding). Have to look a little harder at it.
msg184890 - (view)	Author: Toshio Kuratomi (a.badger) *	Date: 2013-03-21 17:23
Version 2 of the patch * fixes for the style problems noted by ezio.melotti

History
Date	User	Action	Args
2022-04-11 14:57:37	admin	set	github: 60514
2015-07-21 08:10:54	ethan.furman	set	nosy: - ethan.furman
2013-10-14 22:52:10	ethan.furman	set	nosy: + ethan.furman
2013-03-21 17:23:31	a.badger	set	files: + python3-zipfile-surrogate.patch messages: + msg184890
2013-03-21 00:54:21	a.badger	set	files: + python3-zipfile-surrogate.patch keywords: + patch messages: + msg184820
2013-03-16 03:44:45	a.badger	set	nosy: + a.badger messages: + msg184288
2012-10-30 10:59:18	vstinner	set	messages: + msg174201
2012-10-30 10:06:21	stefanholek	set	messages: + msg174196
2012-10-30 00:37:31	vstinner	set	messages: + msg174166
2012-10-28 12:30:07	asvetlov	set	nosy: + asvetlov
2012-10-25 16:26:18	serhiy.storchaka	set	messages: + msg173766
2012-10-25 13:30:57	serhiy.storchaka	set	messages: + msg173756
2012-10-25 11:56:45	stefanholek	set	type: enhancement -> messages: + msg173754 versions: + Python 3.3
2012-10-24 17:48:56	pitrou	set	nosy: + vstinner
2012-10-24 16:16:09	Arfrever	set	nosy: + Arfrever
2012-10-24 14:38:34	r.david.murray	set	messages: + msg173689
2012-10-24 14:19:18	stefanholek	set	messages: + msg173687
2012-10-24 13:15:01	r.david.murray	set	nosy: + r.david.murray messages: + msg173682
2012-10-24 12:59:55	serhiy.storchaka	set	nosy: + serhiy.storchaka type: enhancement versions: + Python 3.4, - Python 3.3
2012-10-24 12:33:45	stefanholek	create