classification
Title: patch for bug 1170311 "zipfile UnicodeDecodeError"
Type: Stage:
Components: Library (Lib) Versions:
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: loewis Nosy List: kalt, loewis, snaury
Priority: high Keywords: patch

Created on 2007-06-10 10:53 by snaury, last changed 2008-05-05 21:16 by loewis. This issue is now closed.

Files
File name Uploaded Description Edit
python-zipfile-unicode-filenames.patch snaury, 2007-06-10 10:53 Patch and test case
python-zipfile-unicode-filenames-utf8.patch snaury, 2007-06-10 20:29 Patch that sets language bit for unicode filenames
python-zipfile-unicode-filenames-utf8-2.patch snaury, 2007-06-11 04:22 Patch falls back to ascii when it can, ZipInfo filenames are not damaged after writing
python-zipfile-unicode-filenames-utf8-3.patch snaury, 2007-06-11 04:27 Forgot to add test case in the previous patch
Messages (10)
msg52744 - (view) Author: Alexey Borzenkov (snaury) Date: 2007-06-10 10:53
This patch fixes UnicodeDecodeError when attempting to write files to zipfile with filename of unicode class.
msg52745 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-06-10 16:48
This patch is incorrect. It relies on the system encoding, and allows non-string things as file names. What it really should do is to encode in code page 437; bonus points if it falls back to the UTF-8 feature of zip files when that encoding fails.
msg52746 - (view) Author: Alexey Borzenkov (snaury) Date: 2007-06-10 20:29
File Added: python-zipfile-unicode-filenames-utf8.patch
msg52747 - (view) Author: Alexey Borzenkov (snaury) Date: 2007-06-11 04:22
File Added: python-zipfile-unicode-filenames-utf8-2.patch
msg52748 - (view) Author: Alexey Borzenkov (snaury) Date: 2007-06-11 04:27
File Added: python-zipfile-unicode-filenames-utf8-3.patch
msg65935 - (view) Author: Christophe Kalt (kalt) Date: 2008-04-28 21:32
Any chance of this making it in sometime?
The current behaviour is rather limiting/annoying.
msg65939 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-28 22:13
> Any chance of this making it in sometime?

I'll see what I can do for 2.6, but perhaps it gets delayed until
2.7/3.1.
msg66274 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-05 17:18
Thanks for the patch, committed as r62724. I didn't see the need to
clear the UTF-8 flag, so I left it in (in case somebody wants to inspect
it).
msg66277 - (view) Author: Alexey Borzenkov (snaury) Date: 2008-05-05 18:40
Martin, I cleared the flag bit because filename was changed in-place, to
mark that filename does not need further processing. This was primarily
compatibility concern, to accommodate for situations where users try to
do such decoding in their own code (this way flag won't be there, so
their code won't trigger). Without clearing the flag bit, calling
_decodeFilenameFlags second time will fail, as well as any similar user
code.

I suggest that if users want to know if filename is unicode, they should
check that filename is of class unicode.
msg66289 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-05-05 21:15
> Martin, I cleared the flag bit because filename was changed in-place, to
> mark that filename does not need further processing. This was primarily
> compatibility concern, to accommodate for situations where users try to
> do such decoding in their own code (this way flag won't be there, so
> their code won't trigger). Without clearing the flag bit, calling
> _decodeFilenameFlags second time will fail, as well as any similar user
> code.

I'm not concerned about the compatibility; code that actually does the
decoding still might break since it would expect the filename to be a
byte string if it doesn't explicitly decode. Such assumption would still
break under your change.

I am concerned about silently faking data. The library shouldn't do
that; it should present the flags unmodified, as some application might
perform further processing (such as displaying the flags to the user).
It would then be confusing if the data processed isn't the one that was
read from disk.

> I suggest that if users want to know if filename is unicode, they should
> check that filename is of class unicode.

That won't work in Py3k, which will always decode the filename.
History
Date User Action Args
2008-05-05 21:16:03loewissetmessages: + msg66289
2008-05-05 18:40:11snaurysetmessages: + msg66277
2008-05-05 17:18:55loewissetstatus: open -> closed
resolution: accepted
messages: + msg66274
2008-04-28 22:14:30loewissetpriority: normal -> high
2008-04-28 22:13:53loewissetmessages: + msg65939
2008-04-28 21:32:28kaltsetnosy: + kalt
messages: + msg65935
2007-09-10 20:34:45loewissetassignee: loewis
severity: normal -> major
2007-06-10 10:53:22snaurycreate