classification
Title: zipfile: add "unicode" option to the force the filename encoding to UTF-8
Type: Stage: resolved
Components: Library (Lib), Unicode Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: duplicate
Dependencies: Superseder: ZipFile: add a filename_encoding argument
View: 10614
Assigned To: Nosy List: THRlWiTi, alanmcintyre, amaury.forgeotdarc, pitrou, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2011-01-21 12:00 by vstinner, last changed 2017-06-28 03:58 by serhiy.storchaka. This issue is now closed.

Files
File name Uploaded Description Edit
zipfile_unicode.patch vstinner, 2011-01-21 12:00
Messages (12)
msg126724 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-21 12:00
ZipInfo._encodeFilename() tries cp437 encoding or use UTF-8. It is not possible to decide the encoding.

To workaround #10955 (bootstrap issue with python32.zip), it would be nice to be able to create a ZIP file using only UTF-8 filenames.

Attached patch adds unicode parameter to ZipFile.write(), ZipFile.writestr() and ZipInfo constructor.
msg126725 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-21 12:03
Oh, this patch fixes also a bug: ZipFile._RealGetContents() doesn't keep the unicode flag, so open a ZIP file and then write it somewhere else may change the unicode flag if unicode flag was set but the filename is also encodable to UTF-8 (eg. ASCII filename).
msg126727 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-21 12:07
7zip and WinRAR uses the same algorithm than ZipFile._encodeFilename(): try cp437 or use UTF-8. Eg. if a filename contains ∞ (U+221E), it is encoded to UTF-8.

WinZIP encodes all filenames to cp437: ∞ (U+221E) is replaced by 8 (U+0038), ☺ (U+263A) is replaced by... U+0001!

7zip, WinRAR and WinZIP are able to decode UTF-8 filenames (handle correctly the unicode flag).
msg126731 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 12:18
What kind of problem are you trying to solve?
msg126734 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-21 13:00
> What kind of problem are you trying to solve?

Support non-ASCII filenames in python32.zip (#10955): at bootstrap, Python 3.2 can only use UTF-8 codec (not cp437).

But I suppose also that forcing the encoding to UTF-8 gives a better Unicode support (when you decompress the archive).
msg126735 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 13:03
> Support non-ASCII filenames in python32.zip (#10955): at bootstrap,
> Python 3.2 can only use UTF-8 codec (not cp437).
> 
> But I suppose also that forcing the encoding to UTF-8 gives a better
> Unicode support (when you decompress the archive).

The question is, rather, why you need an external flag for that.
msg126745 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2011-01-21 15:02
> The question is, rather, why you need an external flag for that.

Because I don't want to change the default encoding. I am not sure that all applications support UTF-8 encodings.

But if you control your environment, force UTF-8 encoding should improve your Unicode support.
msg126746 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-01-21 15:12
> > The question is, rather, why you need an external flag for that.
> 
> Because I don't want to change the default encoding. I am not sure
> that all applications support UTF-8 encodings.

If this is a ZIP standard flag, why should we care about applications
which don't support it? Should we add other flags to disable other
features out of fear that other applications might not support them
either?

> But if you control your environment, force UTF-8 encoding should
> improve your Unicode support.

How is a random user supposed to know if their tools support UTF-8
encoding? It's not like everyone is an expert in ZIP files. This is the
kind of situation where asking the user to make a choice is more
confusing than helpful. When adding the flag, not only you complicate
the API, but you have to support this flag for the rest of your life
(well, almost :-)).

We could instead use utf-8 by default for all non-ascii filenames (and
*perhaps* have a separate force_cp437 flag, but default it to False).
msg126759 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-01-21 17:59
This looks similar to issue10614
msg276182 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-09-13 06:18
Now UTF-8 is used for non-ASCII names. Can this issue be closed as outdated?
msg297125 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017-06-28 01:37
> This looks similar to issue10614

Right. Let's focus on that one which has a better design. "unicode" means everything and nothing. It's more reliable to specify an encoding.
msg297148 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017-06-28 03:58
See also issue28080.
History
Date User Action Args
2017-06-28 03:58:24serhiy.storchakasetmessages: + msg297148
2017-06-28 01:37:05vstinnersetstatus: open -> closed
superseder: ZipFile: add a filename_encoding argument
messages: + msg297125

resolution: duplicate
stage: resolved
2016-09-13 06:18:50serhiy.storchakasetmessages: + msg276182
2015-09-12 05:55:36THRlWiTisetnosy: + THRlWiTi
2015-07-21 08:11:16ethan.furmansetnosy: - ethan.furman
2013-10-14 22:43:18ethan.furmansetnosy: + ethan.furman
2012-04-07 19:22:03serhiy.storchakasetnosy: + serhiy.storchaka
2011-01-21 17:59:51amaury.forgeotdarcsetmessages: + msg126759
2011-01-21 15:16:13pitrousetnosy: + amaury.forgeotdarc
2011-01-21 15:12:22pitrousetmessages: + msg126746
2011-01-21 15:02:07vstinnersetmessages: + msg126745
2011-01-21 13:03:42pitrousetmessages: + msg126735
2011-01-21 13:00:51vstinnersetmessages: + msg126734
2011-01-21 12:18:49pitrousetnosy: + pitrou
messages: + msg126731
2011-01-21 12:07:38vstinnersettitle: zipfile: add unicode option to the choose filename encoding -> zipfile: add "unicode" option to the force the filename encoding to UTF-8
2011-01-21 12:07:08vstinnersetnosy: + alanmcintyre
messages: + msg126727
2011-01-21 12:03:06vstinnersetmessages: + msg126725
2011-01-21 12:00:43vstinnercreate