classification
Title: zipfile *does* support utf-8 filenames
Type: Stage: resolved
Components: Documentation Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Documentation of zipfile.ZipFile().writestr() fails to mention that 'data' may also be bytes
View: 32035
Assigned To: docs@python Nosy List: cheryl.sabella, dholth, docs@python, r.david.murray, serhiy.storchaka, terry.reedy
Priority: normal Keywords:

Created on 2016-06-17 15:08 by dholth, last changed 2019-02-06 18:25 by cheryl.sabella. This issue is now closed.

Messages (12)
msg268727 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-17 15:08
The zipfile documentation says "There is no official file name encoding for ZIP files." However ZIP and zipfile supports utf-8 filenames; this has been true for a long time, at least since Python 2.7.
msg268750 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2016-06-18 00:22
There is a difference between 'official' and 'supported', and I don't quite know what you mean by the latter.
msg269035 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-22 00:24
See issue 10614 for the current state of play.  This issue should probably be closed in favor of that one.
msg269041 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-22 02:50
This is a simple documentation bug about the ZIP file format supporting utf-8 and 'no encoding' filenames depending on whether two bits are set in a flag inside the archive member. Bug 10614 appears to be a different issue about out-of-band encoding information that you could pass to Python's zipfile implementation.
msg269117 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-23 15:24
OK, what do you propose as a documentation change?  The current doc is accurate, but incomplete.  New phrasing could include something about the two de-facto standards but that one can not be sure that filenames will be in one of those two encodings.  Issue 10614 addresses the fact that the zipfile module doesn't make it easy to specify the encoding of filenames when creating an archive, IIUC, which also still needs to be addressed in any documentation change.
msg269120 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-23 15:46
The current documentation says "Note There is no official file name encoding for ZIP files. If you have unicode file names, you must convert them to byte strings in your desired encoding before passing them to write(). WinZip interprets all file names as encoded in CP437, also known as DOS Latin."

This is bad advice because if you convert the filenames to bytes before passing them to zipfile, it won't remember that they should be unicode. Instead it should say

"The ZIP file format supports Unicode filenames. If you have unicode filenames, zipfile will encode them to and from utf-8 internally. If you pass bytes filenames to write() then they will be stored without a specified encoding."

I am not sure what current versions of WinZip or Windows file manager do.
msg269121 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-23 15:47
" ... zipfile will encode them to and from utf-8 internally, and the encoding is marked in a standard flag inside the archive member."
msg269123 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-23 16:08
The documentation should read

The ZIP file format supports Unicode filenames. If you have unicode filenames, zipfile will encode them to and from utf-8 internally, but if you pass bytes filenames to write() then they will be stored without a specified encoding.

Even though the format itself supports Unicode, historically Windows' built-in ZIP utility has interpreted all ZIP filenames as CP437 also known as DOS Latin. There is a fix from Microsoft for Windows 7 available here: https://support.microsoft.com/en-us/kb/2704299
msg269180 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-06-24 14:28
I bet the existing wording is just left over from the python2 docs.

I think cp437 should still be mentioned explicitly.  And mentioning "setting the utf-8 flag" would probably make the explanation clearer, though I'm not sure.

Tecnically speaking, I think zipfile supports utf8, not unicode.  Or it supports unicode via utf-8.
msg269190 - (view) Author: Daniel Holth (dholth) * Date: 2016-06-24 16:24
https://hg.python.org/cpython/file/2.6/Lib/zipfile.py#l331

Python 2.6 zipfile supports utf8 properly. It has only improved since then.
msg269201 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-06-24 17:29
This note looks outdated.

In 2.x 8-bit file names are written as is, implying cp437 or what your consumers expect. Unicode file names are encoded to ascii or utf-8 (with setting utf-8 flag). In 3.x only Unicode file names are accepted, and they always are encoded to ascii or utf-8. There is no way to write non-ascii non-utf-8 file names. cp437 is not used at all.

Maybe just remove this misleading note?
msg334969 - (view) Author: Cheryl Sabella (cheryl.sabella) * (Python committer) Date: 2019-02-06 18:25
This wording was removed as part of issue 32035.
History
Date User Action Args
2019-02-06 18:25:03cheryl.sabellasetstatus: open -> closed

superseder: Documentation of zipfile.ZipFile().writestr() fails to mention that 'data' may also be bytes

nosy: + cheryl.sabella
messages: + msg334969
resolution: duplicate
stage: needs patch -> resolved
2016-06-24 17:29:11serhiy.storchakasetmessages: + msg269201
2016-06-24 16:24:40dholthsetmessages: + msg269190
2016-06-24 14:28:02r.david.murraysetmessages: + msg269180
2016-06-23 16:08:32dholthsetmessages: + msg269123
2016-06-23 15:47:29dholthsetmessages: + msg269121
2016-06-23 15:46:24dholthsetmessages: + msg269120
2016-06-23 15:24:39r.david.murraysetmessages: + msg269117
2016-06-22 02:50:33dholthsetmessages: + msg269041
2016-06-22 00:24:20r.david.murraysetnosy: + r.david.murray
messages: + msg269035
2016-06-18 00:22:16terry.reedysetnosy: + terry.reedy
messages: + msg268750
2016-06-17 19:11:38serhiy.storchakasetnosy: + serhiy.storchaka
stage: needs patch

versions: + Python 3.5
2016-06-17 15:08:46dholthcreate