New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zipfile: add "unicode" option to the force the filename encoding to UTF-8 #55181
Comments
ZipInfo._encodeFilename() tries cp437 encoding or use UTF-8. It is not possible to decide the encoding. To workaround bpo-10955 (bootstrap issue with python32.zip), it would be nice to be able to create a ZIP file using only UTF-8 filenames. Attached patch adds unicode parameter to ZipFile.write(), ZipFile.writestr() and ZipInfo constructor. |
Oh, this patch fixes also a bug: ZipFile._RealGetContents() doesn't keep the unicode flag, so open a ZIP file and then write it somewhere else may change the unicode flag if unicode flag was set but the filename is also encodable to UTF-8 (eg. ASCII filename). |
7zip and WinRAR uses the same algorithm than ZipFile._encodeFilename(): try cp437 or use UTF-8. Eg. if a filename contains ∞ (U+221E), it is encoded to UTF-8. WinZIP encodes all filenames to cp437: ∞ (U+221E) is replaced by 8 (U+0038), ☺ (U+263A) is replaced by... U+0001! 7zip, WinRAR and WinZIP are able to decode UTF-8 filenames (handle correctly the unicode flag). |
What kind of problem are you trying to solve? |
Support non-ASCII filenames in python32.zip (bpo-10955): at bootstrap, Python 3.2 can only use UTF-8 codec (not cp437). But I suppose also that forcing the encoding to UTF-8 gives a better Unicode support (when you decompress the archive). |
The question is, rather, why you need an external flag for that. |
Because I don't want to change the default encoding. I am not sure that all applications support UTF-8 encodings. But if you control your environment, force UTF-8 encoding should improve your Unicode support. |
If this is a ZIP standard flag, why should we care about applications
How is a random user supposed to know if their tools support UTF-8 We could instead use utf-8 by default for all non-ascii filenames (and |
This looks similar to bpo-10614 |
Now UTF-8 is used for non-ASCII names. Can this issue be closed as outdated? |
Right. Let's focus on that one which has a better design. "unicode" means everything and nothing. It's more reliable to specify an encoding. |
See also bpo-28080. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: