Message 415745 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	gregory.p.smith
Recipients	dhillier, gregory.p.smith, iritkatriel, yudilevi
Date	2022-03-22.06:31:22
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1647930682.89.0.609479786809.issue40172@roundup.psfhosted.org>
In-reply-to

Content
Examining Lib/zipfile.py code, the existing code makes sense. Python's zipfile module produces modern zipfiles when writing by setting the utf-8 flag and storing the filename as utf-8 when it is not ASCII. This is desirable for use with all normal zip implementations in the past 10-15 years. When decoding a zipfile, if the utf-8 flag is not set, we assume cp437 per the pkware zip appnotes.txt "spec". So our reading is correct as well, even for very old files. This is being strict in what we produce an lenient in what we accept. caveats? yes: If someone does need to produce zipfiles for use with ancient software that does not support utf-8, that also does not identify the unknown utf-8 flag as an error condition, it will interpret the name in a corrupt manner for non-ascii names. Similarly, even if written with cp437 names (as PR 19335 would do), in old zip system implementations where the implementation blindly uses the users locale encoding instead of cp437, it will always see corrupt data in that scenario. (aka mojibake?) These are not what I'd expect to be normal use cases. Do you have a common practical example of a need for this? (The PR on issue28080 provides a way to _read_ legacy zip files that used a codec other than cp437 if you know what it was.) --- https://www.loc.gov/preservation/digital/formats/fdd/fdd000354.shtml may also be of interest regarding the zip format.

Examining Lib/zipfile.py code, the existing code makes sense. Python's zipfile module produces modern zipfiles when writing by setting the utf-8 flag and storing the filename as utf-8 when it is not ASCII.  This is desirable for use with all normal zip implementations in the past 10-15 years.

When decoding a zipfile, if the utf-8 flag is not set, we assume cp437 per the pkware zip appnotes.txt "spec".  So our reading is correct as well, even for very old files.

This is being strict in what we produce an lenient in what we accept.  caveats?  yes:

If someone does need to produce zipfiles for use with ancient software that does not support utf-8, that also does not identify the unknown utf-8 flag as an error condition, it will interpret the name in a corrupt manner for non-ascii names.

Similarly, even if written with cp437 names (as PR 19335 would do), in old zip system implementations where the implementation blindly uses the users locale encoding instead of cp437, it will always see corrupt data in that scenario. (aka mojibake?)

These are not what I'd expect to be normal use cases. Do you have a common practical example of a need for this?

(The PR on issue28080 provides a way to _read_ legacy zip files that used a codec other than cp437 if you know what it was.)

---

https://www.loc.gov/preservation/digital/formats/fdd/fdd000354.shtml may also be of interest regarding the zip format.

History
Date	User	Action	Args
2022-03-22 06:31:22	gregory.p.smith	set	recipients: + gregory.p.smith, dhillier, yudilevi, iritkatriel
2022-03-22 06:31:22	gregory.p.smith	set	messageid: <1647930682.89.0.609479786809.issue40172@roundup.psfhosted.org>
2022-03-22 06:31:22	gregory.p.smith	link	issue40172 messages
2022-03-22 06:31:22	gregory.p.smith	create