This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author a.badger
Recipients Arfrever, a.badger, asvetlov, ezio.melotti, r.david.murray, serhiy.storchaka, stefanholek, vstinner
Date 2013-03-16.03:44:44
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1363405485.08.0.633679431542.issue16310@psf.upfronthosting.co.za>
In-reply-to
Content
I found some "standards" docs that could bear on this:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D:
"D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437."
[..]
"D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding.  If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification."
[..]

So there's two choices for a filename in a zipfile:

* bytes that make valid UTF-8 strings
* bytes that make valid strings in code page 437

http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page

Code Page 437 takes up all 256 possible bit patterns available in a byte.

These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes.  If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8.  

The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437.   However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.
History
Date User Action Args
2013-03-16 03:44:45a.badgersetrecipients: + a.badger, vstinner, ezio.melotti, Arfrever, r.david.murray, asvetlov, stefanholek, serhiy.storchaka
2013-03-16 03:44:45a.badgersetmessageid: <1363405485.08.0.633679431542.issue16310@psf.upfronthosting.co.za>
2013-03-16 03:44:45a.badgerlinkissue16310 messages
2013-03-16 03:44:44a.badgercreate