Message 184288 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	a.badger
Recipients	Arfrever, a.badger, asvetlov, ezio.melotti, r.david.murray, serhiy.storchaka, stefanholek, vstinner
Date	2013-03-16.03:44:44
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1363405485.08.0.633679431542.issue16310@psf.upfronthosting.co.za>
In-reply-to

Content
I found some "standards" docs that could bear on this: http://www.pkware.com/documents/casestudies/APPNOTE.TXT Appendix D: "D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437." [..] "D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding. If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification." [..] So there's two choices for a filename in a zipfile: * bytes that make valid UTF-8 strings * bytes that make valid strings in code page 437 http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page Code Page 437 takes up all 256 possible bit patterns available in a byte. These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes. If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8. The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437. However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.

I found some "standards" docs that could bear on this:

http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Appendix D:
"D.1 The ZIP format has historically supported only the original IBM PC character encoding set, commonly referred to as IBM Code Page 437."
[..]
"D.2 If general purpose bit 11 is unset, the file name and comment should conform to the original ZIP character encoding.  If general purpose bit 11 is set, the filename and comment must support The Unicode Standard, Version 4.1.0 or greater using the character encoding form defined by the UTF-8 storage specification."
[..]

So there's two choices for a filename in a zipfile:

* bytes that make valid UTF-8 strings
* bytes that make valid strings in code page 437

http://en.wikipedia.org/wiki/Code_page_437#Standard_code_page

Code Page 437 takes up all 256 possible bit patterns available in a byte.

These two factors mean that if a filename in a zipfile is considered from the POV of a sequence of bytes, it can (according to the zipfile standard) contain any possible sequence of bytes.  If a filename is considered from the POV of a sequence of human characters, it can contain any possible sequence of unicode code points encoded as utf-8.  

The tricky bit: if the bytes are not valid utf-8 then officially the characters should be limited to the 256 characters of Code Page 437.   However, the client tools I've looked at exploit the fact that all bytes are possible to simply save the bytes that make up the filename into the zip file.

History
Date	User	Action	Args
2013-03-16 03:44:45	a.badger	set	recipients: + a.badger, vstinner, ezio.melotti, Arfrever, r.david.murray, asvetlov, stefanholek, serhiy.storchaka
2013-03-16 03:44:45	a.badger	set	messageid: <1363405485.08.0.633679431542.issue16310@psf.upfronthosting.co.za>
2013-03-16 03:44:45	a.badger	link	issue16310 messages
2013-03-16 03:44:44	a.badger	create