Author jgoerzen
Recipients jgoerzen, jnalley
Date 2019-11-26.04:28:16
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1574742497.21.0.901658869202.issue38861@roundup.psfhosted.org>
In-reply-to
Content
Hi Jon,

I've read your article in the gist, the ZIP spec, and the article you linked to.  As the article you linked to (https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/) states, "Implementers just encode file names however they want (usually byte for byte as they are in the OS".  That is certainly my observation.  CP437 has NEVER been guaranteed, *even on DOS*.  See https://en.wikipedia.org/wiki/Category:DOS_code_pages and https://www.aivosto.com/articles/charsets-codepages-dos.html for details on DOS code pages.  I do not recall any translation between DOS codepages being done in practice, or even possible - since the whole point of multiple codepages was the need for more than 256 symbols.  So (leaving aside utf-8 encodings for a second) no operating system or ZIP implementation I am aware of performs a translation to cp437, such translation is often not even possible, and they're just copying literal bytes to ZIP -- as the POSIX filesystem itself is.

So, from the above paragraph, it's clear that the assumption in zipfile that cp437 is in use is faulty.  Your claim that Python "fixes" a problem is also faulty.  Converting from a latin-1 character, using a cp437 codeset, and generating a filename with that cp437 character represented as a Unicode code point is wrong in many ways.  Python should not take an opinion on this; it should be agnostic and copy the bytes that represent the filename in the ZIP to bytes that represent the filename on the filesystem.

POSIX filenames contain any of 254 characters (only 0x00 and '/' are invalid).  The filesystem is encoding-agnostic; POSIX filenames are just stream of bytes.  There is no alternative but to treat ZIP filenames (without the Unicode flag) the same way.  Copy bytes to bytes.  It is not possible to identify the encoding of the filename in the absence of the Unicode flag.

zipfile should:

1) expose a bytes interface to filename
2) use byte-for-byte extraction when no Unicode flag is present
3) not make the assumption that cp437 was the original encoding

Your proposal only "works" cross-platform because it is broken on every platform!
History
Date User Action Args
2019-11-26 04:28:17jgoerzensetrecipients: + jgoerzen, jnalley
2019-11-26 04:28:17jgoerzensetmessageid: <1574742497.21.0.901658869202.issue38861@roundup.psfhosted.org>
2019-11-26 04:28:17jgoerzenlinkissue38861 messages
2019-11-26 04:28:16jgoerzencreate