Message 407666 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	accelerator0099
Recipients	accelerator0099
Date	2021-12-04.14:01:19
SpamBayes Score	-1.0
Marked as misclassified	Yes
Message-id	<1638626479.32.0.427915201739.issue45981@roundup.psfhosted.org>
In-reply-to

Content
In file Lib/zipfile.py: 1357> flags = centdir[5] 1358> if flags & 0x800: 1359> # UTF-8 file names extension 1360> filename = filename.decode('utf-8') 1361> else: 1362> # Historical ZIP filename encoding 1363> filename = filename.decode('cp437') ZipFile simply decodes all non-utf8 file names by encoding CP437. In file Lib/zipfile.py: 352> # This is used to ensure paths in generated ZIP files always use 353> # forward slashes as the directory separator, as required by the 354> # ZIP format specification. 355> if os.sep != "/" and os.sep in filename: 356> filename = filename.replace(os.sep, "/") And it replaces every '\\' with '/' on windows. Consider we have a file named '\x97\x5c\x92\x9b', which is '予兆' in Japanese encoded in SHIFT_JIS. You may have noticed the problem: '\x5c' is '\\'(backslash) in ASCII So you will see ZipFile decodes the bytes by CP437, and replaces all '\\' with '/'. And the Japanese character '予' is replaced partially, it is no longer itself. Someone says we can replace '/' with '\\' back, and decode it by CP437 to get the raw bytes. But what if both '/'('\x2f') and '\\'('\x5c') appear in the raw filename? Simply replacing '\\' in a bytestream without knowning the encoding is by no means a good way. Maybe we can provide a rawname field in the ZipInfo struct?

In file Lib/zipfile.py:
1357>  flags = centdir[5]
1358>  if flags & 0x800:
1359>    # UTF-8 file names extension
1360>    filename = filename.decode('utf-8')
1361>  else:
1362>    # Historical ZIP filename encoding
1363>    filename = filename.decode('cp437')

ZipFile simply decodes all non-utf8 file names by encoding CP437.

In file Lib/zipfile.py:
352>  # This is used to ensure paths in generated ZIP files always use
353>  # forward slashes as the directory separator, as required by the
354>  # ZIP format specification.
355>  if os.sep != "/" and os.sep in filename:
356>    filename = filename.replace(os.sep, "/")

And it replaces every '\\' with '/' on windows.

Consider we have a file named '\x97\x5c\x92\x9b', which is '予兆' in Japanese encoded in SHIFT_JIS.
You may have noticed the problem:

  '\x5c' is '\\'(backslash) in ASCII

So you will see ZipFile decodes the bytes by CP437, and replaces all '\\' with '/'.
And the Japanese character '予' is replaced partially, it is no longer itself.

Someone says we can replace '/' with '\\' back, and decode it by CP437 to get the raw bytes.
But what if both '/'('\x2f') and '\\'('\x5c') appear in the raw filename?

Simply replacing '\\' in a bytestream without knowning the encoding is by no means a good way.
Maybe we can provide a rawname field in the ZipInfo struct?

History
Date	User	Action	Args
2021-12-04 14:01:19	accelerator0099	set	recipients: + accelerator0099
2021-12-04 14:01:19	accelerator0099	set	messageid: <1638626479.32.0.427915201739.issue45981@roundup.psfhosted.org>
2021-12-04 14:01:19	accelerator0099	link	issue45981 messages
2021-12-04 14:01:19	accelerator0099	create