This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Detect OEM code page for zip archives in ZipFile based on system locale
Type: Stage:
Components: Library (Lib) Versions: Python 3.10
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eryksun, ivan.sorokin.tech
Priority: normal Keywords:

Created on 2020-10-04 11:36 by ivan.sorokin.tech, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
windows_cyrillic.zip ivan.sorokin.tech, 2020-10-04 11:36
Messages (2)
msg377932 - (view) Author: Ivan Sorokin (ivan.sorokin.tech) Date: 2020-10-04 11:36
ZipFile has problems with filename charset in .zip archives having filenames charset encoded in OEM code page.

ZipFile assumes that OEM code page always means "cp437". Actually many popular .zip packers (for example, Windows internal "zip folders" tool) use OEM code page corresponding to system locale to write file names in .zip files.

To read such files correctly we should detect correct OEM code page from system locale instead of sticking to cp437.

Here is locale-to-oem-code-page conversion table, generated from Wine source code:
https://github.com/unxed/oemcp/blob/master/oemcp.txt

Sample archive is attached. The file inside should be extracted as "Новый текстовый документ.txt" when ru_RU system locale is set.
msg377948 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2020-10-04 15:42
This is already addressed in bpo-28080, which adds an `encoding` parameter -- e.g. `encoding="oem"` ("oem" is only available in Windows). Unfortunately bpo-28080 has languished without resolution for four years.
History
Date User Action Args
2022-04-11 14:59:36adminsetgithub: 86095
2020-10-04 15:42:42eryksunsetnosy: + eryksun
messages: + msg377948
2020-10-04 11:36:22ivan.sorokin.techcreate