Message 107455 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vstinner
Recipients	lars.gustaebel, lemburg, loewis, vstinner
Date	2010-06-10.11:47:38
SpamBayes Score	4.6043955e-05
Marked as misclassified	No
Message-id	<1276170460.83.0.595396051352.issue8784@psf.upfronthosting.co.za>
In-reply-to

Content
I created a tarball (.tar.gz) on Windows with Python 3.1 (which uses "mbcs" encoding). With locale.getpreferredencoding() == 'cp1252', "é" (U+00e9) is encoded 0xe9 (1 byte) and "à" (U+00e0) as 0xe0 (1 byte). WinRAR displays correctly the file names, but 7-zip displays the wrong glyphs. So WinRAR expects CP1252 whereas 7-zip expects CP850. I also tested an archive encoded with UTF-8: WinRAR and 7-zip display the wrong glyph, they decode utf-8 with CP1252 / CP850 :-/ If an archive will be used on UNIX, I think that the archive should use UTF-8 (on Windows and UNIX). But if the archive is read on Windows with WinRAR or 7-zip, the archive should use a codepage. Since mbcs looks to be the least worst choice, it may be used but with "replace" error handler (because it doesn't support "surrogateescape" error handler). -- About the code pages: - chcp command displays "Active code page: 850" - python -c "import locale; print(locale.getpreferredencoding())" displays "cp1252" - python -c "import sys; print(sys.stdout.encoding)" displays "cp850" Python calls GetConsoleOutputCP() to get stdout/stderr encoding (code page), whereas locale.getpreferredencoding() (_locale.getdefaultencoding()) calls GetACP().

I created a tarball (.tar.gz) on Windows with Python 3.1 (which uses "mbcs" encoding). With locale.getpreferredencoding() == 'cp1252', "é" (U+00e9) is encoded 0xe9 (1 byte) and "à" (U+00e0) as 0xe0 (1 byte). WinRAR displays correctly the file names, but 7-zip displays the wrong glyphs.

So WinRAR expects CP1252 whereas 7-zip expects CP850.

I also tested an archive encoded with UTF-8: WinRAR and 7-zip display the wrong glyph, they decode utf-8 with CP1252 / CP850 :-/

If an archive will be used on UNIX, I think that the archive should use UTF-8 (on Windows and UNIX). But if the archive is read on Windows with WinRAR or 7-zip, the archive should use a codepage.

Since mbcs looks to be the least worst choice, it may be used but with "replace" error handler (because it doesn't support "surrogateescape" error handler).

--

About the code pages:

 - chcp command displays "Active code page: 850"
 - python -c "import locale; print(locale.getpreferredencoding())" displays "cp1252"
 - python -c "import sys; print(sys.stdout.encoding)" displays "cp850"

Python calls GetConsoleOutputCP() to get stdout/stderr encoding (code page), whereas locale.getpreferredencoding() (_locale.getdefaultencoding()) calls GetACP().

History
Date	User	Action	Args
2010-06-10 11:47:41	vstinner	set	recipients: + vstinner, lemburg, loewis, lars.gustaebel
2010-06-10 11:47:40	vstinner	set	messageid: <1276170460.83.0.595396051352.issue8784@psf.upfronthosting.co.za>
2010-06-10 11:47:38	vstinner	link	issue8784 messages
2010-06-10 11:47:38	vstinner	create