This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: tarfile/Windows: Don't use mbcs as the default encoding
Type: Stage:
Components: Library (Lib), Unicode, Windows Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lars.gustaebel, lemburg, loewis, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2010-05-22 01:22 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
tarfile_windows_utf8.patch vstinner, 2010-05-22 01:22
tarfile_mbcs_errors.patch vstinner, 2010-06-10 17:18
tarfile_windows_utf8-2.patch vstinner, 2010-06-10 21:20
Messages (14)
msg106276 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-22 01:22
mbcs encoding replace non encodable characters (loose information) and doesn't support surrogateescape error handler. It ignores the error handler argument: see #850997, and tarfile now uses surrogateescape error handler by default (#8390). This encoding is just horrible for unicode support :-)

Since Windows native API use unicode character (UTF-16), I think that it would be better to use utf-8 for the default encoding on Windows. utf-8 is able to encode and decode the full Unicode charset and supports all error handlers (especially surrogateescape).

Attached patch sets the default encoding to utf-8 on Windows, and removes the test ENCODING is None because sys.getfilesystemencoding() cannot be None anymore (in 3.2 only, it's a recent change: #8610).
msg106758 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-05-30 12:31
My expertise on Windows is rather limited, but as far as I understand the issue, I consider this a reasonable idea.
I think it is impossible to find a perfect default encoding, and utf-8 seems to be the best bet with regard to portability. IIRC most of the archivers on the Windows machines I have access to use latin-1, but I don't think that latin-1 is a suitable default value. I don't know much about Windows internals and have no idea what mbcs really is, but it is actually not available on other platforms.
msg107435 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-09 22:57
I created a TAR archive with the 7-zip archiver of file with diacritics in their name (eg. "é" and "à"). Then I opened the archive with WinRAR: the file names were not displayed correctly :-/

7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding.
msg107438 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-09 23:09
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> I created a TAR archive with the 7-zip archiver of file with diacritics in their name (eg. "é" and "à"). Then I opened the archive with WinRAR: the file names were not displayed correctly :-/
> 
> 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding.

That's an old DOS code paged used in Europe: CP850

http://en.wikipedia.org/wiki/Code_page_850
msg107440 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-09 23:21
Marc-Andre Lemburg wrote:
> 
> Marc-Andre Lemburg <mal@egenix.com> added the comment:
> 
> STINNER Victor wrote:
>>
>> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
>>
>> I created a TAR archive with the 7-zip archiver of file with diacritics in their name (eg. "é" and "à"). Then I opened the archive with WinRAR: the file names were not displayed correctly :-/
>>
>> 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding.
> 
> That's an old DOS code paged used in Europe: CP850
> 
> http://en.wikipedia.org/wiki/Code_page_850

Looks like the cmd.exe on WinXP still uses it. At least on my German
WinXP it does for Python 2.3 and older. Starting with Python 2.4,
the behavior changed to use CP1252 instead:

D:\Python26>python
Python 2.6 (r26:66721, Oct  2 2008, 11:35:03) [MSC v.1500 32 bit (Intel)] on wi
32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'àé'
u'\xe0\xe9'

D:\Python25>python
Python 2.5.2 (r252:60911, Feb 21 2008, 13:11:45) [MSC v.1310 32 bit (Intel)] on
win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'áé'
u'\xe1\xe9'

D:\Python24>python
Python 2.4 (#60, Nov 30 2004, 11:49:19) [MSC v.1310 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'àé'
u'\xe0\xe9'

D:\Python23>python
Python 2.3.4 (#53, May 25 2004, 21:17:02) [MSC v.1200 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> u'àé'
u'\x85\x82'
>>>
msg107455 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-10 11:47
I created a tarball (.tar.gz) on Windows with Python 3.1 (which uses "mbcs" encoding). With locale.getpreferredencoding() == 'cp1252', "é" (U+00e9) is encoded 0xe9 (1 byte) and "à" (U+00e0) as 0xe0 (1 byte). WinRAR displays correctly the file names, but 7-zip displays the wrong glyphs.

So WinRAR expects CP1252 whereas 7-zip expects CP850.

I also tested an archive encoded with UTF-8: WinRAR and 7-zip display the wrong glyph, they decode utf-8 with CP1252 / CP850 :-/

If an archive will be used on UNIX, I think that the archive should use UTF-8 (on Windows and UNIX). But if the archive is read on Windows with WinRAR or 7-zip, the archive should use a codepage.

Since mbcs looks to be the least worst choice, it may be used but with "replace" error handler (because it doesn't support "surrogateescape" error handler).

--

About the code pages:

 - chcp command displays "Active code page: 850"
 - python -c "import locale; print(locale.getpreferredencoding())" displays "cp1252"
 - python -c "import sys; print(sys.stdout.encoding)" displays "cp850"

Python calls GetConsoleOutputCP() to get stdout/stderr encoding (code page), whereas locale.getpreferredencoding() (_locale.getdefaultencoding()) calls GetACP().
msg107466 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-10 17:18
My tests with 7-zip and WinRAR conviced me that it's not a good idea to use utf-8 *by default* on Windows. But since mbcs doesn't support surrogateescape error handler, we should restore the previous behaviour just for this encoding.

tarfile_mbcs_errors.patch creates a function choose_errors() which determine the best error handler depending on the encoding and the mode (read or write):
 - "strict" to write with mbcs
 - "replace" to read with mbcs
 - "surrogateescape" otherwise

Please, review my changes on the documentation :-)

On Windows, patched tarfile works exactly as Python 3.1.
msg107467 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-06-10 17:27
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> My tests with 7-zip and WinRAR conviced me that it's not a good idea to use utf-8 *by default* on Windows. But since mbcs doesn't support surrogateescape error handler, we should restore the previous behaviour just for this encoding.
> 
> tarfile_mbcs_errors.patch creates a function choose_errors() which determine the best error handler depending on the encoding and the mode (read or write):
>  - "strict" to write with mbcs
>  - "replace" to read with mbcs
>  - "surrogateescape" otherwise

I think you should implement this in a more general way:
have the class test whether the codec supports "surrogateescape"
and then use it. Otherwise fall back to "strict" for writing
and "replace" for reading.
msg107468 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-06-10 18:40
>> 7-zip encodes "à" (U+00e0) as 0x85 (1 byte), and "é" (U+00e9) as 0x82 (1 byte). I don't know this encoding.
>
> That's an old DOS code paged used in Europe: CP850

There is a good chance that they use it because it is the OEM code page 
on the system.

In any case, I think that both cp850 and cp1252 are inherently incorrect 
for tarfiles (despite these tools using them). tar is a POSIX thing, and 
these encodings have nothing to do with POSIX.

So using UTF-8 is a reasonable choice, IMO. The other reasonable choice 
would be ASCII.
msg107469 - (view) Author: Lars Gustäbel (lars.gustaebel) * (Python committer) Date: 2010-06-10 18:51
Maybe I'm going out on a limb here, but I think we should again consider what tarfile users on Windows(!) actually use it for under which circumstances. The following list is probably not exhaustive, but IMHO covers 90%:

1. Download tar archives from a webpage (when no zip is supplied) for viewing or extracting.
2. Create backups for personal use.
3. Create source archives from a project for unix users who hate zipfiles.

I am convinced that the tarfile module is not very popular on Windows, because of a simple reason: tar archives are not. Windows users will always prefer zip archives and hence the zipfile module, because it's something they're familiar with.

The point I am trying to make is, that, first, we should not choose a default encoding based on what works best with WinRAR, 7-zip and such, because they all act very differently which makes it impossible. Second, we must not overemphasize the encoding issue to a point where portability is in danger. This means that in almost all real-life cases there are no encoding issues. In my whole tarfile maintaining career I cannot remember a single incident of a tar archive that I got from an external source that contained special characters. The only tar archives that contain special characters in my experience are backups. But: these backups are created and later restored on one and the same system. Again, no encoding issues.

Long story short, I still vote for utf-8, because it enables Windows users to create backups without losing special characters, and it's ASCII-"compatible" and should be able to read 99% of the files that you get from the internet.
msg107488 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-10 21:14
> 2. Create backups for personal use.

What? Really? I'm sure that all Windows users will use ZIP or maybe RAR, but never the geek choice.

> 1. Download tar archives from a webpage (when no zip is supplied) for viewing or extracting.

Tarballs come from UNIX/BSD world which use UTF-8 by default since some years ago.

> 3. Create source archives from a project for unix users who hate zipfiles.

In this case, UTF-8 is also better.

--

Did I mentionned that 7-zip is only able to create TAR archive? I mean uncompressed archive. Who will use that? (not me ;-))

WinRAR is unable to create tarballs, even (uncompressed) .tar archive.

--

If the maintainer of the tarfile module agrees that UTF-8 is the best choice, I will commit my initial patch. I would prefer to commit tarfile_windows_utf8.patch because it changes 4 lines, whereas tarfile_mbcs_errors.patch changes... much more code :-)

tarfile_windows_utf8.patch is not complete: the documentation should also be updated:

.. data:: ENCODING

   The default character encoding i.e. the value from either
   :func:`sys.getfilesystemencoding` or :func:`sys.getdefaultencoding`.

=>

.. data:: ENCODING

   The default character encoding: ``'utf-8'`` on Windows,
   :func:`sys.getfilesystemencoding` otherwise.
msg107491 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-10 21:20
Updated version of the utf-8 patch:
 - Use also UTF-8 for Windows CE
 - Update the documentation
 - Prepare the NEWS entry
msg107492 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-06-10 21:24
FWIW, I agree with Lars: the main use of tar files under Windows is when they come from other systems. Windows users almost never generate tar files by themselves; they will generate zip, rar or 7z files instead.
msg107609 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-06-11 23:49
Ok. I commited the patch to set the default encoding to utf-8 on Windows: r81925.
History
Date User Action Args
2022-04-11 14:57:01adminsetgithub: 53030
2010-06-11 23:49:47vstinnersetstatus: open -> closed
resolution: fixed
2010-06-11 23:49:39vstinnersetmessages: + msg107609
2010-06-10 21:24:05pitrousetnosy: + pitrou
messages: + msg107492
2010-06-10 21:20:37vstinnersetfiles: + tarfile_windows_utf8-2.patch

messages: + msg107491
2010-06-10 21:14:10vstinnersetmessages: + msg107488
2010-06-10 18:51:58lars.gustaebelsetmessages: + msg107469
2010-06-10 18:40:56loewissetmessages: + msg107468
2010-06-10 17:27:25lemburgsetmessages: + msg107467
2010-06-10 17:19:02vstinnersetfiles: + tarfile_mbcs_errors.patch

messages: + msg107466
2010-06-10 11:47:38vstinnersetmessages: + msg107455
2010-06-09 23:21:35lemburgsetmessages: + msg107440
2010-06-09 23:09:08lemburgsetnosy: + lemburg
messages: + msg107438
2010-06-09 22:57:02vstinnersetnosy: + loewis
messages: + msg107435
2010-05-30 12:31:54lars.gustaebelsetmessages: + msg106758
2010-05-22 01:22:13vstinnercreate