classification
Title: test_tarfile fails on cygwin (unicode decode error)
Type: behavior Stage: resolved
Components: Versions: Python 3.0
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, loewis, ocean-city, rpetrov
Priority: normal Keywords: patch

Created on 2008-09-09 20:42 by ocean-city, last changed 2016-01-13 17:39 by ezio.melotti. This issue is now closed.

Messages (15)
msg72907 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-09 20:42
I noticed test_tarfile on py3k fails like this.

======================================================================
ERROR: test_directory_size (__main__.WriteTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_tarfile.py", line 598, in test_directory_size
    tarinfo = tar.gettarinfo(path)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1869, in
gettari
nfo
    tarinfo.gname = grp.getgrgid(tarinfo.gid)[0]
UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 0:
unexpecte
d code byte

======================================

And I noticed PyUnicode_FromString supposes input as UTF-8, but actually
member of struct grp is MBCS or CP932 on cygwin.

After patched following workaround, test passed. I don't know how to fix
this... Does python have system encoding or something?
(I experienced similar error on test_grp and test_pwd)

Index: Modules/grpmodule.c
===================================================================
--- Modules/grpmodule.c	(revision 66345)
+++ Modules/grpmodule.c	(working copy)
@@ -32,6 +32,8 @@
 static int initialized;
 static PyTypeObject StructGrpType;
 
+#define PyUnicode_FromString(s) PyUnicode_DecodeMBCS(s, strlen(s),
"strict")
+
 static PyObject *
 mkgrent(struct group *p)
 {
@@ -83,6 +85,8 @@
     return v;
 }
 
+#undef PyUnicode_FromString
+
 static PyObject *
 grp_getgrgid(PyObject *self, PyObject *pyo_id)
 {
msg72918 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-09 21:21
I think you should use the locale's encoding to process the data, ie.
either mbstowcs, then Unicode from wchar_t, or decode with the
nl_langinfo(CODESET) encoding. You might have to set the locale before
this can work (which isn't thread-safe), so it might be tricky to implement.

Python already does nl_langinfo at startup, but then restores the
locale. It should probably save the default locale's codeset somewhere,
as C code requires it in many places.

There is also a "system" encoding, but that is UTF-8 independent of the
system.
msg72957 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-10 12:44
Sorry, probably I saw illusion... If uses cp932 codec, still
test_tarfile.py reports error. :-(

======================================================================
ERROR: test_tar_size (__main__.WriteTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_tarfile.py", line 570, in test_tar_size
    tar.add(path)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1953, in add
    self.addfile(tarinfo, f)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1976, in
addfile

    buf = tarinfo.tobuf(self.format, self.encoding, self.errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 987, in
tobuf
    return self.create_gnu_header(info, encoding, errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1018, in
create_
gnu_header
    return buf + self._create_header(info, GNU_FORMAT, encoding, errors)
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1107, in
_create
_header
    stn(info.get("gname", "root"), 32, encoding, errors),
  File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 177, in stn
    s = s.encode(encoding, errors)
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordin
al not in range(128)
msg72959 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-10 12:57
Is PyUnicode_DecodeMBCS available on cygwin?
I get compilation errors when I try your patch.
msg72960 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-10 13:10
Yes, when I did it last night, I thought I could compile it and saw OK
on test_tarfile.py, but probably I dreamed. :-(

#define PyUnicode_FromString(s) PyUnicode_Decode(s, strlen(s), "cp932",
"strict")

or following patch should work.
msg73545 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-22 00:29
Sorry, the patch didn't work... I didn't understand Martin's word. And
nl_langinfo(CODESET) is useless on cygwin because it's always US-ASCII.
msg73556 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-22 05:16
I didn't mean to suggest that a new codec is created; instead, mbstowcs
should be called directly in grpmodule.c.

By default, mbstowcs will use ASCII, so it is likely to fail - you would
need to call setlocale first.
msg73580 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-22 14:48
I'm not cygwin user, but cygwin seems not to support multibyte function.
Following program outputs 5 on VC6 as expected, but 10 on cygwin. 
Hmm...

#include <stdio.h>
#include <stdlib.h>
#include <locale.h>

int main(int argc, char* argv[])
{
	const char s[] = "あいうえお";
	size_t len;
	wchar_t *buf;

	setlocale(LC_ALL, "");

	len = strlen(s); /* 10 */

	buf = (wchar_t*)malloc((len+1)*sizeof(wchar_t));

	len = mbstowcs(buf, s, len+1);

	printf("--------> %d\n", len); /* should be 5 */

	return 0;
}
msg73598 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-22 21:17
In this case, I think there is nothing we can do. Perhaps it is useful
to put a comment into the test, pointing out that this is likely to
break on Cygwin, and refer to this issue.

I don't see that as a problem: it's just a test that fails, and only on
some systems (i.e. when you have non-ASCII characters in the group
file). People running into the problem should first resolve the
underlying problem in Cygwin, and, when Cygwin actually works correctly,
come back to fixing this issue.
msg73658 - (view) Author: Roumen Petrov (rpetrov) * Date: 2008-09-23 19:34
What is test result if the environment variable LANG is set to C ?
msg73681 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-24 00:04
> What is test result if the environment variable LANG is set to C ?

There is no change.
msg73707 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-09-24 12:00
Doesn't getgrgid() return the untranslated content of /etc/group?
Then the encoding of this file is relevant.

On cygwin, "mkgroup -l" is often (exclusively?) used to generate this
/etc/group, extracting the user definitions from the Windows settings.
Its source code
http://www.google.com/codesearch?q=file:mkgroup.c+package:cygwin-1.5
shows that it uses MBCS to encode strings.

Maybe we should start considering cygwin as a posix platform with win32
features; but I am not sure to like this idea.
msg73731 - (view) Author: Hirokazu Yamamoto (ocean-city) * (Python committer) Date: 2008-09-24 16:45
>Doesn't getgrgid() return the untranslated content of /etc/group?
>Then the encoding of this file is relevant.

Yes, /etc/group contains "なし" as gr_name in MBCS,("なし" means
"nothing")and I can print it with puts() in grpmodule.c, so it
shouldn't be translated.

>Maybe we should start considering cygwin as a posix platform with win32
>features; but I am not sure to like this idea.

Me neigher.
msg73748 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-09-24 18:42
> Doesn't getgrgid() return the untranslated content of /etc/group?
> Then the encoding of this file is relevant.

That certainly depends on the implementation of getgrgid. On some
systems, it uses NIS, LDAP, or a relational database in addition to
or instead of /etc/group.

I don't think POSIX specifies the charset of gr_name, except perhaps
implicitly by inheriting the notion of "execution character set" from
C, which proposes that all char* should have the same interpretation.
In POSIX, the character set comes from the locale.
The getgrid man page gives an example demonstrating that you are
supposed to be able to printf() gr_name using %s, which suggests that
it should have the locale's encoding.

In Cygwin, I have no doubt that the implementation literally copies
the encoding untranslated. I would claim this to be a bug in Cygwin.
A quality POSIX implementation would convert the /etc/group encoding
to the locale's encoding (just as it should when fetching the data
from LDAP).

> Maybe we should start considering cygwin as a posix platform with win32
> features; but I am not sure to like this idea.

If it is desired that we support this specific implementation aspect, we
can certainly special-case Cygwin in getgrgid. However, I would prefer
if this issue was discussed with the Cygwin people first, suggesting
that it might be a Cygwin bug (one that it certainly shares with many
other POSIX implementations - people just don't use non-ASCII characters
in /etc/group).

If Cygwin ever changes its implementation in that respect, we would need
to have a way for telling which specific implementation is being used.
msg148001 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2011-11-20 16:37
grp.getgrgid() now calls .decode('utf8', errors="surrogateescape").
Even if cygwin does not correctly copy strings from the Windows registry, tarinfo.gname should now contain a string that will at least round trip and give the same value on disk.
Ocean-city, can you check whether this error still reproduces?
History
Date User Action Args
2016-01-13 17:39:53ezio.melottisetstatus: pending -> closed
type: behavior
resolution: out of date
stage: resolved
2014-10-02 16:57:34serhiy.storchakasetstatus: open -> pending
2011-11-20 16:37:50amaury.forgeotdarcsetmessages: + msg148001
2008-09-24 18:42:05loewissetmessages: + msg73748
2008-09-24 16:45:10ocean-citysetmessages: + msg73731
2008-09-24 12:00:46amaury.forgeotdarcsetmessages: + msg73707
2008-09-24 00:04:49ocean-citysetmessages: + msg73681
2008-09-23 19:34:00rpetrovsetnosy: + rpetrov
messages: + msg73658
2008-09-22 21:17:15loewissetmessages: + msg73598
2008-09-22 14:48:13ocean-citysetmessages: + msg73580
2008-09-22 12:24:46ocean-citysetfiles: - experimental_mbcstowcs_codec.patch
2008-09-22 05:16:01loewissetmessages: + msg73556
2008-09-22 00:29:40ocean-citysetmessages: + msg73545
2008-09-10 13:10:30ocean-citysetfiles: + experimental_mbcstowcs_codec.patch
keywords: + patch
messages: + msg72960
2008-09-10 12:57:07amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg72959
2008-09-10 12:44:32ocean-citysetmessages: + msg72957
2008-09-09 21:21:37loewissetnosy: + loewis
messages: + msg72918
2008-09-09 20:42:08ocean-citycreate