Issue 3824: test_tarfile fails on cygwin (unicode decode error)

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/48074

classification

Title:	test_tarfile fails on cygwin (unicode decode error)
Type:	behavior	Stage:	resolved
Components:		Versions:	Python 3.0

process

Status:	closed	Resolution:	out of date
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, loewis, ocean-city, rpetrov
Priority:	normal	Keywords:	patch

Created on 2008-09-09 20:42 by ocean-city, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (15)
msg72907 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-09 20:42
I noticed test_tarfile on py3k fails like this. ====================================================================== ERROR: test_directory_size (__main__.WriteTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_tarfile.py", line 598, in test_directory_size tarinfo = tar.gettarinfo(path) File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1869, in gettari nfo tarinfo.gname = grp.getgrgid(tarinfo.gid)[0] UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 0: unexpecte d code byte ====================================== And I noticed PyUnicode_FromString supposes input as UTF-8, but actually member of struct grp is MBCS or CP932 on cygwin. After patched following workaround, test passed. I don't know how to fix this... Does python have system encoding or something? (I experienced similar error on test_grp and test_pwd) Index: Modules/grpmodule.c =================================================================== --- Modules/grpmodule.c (revision 66345) +++ Modules/grpmodule.c (working copy) @@ -32,6 +32,8 @@ static int initialized; static PyTypeObject StructGrpType; +#define PyUnicode_FromString(s) PyUnicode_DecodeMBCS(s, strlen(s), "strict") + static PyObject * mkgrent(struct group p) { @@ -83,6 +85,8 @@ return v; } +#undef PyUnicode_FromString + static PyObject grp_getgrgid(PyObject self, PyObject pyo_id) {
msg72918 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-09-09 21:21
I think you should use the locale's encoding to process the data, ie. either mbstowcs, then Unicode from wchar_t, or decode with the nl_langinfo(CODESET) encoding. You might have to set the locale before this can work (which isn't thread-safe), so it might be tricky to implement. Python already does nl_langinfo at startup, but then restores the locale. It should probably save the default locale's codeset somewhere, as C code requires it in many places. There is also a "system" encoding, but that is UTF-8 independent of the system.
msg72957 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-10 12:44
Sorry, probably I saw illusion... If uses cp932 codec, still test_tarfile.py reports error. :-( ====================================================================== ERROR: test_tar_size (__main__.WriteTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_tarfile.py", line 570, in test_tar_size tar.add(path) File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1953, in add self.addfile(tarinfo, f) File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1976, in addfile buf = tarinfo.tobuf(self.format, self.encoding, self.errors) File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 987, in tobuf return self.create_gnu_header(info, encoding, errors) File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1018, in create_ gnu_header return buf + self._create_header(info, GNU_FORMAT, encoding, errors) File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 1107, in _create _header stn(info.get("gname", "root"), 32, encoding, errors), File "/home/WhiteRabbit/python-dev/py3k/Lib/tarfile.py", line 177, in stn s = s.encode(encoding, errors) UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordin al not in range(128)
msg72959 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-09-10 12:57
Is PyUnicode_DecodeMBCS available on cygwin? I get compilation errors when I try your patch.
msg72960 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-10 13:10
Yes, when I did it last night, I thought I could compile it and saw OK on test_tarfile.py, but probably I dreamed. :-( #define PyUnicode_FromString(s) PyUnicode_Decode(s, strlen(s), "cp932", "strict") or following patch should work.
msg73545 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-22 00:29
Sorry, the patch didn't work... I didn't understand Martin's word. And nl_langinfo(CODESET) is useless on cygwin because it's always US-ASCII.
msg73556 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-09-22 05:16
I didn't mean to suggest that a new codec is created; instead, mbstowcs should be called directly in grpmodule.c. By default, mbstowcs will use ASCII, so it is likely to fail - you would need to call setlocale first.
msg73580 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-22 14:48
I'm not cygwin user, but cygwin seems not to support multibyte function. Following program outputs 5 on VC6 as expected, but 10 on cygwin. Hmm... #include <stdio.h> #include <stdlib.h> #include <locale.h> int main(int argc, char* argv[]) { const char s[] = "あいうえお"; size_t len; wchar_t buf; setlocale(LC_ALL, ""); len = strlen(s); / 10 / buf = (wchar_t)malloc((len+1)sizeof(wchar_t)); len = mbstowcs(buf, s, len+1); printf("--------> %d\n", len); / should be 5 */ return 0; }
msg73598 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-09-22 21:17
In this case, I think there is nothing we can do. Perhaps it is useful to put a comment into the test, pointing out that this is likely to break on Cygwin, and refer to this issue. I don't see that as a problem: it's just a test that fails, and only on some systems (i.e. when you have non-ASCII characters in the group file). People running into the problem should first resolve the underlying problem in Cygwin, and, when Cygwin actually works correctly, come back to fixing this issue.
msg73658 - (view)	Author: Roumen Petrov (rpetrov) *	Date: 2008-09-23 19:34
What is test result if the environment variable LANG is set to C ?
msg73681 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-24 00:04
> What is test result if the environment variable LANG is set to C ? There is no change.
msg73707 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-09-24 12:00
Doesn't getgrgid() return the untranslated content of /etc/group? Then the encoding of this file is relevant. On cygwin, "mkgroup -l" is often (exclusively?) used to generate this /etc/group, extracting the user definitions from the Windows settings. Its source code http://www.google.com/codesearch?q=file:mkgroup.c+package:cygwin-1.5 shows that it uses MBCS to encode strings. Maybe we should start considering cygwin as a posix platform with win32 features; but I am not sure to like this idea.
msg73731 - (view)	Author: Hirokazu Yamamoto (ocean-city) *	Date: 2008-09-24 16:45
>Doesn't getgrgid() return the untranslated content of /etc/group? >Then the encoding of this file is relevant. Yes, /etc/group contains "なし" as gr_name in MBCS,（"なし" means "nothing"）and I can print it with puts() in grpmodule.c, so it shouldn't be translated. >Maybe we should start considering cygwin as a posix platform with win32 >features; but I am not sure to like this idea. Me neigher.
msg73748 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2008-09-24 18:42
> Doesn't getgrgid() return the untranslated content of /etc/group? > Then the encoding of this file is relevant. That certainly depends on the implementation of getgrgid. On some systems, it uses NIS, LDAP, or a relational database in addition to or instead of /etc/group. I don't think POSIX specifies the charset of gr_name, except perhaps implicitly by inheriting the notion of "execution character set" from C, which proposes that all char* should have the same interpretation. In POSIX, the character set comes from the locale. The getgrid man page gives an example demonstrating that you are supposed to be able to printf() gr_name using %s, which suggests that it should have the locale's encoding. In Cygwin, I have no doubt that the implementation literally copies the encoding untranslated. I would claim this to be a bug in Cygwin. A quality POSIX implementation would convert the /etc/group encoding to the locale's encoding (just as it should when fetching the data from LDAP). > Maybe we should start considering cygwin as a posix platform with win32 > features; but I am not sure to like this idea. If it is desired that we support this specific implementation aspect, we can certainly special-case Cygwin in getgrgid. However, I would prefer if this issue was discussed with the Cygwin people first, suggesting that it might be a Cygwin bug (one that it certainly shares with many other POSIX implementations - people just don't use non-ASCII characters in /etc/group). If Cygwin ever changes its implementation in that respect, we would need to have a way for telling which specific implementation is being used.
msg148001 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2011-11-20 16:37
grp.getgrgid() now calls .decode('utf8', errors="surrogateescape"). Even if cygwin does not correctly copy strings from the Windows registry, tarinfo.gname should now contain a string that will at least round trip and give the same value on disk. Ocean-city, can you check whether this error still reproduces?

History
Date	User	Action	Args
2022-04-11 14:56:38	admin	set	github: 48074
2016-01-13 17:39:53	ezio.melotti	set	status: pending -> closed type: behavior resolution: out of date stage: resolved
2014-10-02 16:57:34	serhiy.storchaka	set	status: open -> pending
2011-11-20 16:37:50	amaury.forgeotdarc	set	messages: + msg148001
2008-09-24 18:42:05	loewis	set	messages: + msg73748
2008-09-24 16:45:10	ocean-city	set	messages: + msg73731
2008-09-24 12:00:46	amaury.forgeotdarc	set	messages: + msg73707
2008-09-24 00:04:49	ocean-city	set	messages: + msg73681
2008-09-23 19:34:00	rpetrov	set	nosy: + rpetrov messages: + msg73658
2008-09-22 21:17:15	loewis	set	messages: + msg73598
2008-09-22 14:48:13	ocean-city	set	messages: + msg73580
2008-09-22 12:24:46	ocean-city	set	files: - experimental_mbcstowcs_codec.patch
2008-09-22 05:16:01	loewis	set	messages: + msg73556
2008-09-22 00:29:40	ocean-city	set	messages: + msg73545
2008-09-10 13:10:30	ocean-city	set	files: + experimental_mbcstowcs_codec.patch keywords: + patch messages: + msg72960
2008-09-10 12:57:07	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg72959
2008-09-10 12:44:32	ocean-city	set	messages: + msg72957
2008-09-09 21:21:37	loewis	set	nosy: + loewis messages: + msg72918
2008-09-09 20:42:08	ocean-city	create