classification
Title: sys.argv is wrong for unicode strings
Type: behavior Stage:
Components: Interpreter Core, Windows Versions: Python 3.0, Python 2.7, Python 2.6, Python 2.5
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, christian.heimes, davidsarah, giovannibajo, haypo, loewis, mherrmann.at
Priority: high Keywords: patch

Created on 2008-02-16 16:27 by giovannibajo, last changed 2013-01-14 09:15 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
argv_unicode.patch giovannibajo, 2008-02-17 18:57
wchar.diff loewis, 2008-03-10 14:40
Messages (15)
msg62458 - (view) Author: Giovanni Bajo (giovannibajo) Date: 2008-02-16 16:27
Under Windows, sys.argv is created through the Windows ANSI API.

When you have a file/directory which can't be represented in the 
system encoding (eg: a japanese-named file or directory on a Western 
Windows), Windows will encode the filename to the system encoding using
what we call the "replace" policy, and thus sys.argv[] will contain an
entry like "c:\\foo\\??????????????.dat".

My suggestion is that:

* At the Python level, we still expose a single sys.argv[], which will 
contain unicode strings. I think this exactly matches what Py3k does now. 

* At the C level, I believe it involves using GetCommandLineW() and 
CommandLineToArgvW() in WinMain.c, but should Py_Main/PySys_SetArgv() be 
changed to also accept wchar_t** arguments? Or is it better to allow for 
NULL to be passed (under Windows at least), so that the Windows
code-path in there can use GetCommandLineW()/CommandLineToArgvW() to get
the current process' arguments?
msg62460 - (view) Author: Christian Heimes (christian.heimes) * (Python committer) Date: 2008-02-16 16:54
The issue is related to #1342

Since we have dropped support for older versions of Windows (9x, ME,
NT4) I like to get the Python interface to argv, env and files fixed.
msg62499 - (view) Author: Giovanni Bajo (giovannibajo) Date: 2008-02-17 18:57
I'm attaching a simple patch that seems to work under Py3k. The trick is
that Py3k already attempts (not sure how or why) to decode argv using
utf-8. So it's sufficient to setup argv as UTF8-encoded strings.

Notice that brings the output of "python ààààà" from this:

Fatal Python error: no mem for sys.argv
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
invalid data

to this:

TypeError: zipimporter() argument 1 must be string without null bytes,
not str

which is expected since zipimporter_init() doesn't even know to ignore
unicode strings (let alone handle them correctly...).
msg62659 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-02-21 20:50
I dislike the double decoding, and would prefer if sys.argv would be
created directly from the wide command line.

In addition, I think the patch is incorrect: it ignores the arguments to
Py_Main, which is a documented API function.

One solution might be to declare all these functions (Py_Main,
SetProgramName, GetArgcArgv) to operate on Py_UNICODE*, and then
convert the POSIX callers of Py_Main to use mbstowcs when going
from the command line to Py_Main. WinMain could then become 
recompiled for Unicode directly, likewise Modules/python.c
msg62660 - (view) Author: Giovanni Bajo (giovannibajo) Date: 2008-02-21 21:33
mbstowcs uses LC_CTYPE. Is that correct and consistent with the way
default encoding under UNIX is handled by Py3k?

Would a Py_MainW or similar wrapper be easier on the UNIX guys? I'm just
asking, I don't have a definite idea.
msg62664 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-02-21 22:01
> mbstowcs uses LC_CTYPE. Is that correct and consistent with the way
> default encoding under UNIX is handled by Py3k?

It's correct, but it's not consistent with the default encoding - there
isn't really any default encoding in Py3k. More specifically,
PyUnicode_FromString uses UTF-8, but not as a (changeable) default,
but as part of its API specification.
Command line arguments are in the locale's charset, so the LC_CTYPE
must be used to convert them.

> Would a Py_MainW or similar wrapper be easier on the UNIX guys? I'm just
> asking, I don't have a definite idea.

See above. The current POSIX implementation is incorrect also. It should
use the locale's encoding, but doesn't.
msg63443 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-03-10 14:40
Here is a patch that redoes the entire argv handling, in terms of
wchar_t. As a side effect, it also changes the sys.path handling to use
wchar_t.
msg65005 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-05 20:42
This is now fixed in r62178 for Py3k. For 2.6, I don't think fixing it
is feasible.
msg65045 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-06 16:50
MvL's recent commit creates compiler warnings for Unicode UCS4 for the
same reason as #2388.
msg65061 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-04-07 03:27
What warnings precisely are you seeing? I didn't see anything in the 3k
branch (not even for #2388, as PyErr_Format doesn't have the GCC format
attribute in 3k, unlike 2.x).
msg65073 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2008-04-07 11:54
Martin, you are right that they are not from the same reason as that issue.

gcc -c -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk/ 
-fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes 
-I. -IInclude -I./Include   -DPy_BUILD_CORE -o Modules/main.o Modules/main.c
Modules/main.c: In function 'Py_Main':
Modules/main.c:478: warning: passing argument 1 of 'Py_SetProgramName'
from incompatible pointer type
Modules/main.c: In function 'Py_Main':
Modules/main.c:478: warning: passing argument 1 of 'Py_SetProgramName'
from incompatible pointer type
msg125827 - (view) Author: David-Sarah Hopwood (davidsarah) Date: 2011-01-09 07:36
The following code is being used to work around this issue for Python 2.x in Tahoe-LAFS:

    # This works around <http://bugs.python.org/issue2128>.
    GetCommandLineW = WINFUNCTYPE(LPWSTR)(("GetCommandLineW", windll.kernel32))
    CommandLineToArgvW = WINFUNCTYPE(POINTER(LPWSTR), LPCWSTR, POINTER(c_int)) \
                            (("CommandLineToArgvW", windll.shell32))

    argc = c_int(0)
    argv_unicode = CommandLineToArgvW(GetCommandLineW(), byref(argc))

    argv = [argv_unicode[i].encode('utf-8') for i in range(0, argc.value)]

    if not hasattr(sys, 'frozen'):
        # If this is an executable produced by py2exe or bbfreeze, then it will
        # have been invoked directly. Otherwise, unicode_argv[0] is the Python
        # interpreter, so skip that.
        argv = argv[1:]

        # Also skip option arguments to the Python interpreter.
        while len(argv) > 0:
            arg = argv[0]
            if not arg.startswith("-") or arg == "-":
                break
            argv = argv[1:]
            if arg == '-m':
                # sys.argv[0] should really be the absolute path of the module source,
                # but never mind
                break
            if arg == '-c':
                argv[0] = '-c'
                break
msg125829 - (view) Author: David-Sarah Hopwood (davidsarah) Date: 2011-01-09 07:39
Sorry, missed out the imports:

    from ctypes import WINFUNCTYPE, windll, POINTER, byref, c_int
    from ctypes.wintypes import LPWSTR, LPCWSTR
msg179892 - (view) Author: Michael Herrmann (mherrmann.at) Date: 2013-01-13 20:23
Hi,

is it correct that this bug no longer appears in Python 2.7.3? I checked the changelogs of 2.7, but couldn't find anything.

Thanks!
Michael
msg179928 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2013-01-14 09:15
> is it correct that this bug no longer appears in Python 2.7.3?

Martin wrote that it cannot be fixed in Python 2: "For 2.6, I don't think fixing it is feasible."

The "fix" is to upgrade your application to Python 3.
History
Date User Action Args
2013-01-14 09:15:31hayposetmessages: + msg179928
2013-01-13 20:23:17mherrmann.atsetnosy: + mherrmann.at
messages: + msg179892
2011-01-14 22:18:04hayposetnosy: + haypo
2011-01-09 07:39:42davidsarahsetnosy: loewis, christian.heimes, giovannibajo, benjamin.peterson, davidsarah
messages: + msg125829
2011-01-09 07:36:51davidsarahsetnosy: + davidsarah

messages: + msg125827
versions: + Python 2.6, Python 2.5, Python 2.7
2008-04-07 11:54:38benjamin.petersonsetmessages: + msg65073
2008-04-07 03:27:27loewissetmessages: + msg65061
2008-04-06 16:50:36benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg65045
2008-04-05 20:42:42loewissetstatus: open -> closed
messages: + msg65005
resolution: fixed
versions: - Python 2.6
2008-03-10 14:40:50loewissetfiles: + wchar.diff
keywords: + patch
messages: + msg63443
2008-02-21 22:01:33loewissetmessages: + msg62664
2008-02-21 21:33:17giovannibajosetmessages: + msg62660
2008-02-21 20:50:58loewissetnosy: + loewis
messages: + msg62659
2008-02-17 18:58:00giovannibajosetfiles: + argv_unicode.patch
messages: + msg62499
2008-02-16 16:54:06christian.heimessetpriority: high
nosy: + christian.heimes
messages: + msg62460
components: + Windows
versions: + Python 2.6
2008-02-16 16:27:45giovannibajocreate