Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sys.argv is wrong for unicode strings #46381

Closed
GiovanniBajo mannequin opened this issue Feb 16, 2008 · 15 comments
Closed

sys.argv is wrong for unicode strings #46381

GiovanniBajo mannequin opened this issue Feb 16, 2008 · 15 comments
Labels
interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error

Comments

@GiovanniBajo
Copy link
Mannequin

GiovanniBajo mannequin commented Feb 16, 2008

BPO 2128
Nosy @loewis, @vstinner, @tiran, @benjaminp
Files
  • argv_unicode.patch
  • wchar.diff
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2008-04-05.20:42:42.184>
    created_at = <Date 2008-02-16.16:27:45.539>
    labels = ['interpreter-core', 'type-bug', 'OS-windows']
    title = 'sys.argv is wrong for unicode strings'
    updated_at = <Date 2013-01-14.09:15:31.925>
    user = 'https://bugs.python.org/giovannibajo'

    bugs.python.org fields:

    activity = <Date 2013-01-14.09:15:31.925>
    actor = 'vstinner'
    assignee = 'none'
    closed = True
    closed_date = <Date 2008-04-05.20:42:42.184>
    closer = 'loewis'
    components = ['Interpreter Core', 'Windows']
    creation = <Date 2008-02-16.16:27:45.539>
    creator = 'giovannibajo'
    dependencies = []
    files = ['9449', '9647']
    hgrepos = []
    issue_num = 2128
    keywords = ['patch']
    message_count = 15.0
    messages = ['62458', '62460', '62499', '62659', '62660', '62664', '63443', '65005', '65045', '65061', '65073', '125827', '125829', '179892', '179928']
    nosy_count = 7.0
    nosy_names = ['loewis', 'vstinner', 'christian.heimes', 'giovannibajo', 'benjamin.peterson', 'davidsarah', 'mherrmann.at']
    pr_nums = []
    priority = 'high'
    resolution = 'fixed'
    stage = None
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue2128'
    versions = ['Python 2.6', 'Python 2.5', 'Python 3.0', 'Python 2.7']

    @GiovanniBajo
    Copy link
    Mannequin Author

    GiovanniBajo mannequin commented Feb 16, 2008

    Under Windows, sys.argv is created through the Windows ANSI API.

    When you have a file/directory which can't be represented in the
    system encoding (eg: a japanese-named file or directory on a Western
    Windows), Windows will encode the filename to the system encoding using
    what we call the "replace" policy, and thus sys.argv[] will contain an
    entry like "c:\\foo\\??????????????.dat".

    My suggestion is that:

    • At the Python level, we still expose a single sys.argv[], which will
      contain unicode strings. I think this exactly matches what Py3k does now.

    • At the C level, I believe it involves using GetCommandLineW() and
      CommandLineToArgvW() in WinMain.c, but should Py_Main/PySys_SetArgv() be
      changed to also accept wchar_t** arguments? Or is it better to allow for
      NULL to be passed (under Windows at least), so that the Windows
      code-path in there can use GetCommandLineW()/CommandLineToArgvW() to get
      the current process' arguments?

    @GiovanniBajo GiovanniBajo mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Feb 16, 2008
    @tiran
    Copy link
    Member

    tiran commented Feb 16, 2008

    The issue is related to bpo-1342

    Since we have dropped support for older versions of Windows (9x, ME,
    NT4) I like to get the Python interface to argv, env and files fixed.

    @GiovanniBajo
    Copy link
    Mannequin Author

    GiovanniBajo mannequin commented Feb 17, 2008

    I'm attaching a simple patch that seems to work under Py3k. The trick is
    that Py3k already attempts (not sure how or why) to decode argv using
    utf-8. So it's sufficient to setup argv as UTF8-encoded strings.

    Notice that brings the output of "python ààààà" from this:

    Fatal Python error: no mem for sys.argv
    UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2:
    invalid data

    to this:

    TypeError: zipimporter() argument 1 must be string without null bytes,
    not str

    which is expected since zipimporter_init() doesn't even know to ignore
    unicode strings (let alone handle them correctly...).

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 21, 2008

    I dislike the double decoding, and would prefer if sys.argv would be
    created directly from the wide command line.

    In addition, I think the patch is incorrect: it ignores the arguments to
    Py_Main, which is a documented API function.

    One solution might be to declare all these functions (Py_Main,
    SetProgramName, GetArgcArgv) to operate on Py_UNICODE*, and then
    convert the POSIX callers of Py_Main to use mbstowcs when going
    from the command line to Py_Main. WinMain could then become
    recompiled for Unicode directly, likewise Modules/python.c

    @GiovanniBajo
    Copy link
    Mannequin Author

    GiovanniBajo mannequin commented Feb 21, 2008

    mbstowcs uses LC_CTYPE. Is that correct and consistent with the way
    default encoding under UNIX is handled by Py3k?

    Would a Py_MainW or similar wrapper be easier on the UNIX guys? I'm just
    asking, I don't have a definite idea.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Feb 21, 2008

    mbstowcs uses LC_CTYPE. Is that correct and consistent with the way
    default encoding under UNIX is handled by Py3k?

    It's correct, but it's not consistent with the default encoding - there
    isn't really any default encoding in Py3k. More specifically,
    PyUnicode_FromString uses UTF-8, but not as a (changeable) default,
    but as part of its API specification.
    Command line arguments are in the locale's charset, so the LC_CTYPE
    must be used to convert them.

    Would a Py_MainW or similar wrapper be easier on the UNIX guys? I'm just
    asking, I don't have a definite idea.

    See above. The current POSIX implementation is incorrect also. It should
    use the locale's encoding, but doesn't.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Mar 10, 2008

    Here is a patch that redoes the entire argv handling, in terms of
    wchar_t. As a side effect, it also changes the sys.path handling to use
    wchar_t.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 5, 2008

    This is now fixed in r62178 for Py3k. For 2.6, I don't think fixing it
    is feasible.

    @loewis loewis mannequin closed this as completed Apr 5, 2008
    @benjaminp
    Copy link
    Contributor

    MvL's recent commit creates compiler warnings for Unicode UCS4 for the
    same reason as bpo-2388.

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Apr 7, 2008

    What warnings precisely are you seeing? I didn't see anything in the 3k
    branch (not even for bpo-2388, as PyErr_Format doesn't have the GCC format
    attribute in 3k, unlike 2.x).

    @benjaminp
    Copy link
    Contributor

    Martin, you are right that they are not from the same reason as that issue.

    gcc -c -arch ppc -arch i386 -isysroot /Developer/SDKs/MacOSX10.4u.sdk/
    -fno-strict-aliasing -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes
    -I. -IInclude -I./Include -DPy_BUILD_CORE -o Modules/main.o Modules/main.c
    Modules/main.c: In function 'Py_Main':
    Modules/main.c:478: warning: passing argument 1 of 'Py_SetProgramName'
    from incompatible pointer type
    Modules/main.c: In function 'Py_Main':
    Modules/main.c:478: warning: passing argument 1 of 'Py_SetProgramName'
    from incompatible pointer type

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Jan 9, 2011

    The following code is being used to work around this issue for Python 2.x in Tahoe-LAFS:

        # This works around <http://bugs.python.org/issue2128>.
        GetCommandLineW = WINFUNCTYPE(LPWSTR)(("GetCommandLineW", windll.kernel32))
        CommandLineToArgvW = WINFUNCTYPE(POINTER(LPWSTR), LPCWSTR, POINTER(c_int)) \
                                (("CommandLineToArgvW", windll.shell32))
    
        argc = c_int(0)
        argv_unicode = CommandLineToArgvW(GetCommandLineW(), byref(argc))
    
        argv = [argv_unicode[i].encode('utf-8') for i in range(0, argc.value)]
    
        if not hasattr(sys, 'frozen'):
            # If this is an executable produced by py2exe or bbfreeze, then it will
            # have been invoked directly. Otherwise, unicode_argv[0] is the Python
            # interpreter, so skip that.
            argv = argv[1:]
    
            # Also skip option arguments to the Python interpreter.
            while len(argv) > 0:
                arg = argv[0]
                if not arg.startswith("-") or arg == "-":
                    break
                argv = argv[1:]
                if arg == '-m':
                    # sys.argv[0] should really be the absolute path of the module source,
                    # but never mind
                    break
                if arg == '-c':
                    argv[0] = '-c'
                    break

    @davidsarah
    Copy link
    Mannequin

    davidsarah mannequin commented Jan 9, 2011

    Sorry, missed out the imports:

        from ctypes import WINFUNCTYPE, windll, POINTER, byref, c_int
        from ctypes.wintypes import LPWSTR, LPCWSTR

    @mherrmannat
    Copy link
    Mannequin

    mherrmannat mannequin commented Jan 13, 2013

    Hi,

    is it correct that this bug no longer appears in Python 2.7.3? I checked the changelogs of 2.7, but couldn't find anything.

    Thanks!
    Michael

    @vstinner
    Copy link
    Member

    is it correct that this bug no longer appears in Python 2.7.3?

    Martin wrote that it cannot be fixed in Python 2: "For 2.6, I don't think fixing it is feasible."

    The "fix" is to upgrade your application to Python 3.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    TODO: <Get performance changes in windows>.
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    Benchmarking on windows:
    * Baseline (http://gpaste/6701096112750592):
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    abeir pushed a commit to abeir/depot_tools that referenced this issue Apr 24, 2024
    The fix_encoding module within depot_tools was included back in the python2[1] days to as a be all encoding fix boilerplate that is called across depot_tools scripts.
    
    However, now that depot_tools officially deprecated support for py2 and support >= 3.8[2], the boilerplate is not needed anymore.
    
    * `fix_win_codec()`[3] The 'cp65001' codec issue this fixes is fixed in python 3.3[4].
    * `fix_default_encoding()`[5] python3 defaults to utf8.
    * `fix_win_sys_argv()`[6] sys.srgv unicode issue is fixed in python3[7].
    * `fix_win_console()`[8] Fixed[9].
    
    Benchmarking on windows:
    * Baseline (http://gpaste/6701096112750592): ~1min 41sec.
    
    [1] https://codereview.chromium.org/6721029
    [2] https://crrev.com/371aa997c04791d21e222ed43a1a0d55b450dd53/README.md
    [3] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=123-132;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [4] python/cpython#57425 (comment)
    [5] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=29-66;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [6] https://crsrc.org/d/fix_encoding.py;l=73-120;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [7] python/cpython#46381 (comment)
    [8] https://source.chromium.org/chromium/chromium/tools/depot_tools/+/main:fix_encoding.py;l=315-344;drc=cfa826c9845122d445dce4f51f556381865dbed3
    [9] python/cpython#45943 (comment)
    
    Bug: 1501984
    Change-Id: I1d512a4b1bfe14e680ac0aa08027849b999cc638
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    interpreter-core (Objects, Python, Grammar, and Parser dirs) OS-windows type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants