Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.) #53021

Closed
vstinner opened this issue May 20, 2010 · 9 comments

Comments

@vstinner
Copy link
Member

BPO 8775
Nosy @loewis, @ronaldoussoren, @vstinner, @dvarrazzo
Files
  • cmdline_encoding.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2010-07-24.11:26:52.247>
    created_at = <Date 2010-05-20.12:09:24.080>
    labels = []
    title = 'Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.)'
    updated_at = <Date 2010-07-24.11:26:52.246>
    user = 'https://github.com/vstinner'

    bugs.python.org fields:

    activity = <Date 2010-07-24.11:26:52.246>
    actor = 'loewis'
    assignee = 'none'
    closed = True
    closed_date = <Date 2010-07-24.11:26:52.247>
    closer = 'loewis'
    components = []
    creation = <Date 2010-05-20.12:09:24.080>
    creator = 'vstinner'
    dependencies = []
    files = ['17716']
    hgrepos = []
    issue_num = 8775
    keywords = ['patch']
    message_count = 9.0
    messages = ['106139', '106150', '106171', '106543', '108151', '108153', '108154', '111432', '111456']
    nosy_count = 5.0
    nosy_names = ['loewis', 'ronaldoussoren', 'vstinner', 'piro', 'Arfrever']
    pr_nums = []
    priority = 'normal'
    resolution = 'wont fix'
    stage = None
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue8775'
    versions = ['Python 3.2']

    @vstinner
    Copy link
    Member Author

    The file system is hardcoded to UTF-8 on Mac OS X, whereas the locale encoding... depends on the locale. See issue bpo-4388 for the details.

    I think that we should use the locale encoding to encode and decode command line arguments. We have to create a new encoding variable used for the command line arguments:

    • Py_CommandLineEncoding
    • sys.getcmdlineencoding()
    • (no sys.setcmdlineencoding() please!)
    • ...

    This encoding only should be used on POSIX: Windows native type is unicode (wchar_t*). It should be used to decode sys.argv and to encode child processes arguments (subprocess, os.exec*(), etc.)).

    On Linux, it should change anything because the file system encoding is the locale encoding. Said differently, Python3 does already use the locale encoding for the command arguments on Linux.

    If you pass a filename on the command line and then open it: the filename is decoded with the locale encoding, and then encoded with the file system encoding. I fear that it will fail if both encodings are differents...

    @vstinner
    Copy link
    Member Author

    Fix the title: sys.argv is already decoded using the locale encoding on Unix, the problem is that it uses a (possibly) different encoding to encode command line arguments: file system encoding.

    @vstinner vstinner changed the title Use locale encoding to decode sys.argv, not the file system encoding Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.) May 20, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented May 20, 2010

    I think that we should use the locale encoding to encode and decode command line arguments.

    I disagree. IIUC, this is only about OSX. Now, we shouldn't take any
    action until either some OSX expert explains us how command line
    arguments are being passed on OSX, or we find some Apple documentation
    that can be taken as a specification.

    I think the C locale is very poorly supported on OSX, and we shouldn't
    really use it for anything. What may be useful is the terminal encoding
    (which may be different both from UTF-8 and the locale encoding),
    however, it's not possible to find out what the terminal encoding is.
    In addition, programs may be started "directly" (i.e. not from the
    terminal), in which case the terminal encoding would be irrelevant.

    For file name arguments at least, it's very clear that the command line
    arguments also use the file system encoding.

    @loewis loewis mannequin changed the title Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.) Use locale encoding to decode sys.argv, not the file system encoding May 20, 2010
    @vstinner
    Copy link
    Member Author

    @loewis: You restored the original (wrong) title "Use locale encoding to decode sys.argv, not the file system encoding", instead of the new (good) title "Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.)". Is it wanted or not?

    @vstinner
    Copy link
    Member Author

    Attached patch is a draft adding a new encoding: command line encoding. It is used to encode (subprocess) and decode (python) the command line arguments. It adds sys.getcmdlineencoding().

    @loewis loewis mannequin changed the title Use locale encoding to decode sys.argv, not the file system encoding Use locale encoding to encode command line arguments (subprocess, os.exec*(), etc.) Jun 18, 2010
    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jun 18, 2010

    I'm still -1, failing to see the problem that is solved.

    @vstinner
    Copy link
    Member Author

    I'm still -1, failing to see the problem that is solved.

    I know (and I agree), but I don't want to loose the patch :-)

    @ronaldoussoren
    Copy link
    Contributor

    This issue only seems to be relevant for OSX, and then only for OSX releases before 10.5, because in that release Apple made sure that the LANG variable and simular LC_* ones specify a UTF-8 encoding and we're back at the common case where the filesystem encoding matches the locale encoding.

    A system where the filesystem encoding doesn't match the locale encoding is hard to get right. While it would be possible to add sys.cmdlineencoding that doesn't actually solve the semantic problem because external tools might not cooperate.

    That is, most system tools seem to work with bytes internally and do not treat arguments as text encoded in the locale encoding that should be re-encoded in the filesystem encoding before passing them to the C APIs.

    That is, when calling "ls somefile" the "ls" command will pass the bytes in argv[1] to the POSIX routines for getting file information without trying to reencode.

    In short, having a filesystem encoding that is different from the command-line only works when all system tools cooperate and are unicode aware.

    To be honest, I'd say the behavior of OSX 10.4 is a bug and we might add a workaround on that platform that uses CFStringGetSystemEncoding() to fetch the actual system encoding when LANG=C.

    (And I'm -1 on adding the patch)

    See also: bpo-9167

    @loewis
    Copy link
    Mannequin

    loewis mannequin commented Jul 24, 2010

    It seems that everybody now agrees to close this issue as "won't fix".

    @loewis loewis mannequin closed this as completed Jul 24, 2010
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    None yet
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants