This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author lemburg
Recipients lemburg, pjenvey, vstinner
Date 2010-09-30.07:55:18
SpamBayes Score 0.0
Marked as misclassified No
Message-id <4CA44263.3090509@egenix.com>
In-reply-to <1285799783.26.0.120724027895.issue9992@psf.upfronthosting.co.za>
Content
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem.
> 
> There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug.
> 
> --
> 
> I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler):
> 
>  (a) filesystem encoding
>  (b) locale encoding
> 
> Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python.
> 
> Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode).
> 
> In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed).
> 
> I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments.
> 
> --
> 
> I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user?
> 
> And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"?

The problem with command line arguments is that they don't necessarily
have just one encoding (just like env vars may well use more than
one encoding) on Unix platforms.

When using path and file names on the command line they will likely
use the file system encoding. When passing in configuration variables,
the arguments will likely use the current locale settings.

The use of wchar C lib functions is not ideal for parsing the
command line arguments, since this always uses the locale
settings.

Creating a copy as Python3 of argv is also not ideal,
since manipulating argv to change the OS process ps-output is
common on Unix, and there is currently no access (AFAIK) provided
to the original argv array passed to Python in Python3.

I think we should use a similar approach as the one for os.environ
here, where we keep the original bytes buffers around and have
a second copy with str objects which may not necessarily be
complete (e.g. when decoding a string fails).

Unfortunately, the use of wchar_t for command line arguments
has already spread throughout the code base, so I see little
chance of fixing this use.

What we could do, is at least make the original bytes version
of argv available to Python, so that decoding errors can be worked
around in the application (just like we have for os.environ with
os.environb).
History
Date User Action Args
2010-09-30 07:55:22lemburgsetrecipients: + lemburg, vstinner, pjenvey
2010-09-30 07:55:20lemburglinkissue9992 messages
2010-09-30 07:55:18lemburgcreate