This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author vstinner
Recipients vstinner
Date 2010-09-29.22:36:20
SpamBayes Score 9.8840935e-12
Marked as misclassified No
Message-id <1285799783.26.0.120724027895.issue9992@psf.upfronthosting.co.za>
In-reply-to
Content
On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem.

There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug.

--

I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler):

 (a) filesystem encoding
 (b) locale encoding

Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python.

Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode).

In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed).

I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments.

--

I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user?

And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"?
History
Date User Action Args
2010-09-29 22:36:23vstinnersetrecipients: + vstinner
2010-09-29 22:36:23vstinnersetmessageid: <1285799783.26.0.120724027895.issue9992@psf.upfronthosting.co.za>
2010-09-29 22:36:21vstinnerlinkissue9992 messages
2010-09-29 22:36:20vstinnercreate