classification
Title: Command-line arguments are not correctly decoded if locale and fileystem encodings are different
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: vstinner Nosy List: eric.araujo, ixokai, lemburg, loewis, pitrou, pjenvey, ronaldoussoren, vstinner
Priority: normal Keywords: patch

Created on 2010-09-29 22:36 by vstinner, last changed 2010-10-14 10:59 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
locale_fs_encoding.py vstinner, 2010-09-29 22:36
cmdline_encoding-2.patch vstinner, 2010-09-29 23:45 review
unnamed ronaldoussoren, 2010-10-11 14:32
issue9992.patch vstinner, 2010-10-11 21:16
Messages (49)
msg117669 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 22:36
On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem.

There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug.

--

I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler):

 (a) filesystem encoding
 (b) locale encoding

Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python.

Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode).

In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed).

I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments.

--

I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user?

And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"?
msg117676 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-29 23:45
[cmdline_encoding-2.patch] Patch to use locale encoding to decode and encode command line arguments. Remarks about the patch:

 - failing to get the locale encoding (very unlikely) is a fatal error
 - TODO: in initfsencoding(), Py_FileSystemDefaultEncoding should reuse Py_CommandLineEncoding instead of calling get_codeset() again
 - subprocess encodes arguments to the command line encoding for _posixsubprocess and Python implementations
 - _posixsubprocess doesn't support unicode command line arguments anymore

The patch is an updated version of the patch attached to #8775.

Using the patch, locale_fs_encoding.py test script pass.
msg117705 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-09-30 07:55
STINNER Victor wrote:
> 
> New submission from STINNER Victor <victor.stinner@haypocalc.com>:
> 
> On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem.
> 
> There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug.
> 
> --
> 
> I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler):
> 
>  (a) filesystem encoding
>  (b) locale encoding
> 
> Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python.
> 
> Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode).
> 
> In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed).
> 
> I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments.
> 
> --
> 
> I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user?
> 
> And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"?

The problem with command line arguments is that they don't necessarily
have just one encoding (just like env vars may well use more than
one encoding) on Unix platforms.

When using path and file names on the command line they will likely
use the file system encoding. When passing in configuration variables,
the arguments will likely use the current locale settings.

The use of wchar C lib functions is not ideal for parsing the
command line arguments, since this always uses the locale
settings.

Creating a copy as Python3 of argv is also not ideal,
since manipulating argv to change the OS process ps-output is
common on Unix, and there is currently no access (AFAIK) provided
to the original argv array passed to Python in Python3.

I think we should use a similar approach as the one for os.environ
here, where we keep the original bytes buffers around and have
a second copy with str objects which may not necessarily be
complete (e.g. when decoding a string fails).

Unfortunately, the use of wchar_t for command line arguments
has already spread throughout the code base, so I see little
chance of fixing this use.

What we could do, is at least make the original bytes version
of argv available to Python, so that decoding errors can be worked
around in the application (just like we have for os.environ with
os.environb).
msg117709 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-30 08:57
> The problem with command line arguments is that they don't necessarily
> have just one encoding (just like env vars may well use more than
> one encoding) on Unix platforms.

The issue #8776 proposes the creation of sys.argv.

> When using path and file names on the command line they will likely
> use the file system encoding. When passing in configuration variables,
> the arguments will likely use the current locale settings.

Ok, and? We have to pick up one and use it. We cannot guess the encoding of 
each argument, nor change sys.argv to use bytes. (And the creation sys.argvb 
will not solve this issue.)

I still think that using the filesystem encoding is not possible for technical 
reasons (it might be possible, but it will be very hard), whereas I attached a 
working patch to use the locale encoding.
msg117711 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-09-30 09:02
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> The problem with command line arguments is that they don't necessarily
>> have just one encoding (just like env vars may well use more than
>> one encoding) on Unix platforms.
> 
> The issue #8776 proposes the creation of sys.argv.

Right, I think you meant sys.argvb and yes, I think it's a good idea.

>> When using path and file names on the command line they will likely
>> use the file system encoding. When passing in configuration variables,
>> the arguments will likely use the current locale settings.
> 
> Ok, and? We have to pick up one and use it. We cannot guess the encoding of 
> each argument, nor change sys.argv to use bytes. (And the creation sys.argvb 
> will not solve this issue.)

Sure and using the locale setting is fine. The point is that we pick
one, but keep the original data around for the application to use in
case it knows better, so this will solve the problem.

> I still think that using the filesystem encoding is not possible for technical 
> reasons (it might be possible, but it will be very hard), whereas I attached a 
> working patch to use the locale encoding.
msg117716 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-30 10:43
Extract of an interesting message (msg111432) of #8775 (issue specific to Mac OS X):

<< A system where the filesystem encoding doesn't match the locale encoding is hard to get right. While it would be possible to add sys.cmdlineencoding that doesn't actually solve the semantic problem because external tools might not cooperate.

That is, most system tools seem to work with bytes internally and do not treat arguments as text encoded in the locale encoding that should be re-encoded in the filesystem encoding before passing them to the C APIs.

That is, when calling "ls somefile" the "ls" command will pass the bytes in argv[1] to the POSIX routines for getting file information without trying to reencode. >>
msg117717 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-09-30 10:53
> A system where the filesystem encoding doesn't match the locale
> encoding is hard to get right.

Mmmh. The problem is maybe that the new PYTHONFSENCODING environment variable (added by #8622) introduced an horrible inconstency between Python and other applications. Other applications ignore PYTHONFSENCODING.

The simplest solution to fix this issue is to remove PYTHONFSENCODING variable. In this case, the user have to set LANG, LC_ALL or LC_CTYPE, instead of PYTHONFSENCODING, to set Python filesystem encoding.
msg117871 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-02 12:14
See also #10014: sys.path[0] is decoded from the locale encoding instead of the fileystem encoding.
msg118221 - (view) Author: Stephen Hansen (ixokai) (Python triager) Date: 2010-10-08 19:25
This issue seems to be the cause of issue4388 -- and cmdline_encoding-2.patch fixes it, fwiw.
msg118225 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-08 20:30
> The important point is that we have to use the same encoding to decode 
> and encode command line arguments.

I don't think I agree with this. It's only important when you run a Python interpreter using subprocess, but the point of using subprocess is to run something *else* than Python. This something else generally expects filenames in their correct bytes representation, not in a mojibaked version hand-tuned for Python.
msg118257 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-09 09:14
Antoine: Python cannot possibly know whether a command line argument is meant as a file name or as some other text, and what encoding the receiving application will apply to it (if any).

I agree it's best to have all "IO" encodings being the same in Python, but perhaps there are use cases where you have to use a different encoding for file names, so I don't think it is necessary to rip this feature out.

So perhaps it would be best if Python had two external default encodings: the IO one (command line arguments, environment variables, text files), and the file name encoding (defaulting to the IO encoding if not set). If they differ and you get mojibake in subprocesses: bad luck - it's exactly what you asked for. 

The fsname encoding should *only* be used for file names, not for command line arguments in subprocess.

If we have tests that rely on the fsname encoding and the IO encoding being the same, then those tests should get skipped if the encodings are actually different.

The tricky parts remains determining the IO encoding. If PYTHONIOENCODING can override the locale's encoding, then the tricky question is how command line arguments should get decoded in absence of the codec machinery on Unix. They must get decoded for uniformity with Windows (which received the command line as a Unicode string already).

That problem may be the reason why we need *three* encodings (as it is now), the IOENCODING only applying to file streams.
msg118258 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-09 09:45
> Antoine: Python cannot possibly know whether a command line argument
> is meant as a file name or as some other text, and what encoding the
> receiving application will apply to it (if any).

I understand. But practicality seems to suggest that, most of the time,
non-ASCII arguments on a command line will be filenames. We should
probably try to favour the common case (barring implementation issues,
though, and it seems using the filesystem encoding in the interpreter
bootup phase is not easy).

> So perhaps it would be best if Python had two external default
> encodings: the IO one (command line arguments, environment variables,
> text files), and the file name encoding (defaulting to the IO encoding
> if not set).

Looking at environment variables here, they seem to be either:
- integers (pids, port numbers...)
- conventional variables (such as "fr_FR.utf8")
- usernames
- file paths

The most likely values to be non-ASCII are, therefore, file paths. So it
would make sense to also use the filesystem encoding for environment
variables (so as to satisfy the common case).

As for text files, I agree it's different, and the encoding choice
routine in TextIOWrapper already favours locale.getpreferredencoding()
and ignores the filesystem encoding.

> If we have tests that rely on the fsname encoding and the IO encoding
> being the same, then those tests should get skipped if the encodings
> are actually different.

Agreed, but only when this discussion has come to a conclusion :)
msg118263 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-09 10:47
> The most likely values to be non-ASCII are, therefore, file paths. So it
> would make sense to also use the filesystem encoding for environment
> variables (so as to satisfy the common case).

-1. Environment variables are typically set in a text editor or on
the command line, so they will typically have the locale's encoding.

Applications that wish to support the case that fsencoding != locale
can recode the file names if desired, or use environb in the first
place.

If the mere existence of the fsname encoding leads to that much
confusion, I think I also support its removal.
msg118264 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-09 10:52
> -1. Environment variables are typically set in a text editor or on
> the command line, so they will typically have the locale's encoding.

Fair enough.

> If the mere existence of the fsname encoding leads to that much
> confusion, I think I also support its removal.

Well, the fsname encoding has a hardwired value under OS X (regardless
of the locale), which kind of justifies its existence, no?
msg118268 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-09 11:49
>> If the mere existence of the fsname encoding leads to that much
>> confusion, I think I also support its removal.
> 
> Well, the fsname encoding has a hardwired value under OS X (regardless
> of the locale), which kind of justifies its existence, no?

Perhaps. We could also declare that command line arguments and
environment variables are always UTF-8-encoded on OSX (which I think
would be fairly accurate), and stop relying on the locale to determine
encodings on OSX (which Apple didn't like as a mechanism, anyway).
I think OSX converges faster to UTF-8 than the other Unices.
msg118269 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-09 12:01
> Perhaps. We could also declare that command line arguments and
> environment variables are always UTF-8-encoded on OSX (which I think
> would be fairly accurate)

Python uses the filesystem encoding to encode/decode environment variables, 
and OSX, fs encoding is utf-8. For the command line, it would mean that we 
introduced a new encoding: "command line encoding", which will be utf-8 on 
OSX.
msg118270 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-09 12:07
> For the command line, it would mean that we 
> introduced a new encoding: "command line encoding", which will be utf-8 on 
> OSX.

Or more generally "environment encoding", if it's also used for env
vars. This could solve the subprocess issue neatly.
msg118271 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-09 12:28
> So perhaps it would be best if Python had two external default encodings:
> the IO one (command line arguments, environment variables, text files),
> and the file name encoding (defaulting to the IO encoding if not set)

Hum, I prefer to consider the FS encoding as an *internal* encoding. ... But 
it's not completly true: it is used for the environment variables.

Let's consider that FS encoding is only an internal encoding. Wee need 3 
encodings:
 - FS encoding: any operation on the filesystem
 - IO encoding: text file contents (included stdin, stdout, stderr which are 
text files)
 - a 3rd encoding (let's call it the "command line encoding"): used for the 
command line arguments and the environment variables

For technical reasons ("bootstrap": Python initialization issues), I would 
like that the 3rd encoding is set using the locale encoding. The user can only 
control it using the classical locale variables (LC_ALL, LC_CTYPE, LANG).
msg118278 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-09 17:11
Am 09.10.2010 14:07, schrieb Antoine Pitrou:
> 
> Antoine Pitrou <pitrou@free.fr> added the comment:
> 
>> For the command line, it would mean that we 
>> introduced a new encoding: "command line encoding", which will be utf-8 on 
>> OSX.
> 
> Or more generally "environment encoding", if it's also used for env
> vars. This could solve the subprocess issue neatly.

Please no. We run into problems because we have two inconsistent
encodings, and now you propose to introduce another one, allowing
for even more inconsistencies???
msg118279 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-09 17:32
> Please no. We run into problems because we have two inconsistent
> encodings, and now you propose to introduce another one, allowing
> for even more inconsistencies???

It would not really be a "third encoding", since it would replace the
locale encoding for all pratical purposes, if I understand Victor's
proposal correctly.
msg118336 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-10 15:51
> We run into problems because we have two inconsistent
> encodings, ...

What? No. We have problems because we don't use the same encoding to decode and to encode the same data type. It's not a problem to use a different encoding for each data type (stdout, filenames, environment variables, ...).

--

About the 3rd encoding: it will be just the locale encoding. Use the locale encoding to encode/decode command line arguments and environment variables is complelty compatible with Python 3.1, because Python 3.1 initializes the filesystem encoding with the locale encoding. Use the locale encoding helps the interoperability because other programs use the same encoding.

Mac OS X is a special case. Filesystem encoding is utf-8 on this OS, whereas the locale encoding depends on LANG variable. If I understood MvL proposition correctly, we should not rely on the locale on Mac OS X. So the "3rd encoding" and the filesystem encodings should be hardcoded to utf-8?

--

The "third encoding" is no more controlable by a special environment variable, only by classic locale environment variables (LC_ALL, LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL saying that it may be a problem for CGI for the environment variables because some (all?) variables are not encoded with the locale encoding (but the HTML encoding?). I don't know if Python should workaround CGI specific issues. In Python 3.2, we have now os.environb: it's now possible to use a different encoding for each variable.
msg118337 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-10 16:22
Am 10.10.2010 17:51, schrieb STINNER Victor:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> We run into problems because we have two inconsistent encodings,
>> ...
> 
> What? No. We have problems because we don't use the same encoding to
> decode and to encode the same data type. It's not a problem to use a
> different encoding for each data type (stdout, filenames, environment
> variables, ...).

This is exactly the very problem that we face. In particular, the
question is what encoding to use if something is *both* a filename
and an environment variable value, or both a filename and a command
line argument.

> Mac OS X is a special case. Filesystem encoding is utf-8 on this OS,
> whereas the locale encoding depends on LANG variable. If I understood
> MvL proposition correctly, we should not rely on the locale on Mac OS
> X.

"Not rely on" is perhaps a bit harsh. It's not clear (to me) under what
conditions the locale's encoding will be more correct than just assuming
UTF-8 - there may actually be use cases for it.

However, with the surrogate escapes, we could just always decode using
UTF-8, and leave any mojibake problems that may arise from this from
this to the application. I do think that these problems will be rare,
since a) many OSX installations use UTF-8, anyway, and b) those that
don't likely experience the proper round-tripping of the escape mechanism.

> So the "3rd encoding" and the filesystem encodings should be
> hardcoded to utf-8?

That's an option to consider, yes - I'd like an OSX expert to
comment.

> The "third encoding" is no more controlable by a special environment
> variable, only by classic locale environment variables (LC_ALL,
> LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL
> saying that it may be a problem for CGI for the environment variables
> because some (all?) variables are not encoded with the locale
> encoding (but the HTML encoding?). I don't know if Python should
> workaround CGI specific issues. In Python 3.2, we have now
> os.environb: it's now possible to use a different encoding for each
> variable.

I think these problems are sufficiently resolved now: either by
PEP 3333, PEP 444, PEP 383, or os.environb.

I think you misunderstood MAL's comment, though: the environment
variables are not encoded in *any* specific encoding. Instead,
they are copied literally from the HTTP request, using whatever
bytes the browser originally put in there - which may or may
not have followed a particular encoding. HTTP is silent on
this most of the time, and HTML is out of scope.
msg118339 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-10 17:59
> > What? No. We have problems because we don't use the same encoding to
> > decode and to encode the same data type. It's not a problem to use a
> > different encoding for each data type (stdout, filenames, environment
> > variables, ...).
> 
> This is exactly the very problem that we face. In particular, the
> question is what encoding to use if something is *both* a filename
> and an environment variable value, or both a filename and a command
> line argument.

The question is: what is the best default encoding for a specific data type? 
There is no perfect answer (well, except maybe using byte strings :-)). Each 
solution has its own use cases and disadvantages.

If an application knows exactly the encoding of a data, and it is not the 
default encoding, it can still redecode the data. Using os.environb, it's a 
little bit better: the application just has to decode (don't have to encode 
and to know which encoding was used to decode initially the data). For 
sys.argv, I still want to create sys.argvb (bytes version) ;-)

For the command line arguments and environment variables, we don't have a lot 
of choices: locale or filesystem encodings. So Antoine and Martin: which 
encoding do you prefer? We should maybe try to find some use cases

Here is a dummy script bla.py:
---
import sys
print(sys.argv)
try:
    open(sys.argv[1]).close()
except Exception as err:
    print("open error: %s" % err)
else:
    print("open ok")
---

Locale encoding = FS encoding = utf-8:

$ ./python bla.py xxxé.txt 
['bla.py', 'xxxé.txt']
open ok

Locale encoding = utf8, FS encoding = ascii:

$ PYTHONFSENCODING=ascii ./python bla.py xxxé.txt 
['bla.py', 'xxxé.txt']
open error: 'ascii' codec can't encode character '\xe9' ...

The filename is displayed correctly, but we are unable to open the file if 
PYTHONFSENCODING is used :-/ Should the filename be displayed differently if 
PYTHONFSENCODING is used?

> I think these problems are sufficiently resolved now: either by
> PEP 3333, PEP 444, PEP 383, or os.environb.

Ok, cool :-)

> I think you misunderstood MAL's comment, though: the environment
> variables are not encoded in *any* specific encoding. Instead,
> they are copied literally from the HTTP request, using whatever
> bytes the browser originally put in there - which may or may
> not have followed a particular encoding. HTTP is silent on
> this most of the time, and HTML is out of scope.

Ah yes, thanks for you explaination. I was unable to find its comment.
msg118340 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-10 18:23
> For the command line arguments and environment variables, we don't have a lot 
> of choices: locale or filesystem encodings. So Antoine and Martin: which 
> encoding do you prefer?

I still propose to drop the fsname encoding. Then this question goes away.
msg118341 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-10 18:33
Le dimanche 10 octobre 2010 à 18:23 +0000, Martin v. Löwis a écrit :
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
> > For the command line arguments and environment variables, we don't have a lot 
> > of choices: locale or filesystem encodings. So Antoine and Martin: which 
> > encoding do you prefer?
> 
> I still propose to drop the fsname encoding. Then this question goes away.

I don't know what you mean by dropping, since OS X by construction needs
a filesystem encoding (utf-8) different from the locale encoding; and
Windows hardwires the decoding/encoding of bytes filenames using mbcs
regardless of the current codepage, IIRC.

So do you just mean the filesystem encoding should be hidden from the
user? What would be the benefit?
msg118344 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-10 19:44
> I don't know what you mean by dropping, since OS X by construction needs
> a filesystem encoding (utf-8) different from the locale encoding;

See above. I propose to stop using the locale encoding for command line
arguments and environment variables on OSX, and use UTF-8 instead.

> and
> Windows hardwires the decoding/encoding of bytes filenames using mbcs
> regardless of the current codepage, IIRC.

I wish byte-oriented file names could be dropped on Windows. But that
is probably too incompatible.

> So do you just mean the filesystem encoding should be hidden from the
> user? What would be the benefit?

That the very issue that this bug report (re-read the title) is about
would go away.
msg118352 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 12:15
> > ... So Antoine and Martin: which encoding do you prefer?
> 
> I still propose to drop the fsname encoding. Then this question goes away.

You mean that we should use the following encoding for the command line 
arguments, environment variables and all filenames/paths:
 - Mac OS X: utf-8
 - Windows: unicode for command line/env, mbcs to decode filenames
 - others OSes: locale encoding

To do that, we have to:
 - "others OSes": delete the PYTHONFSENCODING variable
 - Mac OS X: use utf-8 to decode the command line arguments (we can use 
PyUnicode_DecodeUTF8()+PyUnicode_AsWideCharString() before Python is 
initialized)

On "others OSes", we continue to use the FS encoding to encode command 
line/env vars, because the FS encoding will always be the locale encoding. And 
it's more pratical to use sys.getfilesystemencoding() than mbstowcs(), 
wcstombs(), _Py_wchar2char(), _Py_char2wchar(), etc. because the FS encoding 
doesn't depend on the current locale, and it uses Python codecs which support 
more error handlers.

I like this solution because it doesn't change a lot of things. I agree to 
drop PYTHONFSENCODING because it looks like PYTHONFSENCODING introduced more 
inconsistencies than it solved.
msg118358 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-11 13:45
STINNER Victor wrote:
> 
> I like this solution because it doesn't change a lot of things. I agree to 
> drop PYTHONFSENCODING because it looks like PYTHONFSENCODING introduced more 
> inconsistencies than it solved.

If you remove the PYTHONFSENCODING, then we have to reconsider
removal of sys.setfilesystemencoding().

The main argument for removal of the sys function was having
the environment variable.

If you remove both, Python will get very poor grades for OS
interoperability on platforms that often deal with multiple
different encodings for file names.

I am repeating myself, but please keep in mind that the locale
is an application scope setting. It doesn't have anything
to do with what's actually stored in file systems or what the
OS uses internally.

Python therefore has to provide a way to customize the file system
encoding and allow to override the locale guessing that's currently
happening.

You can't just tell people to go with whatever encoding setup
you prefer to make Python's guessing easier or more correct. Python
has to adapt to what the users actually use, not the other way
around. Where that's not easily possible, there have to be ways
to explicitly tell Python what to use... telling the user to adjust
his or her locale settings just to be able to run Python is not
an option.

The world is still moving towards Unicode - it's not 100% there
yet.
msg118359 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-11 13:54
> You mean that we should use the following encoding for the command line 
> arguments, environment variables and all filenames/paths:
>  - Mac OS X: utf-8
>  - Windows: unicode for command line/env, mbcs to decode filenames

No: unicode for filenames also.

>  - others OSes: locale encoding

Yes, that is my proposal.
msg118360 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-11 13:56
> If you remove both, Python will get very poor grades for OS
> interoperability on platforms that often deal with multiple
> different encodings for file names.

Why that? It will work very well in such a setting, much better
than, say, Java.
msg118365 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-10-11 14:32
On 09 Oct, 2010,at 02:07 PM, Antoine Pitrou <report@bugs.python.org> wrote:

Antoine Pitrou <pitrou@free.fr> added the comment:

> For the command line, it would mean that we 
> introduced a new encoding: "command line encoding", which will be utf-8 on 
> OSX.

Or more generally "environment encoding", if it's also used for env
vars. This could solve the subprocess issue neatly.
 

Note that the command-line and environment encoding on OSX is generally UTF-8, even if that is not always reflected in the locale settings.

On recent OSX releases LANG will be set to a UTF-8 aware locale ("en_US.UTF-8" on my machine) when you start a shell using Terminal.app.

The correct locale environment variables are AFAIK not set in two important situations: on OSX 10.4 and when running code from an application bundle, in both cases the environment/command-line encoding should be treated as UTF-8.

There is one reason for not wanting to assume that the encoding is always UTF-8: the user might access the system from a non-UTF8 terminal (such as when logging in with an SSH session from a system not using UTF-8, or using an alternate terminal application). IMHO these are minor enough use-cases that we could just enforce that the encoding is UTF-8 on OSX. 

That would ensure that the filesystem encoding and environment/command-line encoding are consistent and we'd no longer run into the problem that triggered this issue.

Ronald

----------

_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue9992>
_______________________________________
msg118367 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-11 14:38
> There is one reason for not wanting to assume that the encoding is
> always UTF-8: the user might access the system from a non-UTF8
> terminal (such as when logging in with an SSH session from a system
> not using UTF-8, or using an alternate terminal application). IMHO
> these are minor enough use-cases that we could just enforce that the
> encoding is UTF-8 on OSX.

Ok, that's enough of an expert statement for me to settle the OSX
case: we will always assume that environment data is UTF-8 on OSX
(leaving the rest to the surrogate escape handler).
msg118368 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-11 14:41
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> If you remove both, Python will get very poor grades for OS
>> interoperability on platforms that often deal with multiple
>> different encodings for file names.
> 
> Why that? It will work very well in such a setting, much better
> than, say, Java.

Well, Java pretty much fails completely in this respect, so being
better than Java is not exactly the benchmark I had in mind :-)

I think the proper benchmark would be a Python2 application that
has no problems with these things, since file names are just
bytes that refer to files on the disk, with no associated encoding -
at least on Unix and related platforms.

Being pedantic about forcing some encoding onto things that don't
have an encoding won't really work out in practice. Dealing with
file names, OS environments, pipes and sockets is dirty work, so
I think we should go with the 80-20 approach in making 80% easy
and 20% harder, but still possible.
msg118374 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-11 16:01
> Being pedantic about forcing some encoding onto things that don't
> have an encoding won't really work out in practice. Dealing with
> file names, OS environments, pipes and sockets is dirty work, so
> I think we should go with the 80-20 approach in making 80% easy
> and 20% harder, but still possible.

Unix applications can always use the byte-oriented file name APIs
if they need to. Then you are back to the state that things have
in Python 2. No need to have a user-tunable file system encoding
there.

However, I completely fail to see the advantage that the
PYTHONFSENCODING variable has over the LANG variable. If it's
possible to set PTHONFSENCODING in some application, it surely
is also possible to set LANG (or LC_CTYPE), no? Setting the
latter also gives you the advantage that environment variables
and command line arguments use the same encoding as file names.
msg118375 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-11 16:08
> However, I completely fail to see the advantage that the
> PYTHONFSENCODING variable has over the LANG variable. If it's
> possible to set PTHONFSENCODING in some application, it surely
> is also possible to set LANG (or LC_CTYPE), no? Setting the
> latter also gives you the advantage that environment variables
> and command line arguments use the same encoding as file names.

I guess LANG and LC_CTYPE can be used for other purposes such as
internationalization.
msg118377 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-11 16:26
Martin v. Löwis wrote:
> 
> Martin v. Löwis <martin@v.loewis.de> added the comment:
> 
>> Being pedantic about forcing some encoding onto things that don't
>> have an encoding won't really work out in practice. Dealing with
>> file names, OS environments, pipes and sockets is dirty work, so
>> I think we should go with the 80-20 approach in making 80% easy
>> and 20% harder, but still possible.
> 
> Unix applications can always use the byte-oriented file name APIs
> if they need to. Then you are back to the state that things have
> in Python 2. No need to have a user-tunable file system encoding
> there.

Right and if you take the position of refusing to guess
which we usually do in Python, then interfacing to file names
using bytes would be the appropriate way to handle the situation.

However, since Python3 has chosen to regard file names as
text regardless of platform, we're now in the situation that
we have to come up with some educated guess on the encoding.

> However, I completely fail to see the advantage that the
> PYTHONFSENCODING variable has over the LANG variable. If it's
> possible to set PTHONFSENCODING in some application, it surely
> is also possible to set LANG (or LC_CTYPE), no? Setting the
> latter also gives you the advantage that environment variables
> and command line arguments use the same encoding as file names.

The advantage is that you can change the Python files system
encoding *without* having to change your locale settings.

You can't possibly expect a user to switch to using UTF-8 for
all his/her applications just because Python needs this to
properly decode file names.

Users of applications written in Python will most likely not
even know how to change the locale encoding.
msg118385 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 19:58
MvL> >  - Windows: unicode for command line/env, mbcs to decode filenames
MvL> No: unicode for filenames also.

Yes, I mean unicode for everything, but decode bytes data from the mbcs encoding.
msg118386 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 20:03
MAL> If you remove the PYTHONFSENCODING, then we have to reconsider
MAL> removal of sys.setfilesystemencoding().

Pleeeeeeeease, Marc, read my comments. You never consider technical problems, you just propose to ensure that "Python just works", without answering to my technical questions. I already explained 2 or 3 times that sys.setfilesystemencoding() was completly buggy and not usable in pratical. You proposed PYTHONFSENCODING and I implemented it. But then I explained in an email to python-dev and in this issue, that this environment variable introduced many problems.

I don't see how sys.setfilesystemencoding() would solve this issue, it's out of scope.
msg118388 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-10-11 20:07
> You can't possibly expect a user to switch to using UTF-8 for
> all his/her applications just because Python needs this to
> properly decode file names.

If the user hasn't switched to UTF-8, why would Python need that
to properly decode file names?
msg118389 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 20:17
MAL> You can't just tell people to go with whatever encoding setup
MAL> you prefer to make Python's guessing easier or more correct.

Python doesn't really *guess* the encoding, it just reads the encoding from the locale.

What do you mean by "more correct"? How can Python knowns the right encoding better than the user? Python should not guess anything. If the environment is not correctly configured, it's not Python's fault. The user has to fix its environment.
msg118390 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 20:19
> I guess LANG and LC_CTYPE can be used for other purposes
> such as internationalization.

That's why there are different environement variables:
 * LC_MESSAGES for i18n (messages)
 * LC_CTYPE for the encoding
 * LC_TIME for time and date
 * etc.
msg118392 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 21:16
issue9992.patch:
 - Remove PYTHONFSENCODING environment variable
 - Mac OS X: Use utf-8 to decode command line arguments
 - Fix issue #9992 (this issue): attached test, locale_fs_encoding.py, pass
 - Fix issue #9988
 - Fix issue #10014
 - Fix issue #10039

$ diffstat issue9992.patch 
 Doc/using/cmdline.rst       |   12 ------------
 Doc/whatsnew/3.2.rst        |    6 ------
 Lib/test/test_os.py         |   30 ------------------------------
 Lib/test/test_subprocess.py |    4 ----
 Lib/test/test_sys.py        |   29 -----------------------------
 Modules/main.c              |    3 ---
 Modules/python.c            |   10 +++++++++-
 Python/pythonrun.c          |   22 ++++++----------------
 8 files changed, 15 insertions(+), 101 deletions(-)

I like such patch: it removes more code than it adds, but it fixes 4 different issues!

I didn't tested the patch specific to OSX (use utf8 to decode command line arguments).
msg118394 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-11 21:42
I think that issue9992.patch fixes also #4388 because it uses the same encoding (FS encoding, utf8) on OSX to encode and to decode command line arguments.
msg118591 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-13 22:18
I commited issue9992.patch as r85430 (remove PYTHONFSENCODING) + r85435 (OSX: decode command line arguments from utf-8).

These commits should fix this issue. Reopen the issue if you notice new problems, or if the problem is not fixed yet. I will watch Mac OS X buildbots, especially about r85435 ;-)
msg118607 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-14 00:34
test_undecodable_env() of test_subprocess fails. r85430 removes the following code which was added by Antoine to fix this issue.

# Force surrogate-escaping of \xFF in the child process;
# otherwise it can be decoded as-is if the default locale
# is latin-1.
env['PYTHONFSENCODING'] = 'ascii'

I think that we should accept that b'\xff' can be decoded as '\xff' and that's all.
msg118633 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2010-10-14 08:23
> I think that we should accept that b'\xff' can be decoded as '\xff' and 
> that's all.

What do you plan to do to fix this failure?

======================================================================
FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home2/buildbot2/slave/3.x.loewis-parallel/build/Lib/test/test_subprocess.py", line 892, in test_undecodable_env
    self.assertEquals(stdout.decode('ascii'), ascii(value))
AssertionError: "'abc\\xff'" != "'abc\\udcff'"
- 'abc\xff'
?      ^
+ 'abc\udcff'
?      ^^^

http://www.python.org/dev/buildbot/builders/x86%20debian%20parallel%203.x/builds/502/steps/test/logs/stdio
msg118645 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-14 10:44
With r85466+r85467, the test_undecodable_env (of test_subprocess) uses C locale to get ASCII locale encoding (for the first test, on unicode environment variables). It should have the same effect than env['PYTHONFSENCODING'] = 'ascii': get ASCII as the filesystem encoding.
msg118647 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-14 10:56
Ok, the issue is not complelty fixed ;-)

12:55 < py-bb> build #504 of x86 debian parallel 3.x is complete: Success [build successful]  Build details are at 
               http://www.python.org/dev/buildbot/all/builders/x86%20debian%20parallel%203.x/builds/504
msg118648 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-10-14 10:59
I tried... "the issue is *now* complelty fixed"
History
Date User Action Args
2010-10-14 10:59:27vstinnersetmessages: + msg118648
2010-10-14 10:56:13vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg118647
2010-10-14 10:44:32vstinnersetmessages: + msg118645
2010-10-14 08:23:19pitrousetassignee: vstinner
messages: + msg118633
2010-10-14 00:34:39vstinnersetmessages: + msg118607
2010-10-13 22:18:40vstinnersetmessages: + msg118591
2010-10-13 18:03:22eric.araujosetnosy: + eric.araujo

title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command-line arguments are not correctly decoded if locale and fileystem encodings are different
2010-10-11 21:42:15vstinnersetmessages: + msg118394
2010-10-11 21:16:37vstinnersetfiles: + issue9992.patch

messages: + msg118392
2010-10-11 20:19:29vstinnersetmessages: + msg118390
2010-10-11 20:17:26vstinnersetmessages: + msg118389
2010-10-11 20:07:45loewissetmessages: + msg118388
2010-10-11 20:03:27vstinnersetmessages: + msg118386
2010-10-11 19:58:45vstinnersetmessages: + msg118385
2010-10-11 16:26:33lemburgsetmessages: + msg118377
2010-10-11 16:08:55pitrousetmessages: + msg118375
2010-10-11 16:01:58loewissetmessages: + msg118374
2010-10-11 14:41:29lemburgsetmessages: + msg118368
title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent
2010-10-11 14:38:38loewissetmessages: + msg118367
title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent
2010-10-11 14:32:32ronaldoussorensetfiles: + unnamed

messages: + msg118365
title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent
2010-10-11 13:56:27loewissetmessages: + msg118360
2010-10-11 13:54:35loewissetmessages: + msg118359
2010-10-11 13:45:43lemburgsetmessages: + msg118358
2010-10-11 12:15:20vstinnersetmessages: + msg118352
2010-10-10 19:44:20loewissetmessages: + msg118344
2010-10-10 18:33:12pitrousetmessages: + msg118341
2010-10-10 18:23:20loewissetmessages: + msg118340
2010-10-10 17:59:23vstinnersetmessages: + msg118339
2010-10-10 16:22:26loewissetmessages: + msg118337
2010-10-10 15:51:27vstinnersetmessages: + msg118336
2010-10-09 17:32:45pitrousetmessages: + msg118279
title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent
2010-10-09 17:11:28loewissetmessages: + msg118278
title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent
2010-10-09 12:28:18vstinnersetmessages: + msg118271
2010-10-09 12:07:16pitrousetmessages: + msg118270
2010-10-09 12:01:46vstinnersetmessages: + msg118269
title: Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent
2010-10-09 11:49:51loewissetmessages: + msg118268
2010-10-09 10:52:24pitrousetmessages: + msg118264
2010-10-09 10:47:47loewissetmessages: + msg118263
2010-10-09 09:45:17pitrousetmessages: + msg118258
2010-10-09 09:14:48loewissetmessages: + msg118257
2010-10-08 20:30:35pitrousetnosy: + pitrou
messages: + msg118225
2010-10-08 19:25:58ixokaisetnosy: + ixokai
messages: + msg118221
2010-10-02 12:14:56vstinnersetmessages: + msg117871
2010-09-30 10:53:37vstinnersetmessages: + msg117717
2010-09-30 10:43:17vstinnersetnosy: + loewis, ronaldoussoren
messages: + msg117716
2010-09-30 09:02:03lemburgsetmessages: + msg117711
title: Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent -> Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent
2010-09-30 08:57:00vstinnersetmessages: + msg117709
title: Command line arguments are not correctly decoded if locale and fileystem encodings are different -> Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent
2010-09-30 07:55:20lemburgsetnosy: + lemburg
title: Command line arguments are not correctly decoded if locale and fileystem encodings are different -> Command line arguments are not correctly decoded if locale and fileystem encodings are different
messages: + msg117705
2010-09-29 23:45:58vstinnersetfiles: + cmdline_encoding-2.patch
keywords: + patch
messages: + msg117676
2010-09-29 23:11:21pjenveysetnosy: + pjenvey
2010-09-29 22:36:21vstinnercreate