Issue9992
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010-09-29 22:36 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
locale_fs_encoding.py | vstinner, 2010-09-29 22:36 | |||
cmdline_encoding-2.patch | vstinner, 2010-09-29 23:45 | review | ||
unnamed | ronaldoussoren, 2010-10-11 14:32 | |||
issue9992.patch | vstinner, 2010-10-11 21:16 |
Messages (49) | |||
---|---|---|---|
msg117669 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-09-29 22:36 | |
On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem. There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug. -- I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler): (a) filesystem encoding (b) locale encoding Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python. Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode). In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed). I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments. -- I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user? And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"? |
|||
msg117676 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-09-29 23:45 | |
[cmdline_encoding-2.patch] Patch to use locale encoding to decode and encode command line arguments. Remarks about the patch: - failing to get the locale encoding (very unlikely) is a fatal error - TODO: in initfsencoding(), Py_FileSystemDefaultEncoding should reuse Py_CommandLineEncoding instead of calling get_codeset() again - subprocess encodes arguments to the command line encoding for _posixsubprocess and Python implementations - _posixsubprocess doesn't support unicode command line arguments anymore The patch is an updated version of the patch attached to #8775. Using the patch, locale_fs_encoding.py test script pass. |
|||
msg117705 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-09-30 07:55 | |
STINNER Victor wrote: > > New submission from STINNER Victor <victor.stinner@haypocalc.com>: > > On UNIX/BSD systems, Python decodes arguments with the locale encoding, whereas subprocess encodes arguments with the fileystem encoding. If both encodings are differents, we have a problem. > > There was already the issue #4388 but it was closed because it was specific to old versions of Mac OS X. With the PYTHONFSENCODING environment variable (added to Python 3.2), it is easy to trigger this issue: run Python with a filesystem encoding different than the locale encoding. Attached script demonstrates the bug. > > -- > > I see two possible encodings to encode and decode command line arguments (with surrogateescape error handler): > > (a) filesystem encoding > (b) locale encoding > > Decode Python command line arguments is one of the first operation executed when running Python, in the main() function. We don't have import machinery or codec API available at this moment. So I don't see how we can use the filesystem encoding here. Read issue #9630 to see how complex it is to use the filesystem encoding when initializing Python. > > Use the locale encoding is easier because we already have _Py_char2wchar() and _Py_wchar2char() functions to decode/encode with the locale encoding and the surrogateescape error handler. These functions use the wchar_t* type which is less pratical than PyUnicodeObject*, but it is an advantage because wchar_t* type doesn't need Python to be completly initialized (whereas some PyUnicode methods loads modules, eg. encode and decode). > > In #8775, I proposed to create a new variable to store the "command line encoding": sys.getcmdlineencoding(). But this issue was closed because there was only one use case: #4388 (which was closed but not fixed). > > I don't know, or don't really care, how sys.getcmdlineencoding() should be initialized. The important point is that we have to use the same encoding to decode and encode command line arguments. > > -- > > I don't really know if using another encoding is the right solution. The problem is maybe that the filesystem encoding should not be controlable by the user? > > And what about environment variables: should we continue to encode and decode them with the filesystem encoding, or should we use the new "command line encoding"? The problem with command line arguments is that they don't necessarily have just one encoding (just like env vars may well use more than one encoding) on Unix platforms. When using path and file names on the command line they will likely use the file system encoding. When passing in configuration variables, the arguments will likely use the current locale settings. The use of wchar C lib functions is not ideal for parsing the command line arguments, since this always uses the locale settings. Creating a copy as Python3 of argv is also not ideal, since manipulating argv to change the OS process ps-output is common on Unix, and there is currently no access (AFAIK) provided to the original argv array passed to Python in Python3. I think we should use a similar approach as the one for os.environ here, where we keep the original bytes buffers around and have a second copy with str objects which may not necessarily be complete (e.g. when decoding a string fails). Unfortunately, the use of wchar_t for command line arguments has already spread throughout the code base, so I see little chance of fixing this use. What we could do, is at least make the original bytes version of argv available to Python, so that decoding errors can be worked around in the application (just like we have for os.environ with os.environb). |
|||
msg117709 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-09-30 08:57 | |
> The problem with command line arguments is that they don't necessarily > have just one encoding (just like env vars may well use more than > one encoding) on Unix platforms. The issue #8776 proposes the creation of sys.argv. > When using path and file names on the command line they will likely > use the file system encoding. When passing in configuration variables, > the arguments will likely use the current locale settings. Ok, and? We have to pick up one and use it. We cannot guess the encoding of each argument, nor change sys.argv to use bytes. (And the creation sys.argvb will not solve this issue.) I still think that using the filesystem encoding is not possible for technical reasons (it might be possible, but it will be very hard), whereas I attached a working patch to use the locale encoding. |
|||
msg117711 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-09-30 09:02 | |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> The problem with command line arguments is that they don't necessarily >> have just one encoding (just like env vars may well use more than >> one encoding) on Unix platforms. > > The issue #8776 proposes the creation of sys.argv. Right, I think you meant sys.argvb and yes, I think it's a good idea. >> When using path and file names on the command line they will likely >> use the file system encoding. When passing in configuration variables, >> the arguments will likely use the current locale settings. > > Ok, and? We have to pick up one and use it. We cannot guess the encoding of > each argument, nor change sys.argv to use bytes. (And the creation sys.argvb > will not solve this issue.) Sure and using the locale setting is fine. The point is that we pick one, but keep the original data around for the application to use in case it knows better, so this will solve the problem. > I still think that using the filesystem encoding is not possible for technical > reasons (it might be possible, but it will be very hard), whereas I attached a > working patch to use the locale encoding. |
|||
msg117716 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-09-30 10:43 | |
Extract of an interesting message (msg111432) of #8775 (issue specific to Mac OS X): << A system where the filesystem encoding doesn't match the locale encoding is hard to get right. While it would be possible to add sys.cmdlineencoding that doesn't actually solve the semantic problem because external tools might not cooperate. That is, most system tools seem to work with bytes internally and do not treat arguments as text encoded in the locale encoding that should be re-encoded in the filesystem encoding before passing them to the C APIs. That is, when calling "ls somefile" the "ls" command will pass the bytes in argv[1] to the POSIX routines for getting file information without trying to reencode. >> |
|||
msg117717 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-09-30 10:53 | |
> A system where the filesystem encoding doesn't match the locale > encoding is hard to get right. Mmmh. The problem is maybe that the new PYTHONFSENCODING environment variable (added by #8622) introduced an horrible inconstency between Python and other applications. Other applications ignore PYTHONFSENCODING. The simplest solution to fix this issue is to remove PYTHONFSENCODING variable. In this case, the user have to set LANG, LC_ALL or LC_CTYPE, instead of PYTHONFSENCODING, to set Python filesystem encoding. |
|||
msg117871 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-02 12:14 | |
See also #10014: sys.path[0] is decoded from the locale encoding instead of the fileystem encoding. |
|||
msg118221 - (view) | Author: Stephen Hansen (ixokai) ![]() |
Date: 2010-10-08 19:25 | |
This issue seems to be the cause of issue4388 -- and cmdline_encoding-2.patch fixes it, fwiw. |
|||
msg118225 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-08 20:30 | |
> The important point is that we have to use the same encoding to decode > and encode command line arguments. I don't think I agree with this. It's only important when you run a Python interpreter using subprocess, but the point of using subprocess is to run something *else* than Python. This something else generally expects filenames in their correct bytes representation, not in a mojibaked version hand-tuned for Python. |
|||
msg118257 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-09 09:14 | |
Antoine: Python cannot possibly know whether a command line argument is meant as a file name or as some other text, and what encoding the receiving application will apply to it (if any). I agree it's best to have all "IO" encodings being the same in Python, but perhaps there are use cases where you have to use a different encoding for file names, so I don't think it is necessary to rip this feature out. So perhaps it would be best if Python had two external default encodings: the IO one (command line arguments, environment variables, text files), and the file name encoding (defaulting to the IO encoding if not set). If they differ and you get mojibake in subprocesses: bad luck - it's exactly what you asked for. The fsname encoding should *only* be used for file names, not for command line arguments in subprocess. If we have tests that rely on the fsname encoding and the IO encoding being the same, then those tests should get skipped if the encodings are actually different. The tricky parts remains determining the IO encoding. If PYTHONIOENCODING can override the locale's encoding, then the tricky question is how command line arguments should get decoded in absence of the codec machinery on Unix. They must get decoded for uniformity with Windows (which received the command line as a Unicode string already). That problem may be the reason why we need *three* encodings (as it is now), the IOENCODING only applying to file streams. |
|||
msg118258 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-09 09:45 | |
> Antoine: Python cannot possibly know whether a command line argument > is meant as a file name or as some other text, and what encoding the > receiving application will apply to it (if any). I understand. But practicality seems to suggest that, most of the time, non-ASCII arguments on a command line will be filenames. We should probably try to favour the common case (barring implementation issues, though, and it seems using the filesystem encoding in the interpreter bootup phase is not easy). > So perhaps it would be best if Python had two external default > encodings: the IO one (command line arguments, environment variables, > text files), and the file name encoding (defaulting to the IO encoding > if not set). Looking at environment variables here, they seem to be either: - integers (pids, port numbers...) - conventional variables (such as "fr_FR.utf8") - usernames - file paths The most likely values to be non-ASCII are, therefore, file paths. So it would make sense to also use the filesystem encoding for environment variables (so as to satisfy the common case). As for text files, I agree it's different, and the encoding choice routine in TextIOWrapper already favours locale.getpreferredencoding() and ignores the filesystem encoding. > If we have tests that rely on the fsname encoding and the IO encoding > being the same, then those tests should get skipped if the encodings > are actually different. Agreed, but only when this discussion has come to a conclusion :) |
|||
msg118263 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-09 10:47 | |
> The most likely values to be non-ASCII are, therefore, file paths. So it > would make sense to also use the filesystem encoding for environment > variables (so as to satisfy the common case). -1. Environment variables are typically set in a text editor or on the command line, so they will typically have the locale's encoding. Applications that wish to support the case that fsencoding != locale can recode the file names if desired, or use environb in the first place. If the mere existence of the fsname encoding leads to that much confusion, I think I also support its removal. |
|||
msg118264 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-09 10:52 | |
> -1. Environment variables are typically set in a text editor or on > the command line, so they will typically have the locale's encoding. Fair enough. > If the mere existence of the fsname encoding leads to that much > confusion, I think I also support its removal. Well, the fsname encoding has a hardwired value under OS X (regardless of the locale), which kind of justifies its existence, no? |
|||
msg118268 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-09 11:49 | |
>> If the mere existence of the fsname encoding leads to that much >> confusion, I think I also support its removal. > > Well, the fsname encoding has a hardwired value under OS X (regardless > of the locale), which kind of justifies its existence, no? Perhaps. We could also declare that command line arguments and environment variables are always UTF-8-encoded on OSX (which I think would be fairly accurate), and stop relying on the locale to determine encodings on OSX (which Apple didn't like as a mechanism, anyway). I think OSX converges faster to UTF-8 than the other Unices. |
|||
msg118269 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-09 12:01 | |
> Perhaps. We could also declare that command line arguments and > environment variables are always UTF-8-encoded on OSX (which I think > would be fairly accurate) Python uses the filesystem encoding to encode/decode environment variables, and OSX, fs encoding is utf-8. For the command line, it would mean that we introduced a new encoding: "command line encoding", which will be utf-8 on OSX. |
|||
msg118270 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-09 12:07 | |
> For the command line, it would mean that we > introduced a new encoding: "command line encoding", which will be utf-8 on > OSX. Or more generally "environment encoding", if it's also used for env vars. This could solve the subprocess issue neatly. |
|||
msg118271 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-09 12:28 | |
> So perhaps it would be best if Python had two external default encodings: > the IO one (command line arguments, environment variables, text files), > and the file name encoding (defaulting to the IO encoding if not set) Hum, I prefer to consider the FS encoding as an *internal* encoding. ... But it's not completly true: it is used for the environment variables. Let's consider that FS encoding is only an internal encoding. Wee need 3 encodings: - FS encoding: any operation on the filesystem - IO encoding: text file contents (included stdin, stdout, stderr which are text files) - a 3rd encoding (let's call it the "command line encoding"): used for the command line arguments and the environment variables For technical reasons ("bootstrap": Python initialization issues), I would like that the 3rd encoding is set using the locale encoding. The user can only control it using the classical locale variables (LC_ALL, LC_CTYPE, LANG). |
|||
msg118278 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-09 17:11 | |
Am 09.10.2010 14:07, schrieb Antoine Pitrou: > > Antoine Pitrou <pitrou@free.fr> added the comment: > >> For the command line, it would mean that we >> introduced a new encoding: "command line encoding", which will be utf-8 on >> OSX. > > Or more generally "environment encoding", if it's also used for env > vars. This could solve the subprocess issue neatly. Please no. We run into problems because we have two inconsistent encodings, and now you propose to introduce another one, allowing for even more inconsistencies??? |
|||
msg118279 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-09 17:32 | |
> Please no. We run into problems because we have two inconsistent > encodings, and now you propose to introduce another one, allowing > for even more inconsistencies??? It would not really be a "third encoding", since it would replace the locale encoding for all pratical purposes, if I understand Victor's proposal correctly. |
|||
msg118336 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-10 15:51 | |
> We run into problems because we have two inconsistent > encodings, ... What? No. We have problems because we don't use the same encoding to decode and to encode the same data type. It's not a problem to use a different encoding for each data type (stdout, filenames, environment variables, ...). -- About the 3rd encoding: it will be just the locale encoding. Use the locale encoding to encode/decode command line arguments and environment variables is complelty compatible with Python 3.1, because Python 3.1 initializes the filesystem encoding with the locale encoding. Use the locale encoding helps the interoperability because other programs use the same encoding. Mac OS X is a special case. Filesystem encoding is utf-8 on this OS, whereas the locale encoding depends on LANG variable. If I understood MvL proposition correctly, we should not rely on the locale on Mac OS X. So the "3rd encoding" and the filesystem encodings should be hardcoded to utf-8? -- The "third encoding" is no more controlable by a special environment variable, only by classic locale environment variables (LC_ALL, LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL saying that it may be a problem for CGI for the environment variables because some (all?) variables are not encoded with the locale encoding (but the HTML encoding?). I don't know if Python should workaround CGI specific issues. In Python 3.2, we have now os.environb: it's now possible to use a different encoding for each variable. |
|||
msg118337 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-10 16:22 | |
Am 10.10.2010 17:51, schrieb STINNER Victor: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> We run into problems because we have two inconsistent encodings, >> ... > > What? No. We have problems because we don't use the same encoding to > decode and to encode the same data type. It's not a problem to use a > different encoding for each data type (stdout, filenames, environment > variables, ...). This is exactly the very problem that we face. In particular, the question is what encoding to use if something is *both* a filename and an environment variable value, or both a filename and a command line argument. > Mac OS X is a special case. Filesystem encoding is utf-8 on this OS, > whereas the locale encoding depends on LANG variable. If I understood > MvL proposition correctly, we should not rely on the locale on Mac OS > X. "Not rely on" is perhaps a bit harsh. It's not clear (to me) under what conditions the locale's encoding will be more correct than just assuming UTF-8 - there may actually be use cases for it. However, with the surrogate escapes, we could just always decode using UTF-8, and leave any mojibake problems that may arise from this from this to the application. I do think that these problems will be rare, since a) many OSX installations use UTF-8, anyway, and b) those that don't likely experience the proper round-tripping of the escape mechanism. > So the "3rd encoding" and the filesystem encodings should be > hardcoded to utf-8? That's an option to consider, yes - I'd like an OSX expert to comment. > The "third encoding" is no more controlable by a special environment > variable, only by classic locale environment variables (LC_ALL, > LC_CTYPE, LANG). Is it a problem? I remember a comment from MAL > saying that it may be a problem for CGI for the environment variables > because some (all?) variables are not encoded with the locale > encoding (but the HTML encoding?). I don't know if Python should > workaround CGI specific issues. In Python 3.2, we have now > os.environb: it's now possible to use a different encoding for each > variable. I think these problems are sufficiently resolved now: either by PEP 3333, PEP 444, PEP 383, or os.environb. I think you misunderstood MAL's comment, though: the environment variables are not encoded in *any* specific encoding. Instead, they are copied literally from the HTTP request, using whatever bytes the browser originally put in there - which may or may not have followed a particular encoding. HTTP is silent on this most of the time, and HTML is out of scope. |
|||
msg118339 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-10 17:59 | |
> > What? No. We have problems because we don't use the same encoding to > > decode and to encode the same data type. It's not a problem to use a > > different encoding for each data type (stdout, filenames, environment > > variables, ...). > > This is exactly the very problem that we face. In particular, the > question is what encoding to use if something is *both* a filename > and an environment variable value, or both a filename and a command > line argument. The question is: what is the best default encoding for a specific data type? There is no perfect answer (well, except maybe using byte strings :-)). Each solution has its own use cases and disadvantages. If an application knows exactly the encoding of a data, and it is not the default encoding, it can still redecode the data. Using os.environb, it's a little bit better: the application just has to decode (don't have to encode and to know which encoding was used to decode initially the data). For sys.argv, I still want to create sys.argvb (bytes version) ;-) For the command line arguments and environment variables, we don't have a lot of choices: locale or filesystem encodings. So Antoine and Martin: which encoding do you prefer? We should maybe try to find some use cases Here is a dummy script bla.py: --- import sys print(sys.argv) try: open(sys.argv[1]).close() except Exception as err: print("open error: %s" % err) else: print("open ok") --- Locale encoding = FS encoding = utf-8: $ ./python bla.py xxxé.txt ['bla.py', 'xxxé.txt'] open ok Locale encoding = utf8, FS encoding = ascii: $ PYTHONFSENCODING=ascii ./python bla.py xxxé.txt ['bla.py', 'xxxé.txt'] open error: 'ascii' codec can't encode character '\xe9' ... The filename is displayed correctly, but we are unable to open the file if PYTHONFSENCODING is used :-/ Should the filename be displayed differently if PYTHONFSENCODING is used? > I think these problems are sufficiently resolved now: either by > PEP 3333, PEP 444, PEP 383, or os.environb. Ok, cool :-) > I think you misunderstood MAL's comment, though: the environment > variables are not encoded in *any* specific encoding. Instead, > they are copied literally from the HTTP request, using whatever > bytes the browser originally put in there - which may or may > not have followed a particular encoding. HTTP is silent on > this most of the time, and HTML is out of scope. Ah yes, thanks for you explaination. I was unable to find its comment. |
|||
msg118340 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-10 18:23 | |
> For the command line arguments and environment variables, we don't have a lot > of choices: locale or filesystem encodings. So Antoine and Martin: which > encoding do you prefer? I still propose to drop the fsname encoding. Then this question goes away. |
|||
msg118341 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-10 18:33 | |
Le dimanche 10 octobre 2010 à 18:23 +0000, Martin v. Löwis a écrit : > Martin v. Löwis <martin@v.loewis.de> added the comment: > > > For the command line arguments and environment variables, we don't have a lot > > of choices: locale or filesystem encodings. So Antoine and Martin: which > > encoding do you prefer? > > I still propose to drop the fsname encoding. Then this question goes away. I don't know what you mean by dropping, since OS X by construction needs a filesystem encoding (utf-8) different from the locale encoding; and Windows hardwires the decoding/encoding of bytes filenames using mbcs regardless of the current codepage, IIRC. So do you just mean the filesystem encoding should be hidden from the user? What would be the benefit? |
|||
msg118344 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-10 19:44 | |
> I don't know what you mean by dropping, since OS X by construction needs > a filesystem encoding (utf-8) different from the locale encoding; See above. I propose to stop using the locale encoding for command line arguments and environment variables on OSX, and use UTF-8 instead. > and > Windows hardwires the decoding/encoding of bytes filenames using mbcs > regardless of the current codepage, IIRC. I wish byte-oriented file names could be dropped on Windows. But that is probably too incompatible. > So do you just mean the filesystem encoding should be hidden from the > user? What would be the benefit? That the very issue that this bug report (re-read the title) is about would go away. |
|||
msg118352 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 12:15 | |
> > ... So Antoine and Martin: which encoding do you prefer? > > I still propose to drop the fsname encoding. Then this question goes away. You mean that we should use the following encoding for the command line arguments, environment variables and all filenames/paths: - Mac OS X: utf-8 - Windows: unicode for command line/env, mbcs to decode filenames - others OSes: locale encoding To do that, we have to: - "others OSes": delete the PYTHONFSENCODING variable - Mac OS X: use utf-8 to decode the command line arguments (we can use PyUnicode_DecodeUTF8()+PyUnicode_AsWideCharString() before Python is initialized) On "others OSes", we continue to use the FS encoding to encode command line/env vars, because the FS encoding will always be the locale encoding. And it's more pratical to use sys.getfilesystemencoding() than mbstowcs(), wcstombs(), _Py_wchar2char(), _Py_char2wchar(), etc. because the FS encoding doesn't depend on the current locale, and it uses Python codecs which support more error handlers. I like this solution because it doesn't change a lot of things. I agree to drop PYTHONFSENCODING because it looks like PYTHONFSENCODING introduced more inconsistencies than it solved. |
|||
msg118358 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-10-11 13:45 | |
STINNER Victor wrote: > > I like this solution because it doesn't change a lot of things. I agree to > drop PYTHONFSENCODING because it looks like PYTHONFSENCODING introduced more > inconsistencies than it solved. If you remove the PYTHONFSENCODING, then we have to reconsider removal of sys.setfilesystemencoding(). The main argument for removal of the sys function was having the environment variable. If you remove both, Python will get very poor grades for OS interoperability on platforms that often deal with multiple different encodings for file names. I am repeating myself, but please keep in mind that the locale is an application scope setting. It doesn't have anything to do with what's actually stored in file systems or what the OS uses internally. Python therefore has to provide a way to customize the file system encoding and allow to override the locale guessing that's currently happening. You can't just tell people to go with whatever encoding setup you prefer to make Python's guessing easier or more correct. Python has to adapt to what the users actually use, not the other way around. Where that's not easily possible, there have to be ways to explicitly tell Python what to use... telling the user to adjust his or her locale settings just to be able to run Python is not an option. The world is still moving towards Unicode - it's not 100% there yet. |
|||
msg118359 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-11 13:54 | |
> You mean that we should use the following encoding for the command line > arguments, environment variables and all filenames/paths: > - Mac OS X: utf-8 > - Windows: unicode for command line/env, mbcs to decode filenames No: unicode for filenames also. > - others OSes: locale encoding Yes, that is my proposal. |
|||
msg118360 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-11 13:56 | |
> If you remove both, Python will get very poor grades for OS > interoperability on platforms that often deal with multiple > different encodings for file names. Why that? It will work very well in such a setting, much better than, say, Java. |
|||
msg118365 - (view) | Author: Ronald Oussoren (ronaldoussoren) * ![]() |
Date: 2010-10-11 14:32 | |
On 09 Oct, 2010,at 02:07 PM, Antoine Pitrou <report@bugs.python.org> wrote: Antoine Pitrou <pitrou@free.fr> added the comment: > For the command line, it would mean that we > introduced a new encoding: "command line encoding", which will be utf-8 on > OSX. Or more generally "environment encoding", if it's also used for env vars. This could solve the subprocess issue neatly. Note that the command-line and environment encoding on OSX is generally UTF-8, even if that is not always reflected in the locale settings. On recent OSX releases LANG will be set to a UTF-8 aware locale ("en_US.UTF-8" on my machine) when you start a shell using Terminal.app. The correct locale environment variables are AFAIK not set in two important situations: on OSX 10.4 and when running code from an application bundle, in both cases the environment/command-line encoding should be treated as UTF-8. There is one reason for not wanting to assume that the encoding is always UTF-8: the user might access the system from a non-UTF8 terminal (such as when logging in with an SSH session from a system not using UTF-8, or using an alternate terminal application). IMHO these are minor enough use-cases that we could just enforce that the encoding is UTF-8 on OSX. That would ensure that the filesystem encoding and environment/command-line encoding are consistent and we'd no longer run into the problem that triggered this issue. Ronald ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue9992> _______________________________________ |
|||
msg118367 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-11 14:38 | |
> There is one reason for not wanting to assume that the encoding is > always UTF-8: the user might access the system from a non-UTF8 > terminal (such as when logging in with an SSH session from a system > not using UTF-8, or using an alternate terminal application). IMHO > these are minor enough use-cases that we could just enforce that the > encoding is UTF-8 on OSX. Ok, that's enough of an expert statement for me to settle the OSX case: we will always assume that environment data is UTF-8 on OSX (leaving the rest to the surrogate escape handler). |
|||
msg118368 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-10-11 14:41 | |
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > >> If you remove both, Python will get very poor grades for OS >> interoperability on platforms that often deal with multiple >> different encodings for file names. > > Why that? It will work very well in such a setting, much better > than, say, Java. Well, Java pretty much fails completely in this respect, so being better than Java is not exactly the benchmark I had in mind :-) I think the proper benchmark would be a Python2 application that has no problems with these things, since file names are just bytes that refer to files on the disk, with no associated encoding - at least on Unix and related platforms. Being pedantic about forcing some encoding onto things that don't have an encoding won't really work out in practice. Dealing with file names, OS environments, pipes and sockets is dirty work, so I think we should go with the 80-20 approach in making 80% easy and 20% harder, but still possible. |
|||
msg118374 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-11 16:01 | |
> Being pedantic about forcing some encoding onto things that don't > have an encoding won't really work out in practice. Dealing with > file names, OS environments, pipes and sockets is dirty work, so > I think we should go with the 80-20 approach in making 80% easy > and 20% harder, but still possible. Unix applications can always use the byte-oriented file name APIs if they need to. Then you are back to the state that things have in Python 2. No need to have a user-tunable file system encoding there. However, I completely fail to see the advantage that the PYTHONFSENCODING variable has over the LANG variable. If it's possible to set PTHONFSENCODING in some application, it surely is also possible to set LANG (or LC_CTYPE), no? Setting the latter also gives you the advantage that environment variables and command line arguments use the same encoding as file names. |
|||
msg118375 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-11 16:08 | |
> However, I completely fail to see the advantage that the > PYTHONFSENCODING variable has over the LANG variable. If it's > possible to set PTHONFSENCODING in some application, it surely > is also possible to set LANG (or LC_CTYPE), no? Setting the > latter also gives you the advantage that environment variables > and command line arguments use the same encoding as file names. I guess LANG and LC_CTYPE can be used for other purposes such as internationalization. |
|||
msg118377 - (view) | Author: Marc-Andre Lemburg (lemburg) * ![]() |
Date: 2010-10-11 16:26 | |
Martin v. Löwis wrote: > > Martin v. Löwis <martin@v.loewis.de> added the comment: > >> Being pedantic about forcing some encoding onto things that don't >> have an encoding won't really work out in practice. Dealing with >> file names, OS environments, pipes and sockets is dirty work, so >> I think we should go with the 80-20 approach in making 80% easy >> and 20% harder, but still possible. > > Unix applications can always use the byte-oriented file name APIs > if they need to. Then you are back to the state that things have > in Python 2. No need to have a user-tunable file system encoding > there. Right and if you take the position of refusing to guess which we usually do in Python, then interfacing to file names using bytes would be the appropriate way to handle the situation. However, since Python3 has chosen to regard file names as text regardless of platform, we're now in the situation that we have to come up with some educated guess on the encoding. > However, I completely fail to see the advantage that the > PYTHONFSENCODING variable has over the LANG variable. If it's > possible to set PTHONFSENCODING in some application, it surely > is also possible to set LANG (or LC_CTYPE), no? Setting the > latter also gives you the advantage that environment variables > and command line arguments use the same encoding as file names. The advantage is that you can change the Python files system encoding *without* having to change your locale settings. You can't possibly expect a user to switch to using UTF-8 for all his/her applications just because Python needs this to properly decode file names. Users of applications written in Python will most likely not even know how to change the locale encoding. |
|||
msg118385 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 19:58 | |
MvL> > - Windows: unicode for command line/env, mbcs to decode filenames MvL> No: unicode for filenames also. Yes, I mean unicode for everything, but decode bytes data from the mbcs encoding. |
|||
msg118386 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 20:03 | |
MAL> If you remove the PYTHONFSENCODING, then we have to reconsider MAL> removal of sys.setfilesystemencoding(). Pleeeeeeeease, Marc, read my comments. You never consider technical problems, you just propose to ensure that "Python just works", without answering to my technical questions. I already explained 2 or 3 times that sys.setfilesystemencoding() was completly buggy and not usable in pratical. You proposed PYTHONFSENCODING and I implemented it. But then I explained in an email to python-dev and in this issue, that this environment variable introduced many problems. I don't see how sys.setfilesystemencoding() would solve this issue, it's out of scope. |
|||
msg118388 - (view) | Author: Martin v. Löwis (loewis) * ![]() |
Date: 2010-10-11 20:07 | |
> You can't possibly expect a user to switch to using UTF-8 for > all his/her applications just because Python needs this to > properly decode file names. If the user hasn't switched to UTF-8, why would Python need that to properly decode file names? |
|||
msg118389 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 20:17 | |
MAL> You can't just tell people to go with whatever encoding setup MAL> you prefer to make Python's guessing easier or more correct. Python doesn't really *guess* the encoding, it just reads the encoding from the locale. What do you mean by "more correct"? How can Python knowns the right encoding better than the user? Python should not guess anything. If the environment is not correctly configured, it's not Python's fault. The user has to fix its environment. |
|||
msg118390 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 20:19 | |
> I guess LANG and LC_CTYPE can be used for other purposes > such as internationalization. That's why there are different environement variables: * LC_MESSAGES for i18n (messages) * LC_CTYPE for the encoding * LC_TIME for time and date * etc. |
|||
msg118392 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 21:16 | |
issue9992.patch: - Remove PYTHONFSENCODING environment variable - Mac OS X: Use utf-8 to decode command line arguments - Fix issue #9992 (this issue): attached test, locale_fs_encoding.py, pass - Fix issue #9988 - Fix issue #10014 - Fix issue #10039 $ diffstat issue9992.patch Doc/using/cmdline.rst | 12 ------------ Doc/whatsnew/3.2.rst | 6 ------ Lib/test/test_os.py | 30 ------------------------------ Lib/test/test_subprocess.py | 4 ---- Lib/test/test_sys.py | 29 ----------------------------- Modules/main.c | 3 --- Modules/python.c | 10 +++++++++- Python/pythonrun.c | 22 ++++++---------------- 8 files changed, 15 insertions(+), 101 deletions(-) I like such patch: it removes more code than it adds, but it fixes 4 different issues! I didn't tested the patch specific to OSX (use utf8 to decode command line arguments). |
|||
msg118394 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-11 21:42 | |
I think that issue9992.patch fixes also #4388 because it uses the same encoding (FS encoding, utf8) on OSX to encode and to decode command line arguments. |
|||
msg118591 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-13 22:18 | |
I commited issue9992.patch as r85430 (remove PYTHONFSENCODING) + r85435 (OSX: decode command line arguments from utf-8). These commits should fix this issue. Reopen the issue if you notice new problems, or if the problem is not fixed yet. I will watch Mac OS X buildbots, especially about r85435 ;-) |
|||
msg118607 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-14 00:34 | |
test_undecodable_env() of test_subprocess fails. r85430 removes the following code which was added by Antoine to fix this issue. # Force surrogate-escaping of \xFF in the child process; # otherwise it can be decoded as-is if the default locale # is latin-1. env['PYTHONFSENCODING'] = 'ascii' I think that we should accept that b'\xff' can be decoded as '\xff' and that's all. |
|||
msg118633 - (view) | Author: Antoine Pitrou (pitrou) * ![]() |
Date: 2010-10-14 08:23 | |
> I think that we should accept that b'\xff' can be decoded as '\xff' and > that's all. What do you plan to do to fix this failure? ====================================================================== FAIL: test_undecodable_env (test.test_subprocess.POSIXProcessTestCase) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home2/buildbot2/slave/3.x.loewis-parallel/build/Lib/test/test_subprocess.py", line 892, in test_undecodable_env self.assertEquals(stdout.decode('ascii'), ascii(value)) AssertionError: "'abc\\xff'" != "'abc\\udcff'" - 'abc\xff' ? ^ + 'abc\udcff' ? ^^^ http://www.python.org/dev/buildbot/builders/x86%20debian%20parallel%203.x/builds/502/steps/test/logs/stdio |
|||
msg118645 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-14 10:44 | |
With r85466+r85467, the test_undecodable_env (of test_subprocess) uses C locale to get ASCII locale encoding (for the first test, on unicode environment variables). It should have the same effect than env['PYTHONFSENCODING'] = 'ascii': get ASCII as the filesystem encoding. |
|||
msg118647 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-14 10:56 | |
Ok, the issue is not complelty fixed ;-) 12:55 < py-bb> build #504 of x86 debian parallel 3.x is complete: Success [build successful] Build details are at http://www.python.org/dev/buildbot/all/builders/x86%20debian%20parallel%203.x/builds/504 |
|||
msg118648 - (view) | Author: STINNER Victor (vstinner) * ![]() |
Date: 2010-10-14 10:59 | |
I tried... "the issue is *now* complelty fixed" |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:57:07 | admin | set | github: 54201 |
2010-10-14 10:59:27 | vstinner | set | messages: + msg118648 |
2010-10-14 10:56:13 | vstinner | set | status: open -> closed resolution: fixed messages: + msg118647 |
2010-10-14 10:44:32 | vstinner | set | messages: + msg118645 |
2010-10-14 08:23:19 | pitrou | set | assignee: vstinner messages: + msg118633 |
2010-10-14 00:34:39 | vstinner | set | messages: + msg118607 |
2010-10-13 22:18:40 | vstinner | set | messages: + msg118591 |
2010-10-13 18:03:22 | eric.araujo | set | nosy:
+ eric.araujo title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command-line arguments are not correctly decoded if locale and fileystem encodings are different |
2010-10-11 21:42:15 | vstinner | set | messages: + msg118394 |
2010-10-11 21:16:37 | vstinner | set | files:
+ issue9992.patch messages: + msg118392 |
2010-10-11 20:19:29 | vstinner | set | messages: + msg118390 |
2010-10-11 20:17:26 | vstinner | set | messages: + msg118389 |
2010-10-11 20:07:45 | loewis | set | messages: + msg118388 |
2010-10-11 20:03:27 | vstinner | set | messages: + msg118386 |
2010-10-11 19:58:45 | vstinner | set | messages: + msg118385 |
2010-10-11 16:26:33 | lemburg | set | messages: + msg118377 |
2010-10-11 16:08:55 | pitrou | set | messages: + msg118375 |
2010-10-11 16:01:58 | loewis | set | messages: + msg118374 |
2010-10-11 14:41:29 | lemburg | set | messages:
+ msg118368 title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent |
2010-10-11 14:38:38 | loewis | set | messages:
+ msg118367 title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent |
2010-10-11 14:32:32 | ronaldoussoren | set | files:
+ unnamed messages: + msg118365 title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent |
2010-10-11 13:56:27 | loewis | set | messages: + msg118360 |
2010-10-11 13:54:35 | loewis | set | messages: + msg118359 |
2010-10-11 13:45:43 | lemburg | set | messages: + msg118358 |
2010-10-11 12:15:20 | vstinner | set | messages: + msg118352 |
2010-10-10 19:44:20 | loewis | set | messages: + msg118344 |
2010-10-10 18:33:12 | pitrou | set | messages: + msg118341 |
2010-10-10 18:23:20 | loewis | set | messages: + msg118340 |
2010-10-10 17:59:23 | vstinner | set | messages: + msg118339 |
2010-10-10 16:22:26 | loewis | set | messages: + msg118337 |
2010-10-10 15:51:27 | vstinner | set | messages: + msg118336 |
2010-10-09 17:32:45 | pitrou | set | messages:
+ msg118279 title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent |
2010-10-09 17:11:28 | loewis | set | messages:
+ msg118278 title: Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent |
2010-10-09 12:28:18 | vstinner | set | messages: + msg118271 |
2010-10-09 12:07:16 | pitrou | set | messages: + msg118270 |
2010-10-09 12:01:46 | vstinner | set | messages:
+ msg118269 title: Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent -> Command line arguments are not correctly decodediflocale and fileystem encodingsaredifferent |
2010-10-09 11:49:51 | loewis | set | messages: + msg118268 |
2010-10-09 10:52:24 | pitrou | set | messages: + msg118264 |
2010-10-09 10:47:47 | loewis | set | messages: + msg118263 |
2010-10-09 09:45:17 | pitrou | set | messages: + msg118258 |
2010-10-09 09:14:48 | loewis | set | messages: + msg118257 |
2010-10-08 20:30:35 | pitrou | set | nosy:
+ pitrou messages: + msg118225 |
2010-10-08 19:25:58 | ixokai | set | nosy:
+ ixokai messages: + msg118221 |
2010-10-02 12:14:56 | vstinner | set | messages: + msg117871 |
2010-09-30 10:53:37 | vstinner | set | messages: + msg117717 |
2010-09-30 10:43:17 | vstinner | set | nosy:
+ loewis, ronaldoussoren messages: + msg117716 |
2010-09-30 09:02:03 | lemburg | set | messages:
+ msg117711 title: Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent -> Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent |
2010-09-30 08:57:00 | vstinner | set | messages:
+ msg117709 title: Command line arguments are not correctly decoded if locale and fileystem encodings are different -> Command line arguments are not correctly decodedif locale and fileystem encodings aredifferent |
2010-09-30 07:55:20 | lemburg | set | nosy:
+ lemburg title: Command line arguments are not correctly decoded if locale and fileystem encodings are different -> Command line arguments are not correctly decoded if locale and fileystem encodings are different messages: + msg117705 |
2010-09-29 23:45:58 | vstinner | set | files:
+ cmdline_encoding-2.patch keywords: + patch messages: + msg117676 |
2010-09-29 23:11:21 | pjenvey | set | nosy:
+ pjenvey |
2010-09-29 22:36:21 | vstinner | create |