classification
Title: Bytes version of sys.argv
Type: enhancement Stage: needs patch
Components: Interpreter Core, Unicode Versions: Python 3.3
process
Status: closed Resolution: wont fix
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, amaury.forgeotdarc, ezio.melotti, haypo, lemburg, loewis
Priority: normal Keywords:

Created on 2010-05-20 12:10 by haypo, last changed 2011-04-12 22:25 by haypo. This issue is now closed.

Files
File name Uploaded Description Edit
argvb.py haypo, 2010-10-24 20:35
Messages (12)
msg106140 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-05-20 12:10
In some situations, the encoding of the command line is incorrect or unknown. sys.argv is decoded with the file system encoding which can be wrong. Eg. see issue #4388 (ok, it's a bug, it should be fixed).

As os.environb, it would be useful to have bytes version of sys.argv to have able to decide the encoding used to decode each argument, or to manipulate bytes if we don't care about the encoding.

See also issue #8775 which propose to add a new encoding to decode sys.argv.
msg106141 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-05-20 12:29
> sys.argv is decoded with the file system encoding

IIRC this is not exact. Py_Main signature is 
    Py_Main(int argc, wchar_t **argv)
then PyUnicode_FromWideChar is used, and there is no conversion (except from UCS4 to UCS2).
The wchar_t strings themselves are built with mbstowcs(), the file system encoding is not used.
msg106143 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-05-20 12:39
> The wchar_t strings themselves are built with mbstowcs(), 
> the file system encoding is not used.

Oops sorry, you are right, and it's worse :-) sys.argv is decoded using the locale encoding, but subprocess & cie use the file system encoding for the reverse operation. => it doesn't work if both encodings are different (#4388, #8775).

The pseudo-code to create sys.argv on Unix is:

 # argv is a bytes list
 encoding = locale.getpreferredencoding()
 sys.argv = [arg.decode(encoding, 'surrogateescape') for arg in argv]
msg106172 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-05-20 17:24
> As os.environb, it would be useful to have bytes version of sys.argv
> to have able to decide the encoding used to decode each argument, or
> to manipulate bytes if we don't care about the encoding.

-1. Py_Main expects wchar_t*, so no byte-oriented representation of the
command line is readily available.
msg111754 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-07-28 00:53
"no byte-oriented representation of the command line is readily available."

Why not using the following recipe?

 encoding = locale.getpreferredencoding()
 sys.argvb = [arg.decode(encoding, 'surrogateescape') for arg in argv]
msg111757 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-07-28 01:21
You should read .encode(), not .decode() :-/
msg111770 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-28 05:54
Using that approach would work on POSIX systems.

Another problem I see is synchronizing the two. If some function strips arguments from sys.argv (because it has completed processing), sys.argvb would still keep the arguments. Of course, this could be fixed by having sys.argvb be a dynamic list (i.e. a sequence object) instead.
msg111818 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-07-28 14:50
> Using that approach would work on POSIX systems.

As os.environb, I think that sys.argv should not exist on Windows.

> Another problem I see is synchronizing the two

os.environ and os.environb are synchronized. It would be possible to do the same with sys.argv and sys.argvb. The implement would be simplier because it's just a list, not a dict.
msg111819 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-07-28 14:54
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Using that approach would work on POSIX systems.
> 
> As os.environb, I think that sys.argv should not exist on Windows.
> 
>> Another problem I see is synchronizing the two
> 
> os.environ and os.environb are synchronized. It would be possible to do the same with sys.argv and sys.argvb. The implement would be simplier because it's just a list, not a dict.

+1 on adding sys.argvb for systems that use char* in main().
msg119255 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-10-21 00:58
Since r85765 (issue #4388), always use UTF-8 to decode the command line arguments on Mac OS X, not the locale encoding. Which means that the pseudo-code becomes:

 if os.name != 'nt':
     if sys.platform == 'darwin':
        encoding = 'utf-8'
     else:
        encoding = locale.getpreferredencoding()
     sys.argvb = [arg.decode(encoding, 'surrogateescape') for arg in sys.argv]

sys.argvb should be synchronized with sys.argv, as os.environb with os.environ.
msg119528 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-10-24 20:35
Prototype (in Python) of argvb.py. Try it with: ./python -i argvb.py.

It's not possible to create sys.argvb in Python in a module loaded by Py_Initialize(), because sys.argv is created after Py_Initialize().
msg133608 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-04-12 22:25
One year after opening the issue, I don't have any real use case. And there are technical issues to implement this feature, so I prefer just to close this issue. Reopen it if you really want it, but please give an use case ;-)
History
Date User Action Args
2011-04-12 22:25:11hayposetstatus: open -> closed
resolution: wont fix
messages: + msg133608
2010-12-14 18:58:30r.david.murraysetstage: needs patch
type: enhancement
versions: + Python 3.3, - Python 3.2
2010-10-24 20:35:57hayposetfiles: + argvb.py

messages: + msg119528
2010-10-21 00:58:20hayposetmessages: + msg119255
2010-07-28 14:54:27lemburgsetnosy: + lemburg
messages: + msg111819
2010-07-28 14:50:31hayposetmessages: + msg111818
2010-07-28 05:54:44loewissetmessages: + msg111770
2010-07-28 03:42:31ezio.melottisetnosy: + ezio.melotti
2010-07-28 01:21:04hayposetmessages: + msg111757
2010-07-28 00:53:02hayposetmessages: + msg111754
2010-05-20 17:24:32loewissetnosy: + loewis
messages: + msg106172
2010-05-20 16:30:21Arfreversetnosy: + Arfrever
2010-05-20 12:39:43hayposetmessages: + msg106143
2010-05-20 12:29:33amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg106141
2010-05-20 12:10:54haypocreate