classification
Title: sys.argv docs should explaining how to handle encoding issues
Type: enhancement Stage: resolved
Components: Documentation, Unicode Versions: Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Arfrever, andyma, docs@python, ezio.melotti, inada.naoki, miss-islington, mjacob, ncoghlan, pitrou, sreepriya
Priority: normal Keywords: patch

Created on 2013-02-03 04:01 by ncoghlan, last changed 2020-06-18 11:19 by inada.naoki. This issue is now closed.

Files
File name Uploaded Description Edit
Issue17110.patch sreepriya, 2014-03-17 23:01 Documentation for proper encoding of command line arguments. review
Pull Requests
URL Status Linked Edit
PR 12602 merged inada.naoki, 2019-03-28 12:27
PR 12626 merged miss-islington, 2019-03-30 05:32
Messages (11)
msg181239 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2013-02-03 04:01
The sys.argv docs [1] currently contain no mention of the fact that they are Unicode strings decoded from bytes provided by the OS. They also don't explain how to correct a decoding error by reversing Python's implicit conversion and redoing it based on the application's knowledge of the correct encoding, as described at [2]

[1] http://docs.python.org/3/library/sys#sys.argv
[2] http://stackoverflow.com/questions/6981594/sys-argv-as-bytes-in-python-3k/
msg213674 - (view) Author: Sreepriya Chalakkal (sreepriya) * Date: 2014-03-15 19:12
I tried running with Python 3.4 the following code

import sys

print(sys.argv[1])
print(b'bytes')

And I ran as follows trying to run with a different encoding. 
$ python ~/a.py `echo priya|iconv -t latin1`
priya
bytes

There was no unicode encode error generated! Is it because the problem is fixed?
msg213699 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-03-16 01:54
> There was no unicode encode error generated! Is it because the problem 
> is fixed?

No, it's not fixed.
First, it seems you are testing with Python 2 (otherwise you would get "b'bytes'", not "bytes"). Python 2 won't have a problem here, since it treats everything as bytestrings.
Second, to evidence the issue you must pass a non-ASCII string. For example:

$ ./python a.py `echo éléphant|iconv -t latin1`
Traceback (most recent call last):
  File "a.py", line 4, in <module>
    print(sys.argv[1])
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed
msg213911 - (view) Author: Sreepriya Chalakkal (sreepriya) * Date: 2014-03-17 23:01
You are right. Instead of running ./python inside the python directory, I ran the default python of older version! Based on the stackoverflow link given, I tried to make some documentation. I am attaching the patch!
msg214022 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2014-03-18 21:33
Hmm, I'm not sure where those explanations belong but I'm not sure should be in the sys module docs (especially as they are quite lengthy, and they also apply to other data such as os.environ). Perhaps the Unicode HOWTO?
msg339175 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-30 05:32
New changeset 38f4e468d4b55551e135c67337c18ae142193ba8 by Inada Naoki in branch 'master':
bpo-17110: doc: add note how to get bytes from sys.argv (GH-12602)
https://github.com/python/cpython/commit/38f4e468d4b55551e135c67337c18ae142193ba8
msg339176 - (view) Author: miss-islington (miss-islington) Date: 2019-03-30 05:38
New changeset 5b80cb5584a72044424f2d82d0ae79c720f24c47 by Miss Islington (bot) in branch '3.7':
bpo-17110: doc: add note how to get bytes from sys.argv (GH-12602)
https://github.com/python/cpython/commit/5b80cb5584a72044424f2d82d0ae79c720f24c47
msg371778 - (view) Author: Manuel Jacob (mjacob) * Date: 2020-06-17 22:03
The actual startup code uses Py_DecodeLocale() for converting argv from bytes to unicode. Since which Python version is it guaranteed that Py_DecodeLocale() and os.fsencode() roundtrip?
msg371788 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2020-06-18 01:42
There is no strict guarantee.

I think ASCII, UTF-8, latin1 with surrogateescape guarantee roundtrip.

Other legacy encodings like cp932 may not roundtrip. But it is not a huge problem because only Windows use them typically.
On Windows:

* wchar_t is used in most case, instead of fsencoding
* fsencoding is now UTF-8 by default

In other words, if you are using legacy encoding on Unix, it may be not roundtripping.
msg371802 - (view) Author: Manuel Jacob (mjacob) * Date: 2020-06-18 09:48
If the encoding supports it, since which Python version do Py_DecodeLocale() and os.fsencode() roundtrip?

The background of my question is that Mercurial goes some extra rounds to determine the correct encoding to emulate what Py_EncodeLocale() would do: https://www.mercurial-scm.org/repo/hg/file/5.4.1/mercurial/pycompat.py#l157 . If os.fsencode() could be used, it would simplify the code. Mercurial supports Python 3.5+.
msg371806 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2020-06-18 11:19
>
> Manuel Jacob <me@manueljacob.de> added the comment:
>
> If the encoding supports it, since which Python version do
> Py_DecodeLocale() and os.fsencode() roundtrip?
>

Maybe, since Python 3.2. FWIW, fsencode is added by Victor in
https://bugs.python.org/issue8514

> The background of my question is that Mercurial goes some extra rounds to
> determine the correct encoding to emulate what Py_EncodeLocale() would do:
> https://www.mercurial-scm.org/repo/hg/file/5.4.1/mercurial/pycompat.py#l157
> . If os.fsencode() could be used, it would simplify the code. Mercurial
> supports Python 3.5+.
>
>
>

I think it is a right approach.
One of the important use case of os.fsencode is using file path from
sys.argv even if it can not be decoded by filesystem encoding.
History
Date User Action Args
2020-06-18 11:19:04inada.naokisetmessages: + msg371806
2020-06-18 09:48:29mjacobsetmessages: + msg371802
2020-06-18 01:42:25inada.naokisetmessages: + msg371788
2020-06-17 22:03:03mjacobsetnosy: + mjacob
messages: + msg371778
2019-03-30 06:25:04inada.naokisetstatus: open -> closed
stage: patch review -> resolved
resolution: fixed
versions: + Python 3.7, Python 3.8, - Python 3.2, Python 3.3, Python 3.4
2019-03-30 05:38:17miss-islingtonsetnosy: + miss-islington
messages: + msg339176
2019-03-30 05:32:36miss-islingtonsetpull_requests: + pull_request12559
2019-03-30 05:32:11inada.naokisetnosy: + inada.naoki
messages: + msg339175
2019-03-28 12:27:55inada.naokisetstage: needs patch -> patch review
pull_requests: + pull_request12542
2014-03-18 21:33:01pitrousetmessages: + msg214022
2014-03-18 08:56:15andymasetnosy: + andyma
2014-03-17 23:01:03sreepriyasetfiles: + Issue17110.patch
keywords: + patch
messages: + msg213911
2014-03-16 01:54:01pitrousetnosy: + pitrou
messages: + msg213699
2014-03-15 19:12:56sreepriyasetnosy: + sreepriya
messages: + msg213674
2013-02-03 04:27:08Arfreversetnosy: + Arfrever
2013-02-03 04:01:11ncoghlancreate