This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Problem with invalidly-encoded command-line arguments (Unix)
Type: behavior Stage: resolved
Components: Unicode Versions: Python 3.1
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: baikie, dedded, ggenellina, loewis, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2008-06-01 22:22 by baikie, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (10)
msg67608 - (view) Author: David Watson (baikie) Date: 2008-06-01 22:22
The error message has no newline at the end:

$ LANG=en_GB.UTF-8 python3.0 test.py $'\xff'
Could not convert argument 2 to string$

Seriously, though: is this the intended behaviour?  If the
interpreter just dies when it gets a non-UTF-8 (or whatever)
argument, it creates an opportunity for a denial-of-service if
some admin is running a Python script via find(1) or similar.
And what if you want to run a Python script on some files named
in a mixture of charsets (because, say, you just untarred an
archive created in a foreign charset)?

Could sys.argv not provide bytes objects for those arguments,
like os.listdir()?  Or (better IMHO) have a separate
sys.argv_bytes interface?
msg67610 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-06-01 22:34
That os.listdir still uses bytes should be changed as well. Both file
names and command line arguments are strings, from the viewpoint of
Python. Nothing else is supported.
msg67637 - (view) Author: David Watson (baikie) Date: 2008-06-02 17:59
Hmm, yes, I see that the open() builtin doesn't accept bytes
filenames, though os.open() still does.  When I saw that you
could pass bytes filenames transparently from os.listdir() to
os.open(), I assumed that this was intentional!

So what *is* os.listdir() supposed to do when it finds an
unconvertible filename?  Raise an exception?  Pretend the file
isn't there?  What if someone puts unconvertible strings in the
password database?  I think this is going to cause real problems
for people.
msg67638 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2008-06-02 20:30
The issue with unrepresentable file names hasn't been decided yet. One
option is to include the bytes object in that case, instead, noting that
this can only occur on selected platforms. Another option is indeed to
raise an exception, or exclude the file from the listing (although
errors should never pass silently).
msg78582 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-12-31 00:29
> Hmm, yes, I see that the open() builtin doesn't accept bytes
> filenames, though os.open() still does.

What? open() builtin, io.open() and os.open() accept bytes filename.

> So what *is* os.listdir() supposed to do when it finds an
> unconvertible filename?  Raise an exception?

os.listdir(str)->str raises an exception on undecodable filename, 
whereas os.listdir(bytes)->bytes doesn't write unicode error because 
the filename is not decoded!

> What if someone puts unconvertible strings in the password database?

Which database? It sounds like a different issue. It's always a good 
thing to reject undecodable string, even with python2 ;-)
msg78586 - (view) Author: Dan Dever (dedded) Date: 2008-12-31 01:45
>> What if someone puts unconvertible strings in the password database?
>
> Which database? It sounds like a different issue.

It's yet another special case of the more general issue, which is that
Unix strings are strings of bytes that may or may not be encoded text. 
Bytes of any value (save nul) are permitted in any order.  There may be
the occasional additional constraint: '/' is not permitted in filenames
since it's the path element delimiter, for example.  But you can
certainly have non-text strings for file names, environment variables,
command-line arguments, etc.

Since Python 3 strings must be text, they cannot generally be used to
represent Unix strings.  David's right, "this is going to cause real
problems."  It has to be solved somehow, but the more obvious solutions
are in some way ugly and introduce platform-to-platform inconsistencies.
 I occasionally skim the python-dev mailing list archive, and as far as
I can tell there is yet no consensus on how to handle this.

My use of Python is chiefly general-purpose scripting on Linux. 
Parameters to these scripts are more likely to be file names than
anything else.  So I can't personally consider moving to version 3 until
this issue is resolved, which is why I added myself to the nosy list.

I'm bothered by Martin's comment:

> That os.listdir still uses bytes should be changed as well. Both
> file names and command line arguments are strings, from the
> viewpoint of Python. Nothing else is supported.

I hope that this is nothing more than his expression of dismay that such
a situation should exist, and that he doesn't mean it literally.
msg78881 - (view) Author: David Watson (baikie) Date: 2009-01-02 21:38
@ Victor Stinner: Yes, the behaviour of those functions is as you
describe - it's been changed since I filed this issue.  I do
consider it an improvement.

By the password database, I mean /etc/passwd or replacements that
are accessible via getpwnam() and friends.  Users are often
allowed to change things like the GECOS field, and can generally
stick any old junk in there, regardless of encoding.  Now that I
come to check, it seems that in the Python 3.0 release, the pwd.*
functions do raise UnicodeDecodeError when the GECOS field can't
be decoded (bizarrely, they try to interpret it as a Python
string literal, and thus choke on invalid backslash escapes).
Unfortunately, this allows a user to change their GECOS field so
that system programs written in Python can't determine the
username corresponding to that user's UID or vice versa.
msg78889 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2009-01-02 22:02
> By the password database, I mean /etc/passwd or replacements that
> are accessible via getpwnam() and friends. 

Please only discuss one issue at the time in the bug tracker. This
issue is about "invalidly-encoded command-line arguments", not about
the password database. If you want to report an issue with the password
database, please do so in a separate report.
msg78906 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009-01-02 23:58
> By the password database, I mean /etc/passwd or replacements that
> are accessible via getpwnam() and friends.  Users are often
> allowed to change things like the GECOS field, and can generally
> stick any old junk in there, regardless of encoding.

I started to patch pwd module to return bytes instead of unicode, but I didn't 
finished my work and the lost it :-/ Today, most UNIX uses UTF-8 as the 
default charset. About GECOS: is it really used? If you have real problems, 
open a new issue as proposed by Martin.
msg88032 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-05-18 15:58
I believe the title problem is solved by PEP 383 in py3k trunk.
History
Date User Action Args
2022-04-11 14:56:35adminsetgithub: 47273
2009-05-18 18:34:10benjamin.petersonsetstatus: pending -> closed
2009-05-18 15:58:57r.david.murraysetstatus: open -> pending

nosy: + r.david.murray
messages: + msg88032

resolution: fixed
stage: resolved
2009-05-16 19:40:43ajaksu2setpriority: normal
versions: + Python 3.1, - Python 3.0
2009-01-02 23:58:49vstinnersetmessages: + msg78906
2009-01-02 22:02:30loewissetmessages: + msg78889
2009-01-02 21:38:53baikiesetmessages: + msg78881
2008-12-31 23:48:48ggenellinasetnosy: + ggenellina
2008-12-31 01:45:16deddedsetmessages: + msg78586
2008-12-31 00:29:38vstinnersetnosy: + vstinner
messages: + msg78582
2008-12-29 23:29:27deddedsetnosy: + dedded
2008-06-02 20:30:47loewissetmessages: + msg67638
2008-06-02 17:59:43baikiesetmessages: + msg67637
2008-06-01 22:34:05loewissetnosy: + loewis
messages: + msg67610
2008-06-01 22:22:15baikiecreate