Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with invalidly-encoded command-line arguments (Unix) #47273

Closed
baikie mannequin opened this issue Jun 1, 2008 · 10 comments
Closed

Problem with invalidly-encoded command-line arguments (Unix) #47273

baikie mannequin opened this issue Jun 1, 2008 · 10 comments
Labels
topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@baikie
Copy link
Mannequin

baikie mannequin commented Jun 1, 2008

BPO 3023
Nosy @loewis, @vstinner, @bitdancer

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2009-05-18.18:34:10.106>
created_at = <Date 2008-06-01.22:22:15.019>
labels = ['type-bug', 'expert-unicode']
title = 'Problem with invalidly-encoded command-line arguments (Unix)'
updated_at = <Date 2009-05-18.18:34:10.085>
user = 'https://bugs.python.org/baikie'

bugs.python.org fields:

activity = <Date 2009-05-18.18:34:10.085>
actor = 'benjamin.peterson'
assignee = 'none'
closed = True
closed_date = <Date 2009-05-18.18:34:10.106>
closer = 'benjamin.peterson'
components = ['Unicode']
creation = <Date 2008-06-01.22:22:15.019>
creator = 'baikie'
dependencies = []
files = []
hgrepos = []
issue_num = 3023
keywords = []
message_count = 10.0
messages = ['67608', '67610', '67637', '67638', '78582', '78586', '78881', '78889', '78906', '88032']
nosy_count = 6.0
nosy_names = ['loewis', 'ggenellina', 'vstinner', 'baikie', 'r.david.murray', 'dedded']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue3023'
versions = ['Python 3.1']

@baikie
Copy link
Mannequin Author

baikie mannequin commented Jun 1, 2008

The error message has no newline at the end:

$ LANG=en_GB.UTF-8 python3.0 test.py $'\xff'
Could not convert argument 2 to string$

Seriously, though: is this the intended behaviour? If the
interpreter just dies when it gets a non-UTF-8 (or whatever)
argument, it creates an opportunity for a denial-of-service if
some admin is running a Python script via find(1) or similar.
And what if you want to run a Python script on some files named
in a mixture of charsets (because, say, you just untarred an
archive created in a foreign charset)?

Could sys.argv not provide bytes objects for those arguments,
like os.listdir()? Or (better IMHO) have a separate
sys.argv_bytes interface?

@baikie baikie mannequin added topic-unicode type-bug An unexpected behavior, bug, or error labels Jun 1, 2008
@loewis
Copy link
Mannequin

loewis mannequin commented Jun 1, 2008

That os.listdir still uses bytes should be changed as well. Both file
names and command line arguments are strings, from the viewpoint of
Python. Nothing else is supported.

@baikie
Copy link
Mannequin Author

baikie mannequin commented Jun 2, 2008

Hmm, yes, I see that the open() builtin doesn't accept bytes
filenames, though os.open() still does. When I saw that you
could pass bytes filenames transparently from os.listdir() to
os.open(), I assumed that this was intentional!

So what *is* os.listdir() supposed to do when it finds an
unconvertible filename? Raise an exception? Pretend the file
isn't there? What if someone puts unconvertible strings in the
password database? I think this is going to cause real problems
for people.

@loewis
Copy link
Mannequin

loewis mannequin commented Jun 2, 2008

The issue with unrepresentable file names hasn't been decided yet. One
option is to include the bytes object in that case, instead, noting that
this can only occur on selected platforms. Another option is indeed to
raise an exception, or exclude the file from the listing (although
errors should never pass silently).

@vstinner
Copy link
Member

Hmm, yes, I see that the open() builtin doesn't accept bytes
filenames, though os.open() still does.

What? open() builtin, io.open() and os.open() accept bytes filename.

So what *is* os.listdir() supposed to do when it finds an
unconvertible filename? Raise an exception?

os.listdir(str)->str raises an exception on undecodable filename,
whereas os.listdir(bytes)->bytes doesn't write unicode error because
the filename is not decoded!

What if someone puts unconvertible strings in the password database?

Which database? It sounds like a different issue. It's always a good
thing to reject undecodable string, even with python2 ;-)

@dedded
Copy link
Mannequin

dedded mannequin commented Dec 31, 2008

> What if someone puts unconvertible strings in the password database?

Which database? It sounds like a different issue.

It's yet another special case of the more general issue, which is that
Unix strings are strings of bytes that may or may not be encoded text.
Bytes of any value (save nul) are permitted in any order. There may be
the occasional additional constraint: '/' is not permitted in filenames
since it's the path element delimiter, for example. But you can
certainly have non-text strings for file names, environment variables,
command-line arguments, etc.

Since Python 3 strings must be text, they cannot generally be used to
represent Unix strings. David's right, "this is going to cause real
problems." It has to be solved somehow, but the more obvious solutions
are in some way ugly and introduce platform-to-platform inconsistencies.
I occasionally skim the python-dev mailing list archive, and as far as
I can tell there is yet no consensus on how to handle this.

My use of Python is chiefly general-purpose scripting on Linux.
Parameters to these scripts are more likely to be file names than
anything else. So I can't personally consider moving to version 3 until
this issue is resolved, which is why I added myself to the nosy list.

I'm bothered by Martin's comment:

That os.listdir still uses bytes should be changed as well. Both
file names and command line arguments are strings, from the
viewpoint of Python. Nothing else is supported.

I hope that this is nothing more than his expression of dismay that such
a situation should exist, and that he doesn't mean it literally.

@baikie
Copy link
Mannequin Author

baikie mannequin commented Jan 2, 2009

@ Victor Stinner: Yes, the behaviour of those functions is as you
describe - it's been changed since I filed this issue. I do
consider it an improvement.

By the password database, I mean /etc/passwd or replacements that
are accessible via getpwnam() and friends. Users are often
allowed to change things like the GECOS field, and can generally
stick any old junk in there, regardless of encoding. Now that I
come to check, it seems that in the Python 3.0 release, the pwd.*
functions do raise UnicodeDecodeError when the GECOS field can't
be decoded (bizarrely, they try to interpret it as a Python
string literal, and thus choke on invalid backslash escapes).
Unfortunately, this allows a user to change their GECOS field so
that system programs written in Python can't determine the
username corresponding to that user's UID or vice versa.

@loewis
Copy link
Mannequin

loewis mannequin commented Jan 2, 2009

By the password database, I mean /etc/passwd or replacements that
are accessible via getpwnam() and friends.

Please only discuss one issue at the time in the bug tracker. This
issue is about "invalidly-encoded command-line arguments", not about
the password database. If you want to report an issue with the password
database, please do so in a separate report.

@vstinner
Copy link
Member

vstinner commented Jan 2, 2009

By the password database, I mean /etc/passwd or replacements that
are accessible via getpwnam() and friends. Users are often
allowed to change things like the GECOS field, and can generally
stick any old junk in there, regardless of encoding.

I started to patch pwd module to return bytes instead of unicode, but I didn't
finished my work and the lost it :-/ Today, most UNIX uses UTF-8 as the
default charset. About GECOS: is it really used? If you have real problems,
open a new issue as proposed by Martin.

@bitdancer
Copy link
Member

I believe the title problem is solved by PEP-383 in py3k trunk.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic-unicode type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants