New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add fsencode() functions to os module #52760
Comments
Python3 uses unicode filenames in Windows and bytes filenames (but support also unicode filenames) on other OS. We have to support both types. On POSIX system, bytes filenames can be stored in unicode filenames using sys.getfilesystemencoding() and the surrogateescape error handler (to store undecodable bytes as unicode surrogates, see PEP-383). I would like to create fs_encode() and fs_decode() in os.path to ease the manipulation of filenames in the two bytes (str and bytes).
On Windows, fs_decode() and fs_encode() don't touch the filename, but reject filenames of types different than str (unicode) with a TypeError, especially bytes filename. Mac OS X rejects invalid UTF-8 filenames, and so surrogateescape should maybe not be used on this OS. Attached patch is an implementation of this issue. |
Issue bpo-8513 would benefit from these functions. |
STINNER Victor wrote:
Please follow the naming convention used in os.path. The functions Other than that, I'm +0 on the patch: the sys.filesystemencoding logic |
Ok
Today, most POSIX system uses utf8 by default for all partitions. If you mount an USB key, CD-Rom or network shared directory with the wrong options, you may get filenames in a different encoding. But this issue is not about fixing your OS configuration, but helping the most common case: a system using the same encoding everywhere (for the whole file system). You are still free to use directly the native OS type (unicode on Windows, bytes on other OS), ie. don't use fsencode()/fsdecode(). Python3 prefers unicode, eg. print expects an unicode string, not a byte string. I mean it's more pratical to use unicode everywhere in Python, and so fsencode()/fsdecode() can be really useful on POSIX systems. |
Update path: rename fs_encode/fs_decode to fsencode/fsdecode. |
Oops, "Update path": I mean "Update patch" ;-) |
i'm +0.7 on fsencode/fsdecode going into os.path. My bikeshed 0.7? They're also useful for dealing with environment variables which are not strictly filesystem (fs) related but also suffer from the same issue requiring surrogate escape. But other than just calling these os.encode and os.decode I don't have any brilliant alternate naming suggestions. thoughts? I could easily live with os.path.fsencode/fsdecode, I just wanted to point the other use out. |
Yes, Python3 decodes environment variables using sys.getfilesystemencoding()+surrogateescape. And since my last fix on os.execve(), subprocess (and os.execv(p)e) uses also surrogateescape to encode environment variables. And yes again, I also patched os.getenv() to decode bytes name to unicode using sys.getfilesystemencoding()+surrogateescape.
*fs*encode() and *fs*decode() is a reference to the encoding: sys.get*filesystem*encoding().
See also issue bpo-8513. |
Oh! In Python3, ntpath.expanduser() supports bytes path and uses sys.getfilesystemencoding() to encode an unicode environment variable to a byte string. Should we remove bytes path support in ntpath.expanduser(), or support bytes in ntpath.fsencode()/.fsdecode()? (sys.getfilesystemencoding() is "mbcs" on Windows) |
STINNER Victor wrote:
I don't see what environment variables have to do with the file Those are two different contexts and thus also require two different Command line parameters are another area, where an encoding Also note that "mbcs" on Windows is a meta-encoding. The On Unix, you often have the case that the environment variables See http://www.ietf.org/rfc/rfc3875.txt for details. Environment variables are also commonly used to interface |
STINNER Victor wrote:
Right, but if you start using those new API in standard lib In real life applications, you do run into these problems quite
Sure, but forcing UnicodeDecodeErrors upon Python3 programmers is Thanks,Marc-Andre Lemburg ::: Try our new mxODBC.Connect Python Database Interface for free ! :::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 |
Yes, I'm agree 100% with you :-)
I proposed to reject bytes on Windows because Martin (who knows Windows better http://mail.python.org/pipermail/python-dev/2010-April/099556.html Reject byte string on Windows is just a suggestion. To support byte strings on Since unicode is a superset of MBCS and MBCS has subtle bugs, it's preferable -- But on POSIX, it's the opposite: I'm doing my best to support byte string The first goal of fsencode() is to accept byte strings on POSIX systems. |
Le lundi 26 avril 2010 13:06:48, vous avez écrit :
A POSIX system only offers *one* function about the encoding: Are you suggesting that Python3 should support a encoding different for About filenames, Python3 choose the encoding using the locale, but the user
os.getenv() should raise a TypeError on Windows if key is a byte string. os.getenv() didn't support byte string. I patched it to support byte string
Since Python3 choosed to store environment variables as unicode string on My patch should help both cases: people using unicode objects and people using |
Version 3 of the patch: fix also os.getenv() which rejects now bytes on Windows (one of the goals of this issue). |
STINNER Victor wrote:
It's better to let the application decide how to solve this problem By using fsencode() and fsdecode() in stdlib functions, you basically Note that application may well use completely different encodings In the end, this will only lead to the same kind of mess we've
Well, yes, but that's a cludge isn't it ? If you know that e.g. your environment variables are going to have It would be a lot better to have the application provide the |
Le vendredi 30 avril 2010 15:58:28, vous avez écrit :
On POSIX, use byte strings to avoid encoding issues. Examples: subprocess.call(['env'], {b'TEST: b'a\xff-'}) # env
subprocess.call(['echo', b'a\xff-']) # command line
open('a\xff-') # filename
os.getenv(b'a\xff-') # get env (result as unicode) Are you talking about issues on Windows?
Not if you use byte strings. On POSIX, an unicode string is always converted
You mean that os.getenv() should have an optionnal argument? Something like: def getenv(key, default=None, encoding=None):
value = environ.get(key, default)
if encoding:
value = value.encode(sys.getfileystemencoding(), 'surrogateescape')
value = value.decode(encoding, 'surrogateescape')
return value There are many indirect calls to os.getenv() (eg. by using os.environ.get()):
How would you specify the correct encoding in indirect calls? If your application gets variables in *mixed* encoding, I think that your for name, encoding in (('PATH', 'latin1'), ...):
value = os.getenv(name)
value = value.encode(sys.getfileystemencoding(), 'surrogateescape')
value = value.decode(encoding, 'surrogateescape')
os.setenv(name, value) |
STINNER Victor wrote:
The issues normally occur on the way in, not the way out of Python,
Right and that's a problem since the file system encoding
Yes.
No, you store the environment data as bytes and only It would probably also worthwhile adding the encoding
In all of the above cases, the application (in this case the E.g. the cgi module can use the content-type passed in as
Which is a cludge as I mentioned in my previous comment: value = os.getenv(name, encoding=encoding)
my_environ[name] = value reads much better. Also note that os.setenv() won't work since that'll use the The point I want to make is that adding fsencode() and |
Yes, this is the best solution for POSIX. We need maybe also a os.getenvb()->bytes function, maybe only on POSIX. But I think that Windows should continue to use unicode environment variables. Should os.getenv(key, encoding=...) reencode the value on Windows? |
STINNER Victor wrote:
Yes, plus a os.setenvb() function to pass the data back to the C level
Good idea. That would make applications more easily portable between |
Ok, here is a first version of my patch to implement os.environb:
The patch is not done (the documentation should be updated), but it's a new step to help the discussion. I didn't tried it on Windows. I already try twice to write os.environb some months ago, but I failed (it was too complex for me). os.environ and os.environb now share the same "data" dictionary, and their methods converts inputs and outputs if necessary. |
In posixmodule.c, the following snippet doesn't make sense anymore: if (k == NULL) {
PyErr_Clear();
continue;
} If memory allocation of the bytes object fails, we should error out. |
I really, really, REALLY think that it is bad to mix issues. This makes patch review impossible. This specific issue is about introducing an fsdecode and fsencode function; this is what the bug title says, and what the initial patch did. Whether or not byte-oriented access to environment variables is also needed is a *separate* issue. -1 on dealing with that in this report. FWIW, I'm +0 on adding these functions. MAL, please stop messing issue subjects. If you are fundamentally opposed to adding such functions, please request that a PEP be written or something. Otherwise, I accept the original patch. I'm -1 on bpo-8514.patch; it is out-of-scope of the issue. |
I agree with Martin regarding the os.environ changes. Victor, please Martin: As you probably know, these issues are managed as micro- |
loewis> I really, really, REALLY think that it is bad to mix issues. I tried to, but it looks difficult :-) Anyway, I opened bpo-8603.
I know, but the two topics (fs*code() and os.environb) are very close and related. My os.environb implementation uses fsencode()/fsdecode().
I think that we cannot decide correctly about fs*code() until we decided for os.environb. |
Why is that? In msg104063, you claim that you want to create these |
Yes, but my os_path_fs_encode_decode-3.patch uses it in getenv() which is maybe a bad idea: os.environb may avoid this.
bpo-8513 is also related to environment variables: subprocess._execute_child() calls os.get_exec_path() which search the PATH environment variable. It would be nice to support bytes environment variable in the env argument of Popen constructor (bytes key and/or value). |
STINNER Victor wrote:
IIUC, that usage is an equivalent transformation, i.e. the code doesn't So *if* these functions are accepted, this change is a good idea
I still fail to see why this would make this issue block on the |
I changed os.getenv() to accept byte string key (in a previous commit), but I don't like this hack. If we have os.environb, os.getenv() shouldn't support bytes anymore (but use str only, as before). -- I worked a little more on fsencode()/os.environb, trying to fix all issues. fsdecode() is no more needed if we have os.environb, and fsencode() can be simplified to: def fsencode(value):
return value.encode(sys.getfilesystemencoding(), 'surrogateescape') fsdecode() leads to mojibake. |
I think that fsencode() (and fsdecode()) should be specific to POSIX. I don't know any good reason to encode a nice and correctly encoded unicode string to the ugly MBCS "encoding". |
New short, simple and clean path: add os.fsencode() for Unix only. -- Don't create it for Windows to encourage the usage of unicode on Windows (and use MBCS is a bad idea). fsdecode() was a also bad idea: it's better to keep bytes unchanged on Unix, and it's now possible thanks to os.environb and os.getenvb(). |
+.. function:: fsencode(value) I'd word the latter sentence as: Uses :func:`sys.getfilesystemencoding` and ``'surrogateescape'`` error handler for strings and returns bytes unchanged. Otherwise I think this patch looks good. +1 |
Commited: r80971 (py3k), blocked by r80972 (3.1). |
Why does this have no tests? |
The function is trivial. Does it really need tests? What kind of tests? fsencode() is already tested indirectly by test_subprocess, and bpo-8513 will add |
2010/5/8 STINNER Victor <report@bugs.python.org>:
Check that it is equivalent to utf-8 with surrogatesescape then.
Excuses, excuses! |
The file system encoding can be anything, not only utf-8. Anyway: r81014. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: