Python3/POSIX: errors if file system encoding is None #52856

vstinner · 2010-05-04T11:58:57Z

BPO	8610
Nosy	@malemburg, @loewis, @pitrou, @vstinner
Files	initfsencoding-4.patch initfsencoding-5-utf-8.patch

^{Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.}

Show more details

GitHub fields:

assignee = None
closed_at = <Date 2010-05-15.12:39:45.690>
created_at = <Date 2010-05-04.11:58:56.531>
labels = ['interpreter-core', 'expert-unicode']
title = 'Python3/POSIX:  errors if file system encoding is None'
updated_at = <Date 2010-05-15.12:39:45.689>
user = 'https://github.com/vstinner'

bugs.python.org fields:

activity = <Date 2010-05-15.12:39:45.689>
actor = 'vstinner'
assignee = 'none'
closed = True
closed_date = <Date 2010-05-15.12:39:45.690>
closer = 'vstinner'
components = ['Interpreter Core', 'Unicode']
creation = <Date 2010-05-04.11:58:56.531>
creator = 'vstinner'
dependencies = []
files = ['17247', '17252']
hgrepos = []
issue_num = 8610
keywords = ['patch']
message_count = 17.0
messages = ['104924', '104984', '104986', '105002', '105005', '105007', '105008', '105010', '105031', '105175', '105176', '105188', '105189', '105195', '105212', '105247', '105805']
nosy_count = 5.0
nosy_names = ['lemburg', 'loewis', 'pitrou', 'vstinner', 'Arfrever']
pr_nums = []
priority = 'normal'
resolution = 'fixed'
stage = None
status = 'closed'
superseder = None
type = None
url = 'https://bugs.python.org/issue8610'
versions = ['Python 3.2']

vstinner · 2010-05-04T11:58:54Z

On POSIX (but not on Mac OS X), Python3 calls get_codeset() to get the file system encoding. If this function fails, sys.getfilesystemencoding() returns None. PyUnicode_DecodeFSDefaultAndSize() fallbacks to utf-8 whereas subprocess fail:

...
File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 670, in __init__
restore_signals, start_new_session)
File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1101, in _execute_child
executable_list = (fs_encode(executable),)
File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1088, in fs_encode
return s.encode(fs_encoding, 'surrogateescape')
TypeError: encode() argument 1 must be string, not None

We have two choices: raise a fatal error if get_codeset() failed, or fallback to utf-8.

On Windows and Mac OS X, get_codeset() shouldn't be called because the result is just dropped. We should call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) instead to ensure that the file system encoding can be loaded.

vstinner · 2010-05-04T23:34:25Z

Here is a patch for the first solution: display a fatal error if we are unable to get the locale encoding. It does always exit with a fatal error if nl_langinfo(CODESET) is not available (and Py_FileSystemDefaultEncoding is not set).

I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?).

Extract of the nl_langinfo() manual page (on Linux):

CONFORMING TO
SUSv2, POSIX.1-2001.

vstinner · 2010-05-05T00:15:47Z

Patch for the second solution (fallback to utf-8 on get_codeset() failure):

create a subfunction initfsencoding() (Py_InitializeEx is already very long)
hardcode the encoding to utf-8 if nl_langinfo(CODESET) is missing
don't call get_codeset() on Windows or Mac OS X
call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) if get_codeset() was not called (eg. on Windows) or if get_codeset() failed to ensure that the codec can be (and is) loaded: display a fatal error on failure

Since I wrote patches for both solution, I can now compare correctly advantages and disavantages. I prefer initfsencoding() because it works on all cases and is simpler than no_fsencoding_error.patch.

malemburg · 2010-05-05T07:45:18Z

STINNER Victor wrote:

I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?).

If nl_langinfo(CODESET) fails, Python should assume the default
locale, which is "C" on POSIX platforms. The "C" locale uses
ASCII as encoding, so Python should use that as well. Note that the
manpage for nl_langinfo() doesn't mention any errors that could
be raised by it:

"""
RETURN VALUE
If no locale has been selected for the appropriate category, nl_langinfo() returns a
pointer to the corresponding string in the "C" locale.

   If item is not valid, a pointer to an empty string is returned.

   This pointer may point to static data that may be overwritten on the next call to  nl_lang‐
   info() or setlocale(3).

"""

As with all locale APIs, it is not thread-safe, which can become
an issues if Python gets embedded in a multi-threaded application.

There's also another issue: it's possible that nl_langinfo(CODESET)
returns an encoding which is not known to Python.

In such a case, it would be best to issue a warning to the
user and fall back to ASCII as in the "C" locale case.

Terminating Python with a fatal error would provide the worst of
all user experiences. -1 on that.

vstinner · 2010-05-05T08:58:00Z

manpage for nl_langinfo() doesn't mention any errors that could
be raised by it

It's more about get_codeset(). This function can fail for different reasons:

nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage
_PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc.
the codec has no "name "attribute
strdup() failure (no more memory)

Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed.

You can at least reproduce the _PyCodec_Lookup() error with bpo-8611.

My problem is also that the file system encoding is required (encoding != None) by os.environ mapping with my os.environb patch. (bpo-8603)

malemburg · 2010-05-05T09:16:42Z

STINNER Victor wrote:

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> manpage for nl_langinfo() doesn't mention any errors that could
> be raised by it

It's more about get_codeset(). This function can fail for different reasons:

nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage

_PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc.

the codec has no "name "attribute

strdup() failure (no more memory)

Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed.

I think that using ASCII is a safer choice in case of errors.
Using UTF-8 may be safe for reading file names, but it's not
safe for creating files or directories.

I also think that an application should be able to update the
file system encoding in such an error case (and only in such a case).
The application may have better knowledge about how it's being
used and can provide correct encoding information by other means.

vstinner · 2010-05-05T09:31:03Z

I think that using ASCII is a safer choice in case of errors.

I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8.

Using UTF-8 may be safe for reading file names, but it's not
safe for creating files or directories.

Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8?

UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.

malemburg · 2010-05-05T10:07:00Z

STINNER Victor wrote:

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

> I think that using ASCII is a safer choice in case of errors.

I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8.

Ouch, that was a poor choice. In Python we have a tradition to
avoid guessing, if possible. Since we cannot guarantee that the
file system will indeed use UTF-8, it would have been safer to
use ASCII. Not sure why this reasoning wasn't applied for
the file system encoding.

Nothing we can do about now, though.

> Using UTF-8 may be safe for reading file names, but it's not
> safe for creating files or directories.

Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8?

UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.

I think we should also add a new environment variable to override
the automatic determination of the file system encoding, much like
what we have for the I/O encoding:

PYTHONFSENCODING: Encoding[:errors] used for file system.

(that would need to go on a new ticket, though)

malemburg · 2010-05-05T13:55:56Z

I've opened bpo-8622 for the env. var idea.

vstinner · 2010-05-06T23:50:26Z

New patch:

set encoding to ASCII on nl_langinfo(CODESET) failure
set encoding to ASCII if nl_langinfo(CODESET) is missing
document the changes

NEWS entry: "Issue bpo-8610: Load file system codec at startup, and display a fatal error on failure. Set the file system encoding to ascii if getting the locale encoding failed, or if nl_langinfo(CODESET) function is missing."

vstinner · 2010-05-06T23:52:20Z

"I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice."

Ok, you conviced me with your PYTHONFSENCODING suggestion (bpo-8622). Can you review my last patch please?

malemburg · 2010-05-07T09:19:49Z

STINNER Victor wrote:

STINNER Victor <victor.stinner@haypocalc.com> added the comment:

"I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice."

Ok, you conviced me with your PYTHONFSENCODING suggestion (bpo-8622). Can you review my last patch please?

I don't think we can change the fallback encoding in 3.2. But you
can start a discussion on python-dev, of course.

The number of Python 3.x users is still small, so perhaps it's still
possible to revert the choice and to use a safer default, which then
results in encoding errors that the user can see in form of tracebacks
rather than just by having actively checking the directory listing
for strange symbols.

Some comments on the patch:

+ fprintf(stderr,
+ "Unable to get the locale encoding: "
+ "fallback to utf-8\n");

This would have to read "... to ASCII"

+ Py_FileSystemDefaultEncoding = "ascii";

+ codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
+ if (!codec) {
+ Py_FatalError(
+ "Py_Initialize: unable to load the file system codec");

It's better to use the same approach as above for this situation
as well.

Fatal errors are just not user friendly and will likely cause
bad vibes towards Python3.

E.g.

	fprintf(stderr,
		"Unknown locale encoding: "
		"fallback to ASCII\\n");

You also need to change this line in pythonrun.c:

        /* reset file system default encoding */
        if (!Py_HasFileSystemDefaultEncoding) {
                free((char*)Py_FileSystemDefaultEncoding);
                Py_FileSystemDefaultEncoding = NULL;
        }

I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
is. If we define a default for the file system encoding,
then why do we bother whether there is one ?

In any case, initfsencoding() would always have to set that
flag to 1.

vstinner · 2010-05-07T10:54:03Z

Le vendredi 07 mai 2010 11:19:52, vous avez écrit :

> Ok, you conviced me with your PYTHONFSENCODING suggestion (bpo-8622). Can
> you review my last patch please?

I don't think we can change the fallback encoding in 3.2. But you
can start a discussion on python-dev, of course.

Ok, I will ask on python-dev.

```
fprintf(stderr,
```
```
	"Unable to get the locale encoding: "
```
```
	"fallback to utf-8\\n");
```

This would have to read "... to ASCII"

Fixed.

Py_FileSystemDefaultEncoding = "ascii";
codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);

if (!codec) {
Py_FatalError(
    "Py_Initialize: unable to load the file system codec");
It's better to use the same approach as above for this situation
as well.

I choosed to display a fatal error here to give a more revelant error message
to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later
in a random function :-/

The fatal error only occurs in critical situations: no more memory, import
machinery completly broken (eg. bpo-8611), etc. In this case, fallback to ASCII
doesn't help, it will also raise somewhere later.

About nl_langinfo(CODESET): get_codeset() does already reject unknown
encoding. So this call is only done on known encoding names.

You also need to change this line in pythonrun.c:

    /* reset file system default encoding \*/
    if (!Py_HasFileSystemDefaultEncoding) {
            free((char*)Py_FileSystemDefaultEncoding);
            Py_FileSystemDefaultEncoding = NULL;
    }

Fixed. This test only match if get_codeset() is used: I choosed to set the
encoding to ASCII with Py_HasFileSystemDefaultEncoding=0.

I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
is.

Its name doesn't help. It's just a flag to tell if free() should be called or
not... (see _Py_SetFileSystemEncoding()).

In any case, initfsencoding() would always have to set that
flag to 1.

initfsencoding() is a static function and it's only called by
Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?

malemburg · 2010-05-07T12:09:09Z

STINNER Victor wrote:

> + codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
> + if (!codec) {
> + Py_FatalError(
> + "Py_Initialize: unable to load the file system codec");
>
> It's better to use the same approach as above for this situation
> as well.

I choosed to display a fatal error here to give a more revelant error message
to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later
in a random function :-/

The fatal error only occurs in critical situations: no more memory, import
machinery completly broken (eg. bpo-8611), etc. In this case, fallback to ASCII
doesn't help, it will also raise somewhere later.

About nl_langinfo(CODESET): get_codeset() does already reject unknown
encoding. So this call is only done on known encoding names.

Ok, please add a comment to that part explaining why it can only
fail as result of some other serious error and not because
the Py_FileSystemDefaultEncoding was not found to be supported
by Python.

> I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
> is.

Its name doesn't help. It's just a flag to tell if free() should be called or
not... (see _Py_SetFileSystemEncoding()).

Interesting... I would associate a completely different meaning
with it.

Scratch my comment on that flag then.

> In any case, initfsencoding() would always have to set that
> flag to 1.

initfsencoding() is a static function and it's only called by
Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?

Well, it shouldn't be called multiple times, but then you never
know how embedded Python interpreters are used.

vstinner · 2010-05-07T16:42:12Z

Version 4: I forgot #include <langinfo.h> in bltinmodule.c.

vstinner · 2010-05-07T23:51:02Z

I realized that fallback to ASCII instead of UTF-8 is not possible yet because of bpo-8611: if it fallbacks to ASCII, it's not more possible to run Python in a non-ASCII directory. I have a patch set fixing bpo-8611 but it's huge and complex. I will not be fixed quickly (if it would be possible someday to fix it).

My new patch fallback to utf-8 instead of ascii, even if I agree that it would be better to fallback to ascii. Improve unicode, surrogates & friends is complex, and I prefer to fix bugs step by step. I propose to first ensure that Py_FileSystemEncoding is always set, and later write a new patch to fallback to ASCII instead of UTF-8.

Patch version 5:

fallback to utf-8 instead of ascii
Set Py_FileSystemDefaultEncoding to NULL to Py_Finalize(): thanks to that, it should be possible to call Py_InitializeEx() (initfsencoding()) twice or more
initfsencoding() doesn't call _PyCodec_Lookup() on get_codeset() success because get_codeset() does already call it
explain that the fatal error is very unlikely

vstinner · 2010-05-15T12:39:46Z

I commited the last patch (fall back to UTF-8): r81190 (3.x), blocked in 3.1 (r81191).

I opened a new issue for the UTF-8/ASCII fallback: bpo-8725, because the ASCII fallback is a different issue.

vstinner added interpreter-core (Objects, Python, Grammar, and Parser dirs) topic-unicode labels May 4, 2010

vstinner closed this as completed May 15, 2010

ezio-melotti transferred this issue from another repository Apr 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python3/POSIX: errors if file system encoding is None #52856

Python3/POSIX: errors if file system encoding is None #52856

vstinner commented May 4, 2010

vstinner commented May 4, 2010

vstinner commented May 4, 2010

vstinner commented May 5, 2010

malemburg commented May 5, 2010

vstinner commented May 5, 2010

malemburg commented May 5, 2010

vstinner commented May 5, 2010

malemburg commented May 5, 2010

malemburg commented May 5, 2010

vstinner commented May 6, 2010

vstinner commented May 6, 2010

malemburg commented May 7, 2010

vstinner commented May 7, 2010

malemburg commented May 7, 2010

vstinner commented May 7, 2010

vstinner commented May 7, 2010

vstinner commented May 15, 2010

Python3/POSIX: errors if file system encoding is None #52856

Python3/POSIX: errors if file system encoding is None #52856

Comments

vstinner commented May 4, 2010

vstinner commented May 4, 2010

vstinner commented May 4, 2010

vstinner commented May 5, 2010

malemburg commented May 5, 2010

vstinner commented May 5, 2010

malemburg commented May 5, 2010

vstinner commented May 5, 2010

malemburg commented May 5, 2010

malemburg commented May 5, 2010

vstinner commented May 6, 2010

vstinner commented May 6, 2010

malemburg commented May 7, 2010

vstinner commented May 7, 2010

malemburg commented May 7, 2010

vstinner commented May 7, 2010

vstinner commented May 7, 2010

vstinner commented May 15, 2010