classification
Title: Python3/POSIX: errors if file system encoding is None
Type: Stage:
Components: Interpreter Core, Unicode Versions: Python 3.2
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: Arfrever, lemburg, loewis, pitrou, vstinner
Priority: normal Keywords: patch

Created on 2010-05-04 11:58 by vstinner, last changed 2010-05-15 12:39 by vstinner. This issue is now closed.

Files
File name Uploaded Description Edit
initfsencoding-4.patch vstinner, 2010-05-07 16:42
initfsencoding-5-utf-8.patch vstinner, 2010-05-07 23:51
Messages (17)
msg104924 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-04 11:58
On POSIX (but not on Mac OS X), Python3 calls get_codeset() to get the file system encoding. If this function fails, sys.getfilesystemencoding() returns None.  PyUnicode_DecodeFSDefaultAndSize() fallbacks to utf-8 whereas subprocess fail:

  ...
  File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 670, in __init__
    restore_signals, start_new_session)
  File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1101, in _execute_child
    executable_list = (fs_encode(executable),)
  File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1088, in fs_encode
    return s.encode(fs_encoding, 'surrogateescape')
TypeError: encode() argument 1 must be string, not None

We have two choices: raise a fatal error if get_codeset() failed, or fallback to utf-8.

On Windows and Mac OS X, get_codeset() shouldn't be called because the result is just dropped. We should call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) instead to ensure that the file system encoding can be loaded.
msg104984 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-04 23:34
Here is a patch for the first solution: display a fatal error if we are unable to get the locale encoding. It does always exit with a fatal error if nl_langinfo(CODESET) is not available (and Py_FileSystemDefaultEncoding is not set).

I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?).

Extract of the nl_langinfo() manual page (on Linux):

CONFORMING TO
       SUSv2, POSIX.1-2001.
msg104986 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-05 00:15
Patch for the second solution (fallback to utf-8 on get_codeset() failure):

 - create a subfunction initfsencoding() (Py_InitializeEx is already very long)
 - hardcode the encoding to utf-8 if nl_langinfo(CODESET) is missing
 - don't call get_codeset() on Windows or Mac OS X
 - call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) if get_codeset() was not called (eg. on Windows) or if get_codeset() failed to ensure that the codec can be (and is) loaded: display a fatal error on failure

Since I wrote patches for both solution, I can now compare correctly advantages and disavantages. I prefer initfsencoding() because it works on all cases and is simpler than no_fsencoding_error.patch.
msg105002 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-05 07:45
STINNER Victor wrote:
> 
> I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?).

If nl_langinfo(CODESET) fails, Python should assume the default
locale, which is "C" on POSIX platforms. The "C" locale uses
ASCII as encoding, so Python should use that as well. Note that the
manpage for nl_langinfo() doesn't mention any errors that could
be raised by it:

"""
RETURN VALUE
       If  no  locale  has  been  selected  for  the appropriate category, nl_langinfo() returns a
       pointer to the corresponding string in the "C" locale.

       If item is not valid, a pointer to an empty string is returned.

       This pointer may point to static data that may be overwritten on the next call to  nl_lang‐
       info() or setlocale(3).
"""

As with all locale APIs, it is not thread-safe, which can become
an issues if Python gets embedded in a multi-threaded application.

There's also another issue: it's possible that nl_langinfo(CODESET)
returns an encoding which is not known to Python.

In such a case, it would be best to issue a warning to the
user and fall back to ASCII as in the "C" locale case.

Terminating Python with a fatal error would provide the worst of
all user experiences. -1 on that.
msg105005 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-05 08:57
> manpage for nl_langinfo() doesn't mention any errors that could
> be raised by it

It's more about get_codeset(). This function can fail for different reasons:

 - nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage
 - _PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc.
 - the codec has no "name "attribute
 - strdup() failure (no more memory)

Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed.

You can at least reproduce the _PyCodec_Lookup() error with #8611.

My problem is also that the file system encoding is required (encoding != None) by os.environ mapping with my os.environb patch. (#8603)
msg105007 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-05 09:16
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> manpage for nl_langinfo() doesn't mention any errors that could
>> be raised by it
> 
> It's more about get_codeset(). This function can fail for different reasons:
> 
>  - nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage
>  - _PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc.
>  - the codec has no "name "attribute
>  - strdup() failure (no more memory)
> 
> Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed.

I think that using ASCII is a safer choice in case of errors.
Using UTF-8 may be safe for reading file names, but it's not
safe for creating files or directories.

I also think that an application should be able to update the
file system encoding in such an error case (and only in such a case).
The application may have better knowledge about how it's being
used and can provide correct encoding information by other means.
msg105008 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-05 09:31
> I think that using ASCII is a safer choice in case of errors.

I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8.

> Using UTF-8 may be safe for reading file names, but it's not
> safe for creating files or directories.

Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8?

UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.
msg105010 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-05 10:07
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> I think that using ASCII is a safer choice in case of errors.
> 
> I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8.

Ouch, that was a poor choice. In Python we have a tradition to
avoid guessing, if possible. Since we cannot guarantee that the
file system will indeed use UTF-8, it would have been safer to
use ASCII. Not sure why this reasoning wasn't applied for
the file system encoding.

Nothing we can do about now, though.

>> Using UTF-8 may be safe for reading file names, but it's not
>> safe for creating files or directories.
> 
> Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8?
> 
> UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.

I think we should also add a new environment variable to override
the automatic determination of the file system encoding, much like
what we have for the I/O encoding:

PYTHONFSENCODING: Encoding[:errors] used for file system.

(that would need to go on a new ticket, though)
msg105031 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-05 13:55
I've opened issue8622 for the env. var idea.
msg105175 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-06 23:50
New patch:
 - set encoding to ASCII on nl_langinfo(CODESET) failure
 - set encoding to ASCII if nl_langinfo(CODESET) is missing
 - document the changes

NEWS entry: "Issue #8610: Load file system codec at startup, and display a fatal error on failure. Set the file system encoding to ascii if getting the locale encoding failed, or if nl_langinfo(CODESET) function is missing."
msg105176 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-06 23:52
"I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice."

Ok, you conviced me with your PYTHONFSENCODING suggestion (#8622). Can you review my last patch please?
msg105188 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-07 09:19
STINNER Victor wrote:
> 
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> "I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice."
> 
> Ok, you conviced me with your PYTHONFSENCODING suggestion (#8622). Can you review my last patch please?

I don't think we can change the fallback encoding in 3.2. But you
can start a discussion on python-dev, of course.

The number of Python 3.x users is still small, so perhaps it's still
possible to revert the choice and to use a safer default, which then
results in encoding errors that the user can see in form of tracebacks
rather than just by having actively checking the directory listing
for strange symbols.

Some comments on the patch:

+		fprintf(stderr,
+			"Unable to get the locale encoding: "
+			"fallback to utf-8\n");

This would have to read "... to ASCII"

+		Py_FileSystemDefaultEncoding = "ascii";

+	codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
+	if (!codec) {
+		Py_FatalError(
+		    "Py_Initialize: unable to load the file system codec");

It's better to use the same approach as above for this situation
as well.

Fatal errors are just not user friendly and will likely cause
bad vibes towards Python3.

E.g.

		fprintf(stderr,
			"Unknown locale encoding: "
			"fallback to ASCII\n");

You also need to change this line in pythonrun.c:

        /* reset file system default encoding */
        if (!Py_HasFileSystemDefaultEncoding) {
                free((char*)Py_FileSystemDefaultEncoding);
                Py_FileSystemDefaultEncoding = NULL;
        }

I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
is. If we define a default for the file system encoding,
then why do we bother whether there is one ?

In any case, initfsencoding() would always have to set that
flag to 1.
msg105189 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-07 10:54
Le vendredi 07 mai 2010 11:19:52, vous avez écrit :
> > Ok, you conviced me with your PYTHONFSENCODING suggestion (#8622). Can
> > you review my last patch please?
> 
> I don't think we can change the fallback encoding in 3.2. But you
> can start a discussion on python-dev, of course.

Ok, I will ask on python-dev.

> +		fprintf(stderr,
> +			"Unable to get the locale encoding: "
> +			"fallback to utf-8\n");
> 
> This would have to read "... to ASCII"

Fixed.

> 
> +		Py_FileSystemDefaultEncoding = "ascii";
> 
> +	codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
> +	if (!codec) {
> +		Py_FatalError(
> +		    "Py_Initialize: unable to load the file system codec");
> 
> It's better to use the same approach as above for this situation
> as well.

I choosed to display a fatal error here to give a more revelant error message 
to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later 
in a random function :-/

The fatal error only occurs in critical situations: no more memory, import 
machinery completly broken (eg. #8611), etc. In this case, fallback to ASCII 
doesn't help, it will also raise somewhere later.

About nl_langinfo(CODESET): get_codeset() does already reject unknown 
encoding. So this call is only done on known encoding names.

> You also need to change this line in pythonrun.c:
> 
>         /* reset file system default encoding */
>         if (!Py_HasFileSystemDefaultEncoding) {
>                 free((char*)Py_FileSystemDefaultEncoding);
>                 Py_FileSystemDefaultEncoding = NULL;
>         }

Fixed. This test only match if get_codeset() is used: I choosed to set the 
encoding to ASCII with Py_HasFileSystemDefaultEncoding=0.

> I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
> is.

Its name doesn't help. It's just a flag to tell if free() should be called or 
not... (see _Py_SetFileSystemEncoding()).

> In any case, initfsencoding() would always have to set that
> flag to 1.

initfsencoding() is a static function and it's only called by 
Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?
msg105195 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-05-07 12:09
STINNER Victor wrote:
> 
>> +	codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding);
>> +	if (!codec) {
>> +		Py_FatalError(
>> +		    "Py_Initialize: unable to load the file system codec");
>>
>> It's better to use the same approach as above for this situation
>> as well.
> 
> I choosed to display a fatal error here to give a more revelant error message 
> to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later 
> in a random function :-/
> 
> The fatal error only occurs in critical situations: no more memory, import 
> machinery completly broken (eg. #8611), etc. In this case, fallback to ASCII 
> doesn't help, it will also raise somewhere later.
> 
> About nl_langinfo(CODESET): get_codeset() does already reject unknown 
> encoding. So this call is only done on known encoding names.

Ok, please add a comment to that part explaining why it can only
fail as result of some other serious error and not because
the Py_FileSystemDefaultEncoding was not found to be supported
by Python.

>> I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding
>> is.
> 
> Its name doesn't help. It's just a flag to tell if free() should be called or 
> not... (see _Py_SetFileSystemEncoding()).

Interesting... I would associate a completely different meaning
with it.

Scratch my comment on that flag then.

>> In any case, initfsencoding() would always have to set that
>> flag to 1.
> 
> initfsencoding() is a static function and it's only called by 
> Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?

Well, it shouldn't be called multiple times, but then you never
know how embedded Python interpreters are used.
msg105212 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-07 16:42
Version 4: I forgot #include <langinfo.h> in bltinmodule.c.
msg105247 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-07 23:51
I realized that fallback to ASCII instead of UTF-8 is not possible yet because of #8611: if it fallbacks to ASCII, it's not more possible to run Python in a non-ASCII directory. I have a patch set fixing #8611 but it's huge and complex. I will not be fixed quickly (if it would be possible someday to fix it).

My new patch fallback to utf-8 instead of ascii, even if I agree that it would be better to fallback to ascii. Improve unicode, surrogates & friends is complex, and I prefer to fix bugs step by step. I propose to first ensure that Py_FileSystemEncoding is always set, and later write a new patch to fallback to ASCII instead of UTF-8.

Patch version 5:
 - fallback to utf-8 instead of ascii
 - Set Py_FileSystemDefaultEncoding to NULL to Py_Finalize(): thanks to that, it should be possible to call Py_InitializeEx() (initfsencoding()) twice or more
 - initfsencoding() doesn't call _PyCodec_Lookup() on get_codeset() success because  get_codeset() does already call it
 - explain that the fatal error is very unlikely
msg105805 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-05-15 12:39
I commited the last patch (fall back to UTF-8): r81190 (3.x), blocked in 3.1 (r81191).

I opened a new issue for the UTF-8/ASCII fallback: #8725, because the ASCII fallback is a different issue.
History
Date User Action Args
2010-05-15 12:39:45vstinnersetstatus: open -> closed
resolution: fixed
messages: + msg105805
2010-05-07 23:51:06vstinnersetfiles: + initfsencoding-5-utf-8.patch

messages: + msg105247
2010-05-07 16:42:20vstinnersetfiles: - initfsencoding-3.patch
2010-05-07 16:42:13vstinnersetfiles: + initfsencoding-4.patch

messages: + msg105212
2010-05-07 12:09:09lemburgsetmessages: + msg105195
2010-05-07 10:54:32vstinnersetfiles: - initfsencoding-2.patch
2010-05-07 10:54:22vstinnersetfiles: + initfsencoding-3.patch
2010-05-07 10:54:04vstinnersetmessages: + msg105189
2010-05-07 09:19:51lemburgsetmessages: + msg105188
2010-05-06 23:52:20vstinnersetmessages: + msg105176
2010-05-06 23:50:36vstinnersetfiles: - initfsencoding.patch
2010-05-06 23:50:33vstinnersetfiles: - no_fsencoding_error.patch
2010-05-06 23:50:28vstinnersetfiles: + initfsencoding-2.patch

messages: + msg105175
2010-05-05 13:55:56lemburgsetmessages: + msg105031
2010-05-05 10:07:00lemburgsetmessages: + msg105010
2010-05-05 09:31:03vstinnersetmessages: + msg105008
2010-05-05 09:16:43lemburgsetmessages: + msg105007
2010-05-05 08:58:01vstinnersetmessages: + msg105005
2010-05-05 07:45:19lemburgsetnosy: + lemburg
messages: + msg105002
2010-05-05 00:30:54Arfreversetnosy: + Arfrever
2010-05-05 00:22:10pitrousetnosy: + loewis
2010-05-05 00:20:18vstinnersetnosy: + pitrou
2010-05-05 00:15:51vstinnersetfiles: + initfsencoding.patch

messages: + msg104986
2010-05-04 23:34:25vstinnersetfiles: + no_fsencoding_error.patch
keywords: + patch
messages: + msg104984
2010-05-04 11:58:56vstinnercreate