Issue 8610: Python3/POSIX: errors if file system encoding is None

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/52856

classification

Title:	Python3/POSIX: errors if file system encoding is None
Type:		Stage:
Components:	Interpreter Core, Unicode	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	Arfrever, lemburg, loewis, pitrou, vstinner
Priority:	normal	Keywords:	patch

Created on 2010-05-04 11:58 by vstinner, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
initfsencoding-4.patch	vstinner, 2010-05-07 16:42
initfsencoding-5-utf-8.patch	vstinner, 2010-05-07 23:51

Messages (17)
msg104924 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-04 11:58
On POSIX (but not on Mac OS X), Python3 calls get_codeset() to get the file system encoding. If this function fails, sys.getfilesystemencoding() returns None. PyUnicode_DecodeFSDefaultAndSize() fallbacks to utf-8 whereas subprocess fail: ... File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 670, in __init__ restore_signals, start_new_session) File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1101, in _execute_child executable_list = (fs_encode(executable),) File "/home/SHARE/SVN/py3k/Lib/subprocess.py", line 1088, in fs_encode return s.encode(fs_encoding, 'surrogateescape') TypeError: encode() argument 1 must be string, not None We have two choices: raise a fatal error if get_codeset() failed, or fallback to utf-8. On Windows and Mac OS X, get_codeset() shouldn't be called because the result is just dropped. We should call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) instead to ensure that the file system encoding can be loaded.
msg104984 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-04 23:34
Here is a patch for the first solution: display a fatal error if we are unable to get the locale encoding. It does always exit with a fatal error if nl_langinfo(CODESET) is not available (and Py_FileSystemDefaultEncoding is not set). I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?). Extract of the nl_langinfo() manual page (on Linux): CONFORMING TO SUSv2, POSIX.1-2001.
msg104986 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-05 00:15
Patch for the second solution (fallback to utf-8 on get_codeset() failure): - create a subfunction initfsencoding() (Py_InitializeEx is already very long) - hardcode the encoding to utf-8 if nl_langinfo(CODESET) is missing - don't call get_codeset() on Windows or Mac OS X - call _PyCodec_Lookup(Py_FileSystemDefaultEncoding) if get_codeset() was not called (eg. on Windows) or if get_codeset() failed to ensure that the codec can be (and is) loaded: display a fatal error on failure Since I wrote patches for both solution, I can now compare correctly advantages and disavantages. I prefer initfsencoding() because it works on all cases and is simpler than no_fsencoding_error.patch.
msg105002 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-05 07:45
STINNER Victor wrote: > > I don't think it's a good idea to display an fatal error at runtime. If nl_langinfo(CODESET) is not available, configure should fail or we should fallback to an hardcoded encoding (ok but which one?). If nl_langinfo(CODESET) fails, Python should assume the default locale, which is "C" on POSIX platforms. The "C" locale uses ASCII as encoding, so Python should use that as well. Note that the manpage for nl_langinfo() doesn't mention any errors that could be raised by it: """ RETURN VALUE If no locale has been selected for the appropriate category, nl_langinfo() returns a pointer to the corresponding string in the "C" locale. If item is not valid, a pointer to an empty string is returned. This pointer may point to static data that may be overwritten on the next call to nl_lang‐ info() or setlocale(3). """ As with all locale APIs, it is not thread-safe, which can become an issues if Python gets embedded in a multi-threaded application. There's also another issue: it's possible that nl_langinfo(CODESET) returns an encoding which is not known to Python. In such a case, it would be best to issue a warning to the user and fall back to ASCII as in the "C" locale case. Terminating Python with a fatal error would provide the worst of all user experiences. -1 on that.
msg105005 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-05 08:57
> manpage for nl_langinfo() doesn't mention any errors that could > be raised by it It's more about get_codeset(). This function can fail for different reasons: - nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage - _PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc. - the codec has no "name "attribute - strdup() failure (no more memory) Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed. You can at least reproduce the _PyCodec_Lookup() error with #8611. My problem is also that the file system encoding is required (encoding != None) by os.environ mapping with my os.environb patch. (#8603)
msg105007 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-05 09:16
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> manpage for nl_langinfo() doesn't mention any errors that could >> be raised by it > > It's more about get_codeset(). This function can fail for different reasons: > > - nl_langinfo() result is an empty string: "If item is not valid, a pointer to an empty string is returned." say the manpage > - _PyCodec_Lookup() failed: unable to import the encoding codec module, there is no such codec, codec machinery is broken, etc. > - the codec has no "name "attribute > - strdup() failure (no more memory) > > Do you think that you should fallback to ASCII if nl_langinfo() result is an empty string, and UTF-8 otherwise? get_codeset() failure is very unlikely, and I think that fallback to UTF-8 is just fine. A warning is printed to stderr, the user should try to understand why get_codeset() failed. I think that using ASCII is a safer choice in case of errors. Using UTF-8 may be safe for reading file names, but it's not safe for creating files or directories. I also think that an application should be able to update the file system encoding in such an error case (and only in such a case). The application may have better knowledge about how it's being used and can provide correct encoding information by other means.
msg105008 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-05 09:31
> I think that using ASCII is a safer choice in case of errors. I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8. > Using UTF-8 may be safe for reading file names, but it's not > safe for creating files or directories. Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8? UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8.
msg105010 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-05 10:07
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > >> I think that using ASCII is a safer choice in case of errors. > > I choosed UTF-8 to keep backward compatibility: PyUnicode_DecodeFSDefaultAndSize() uses utf-8 if Py_FileSystemDefaultEncoding==NULL. If the OS has no nl_langinfo(CODESET) function at all, Python3 uses utf-8. Ouch, that was a poor choice. In Python we have a tradition to avoid guessing, if possible. Since we cannot guarantee that the file system will indeed use UTF-8, it would have been safer to use ASCII. Not sure why this reasoning wasn't applied for the file system encoding. Nothing we can do about now, though. >> Using UTF-8 may be safe for reading file names, but it's not >> safe for creating files or directories. > > Well, I don't know. You are maybe right. And which encoding should be used if nl_langinfo(CODESET) function is missing: ASCII or UTF-8? > > UTF-8 is also an optimist choice: I bet that more and more OS will move to UTF-8. I think we should also add a new environment variable to override the automatic determination of the file system encoding, much like what we have for the I/O encoding: PYTHONFSENCODING: Encoding[:errors] used for file system. (that would need to go on a new ticket, though)
msg105031 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-05 13:55
I've opened issue8622 for the env. var idea.
msg105175 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-06 23:50
New patch: - set encoding to ASCII on nl_langinfo(CODESET) failure - set encoding to ASCII if nl_langinfo(CODESET) is missing - document the changes NEWS entry: "Issue #8610: Load file system codec at startup, and display a fatal error on failure. Set the file system encoding to ascii if getting the locale encoding failed, or if nl_langinfo(CODESET) function is missing."
msg105176 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-06 23:52
"I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice." Ok, you conviced me with your PYTHONFSENCODING suggestion (#8622). Can you review my last patch please?
msg105188 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-07 09:19
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > "I think that using ASCII is a safer choice in case of errors. (...) Ouch, that was a poor choice." > > Ok, you conviced me with your PYTHONFSENCODING suggestion (#8622). Can you review my last patch please? I don't think we can change the fallback encoding in 3.2. But you can start a discussion on python-dev, of course. The number of Python 3.x users is still small, so perhaps it's still possible to revert the choice and to use a safer default, which then results in encoding errors that the user can see in form of tracebacks rather than just by having actively checking the directory listing for strange symbols. Some comments on the patch: + fprintf(stderr, + "Unable to get the locale encoding: " + "fallback to utf-8\n"); This would have to read "... to ASCII" + Py_FileSystemDefaultEncoding = "ascii"; + codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding); + if (!codec) { + Py_FatalError( + "Py_Initialize: unable to load the file system codec"); It's better to use the same approach as above for this situation as well. Fatal errors are just not user friendly and will likely cause bad vibes towards Python3. E.g. fprintf(stderr, "Unknown locale encoding: " "fallback to ASCII\n"); You also need to change this line in pythonrun.c: /* reset file system default encoding / if (!Py_HasFileSystemDefaultEncoding) { free((char)Py_FileSystemDefaultEncoding); Py_FileSystemDefaultEncoding = NULL; } I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding is. If we define a default for the file system encoding, then why do we bother whether there is one ? In any case, initfsencoding() would always have to set that flag to 1.
msg105189 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-07 10:54
Le vendredi 07 mai 2010 11:19:52, vous avez écrit : > > Ok, you conviced me with your PYTHONFSENCODING suggestion (#8622). Can > > you review my last patch please? > > I don't think we can change the fallback encoding in 3.2. But you > can start a discussion on python-dev, of course. Ok, I will ask on python-dev. > + fprintf(stderr, > + "Unable to get the locale encoding: " > + "fallback to utf-8\n"); > > This would have to read "... to ASCII" Fixed. > > + Py_FileSystemDefaultEncoding = "ascii"; > > + codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding); > + if (!codec) { > + Py_FatalError( > + "Py_Initialize: unable to load the file system codec"); > > It's better to use the same approach as above for this situation > as well. I choosed to display a fatal error here to give a more revelant error message to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later in a random function :-/ The fatal error only occurs in critical situations: no more memory, import machinery completly broken (eg. #8611), etc. In this case, fallback to ASCII doesn't help, it will also raise somewhere later. About nl_langinfo(CODESET): get_codeset() does already reject unknown encoding. So this call is only done on known encoding names. > You also need to change this line in pythonrun.c: > > /* reset file system default encoding / > if (!Py_HasFileSystemDefaultEncoding) { > free((char)Py_FileSystemDefaultEncoding); > Py_FileSystemDefaultEncoding = NULL; > } Fixed. This test only match if get_codeset() is used: I choosed to set the encoding to ASCII with Py_HasFileSystemDefaultEncoding=0. > I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding > is. Its name doesn't help. It's just a flag to tell if free() should be called or not... (see _Py_SetFileSystemEncoding()). > In any case, initfsencoding() would always have to set that > flag to 1. initfsencoding() is a static function and it's only called by Py_InitializeEx(). Can Py_InitializeEx() be called multiple times?
msg105195 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2010-05-07 12:09
STINNER Victor wrote: > >> + codec = _PyCodec_Lookup(Py_FileSystemDefaultEncoding); >> + if (!codec) { >> + Py_FatalError( >> + "Py_Initialize: unable to load the file system codec"); >> >> It's better to use the same approach as above for this situation >> as well. > > I choosed to display a fatal error here to give a more revelant error message > to the user. Without the check, _PyCodec_Lookup() will fail anyway, but later > in a random function :-/ > > The fatal error only occurs in critical situations: no more memory, import > machinery completly broken (eg. #8611), etc. In this case, fallback to ASCII > doesn't help, it will also raise somewhere later. > > About nl_langinfo(CODESET): get_codeset() does already reject unknown > encoding. So this call is only done on known encoding names. Ok, please add a comment to that part explaining why it can only fail as result of some other serious error and not because the Py_FileSystemDefaultEncoding was not found to be supported by Python. >> I'm not sure what the purpose of Py_HasFileSystemDefaultEncoding >> is. > > Its name doesn't help. It's just a flag to tell if free() should be called or > not... (see _Py_SetFileSystemEncoding()). Interesting... I would associate a completely different meaning with it. Scratch my comment on that flag then. >> In any case, initfsencoding() would always have to set that >> flag to 1. > > initfsencoding() is a static function and it's only called by > Py_InitializeEx(). Can Py_InitializeEx() be called multiple times? Well, it shouldn't be called multiple times, but then you never know how embedded Python interpreters are used.
msg105212 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-07 16:42
Version 4: I forgot #include <langinfo.h> in bltinmodule.c.
msg105247 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-07 23:51
I realized that fallback to ASCII instead of UTF-8 is not possible yet because of #8611: if it fallbacks to ASCII, it's not more possible to run Python in a non-ASCII directory. I have a patch set fixing #8611 but it's huge and complex. I will not be fixed quickly (if it would be possible someday to fix it). My new patch fallback to utf-8 instead of ascii, even if I agree that it would be better to fallback to ascii. Improve unicode, surrogates & friends is complex, and I prefer to fix bugs step by step. I propose to first ensure that Py_FileSystemEncoding is always set, and later write a new patch to fallback to ASCII instead of UTF-8. Patch version 5: - fallback to utf-8 instead of ascii - Set Py_FileSystemDefaultEncoding to NULL to Py_Finalize(): thanks to that, it should be possible to call Py_InitializeEx() (initfsencoding()) twice or more - initfsencoding() doesn't call _PyCodec_Lookup() on get_codeset() success because get_codeset() does already call it - explain that the fatal error is very unlikely
msg105805 - (view)	Author: STINNER Victor (vstinner) *	Date: 2010-05-15 12:39
I commited the last patch (fall back to UTF-8): r81190 (3.x), blocked in 3.1 (r81191). I opened a new issue for the UTF-8/ASCII fallback: #8725, because the ASCII fallback is a different issue.

History
Date	User	Action	Args
2022-04-11 14:57:00	admin	set	github: 52856
2010-05-15 12:39:45	vstinner	set	status: open -> closed resolution: fixed messages: + msg105805
2010-05-07 23:51:06	vstinner	set	files: + initfsencoding-5-utf-8.patch messages: + msg105247
2010-05-07 16:42:20	vstinner	set	files: - initfsencoding-3.patch
2010-05-07 16:42:13	vstinner	set	files: + initfsencoding-4.patch messages: + msg105212
2010-05-07 12:09:09	lemburg	set	messages: + msg105195
2010-05-07 10:54:32	vstinner	set	files: - initfsencoding-2.patch
2010-05-07 10:54:22	vstinner	set	files: + initfsencoding-3.patch
2010-05-07 10:54:04	vstinner	set	messages: + msg105189
2010-05-07 09:19:51	lemburg	set	messages: + msg105188
2010-05-06 23:52:20	vstinner	set	messages: + msg105176
2010-05-06 23:50:36	vstinner	set	files: - initfsencoding.patch
2010-05-06 23:50:33	vstinner	set	files: - no_fsencoding_error.patch
2010-05-06 23:50:28	vstinner	set	files: + initfsencoding-2.patch messages: + msg105175
2010-05-05 13:55:56	lemburg	set	messages: + msg105031
2010-05-05 10:07:00	lemburg	set	messages: + msg105010
2010-05-05 09:31:03	vstinner	set	messages: + msg105008
2010-05-05 09:16:43	lemburg	set	messages: + msg105007
2010-05-05 08:58:01	vstinner	set	messages: + msg105005
2010-05-05 07:45:19	lemburg	set	nosy: + lemburg messages: + msg105002
2010-05-05 00:30:54	Arfrever	set	nosy: + Arfrever
2010-05-05 00:22:10	pitrou	set	nosy: + loewis
2010-05-05 00:20:18	vstinner	set	nosy: + pitrou
2010-05-05 00:15:51	vstinner	set	files: + initfsencoding.patch messages: + msg104986
2010-05-04 23:34:25	vstinner	set	files: + no_fsencoding_error.patch keywords: + patch messages: + msg104984
2010-05-04 11:58:56	vstinner	create