classification
Title: UnicodeDecodeError during load failure in non-UTF-8 locale
Type: behavior Stage: resolved
Components: Interpreter Core Versions: Python 3.10, Python 3.9, Python 3.8
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: kadler, methane, miss-islington, serhiy.storchaka
Priority: normal Keywords: patch

Created on 2020-09-30 17:26 by kadler, last changed 2020-10-15 03:16 by methane. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 22466 merged kadler, 2020-09-30 17:42
PR 22704 merged miss-islington, 2020-10-15 01:53
PR 22705 merged miss-islington, 2020-10-15 01:53
Messages (15)
msg377713 - (view) Author: Kevin (kadler) * Date: 2020-09-30 17:26
If a native module fails to load, the dynload code will call PyUnicode_FromString on the error message to give back to the user. This can cause a UnicodeDecodeError if the locale is not a UTF-8 locale and the error message contains non-ASCII code points.

While Linux systems almost always use a UTF-8 locale by default nowadays, AIX systems typically use non-UTF-8 locales by default. We encountered an issue where a customer did not have libbz2 installed, causing a load failure when bz2 tried to import _bz2 when running in an Italian locale:

$ LC_ALL=it_IT python3 -c 'import bz2'        
Traceback (most recent call last): 
 File "<string>", line 1, in <module> 
 File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> 
   from _bz2 import BZ2Compressor, BZ2Decompressor 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 161: invalid continuation byte

After switching to a UTF-8 locale, the problem goes away:

$ LC_ALL=IT_IT python3 -c 'import bz2'   
Traceback (most recent call last): 
 File "<string>", line 1, in <module> 
 File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> 
   from _bz2 import BZ2Compressor, BZ2Decompressor 
ImportError:    0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 
       0509-150   Il modulo dipendente libbz2.so non è stato caricato. 
       0509-022 Impossibile caricare il modulo libbz2.so. 
       0509-026 Errore di sistema: Un file o una directory nel nome percorso non esiste. 
       0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 
       0509-150   Il modulo dipendente /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so non è stato caricato.


While this conceivably affects any Unix-like platform, the only system I can recreate it on is AIX and IBM i PASE. As far as I can tell, on Linux you will always get something like "error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory". Even though there seems to be some translations in GLIBC, I have been unable to get them to be used on either Fedora or Ubuntu.
msg378033 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-05 14:30
I succeeded to reproduce it on Ubuntu 20.04.

    $ sudo vi /var/lib/locales/supported.d/ja # add "ja_JP.EUC-JP EUC-JP"
    $ sudo locale-gen ja_JP.EUC-JP
    Generating locales (this might take a while)...
    ja_JP.EUC-JP... done
    Generation complete.
    $ chmod -r./build/lib.linux-x86_64-3.10/_sha3.cpython-310-x86_64-linux-gnu.so
    $ LC_ALL=ja_JP.eucjp ./python
    Python 3.10.0a0 (heads/master:fbf43f051e, Aug 17 2020, 15:13:52)
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale
    >>> locale.setlocale(locale.LC_ALL, "")
    'ja_JP.eucjp'
    >>> import _sha3
    Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 101: invalid start byte

Error message contains file path (byte string, probably encoded with fs encoding) and translated error message (encoded with locale encoding).

I want to use "backslashescape" error handler, but both of PyUnicode_DecodeLocale() and PyUnicode_DecodeFSDefault() don't support it.

After thinking about this several minutes, now I prefer PyUnicode_DecodeUTF8(msg, strlen(msg), "backslashreplace").
It fixes the issue with minimum behavior change, although error message is still backslashescaped.
It might be the best practice for creating Unicode object from C error message like strerror(3).
msg378170 - (view) Author: Kevin (kadler) * Date: 2020-10-07 16:57
Glad you were able to reproduce on Linux.

I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations.


The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato." By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape.

I guess the question comes to: what's more important to be decoded, the message text or the path?
msg378211 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-08 05:09
> I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations.

OK, I changd my b.p.o username.


> The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato."

The issue is not caused by backslashreplace, but by UTF-8 instead of locale. I notice it of course, but:

* Using UTF-8 is status quo. UTF-8:backslashreplace is the simplest fix approach.
* There is no guarantee that the error message can be decoded by locale encoding. Backslash escape is much better than "ignore" or "surrogateescape".


> By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape.
> 

There is no guranatee that the message is properly decoded with fsencoding.
And surrogateescape is for round-tripping bytes path, not for human readable.
Error message should be human readable. So backslashreplace is better than surrogateescape.

Additionally, non-UTF-8 locale is quite rare on Unix systems, and users of such systems would be able to handle backslash escaped message, because they might see such message often.
msg378220 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-10-08 07:49
I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX.
msg378223 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-08 08:34
> I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX.

* PyUnicode_DecodeLocale() doesn't support "backslashescape" error handler.
* Error message is usually encoded in locale encoding, but it is not guaranteed.
* Error message may contain path, it may be not locale encoding too.
* \xXX is far better than UnicodeDecodeError, anyway. We need to fix the UnicodeDecodeError first.
* non-UTF-8 locale is rare. We used this code for long time but we haven't reported this issue until now.

I don't against adding "backslashescape" to PyUnicode_DecodeLocale(). But to backport the bugfix for UnicodeDecodeError, change should be minimum.

So the main problem is: should we allow surrogateescape in error message?

For the record, PyUnicode_DecodeLocale() is using mbstowcs(). I don't know how reliable the function is in various platforms. That is why I had suggested PyUnicode_DecodeFSDefault() at first.
msg378224 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-08 08:42
> So the main problem is: should we allow surrogateescape in error message?

Note that error message may be written to file, stream, structured log (JSON). They may be UTF-8:strict. We can not write surrogateescape-d string to them.
msg378226 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020-10-08 09:41
In os.strerror() and PyErr_SetFromErrnoWithFilenameObjects() we use PyUnicode_DecodeLocale(s, "surrogateescape") for decoding the result of strerror().
msg378228 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-08 09:55
OK. Let's use PyUnicode_DecodeLocale() with surrogateescape for consistency.
msg378251 - (view) Author: Kevin (kadler) * Date: 2020-10-08 14:37
Ok, so should I switch the PR back from PyUnicode_DecodeFSDefault?
msg378298 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-09 02:19
Yes, please.
msg378656 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-15 01:53
New changeset 2d2af320d94afc6561e8f8adf174c9d3fd9065bc by Kevin Adler in branch 'master':
bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466)
https://github.com/python/cpython/commit/2d2af320d94afc6561e8f8adf174c9d3fd9065bc
msg378657 - (view) Author: miss-islington (miss-islington) Date: 2020-10-15 02:11
New changeset 47ca6799725bb4c40953bb26ebcd726d1d766361 by Miss Skeleton (bot) in branch '3.8':
bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466)
https://github.com/python/cpython/commit/47ca6799725bb4c40953bb26ebcd726d1d766361
msg378658 - (view) Author: miss-islington (miss-islington) Date: 2020-10-15 02:25
New changeset f07448bef48d645c8cee862b1f25a99003a6140e by Miss Skeleton (bot) in branch '3.9':
bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466)
https://github.com/python/cpython/commit/f07448bef48d645c8cee862b1f25a99003a6140e
msg378659 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2020-10-15 03:16
Thank you for finding/fixing.
History
Date User Action Args
2020-10-15 03:16:09methanesetstatus: open -> closed
resolution: fixed
messages: + msg378659

stage: patch review -> resolved
2020-10-15 02:25:49miss-islingtonsetmessages: + msg378658
2020-10-15 02:11:16miss-islingtonsetmessages: + msg378657
2020-10-15 01:53:50miss-islingtonsetpull_requests: + pull_request21675
2020-10-15 01:53:43miss-islingtonsetnosy: + miss-islington
pull_requests: + pull_request21674
2020-10-15 01:53:34methanesetmessages: + msg378656
2020-10-09 02:19:53methanesetmessages: + msg378298
2020-10-08 14:37:30kadlersetmessages: + msg378251
2020-10-08 09:55:22methanesetmessages: + msg378228
2020-10-08 09:41:58serhiy.storchakasetmessages: + msg378226
2020-10-08 08:42:42methanesetmessages: + msg378224
2020-10-08 08:34:22methanesetmessages: + msg378223
2020-10-08 07:49:19serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg378220
2020-10-08 05:09:23methanesetmessages: + msg378211
2020-10-07 16:57:17kadlersetmessages: + msg378170
2020-10-05 14:30:27methanesetnosy: + methane
messages: + msg378033
2020-10-02 14:05:36taleinatsetversions: - Python 3.5, Python 3.6, Python 3.7
2020-09-30 17:42:03kadlersetkeywords: + patch
stage: patch review
pull_requests: + pull_request21490
2020-09-30 17:26:23kadlercreate