Issue 41894: UnicodeDecodeError during load failure in non-UTF-8 locale

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/86060

classification

Title:	UnicodeDecodeError during load failure in non-UTF-8 locale
Type:	behavior	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.10, Python 3.9, Python 3.8

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	kadler, methane, miss-islington, serhiy.storchaka
Priority:	normal	Keywords:	patch

Created on 2020-09-30 17:26 by kadler, last changed 2022-04-11 14:59 by admin. This issue is now closed.

Pull Requests
URL	Status	Linked	Edit
PR 22466	merged	kadler, 2020-09-30 17:42
PR 22704	merged	miss-islington, 2020-10-15 01:53
PR 22705	merged	miss-islington, 2020-10-15 01:53

Messages (15)
msg377713 - (view)	Author: Kevin (kadler) *	Date: 2020-09-30 17:26
If a native module fails to load, the dynload code will call PyUnicode_FromString on the error message to give back to the user. This can cause a UnicodeDecodeError if the locale is not a UTF-8 locale and the error message contains non-ASCII code points. While Linux systems almost always use a UTF-8 locale by default nowadays, AIX systems typically use non-UTF-8 locales by default. We encountered an issue where a customer did not have libbz2 installed, causing a load failure when bz2 tried to import _bz2 when running in an Italian locale: $ LC_ALL=it_IT python3 -c 'import bz2' Traceback (most recent call last): File "<string>", line 1, in <module> File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 161: invalid continuation byte After switching to a UTF-8 locale, the problem goes away: $ LC_ALL=IT_IT python3 -c 'import bz2' Traceback (most recent call last): File "<string>", line 1, in <module> File "/QOpenSys/pkgs/lib/python3.6/bz2.py", line 21, in <module> from _bz2 import BZ2Compressor, BZ2Decompressor ImportError: 0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 0509-150 Il modulo dipendente libbz2.so non è stato caricato. 0509-022 Impossibile caricare il modulo libbz2.so. 0509-026 Errore di sistema: Un file o una directory nel nome percorso non esiste. 0509-022 Impossibile caricare il modulo /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so. 0509-150 Il modulo dipendente /QOpenSys/pkgs/lib/python3.6/lib-dynload/_bz2.so non è stato caricato. While this conceivably affects any Unix-like platform, the only system I can recreate it on is AIX and IBM i PASE. As far as I can tell, on Linux you will always get something like "error while loading shared libraries: libbz2.so.1.0: cannot open shared object file: No such file or directory". Even though there seems to be some translations in GLIBC, I have been unable to get them to be used on either Fedora or Ubuntu.
msg378033 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-05 14:30
I succeeded to reproduce it on Ubuntu 20.04. $ sudo vi /var/lib/locales/supported.d/ja # add "ja_JP.EUC-JP EUC-JP" $ sudo locale-gen ja_JP.EUC-JP Generating locales (this might take a while)... ja_JP.EUC-JP... done Generation complete. $ chmod -r./build/lib.linux-x86_64-3.10/_sha3.cpython-310-x86_64-linux-gnu.so $ LC_ALL=ja_JP.eucjp ./python Python 3.10.0a0 (heads/master:fbf43f051e, Aug 17 2020, 15:13:52) [GCC 9.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import locale >>> locale.setlocale(locale.LC_ALL, "") 'ja_JP.eucjp' >>> import _sha3 Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb6 in position 101: invalid start byte Error message contains file path (byte string, probably encoded with fs encoding) and translated error message (encoded with locale encoding). I want to use "backslashescape" error handler, but both of PyUnicode_DecodeLocale() and PyUnicode_DecodeFSDefault() don't support it. After thinking about this several minutes, now I prefer PyUnicode_DecodeUTF8(msg, strlen(msg), "backslashreplace"). It fixes the issue with minimum behavior change, although error message is still backslashescaped. It might be the best practice for creating Unicode object from C error message like strerror(3).
msg378170 - (view)	Author: Kevin (kadler) *	Date: 2020-10-07 16:57
Glad you were able to reproduce on Linux. I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations. The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato." By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape. I guess the question comes to: what's more important to be decoded, the message text or the path?
msg378211 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-08 05:09
> I have since changed the PR to use PyUnicode_DecodeFSDefault based on review feedback. I was going to say that you will have to fight it out with @methane on GH, but I see that that's you. :D Would have been nice if you would have left the updated feedback there as well so people who aren't familiar would know it's one person adjusting their recommendation vs two different people with conflicting recommendations. OK, I changd my b.p.o username. > The only issue I see with using backslashreplace is that users of non-UTF-8 locales would see message text that contains non-ASCII characters only as escape codes. eg, the message above would show "Il modulo dipendente libbz2.so non \xe8 stato caricato." instead of "Il modulo dipendente libbz2.so non è stato caricato." The issue is not caused by backslashreplace, but by UTF-8 instead of locale. I notice it of course, but: * Using UTF-8 is status quo. UTF-8:backslashreplace is the simplest fix approach. * There is no guarantee that the error message can be decoded by locale encoding. Backslash escape is much better than "ignore" or "surrogateescape". > By using PyUnicode_DecodeFSDefault instead, the message should be properly decoded but any encoding errors (such as utf-8 paths, etc) would be handled by surrogateescape. > There is no guranatee that the message is properly decoded with fsencoding. And surrogateescape is for round-tripping bytes path, not for human readable. Error message should be human readable. So backslashreplace is better than surrogateescape. Additionally, non-UTF-8 locale is quite rare on Unix systems, and users of such systems would be able to handle backslash escaped message, because they might see such message often.
msg378220 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-10-08 07:49
I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX.
msg378223 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-08 08:34
> I think that it is more correct to use the locale encoding. If error messages are translated for readability, we should not ruin this by outputting \xXX. * PyUnicode_DecodeLocale() doesn't support "backslashescape" error handler. * Error message is usually encoded in locale encoding, but it is not guaranteed. * Error message may contain path, it may be not locale encoding too. * \xXX is far better than UnicodeDecodeError, anyway. We need to fix the UnicodeDecodeError first. * non-UTF-8 locale is rare. We used this code for long time but we haven't reported this issue until now. I don't against adding "backslashescape" to PyUnicode_DecodeLocale(). But to backport the bugfix for UnicodeDecodeError, change should be minimum. So the main problem is: should we allow surrogateescape in error message? For the record, PyUnicode_DecodeLocale() is using mbstowcs(). I don't know how reliable the function is in various platforms. That is why I had suggested PyUnicode_DecodeFSDefault() at first.
msg378224 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-08 08:42
> So the main problem is: should we allow surrogateescape in error message? Note that error message may be written to file, stream, structured log (JSON). They may be UTF-8:strict. We can not write surrogateescape-d string to them.
msg378226 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2020-10-08 09:41
In os.strerror() and PyErr_SetFromErrnoWithFilenameObjects() we use PyUnicode_DecodeLocale(s, "surrogateescape") for decoding the result of strerror().
msg378228 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-08 09:55
OK. Let's use PyUnicode_DecodeLocale() with surrogateescape for consistency.
msg378251 - (view)	Author: Kevin (kadler) *	Date: 2020-10-08 14:37
Ok, so should I switch the PR back from PyUnicode_DecodeFSDefault?
msg378298 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-09 02:19
Yes, please.
msg378656 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-15 01:53
New changeset 2d2af320d94afc6561e8f8adf174c9d3fd9065bc by Kevin Adler in branch 'master': bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466) https://github.com/python/cpython/commit/2d2af320d94afc6561e8f8adf174c9d3fd9065bc
msg378657 - (view)	Author: miss-islington (miss-islington)	Date: 2020-10-15 02:11
New changeset 47ca6799725bb4c40953bb26ebcd726d1d766361 by Miss Skeleton (bot) in branch '3.8': bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466) https://github.com/python/cpython/commit/47ca6799725bb4c40953bb26ebcd726d1d766361
msg378658 - (view)	Author: miss-islington (miss-islington)	Date: 2020-10-15 02:25
New changeset f07448bef48d645c8cee862b1f25a99003a6140e by Miss Skeleton (bot) in branch '3.9': bpo-41894: Fix UnicodeDecodeError while loading native module (GH-22466) https://github.com/python/cpython/commit/f07448bef48d645c8cee862b1f25a99003a6140e
msg378659 - (view)	Author: Inada Naoki (methane) *	Date: 2020-10-15 03:16
Thank you for finding/fixing.

History
Date	User	Action	Args
2022-04-11 14:59:36	admin	set	github: 86060
2020-10-15 03:16:09	methane	set	status: open -> closed resolution: fixed messages: + msg378659 stage: patch review -> resolved
2020-10-15 02:25:49	miss-islington	set	messages: + msg378658
2020-10-15 02:11:16	miss-islington	set	messages: + msg378657
2020-10-15 01:53:50	miss-islington	set	pull_requests: + pull_request21675
2020-10-15 01:53:43	miss-islington	set	nosy: + miss-islington pull_requests: + pull_request21674
2020-10-15 01:53:34	methane	set	messages: + msg378656
2020-10-09 02:19:53	methane	set	messages: + msg378298
2020-10-08 14:37:30	kadler	set	messages: + msg378251
2020-10-08 09:55:22	methane	set	messages: + msg378228
2020-10-08 09:41:58	serhiy.storchaka	set	messages: + msg378226
2020-10-08 08:42:42	methane	set	messages: + msg378224
2020-10-08 08:34:22	methane	set	messages: + msg378223
2020-10-08 07:49:19	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg378220
2020-10-08 05:09:23	methane	set	messages: + msg378211
2020-10-07 16:57:17	kadler	set	messages: + msg378170
2020-10-05 14:30:27	methane	set	nosy: + methane messages: + msg378033
2020-10-02 14:05:36	taleinat	set	versions: - Python 3.5, Python 3.6, Python 3.7
2020-09-30 17:42:03	kadler	set	keywords: + patch stage: patch review pull_requests: + pull_request21490
2020-09-30 17:26:23	kadler	create