Title: Solaris: Fix broken Unicode encoding in non-UTF locales
Type: Stage: patch review
Components: Unicode Versions: Python 3.10, Python 3.9, Python 3.8
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, kulikjak, vstinner
Priority: normal Keywords: patch

Created on 2021-03-30 10:11 by kulikjak, last changed 2021-05-03 14:46 by kulikjak.

Pull Requests
URL Status Linked Edit
PR 25096 merged kulikjak, 2021-03-30 10:12
PR 25847 open kulikjak, 2021-05-03 11:37
Messages (3)
msg389813 - (view) Author: Jakub Kulik (kulikjak) * Date: 2021-03-30 10:11
On Linux, wchar_t values are mapped to their UTF-8 counterparts; however, that does not have to be the case as the standard allows any arbitrary representation to be used, and this is the case for Solaris.

In Oracle Solaris, the internal form of wchar_t is specific to a locale; in the Unicode locales, wchar_t has the UTF-32 Unicode encoding form, and other locales have different representations [1].

This is an issue because Python expects wchar_t to correspond with Unicode, which on Oracle Solaris with non-UTF locale results either in errors (values are outside the Unicode range) or in output with different symbols.

Unicode locales work as expected, but they are not an acceptable workaround for some Oracle Solaris users that cannot use Unicode encoding for various reasons.

Because of that, we fixed it a few months ago with a patch to `PyUnicode_FromWideChar`, which handles conversion to unicode (attached in PR). It was tested over the last half a year, and we didn't see any related issues since.

Is something like this acceptable or should it be fixed on a different place/in a different way? All comments are appreciated.

msg389814 - (view) Author: Jakub Kulik (kulikjak) * Date: 2021-03-30 10:12
I forgot to mention: this affects Oracle Solaris. I tested this on SmartOS, and I cannot reproduce it there as it seems that they are using Unicode representation for all locales. Based on the documentation, this might also affect other systems as well (e.g. HP UIX specifically says: 'These values may not be compatible with values obtained by specifying other locales that are supported'), but it's hard to tell without testing that.

This one liner breaks with ValueError: character U+30000069 is not in range [U+0000; U+10ffff] if the issue is present:
python3.7 -c 'import datetime; import locale; locale.setlocale(locale.LC_ALL,"es_ES.ISO8859-1");, 1, 3).strftime("%a")'
msg392429 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021-04-30 13:21
New changeset 9032cf5cb1e33c0349089cfb0f6bf11ed3c30e86 by Jakub Kulík in branch 'master':
bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris (GH-25096)
Date User Action Args
2021-05-03 14:46:29kulikjaksetcomponents: + Unicode, - Tests
versions: + Python 3.8, Python 3.9, Python 3.10, - Python 3.11
2021-05-03 12:28:54sujalpatel67821setcomponents: + Tests, - Unicode
versions: + Python 3.11, - Python 3.7, Python 3.8, Python 3.9, Python 3.10
2021-05-03 11:37:13kulikjaksetpull_requests: + pull_request24530
2021-04-30 13:21:48vstinnersetmessages: + msg392429
2021-03-30 10:12:51kulikjaksetmessages: + msg389814
2021-03-30 10:12:23kulikjaksetkeywords: + patch
stage: patch review
pull_requests: + pull_request23840
2021-03-30 10:11:34kulikjakcreate