Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solaris: Fix broken Unicode encoding in non-UTF locales #87833

Closed
kulikjak mannequin opened this issue Mar 30, 2021 · 13 comments
Closed

Solaris: Fix broken Unicode encoding in non-UTF locales #87833

kulikjak mannequin opened this issue Mar 30, 2021 · 13 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-unicode

Comments

@kulikjak
Copy link
Mannequin

kulikjak mannequin commented Mar 30, 2021

BPO 43667
Nosy @vstinner, @ezio-melotti, @pablogsal, @miss-islington, @kulikjak
PRs
  • bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris #25096
  • [3.9] bpo-43667: Fix broken Unicode encoding in non-UTF locales on So… #25847
  • bpo-43667: Add news fragment for changes in #25096 #26405
  • [3.10] bpo-43667: Add news fragment for Solaris changes (GH-26405) #26409
  • [3.9] bpo-43667: Add news fragment for Solaris changes (GH-26405) #26410
  • [3.10] bpo-43667: Add news fragment for Solaris changes (GH-26405) #26498
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2021-05-25.10:02:40.604>
    created_at = <Date 2021-03-30.10:11:34.418>
    labels = ['3.10', '3.9', 'expert-unicode', '3.11']
    title = 'Solaris: Fix broken Unicode encoding in non-UTF locales'
    updated_at = <Date 2021-06-20.20:12:16.267>
    user = 'https://github.com/kulikjak'

    bugs.python.org fields:

    activity = <Date 2021-06-20.20:12:16.267>
    actor = 'pablogsal'
    assignee = 'none'
    closed = True
    closed_date = <Date 2021-05-25.10:02:40.604>
    closer = 'vstinner'
    components = ['Unicode']
    creation = <Date 2021-03-30.10:11:34.418>
    creator = 'kulikjak'
    dependencies = []
    files = []
    hgrepos = []
    issue_num = 43667
    keywords = ['patch']
    message_count = 13.0
    messages = ['389813', '389814', '392429', '394116', '394117', '394305', '394308', '394309', '394572', '394576', '394577', '394578', '396193']
    nosy_count = 5.0
    nosy_names = ['vstinner', 'ezio.melotti', 'pablogsal', 'miss-islington', 'kulikjak']
    pr_nums = ['25096', '25847', '26405', '26409', '26410', '26498']
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = None
    url = 'https://bugs.python.org/issue43667'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @kulikjak
    Copy link
    Mannequin Author

    kulikjak mannequin commented Mar 30, 2021

    On Linux, wchar_t values are mapped to their UTF-8 counterparts; however, that does not have to be the case as the standard allows any arbitrary representation to be used, and this is the case for Solaris.

    In Oracle Solaris, the internal form of wchar_t is specific to a locale; in the Unicode locales, wchar_t has the UTF-32 Unicode encoding form, and other locales have different representations [1].

    This is an issue because Python expects wchar_t to correspond with Unicode, which on Oracle Solaris with non-UTF locale results either in errors (values are outside the Unicode range) or in output with different symbols.

    Unicode locales work as expected, but they are not an acceptable workaround for some Oracle Solaris users that cannot use Unicode encoding for various reasons.

    Because of that, we fixed it a few months ago with a patch to PyUnicode_FromWideChar, which handles conversion to unicode (attached in PR). It was tested over the last half a year, and we didn't see any related issues since.

    Is something like this acceptable or should it be fixed on a different place/in a different way? All comments are appreciated.

    [1] https://docs.oracle.com/cd/E36784_01/html/E39536/gmwkm.html

    @kulikjak kulikjak mannequin added topic-unicode 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels Mar 30, 2021
    @kulikjak
    Copy link
    Mannequin Author

    kulikjak mannequin commented Mar 30, 2021

    I forgot to mention: this affects Oracle Solaris. I tested this on SmartOS, and I cannot reproduce it there as it seems that they are using Unicode representation for all locales. Based on the documentation, this might also affect other systems as well (e.g. HP UIX specifically says: 'These values may not be compatible with values obtained by specifying other locales that are supported'), but it's hard to tell without testing that.

    This one liner breaks with ValueError: character U+30000069 is not in range [U+0000; U+10ffff] if the issue is present:
    python3.7 -c 'import datetime; import locale; locale.setlocale(locale.LC_ALL,"es_ES.ISO8859-1"); datetime.date(2001, 1, 3).strftime("%a")'

    @vstinner
    Copy link
    Member

    New changeset 9032cf5 by Jakub Kulík in branch 'master':
    bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris (GH-25096)
    9032cf5

    @sujalpatel67821 sujalpatel67821 mannequin added tests Tests in the Lib/test dir 3.11 only security fixes and removed topic-unicode 3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes labels May 3, 2021
    @kulikjak kulikjak mannequin added topic-unicode 3.8 only security fixes 3.9 only security fixes 3.10 only security fixes and removed tests Tests in the Lib/test dir labels May 3, 2021
    @kulikjak kulikjak mannequin added 3.10 only security fixes and removed 3.11 only security fixes labels May 3, 2021
    @vstinner
    Copy link
    Member

    New changeset d3cc689 by Jakub Kulík in branch '3.9':
    [3.9] bpo-43667: Fix broken Unicode encoding in non-UTF locales on Solaris (GH-25096) (GH-25847)
    d3cc689

    @vstinner
    Copy link
    Member

    Backport to 3.8 may be more complicated. It's up to you to decide if you want to backport it or not. I merged your 3.9 backport, it looks very close to the change made in the main branch.

    @vstinner
    Copy link
    Member

    Do you want to attempt to backport the fix to 3.8, or can this issue be closed?

    @kulikjak
    Copy link
    Mannequin Author

    kulikjak mannequin commented May 25, 2021

    Sorry for delayed response.

    Considering that we are not delivering or using 3.8 in any way and this issue doesn't seem to impact anybody else, we can omit the backport to 3.8. I will prepare another PR with a news fragment, and after that, this can be considered solved and closed.

    @kulikjak kulikjak mannequin added 3.11 only security fixes and removed 3.8 only security fixes labels May 25, 2021
    @vstinner
    Copy link
    Member

    I close the issue, but you can still reference the bpo issue number for your PR with the changelog (NEWS) entry.

    @vstinner
    Copy link
    Member

    New changeset 164a4f4 by Jakub Kulík in branch 'main':
    bpo-43667: Add news fragment for Solaris changes (GH-26405)
    164a4f4

    @vstinner
    Copy link
    Member

    New changeset 0574b06 by Miss Islington (bot) in branch '3.10':
    bpo-43667: Add news fragment for Solaris changes (GH-26405) (GH-26409)
    0574b06

    @vstinner
    Copy link
    Member

    New changeset 427232f by Miss Islington (bot) in branch '3.9':
    bpo-43667: Add news fragment for Solaris changes (GH-26405) (GH-26410)
    427232f

    @vstinner
    Copy link
    Member

    I merged your PR and backported it to add a NEWS entry, thanks.

    @pablogsal
    Copy link
    Member

    New changeset f87d203 by Miss Islington (bot) in branch '3.10':
    bpo-43667: Add news fragment for Solaris changes (GH-26405) (GH-26498)
    f87d203

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 only security fixes topic-unicode
    Projects
    None yet
    Development

    No branches or pull requests

    2 participants