Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python does not support the GEORGIAN-PS charset #63658

Open
CaolnMcNamara mannequin opened this issue Oct 31, 2013 · 7 comments
Open

Python does not support the GEORGIAN-PS charset #63658

CaolnMcNamara mannequin opened this issue Oct 31, 2013 · 7 comments
Labels
3.9 only security fixes 3.10 only security fixes 3.11 bug and security fixes topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump

Comments

@CaolnMcNamara
Copy link
Mannequin

CaolnMcNamara mannequin commented Oct 31, 2013

BPO 19459
Nosy @malemburg, @loewis, @vstinner, @taleinat, @jwilk, @ezio-melotti, @serhiy-storchaka
Files
  • georgian_ps.py
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2013-10-31.10:52:59.224>
    labels = ['3.10', '3.11', 'expert-unicode', 'type-crash', '3.9']
    title = 'Python does not support the GEORGIAN-PS charset'
    updated_at = <Date 2021-12-11.19:13:45.678>
    user = 'https://bugs.python.org/CaolnMcNamara'

    bugs.python.org fields:

    activity = <Date 2021-12-11.19:13:45.678>
    actor = 'iritkatriel'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['Unicode']
    creation = <Date 2013-10-31.10:52:59.224>
    creator = 'Caol\xc3\xa1n.McNamara'
    dependencies = []
    files = ['32431']
    hgrepos = []
    issue_num = 19459
    keywords = []
    message_count = 7.0
    messages = ['201800', '201801', '201802', '404214', '404250', '404275', '404290']
    nosy_count = 8.0
    nosy_names = ['lemburg', 'loewis', 'vstinner', 'taleinat', 'jwilk', 'ezio.melotti', 'serhiy.storchaka', 'Caol\xc3\xa1n.McNamara']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = 'resolved'
    status = 'open'
    superseder = None
    type = 'crash'
    url = 'https://bugs.python.org/issue19459'
    versions = ['Python 3.9', 'Python 3.10', 'Python 3.11']

    @CaolnMcNamara
    Copy link
    Mannequin Author

    CaolnMcNamara mannequin commented Oct 31, 2013

    LANG=ka_GE.georgianps /usr/bin/python3
    Fatal Python error: Py_Initialize: Unable to get the locale encoding
    LookupError: unknown encoding: GEORGIAN-PS
    Aborted (core dumped)

    but with python-2.7.5 no crash...
    LANG=ka_GE.georgianps /usr/bin/python2
    Python 2.7.5 (default, Oct 8 2013, 12:19:40)
    [GCC 4.8.1 20130603 (Red Hat 4.8.1-1)] on linux2
    Type "help", "copyright", "credits" or "license" for more information.

    >>

    (fedora 19)

    @CaolnMcNamara CaolnMcNamara mannequin added topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump labels Oct 31, 2013
    @vstinner
    Copy link
    Member

    This bug was initially reported in LibreOffice:
    https://bugs.freedesktop.org/show_bug.cgi?id=68850

    @vstinner
    Copy link
    Member

    I found three georgian encodings:

    https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/charmaps/GEORGIAN-PS;h=64615ff4344d74ea0c70cfd7a6c6c8019afb884e;hb=HEAD

    https://sourceware.org/git/?p=glibc.git;a=blob;f=localedata/charmaps/GEORGIAN-ACADEMY;h=9dc1bc9e782e9fe6092a00daf1a75274fd6dd738;hb=HEAD

    http://tools.ietf.org/html/draft-giasher-geostd8-00

    The first one ("GEORGIAN-PS") is probably the most accurate because it is the one included in the GNU libc.

    Could you please try to copy attached georgian_ps.py file into /usr/lib64/python3.3/encodings/ (or /usr/lib/python3.3/encodings/ for 32-bit Linux)?

    Then try to print georgian letters using:

    print(bytes(range(0xc0, 0xe6)).decode("GEORGIAN-PS"))

    Please give me also your locale encoding:

       import locale; print(locale.getpreferredencoding())

    @caolán: Do you know the GEORGIAN-ACADEMY encoding? It doesn't look to be used by any glibc locale.

    On my Fedora 18, I have 3 georgian locales:

    • ka_GE.georgianps: locale encoding GEORGIAN-PS
    • ka_GE: locale encoding GEORGIAN-PS
    • ka_GE.utf8: locale encoding UTF-8

    You can workaround this issue by switching your locale from ka_GE.georgianps to ka_GE.utf8.

    @vstinner vstinner changed the title Fatal Python error: Py_Initialize: Unable to get the locale encoding: GEORGIAN-PS Python does not support the GEORGIAN-PS charset Oct 31, 2013
    @taleinat
    Copy link
    Contributor

    With recent versions of Python (e.g. 3.9) this no longer causes a crash. Python apparently falls back to UTF-8, at least on my system:

    $ LANG=ka_GE.georgianps python3.9
    Python 3.9.7 (default, Sep  9 2021, 23:20:13) 
    [GCC 9.3.0] on linux
    Type "help", "copyright", "credits" or "license" for more information.
    >>> import locale; print(locale.getpreferredencoding())
    UTF-8

    I'm marking this as fixed. If someone still has issues with this encoding, please open a new issue with up-to-date information.

    @vstinner
    Copy link
    Member

    Python uses UTF-8 if the locale is not supported:

    $ LANG=xxx python3.9 -c "import sys; print(sys.flags.utf8_mode)"
    1

    On Fedora 34, the locale is still supported, and Python 3.11 still fails:

    vstinner@apu$ LANG=ka_GE.georgianps locale
    LANG=ka_GE.georgianps
    LC_CTYPE="ka_GE.georgianps"
    LC_NUMERIC="ka_GE.georgianps"
    LC_TIME="ka_GE.georgianps"
    LC_COLLATE="ka_GE.georgianps"
    LC_MONETARY="ka_GE.georgianps"
    LC_MESSAGES="ka_GE.georgianps"
    LC_PAPER="ka_GE.georgianps"
    LC_NAME="ka_GE.georgianps"
    LC_ADDRESS="ka_GE.georgianps"
    LC_TELEPHONE="ka_GE.georgianps"
    LC_MEASUREMENT="ka_GE.georgianps"
    LC_IDENTIFICATION="ka_GE.georgianps"
    LC_ALL=

    vstinner@apu$ LANG=ka_GE.georgianps python3.11 -c "import sys; print(sys.flags.utf8_mode)"
    Python path configuration:
    PYTHONHOME = (not set)
    PYTHONPATH = (not set)
    program name = './python'
    isolated = 0
    environment = 1
    user site = 1
    import site = 1
    stdlib dir = '/home/vstinner/python/main/Lib'
    sys._base_executable = '/home/vstinner/python/main/python'
    sys.base_prefix = '/usr/local'
    sys.base_exec_prefix = '/usr/local'
    sys.platlibdir = 'lib'
    sys.executable = '/home/vstinner/python/main/python'
    sys.prefix = '/usr/local'
    sys.exec_prefix = '/usr/local'
    sys.path = [
    '/usr/local/lib/python311.zip',
    '/home/vstinner/python/main/Lib',
    '/home/vstinner/python/main/build/lib.linux-x86_64-3.11-pydebug',
    ]
    Fatal Python error: init_fs_encoding: failed to get the Python codec of the filesystem encoding
    Python runtime state: core initialized
    LookupError: unknown encoding: GEORGIAN-PS

    Current thread 0x00007ff89b81d2c0 (most recent call first):
    <no Python frame>

    @vstinner vstinner reopened this Oct 18, 2021
    @serhiy-storchaka
    Copy link
    Member

    Possible solutions (they can be combined):

    1. Add support for the GEORGIAN-PS charset and all other encodings used in libc (bpo-22679). The problem is that it is difficult to get the official information about these encodings.

    2. Falls back to utf-8 or ascii+surrogateescape in case of unsupported locale encoding. But typos can slip unnoticed.

    @malemburg
    Copy link
    Member

    On 19.10.2021 10:44, Serhiy Storchaka wrote:

    Possible solutions (they can be combined):

    1. Add support for the GEORGIAN-PS charset and all other encodings used in libc (bpo-22679). The problem is that it is difficult to get the official information about these encodings.

    As with all encodings we add: there has to be a real need to support
    them natively in Python (as opposed to installing codecs via PyPI)
    and we need a definite source for the encoding, e.g. a standards
    document from an official body.

    IMO, we should not really add more encodings to the stdlib, but instead
    point people to e.g. the iconv package:

    https://pypi.org/project/python-iconv/

    Perhaps we ought to make it easier for such packages to provide
    additional codecs even during the startup phase, e.g. via a special
    env var which points Python to a list of codec packages to load
    prior to initializing the I/O encoding... not sure whether this is
    possible, though.

    1. Falls back to utf-8 or ascii+surrogateescape in case of unsupported locale encoding. But typos can slip unnoticed.

    I think this would be a more general solution to such cases, provided
    the startup logic issues a visible warning about the fallback.

    @iritkatriel iritkatriel added 3.9 only security fixes 3.10 only security fixes 3.11 bug and security fixes labels Dec 11, 2021
    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    GerHobbelt added a commit to GerHobbelt/uchardet that referenced this issue Jul 6, 2023
    Without this unicode range spec every character is deferred through an external `iconv` call, slowing down the BuildModel code tremendously!
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.9 only security fixes 3.10 only security fixes 3.11 bug and security fixes topic-unicode type-crash A hard crash of the interpreter, possibly with a core dump
    Projects
    Development

    No branches or pull requests

    5 participants