Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codec lookup failing under turkish locale #46138

Closed
arnimar mannequin opened this issue Jan 12, 2008 · 31 comments
Closed

Codec lookup failing under turkish locale #46138

arnimar mannequin opened this issue Jan 12, 2008 · 31 comments
Assignees
Labels
stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@arnimar
Copy link
Mannequin

arnimar mannequin commented Jan 12, 2008

BPO 1813
Nosy @malemburg, @pitrou, @vstinner, @jwilk, @djc, @bitdancer, @skrah
Files
  • verify_locale.py: Program to verify bug/fix
  • turklocale.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = 'https://github.com/malemburg'
    closed_at = <Date 2012-02-04.04:04:36.033>
    created_at = <Date 2008-01-12.15:00:02.935>
    labels = ['type-bug', 'library']
    title = 'Codec lookup failing under turkish locale'
    updated_at = <Date 2012-02-04.04:04:36.032>
    user = 'https://bugs.python.org/arnimar'

    bugs.python.org fields:

    activity = <Date 2012-02-04.04:04:36.032>
    actor = 'Arfrever'
    assignee = 'lemburg'
    closed = True
    closed_date = <Date 2012-02-04.04:04:36.033>
    closer = 'Arfrever'
    components = ['Library (Lib)']
    creation = <Date 2008-01-12.15:00:02.935>
    creator = 'arnimar'
    dependencies = []
    files = ['9140', '9440']
    hgrepos = []
    issue_num = 1813
    keywords = ['patch']
    message_count = 31.0
    messages = ['59821', '62386', '62433', '62463', '62464', '62466', '62472', '64109', '64162', '111605', '111765', '119686', '119692', '140399', '141028', '141029', '141030', '141190', '141191', '141193', '141196', '141262', '141322', '141550', '141551', '141559', '141561', '141562', '143954', '152461', '152462']
    nosy_count = 13.0
    nosy_names = ['lemburg', 'jafo', 'pitrou', 'vstinner', 'arnimar', 'jwilk', 'djc', 'Arfrever', 'r.david.murray', 'skrah', 'BreamoreBoy', 'python-dev', 'gkcn']
    pr_nums = []
    priority = 'normal'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue1813'
    versions = ['Python 2.7', 'Python 3.2', 'Python 3.3']

    @arnimar
    Copy link
    Mannequin Author

    arnimar mannequin commented Jan 12, 2008

    When switching to a turkish locale, the codecs registry fails on a codec
    lookup which worked before the locale change.

    This happens when the codec name contains an uppercase 'I'. What
    happens, is just before doing a cache lookup, the string is normalized,
    which includes a call to <ctype.h>'s tolower. tolower is locale
    dependant, and the turkish locale handles 'I's different from other
    locales. Thus, the lookup fails, since the normalization behaves
    differently then it did before.

    Replacing the tolower() call with this made the lookup work:

    int my_tolower(char c)
    {
    	if ('A' <= c && c <= 'Z')
    		c += 32;
    
    	return c;
    }

    PS: If the turkish locale is not supported, this here will enable it to
    an Ubuntu system

    a) sudo cp /usr/share/i18n/SUPPORTED /var/lib/locales/supported.d/local
    (or just copy the lines with "tr" in them)
    b) sudo dpkg-reconfigure locales

    @arnimar arnimar mannequin added interpreter-core (Objects, Python, Grammar, and Parser dirs) type-bug An unexpected behavior, bug, or error labels Jan 12, 2008
    @arnimar arnimar mannequin added stdlib Python modules in the Lib dir and removed interpreter-core (Objects, Python, Grammar, and Parser dirs) labels Feb 13, 2008
    @pitrou
    Copy link
    Member

    pitrou commented Feb 14, 2008

    I can confirm this on SVN trunk on a Mandriva system.

    @arnimar
    Copy link
    Mannequin Author

    arnimar mannequin commented Feb 15, 2008

    There is more to this bug than appears. I'm guessing that the name
    mangling code in locale (e.g. the normalizing code) is locale dependent.

    See this example:

    #!/usr/bin/python2.5

    import locale

    print 'TR', locale.normalize('tr')

    print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))

    # first issue, not quite the same coming out, as came in
    print locale.getlocale()

    # and this fails
    print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))

    First, the value returned from getlocale is ('tr_TR', 'so8859-9'), not
    ('tr_TR', 'ISO8859-9'), and the second setlocale fails.

    @pitrou
    Copy link
    Member

    pitrou commented Feb 16, 2008

    The C library's tolower() and toupper() are used in a handful of source
    files. It might make sense to replace some of those calls with
    ascii-only versions of the corresponding functions.

    Modules/_sre.c: return ((ch) < 256 ? (unsigned int)tolower((ch)) : ch);
    Modules/_sqlite/cursor.c: *dst++ = tolower(*src++);
    Modules/stropmodule.c: *s_new = tolower(c);
    Modules/stropmodule.c: *s_new = toupper(c);
    Modules/stropmodule.c: *s_new = toupper(c);
    Modules/stropmodule.c: *s_new = tolower(c);
    Modules/stropmodule.c: *s_new = toupper(c);
    Modules/stropmodule.c: *s_new = tolower(c);
    Modules/unicodedata.c: h = (h * scale) + (unsigned char)
    toupper(Py_CHARMASK(s[i]));
    Modules/unicodedata.c: if (toupper(Py_CHARMASK(name[i])) !=
    buffer[i])
    Modules/_tkinter.c: argv0[0] = tolower(Py_CHARMASK(argv0[0]));
    Modules/binascii.c: c = tolower(c);
    Objects/stringobject.c: s[i] = _tolower(c);
    Objects/stringobject.c: s[i] = _toupper(c);
    Objects/stringobject.c: c = toupper(c);
    Objects/stringobject.c: c = tolower(c);
    Objects/stringobject.c: *s_new = toupper(c);
    Objects/stringobject.c: *s_new = tolower(c);
    Objects/stringobject.c: *s_new = toupper(c);
    Objects/stringobject.c: *s_new = tolower(c);
    Parser/tokenizer.c: else buf[i] = tolower(c);
    Python/codecs.c: ch = tolower(Py_CHARMASK(ch));
    Python/dynload_win.c: first = tolower(*string1);
    Python/dynload_win.c: second = tolower(*string2);
    Python/pystrcmp.c: while ((--size > 0) && (tolower(*s1) == tolower(*s2))) {
    Python/pystrcmp.c: return tolower(*s1) - tolower(*s2);
    Python/pystrcmp.c: while (*s1 && (tolower(*s1++) == tolower(*s2++))) {
    Python/pystrcmp.c: return (tolower(*s1) - tolower(*s2));

    @pitrou
    Copy link
    Member

    pitrou commented Feb 16, 2008

    As for the .upper() and .lower() methods, they are used in quite a bunch
    of standard library modules :-/...

    Lib/base64.py
    Lib/BaseHTTPServer.py
    Lib/bsddb/test/test_compare.py
    Lib/bsddb/test/test_dbobj.py
    Lib/CGIHTTPServer.py
    Lib/cgi.py
    Lib/compiler/ast.py
    Lib/ConfigParser.py
    Lib/cookielib.py
    Lib/Cookie.py
    Lib/csv.py
    Lib/ctypes/test/test_byteswap.py
    Lib/ctypes/util.py
    Lib/decimal.py
    Lib/distutils/command/bdist_rpm.py
    Lib/distutils/command/bdist_wininst.py
    Lib/distutils/command/register.py
    Lib/distutils/msvc9compiler.py
    Lib/distutils/msvccompiler.py
    Lib/distutils/sysconfig.py
    Lib/distutils/tests/test_dist.py
    Lib/distutils/util.py
    Lib/email/charset.py
    Lib/email/encoders.py
    Lib/email/header.py
    Lib/email/init.py
    Lib/email/message.py
    Lib/email/_parseaddr.py
    Lib/email/test/test_email.py
    Lib/email/test/test_email_renamed.py
    Lib/encodings/idna.py
    Lib/encodings/punycode.py
    Lib/formatter.py
    Lib/ftplib.py
    Lib/gettext.py
    Lib/htmllib.py
    Lib/HTMLParser.py
    Lib/httplib.py
    Lib/idlelib/configDialog.py
    Lib/idlelib/EditorWindow.py
    Lib/idlelib/IOBinding.py
    Lib/idlelib/keybindingDialog.py
    Lib/idlelib/PyShell.py
    Lib/idlelib/SearchDialogBase.py
    Lib/idlelib/tabbedpages.py
    Lib/idlelib/TreeWidget.py
    Lib/imaplib.py
    Lib/inspect.py
    Lib/lib-tk/turtle.py
    Lib/locale.py
    Lib/logging/handlers.py
    Lib/logging/init.py
    Lib/_LWPCookieJar.py
    Lib/macpath.py
    Lib/mailcap.py
    Lib/markupbase.py
    Lib/mhlib.py
    Lib/mimetools.py
    Lib/mimetypes.py
    Lib/mimify.py
    Lib/msilib/init.py
    Lib/nntplib.py
    Lib/ntpath.py
    Lib/nturl2path.py
    Lib/optparse.py
    Lib/os2emxpath.py
    Lib/os.py
    Lib/pdb.py
    Lib/plat-irix5/flp.py
    Lib/plat-irix6/flp.py
    Lib/plat-mac/buildtools.py
    Lib/plat-mac/gensuitemodule.py
    Lib/plat-riscos/riscospath.py
    Lib/pyclbr.py
    Lib/rfc822.py
    Lib/robotparser.py
    Lib/sgmllib.py
    Lib/SimpleHTTPServer.py
    Lib/smtpd.py
    Lib/smtplib.py
    Lib/socket.py
    Lib/sqlite3/test/hooks.py
    Lib/sre_constants.py
    Lib/stringold.py
    Lib/stringprep.py
    Lib/string.py
    Lib/_strptime.py
    Lib/subprocess.py
    Lib/test/regrtest.py
    Lib/test/test_bigmem.py
    Lib/test/test_codeccallbacks.py
    Lib/test/test_codecs.py
    Lib/test/test_cookielib.py
    Lib/test/test_datetime.py
    Lib/test/test_decimal.py
    Lib/test/test_deque.py
    Lib/test/test_descr.py
    Lib/test/test_fileinput.py
    Lib/test/test_grp.py
    Lib/test/test_hmac.py
    Lib/test/test_httplib.py
    Lib/test/test_os.py
    Lib/test/test_smtplib.py
    Lib/test/test_sort.py
    Lib/test/test_ssl.py
    Lib/test/test_strop.py
    Lib/test/test_strptime.py
    Lib/test/test_support.py
    Lib/test/test_ucn.py
    Lib/test/test_unicodedata.py
    Lib/test/test_urllib2.py
    Lib/test/test_urllib.py
    Lib/test/test_wsgiref.py
    Lib/test/test_xmlrpc.py
    Lib/urllib2.py
    Lib/urllib.py
    Lib/urlparse.py
    Lib/UserString.py
    Lib/uuid.py
    Lib/warnings.py
    Lib/webbrowser.py
    Lib/wsgiref/handlers.py
    Lib/wsgiref/headers.py
    Lib/wsgiref/simple_server.py
    Lib/wsgiref/util.py
    Lib/wsgiref/validate.py
    Lib/xml/dom/minidom.py
    Lib/xml/dom/xmlbuilder.py
    Lib/xmllib.py

    @pitrou
    Copy link
    Member

    pitrou commented Feb 16, 2008

    Even if we don't fix all uses of (?to)(lower|upper) in the source tree,
    I think it's important that codec and locale lookup work properly when
    the current locale defines non-latin case folding for latin characters.
    Here is a patch.

    Perhaps also the str type should grow ascii_lower() and ascii_upper()
    methods, since many cases of using lower() and upper() actually assume
    ascii semantics (e.g. for parsing of HTTP or SMTP headers).

    @malemburg
    Copy link
    Member

    I agree that it's a bit unfortunate that the 8-bit string APIs in Python
    use the locale aware C functions per default (this should really be
    reversed: there should be locale-aware .upper() and .lower() methods and
    the the standard ones should work just like the Unicode ones - without
    dependency on the locale, using ASCII mappings), but for historical
    reasons this cannot easily be changed.

    .lower() and .upper() for 8-bit strings were always locale dependent and
    before the addition of Unicode, setting the locale was the most common
    way to make an application understand different character sets.

    In Python 3k the problem will probably go away, since .lower() and
    .upper() will then no longer depend on the locale.

    Perhaps we should just convert a few of the cases you found to using
    Unicode strings instead of 8-bit strings in 2.6 ?! That would both make
    the code more portable and also provide a clear statement of "this is a
    text string", making porting to Py3k easier.

    @jafo
    Copy link
    Mannequin

    jafo mannequin commented Mar 19, 2008

    Marc-Andre: How should we proceed with this bug? Discuss on python-dev
    or c.l.python?

    @jafo jafo mannequin assigned malemburg Mar 19, 2008
    @malemburg
    Copy link
    Member

    Sean: I'd suggest to discuss this on python-dev.

    Note that even if we do use Unicode for the cases in question, the
    Turkish locale will still pose a problem - see bpo-1528802 for a discussion.

    @BreamoreBoy
    Copy link
    Mannequin

    BreamoreBoy mannequin commented Jul 26, 2010

    Does anyone know if this was discussed on python-dev? I've tried searching the archives and didn't find anything, but that's not to say it isn't there.

    @vstinner
    Copy link
    Member

    There is also a locale normalization function in unicodeobject.c: normalize_encoding(). This function uses "if (ISUPPER(*e)) *l++ = TOLOWER(e++);" which uses the Python, *locale-independent, implementation of ctype.

    We should maybe use the ISUPPER / TOLOWER in codecs.c.

    Anyway, a function should be fixed, but I don't know which one :-)

    @djc
    Copy link
    Member

    djc commented Oct 27, 2010

    We've included this patch in Gentoo for about two years now. Can we get some discussion going on doing something like this?

    @malemburg
    Copy link
    Member

    Looking at this again, I think we should change the codec registry C code to use Py_TOLOWER() and the encoding search function code to use the .translate() approach that Antoine suggested.

    @vstinner
    Copy link
    Member

    The decimal module has been fixed in Python 2.7, 3.2 and 3.3 for Turkish local: issue bpo-11830.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 24, 2011

    New changeset 92d02de91cc9 by Antoine Pitrou in branch '3.2':
    Issue bpo-1813: Fix codec lookup under Turkish locales.
    http://hg.python.org/cpython/rev/92d02de91cc9

    New changeset a77a4df54b95 by Antoine Pitrou in branch '3.2':
    Add a test for issue bpo-1813: getlocale() failing under a Turkish locale
    http://hg.python.org/cpython/rev/a77a4df54b95

    New changeset fe0caf8c48d2 by Antoine Pitrou in branch 'default':
    Add a test for issue bpo-1813: getlocale() failing under a Turkish locale
    http://hg.python.org/cpython/rev/fe0caf8c48d2

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Jul 24, 2011

    New changeset 739958134fe5 by Antoine Pitrou in branch '2.7':
    Issue bpo-1813: Fix codec lookup and setting/getting locales under Turkish locales.
    http://hg.python.org/cpython/rev/739958134fe5

    @pitrou
    Copy link
    Member

    pitrou commented Jul 24, 2011

    Finally fixed in 2.7, 3.2, 3.3!

    @pitrou pitrou closed this as completed Jul 24, 2011
    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jul 26, 2011

    The Fedora bot fails because here ...

      locale.setlocale(locale.LC_CTYPE, loc)
    
    loc = ('tr_TR', 'ISO8859-9'), and apparently setlocale can only
    handle "tr_TR", but not "tr_TR.ISO8859-9":

    144 if (locale) {
    145 /* set locale */
    146 result = setlocale(category, locale);
    147 if (!result) {
    148 /* operation failed, no setting was changed */
    149 PyErr_SetString(Error, "unsupported locale setting");
    150 return NULL;
    (gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
    $8 = 0x0
    (gdb) p result = setlocale(category, "tr_TR")
    $9 = 0x96d770 "tr_TR"
    (gdb) p locale
    $10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
    (gdb)

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jul 26, 2011

    Stefan Krah <report@bugs.python.org> wrote:

    (gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
    $8 = 0x0
    (gdb) p result = setlocale(category, "tr_TR")
    $9 = 0x96d770 "tr_TR"
    (gdb) p locale
    $10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
    (gdb)

    Perhaps this is a bug in Fedora's setlocale that can't handle the turkish 'I'
    in 'ISO' when CTYPE is turkish.

    @pitrou
    Copy link
    Member

    pitrou commented Jul 26, 2011

    Stefan Krah <report@bugs.python.org> wrote:
    > (gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
    > $8 = 0x0
    > (gdb) p result = setlocale(category, "tr_TR")
    > $9 = 0x96d770 "tr_TR"
    > (gdb) p locale
    > $10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
    > (gdb)

    Perhaps this is a bug in Fedora's setlocale that can't handle the turkish 'I'
    in 'ISO' when CTYPE is turkish.

    Perhaps indeed. Maybe you should try to report it.
    It does look like an OS bug in any case.
    (fortunately that buildbot is in the "unstable" bunch :-))

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jul 26, 2011

    Yes, it's a bug. This works:

    #include <stdio.h>
    #include <locale.h>
    int
    main(void)
    {
        char *s;
        printf("%s\n", setlocale(LC_CTYPE, "tr_TR.ISO8859-9"));
        printf("%s\n", setlocale(LC_CTYPE, NULL));
        s = setlocale(LC_CTYPE, "tr_TR.ISO8859-9");
        printf("%s\n", s ? s : "null");
        return 0;
    }

    But when I change the first setlocale call to "tr_TR", the result of
    the last call is NULL.

    @bitdancer
    Copy link
    Member

    I'm seeing this test failure in Gentoo, as well.

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Jul 28, 2011

    Fedora bug report:

    https://bugzilla.redhat.com/show_bug.cgi?id=726536

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 2, 2011

    Unrelated to the Fedora issue: The test is currently skipped on the
    FreeBSD bot, but completes successfully with:

    diff -r 0b52b6f1bfab Lib/test/test_locale.py
    --- a/Lib/test/test_locale.py   Tue Aug 02 10:16:45 2011 +0200
    +++ b/Lib/test/test_locale.py   Tue Aug 02 11:37:39 2011 +0200
    @@ -399,7 +399,7 @@
             oldlocale = locale.setlocale(locale.LC_CTYPE)
             self.addCleanup(locale.setlocale, locale.LC_CTYPE, oldlocale)
             try:
    -            locale.setlocale(locale.LC_CTYPE, 'tr_TR')
    +            locale.setlocale(locale.LC_CTYPE, 'tr_TR.UTF-8')
             except locale.Error:
                 # Unsupported locale on this system
                 self.skipTest('test needs Turkish locale')

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 2, 2011

    As I wrote on python-dev, this test also fails on Debian lenny, which has
    the same setlocale() bug as Fedora.

    So, indeed the test should be skipped on a multitude of platforms.

    @bitdancer
    Copy link
    Member

    On Tue, 02 Aug 2011 12:12:37 +0200, Stefan Krah <stefan@bytereef.org> wrote:

    I suspect many buildbots are green because they don't have tr_TR and
    tr_TR.iso8859-9 installed.

    This is true for my Gentoo buildbots. Once we've figured out the
    best way to handle this, I'll fix that (install the other locales) for
    my two.

    When I run the C test program I get null as the final output of that
    regardless of whether I use 'tr_TR' or 'tr_TR.utf8'.

    This is with glibc-2.13-r2 (the r2 is Gentoo's mod number).

    As someone pointed out on python-dev, if this isn't fixable then it should be an expected failure, not a skip.

    One question is, is there any platform on which the turkish locale is installed where this test actually works?

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Aug 2, 2011

    [Re-opening to fix the skips]

    Yes, the test works on:

    Ubuntu Lucid (libc-2.11.1), OpenSUSE (libc-2.11.1), FreeBSD-8.2

    Failure:

    Fedora 14 (libc-2.13), Debian lenny (libc-2.7), Gentoo (libc-2.13-r2)

    So perhaps this test should be marked as expected failure on Linux
    altogether (unless we test for the libc version).

    @skrah skrah mannequin reopened this Aug 2, 2011
    @pitrou
    Copy link
    Member

    pitrou commented Aug 2, 2011

    As someone pointed out on python-dev, if this isn't fixable then it
    should be an expected failure, not a skip.

    The Python bug is fixed, the problem is apparently some libcs have the
    same bug as we did...

    One question is, is there any platform on which the turkish locale is
    installed where this test actually works?

    Well, it works here (Mageia).

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Sep 13, 2011

    https://bugzilla.redhat.com/show_bug.cgi?id=726536 claims that the
    glibc issue (which is relevant for skipping the test case) is fixed
    in glibc-2.14.90-8.

    I suspect the only way of running the test case reliably is whitelisting
    a couple of known good glibc versions.

    @python-dev
    Copy link
    Mannequin

    python-dev mannequin commented Feb 2, 2012

    New changeset a55ffb6c1993 by Stefan Krah in branch '3.2':
    Issue bpo-1813: Revert workaround for a glibc bug on the Fedora buildbot.
    http://hg.python.org/cpython/rev/a55ffb6c1993

    New changeset 4244e4348362 by Stefan Krah in branch 'default':
    Issue bpo-1813: merge changeset that reverts a glibc workaround for the
    http://hg.python.org/cpython/rev/4244e4348362

    New changeset 0b8917fc6db5 by Stefan Krah in branch '2.7':
    Issue bpo-1813: backport changeset that reverts a glibc workaround for the
    http://hg.python.org/cpython/rev/0b8917fc6db5

    @skrah
    Copy link
    Mannequin

    skrah mannequin commented Feb 2, 2012

    I've upgraded the Fedora buildbot to Fedora-16. The specific glibc
    workaround should not be necessary any more.

    So the test will now fail again on all systems that a) have the bug
    and b) the tr_Tr locale.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    5 participants