classification
Title: Codec lookup failing under turkish locale
Type: behavior Stage: committed/rejected
Components: Library (Lib) Versions: Python 3.3, Python 3.2, Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: lemburg Nosy List: Arfrever, BreamoreBoy, arnimar, djc, gkcn, haypo, jafo, jwilk, lemburg, pitrou, python-dev, r.david.murray, skrah
Priority: normal Keywords: patch

Created on 2008-01-12 15:00 by arnimar, last changed 2012-02-04 04:04 by Arfrever. This issue is now closed.

Files
File name Uploaded Description Edit
verify_locale.py arnimar, 2008-01-12 15:00 Program to verify bug/fix
turklocale.patch pitrou, 2008-02-16 20:04
Messages (31)
msg59821 - (view) Author: Árni Már Jónsson (arnimar) Date: 2008-01-12 15:00
When switching to a turkish locale, the codecs registry fails on a codec
lookup which worked before the locale change.

This happens when the codec name contains an uppercase 'I'. What
happens, is just before doing a cache lookup, the string is normalized,
which includes a call to <ctype.h>'s tolower. tolower is locale
dependant, and the turkish locale handles 'I's different from other
locales. Thus, the lookup fails, since the normalization behaves
differently then it did before.

Replacing the tolower() call with this made the lookup work:

int my_tolower(char c)
{
	if ('A' <= c && c <= 'Z')
		c += 32;

	return c;
}

PS: If the turkish locale is not supported, this here will enable it to
an Ubuntu system

a) sudo cp /usr/share/i18n/SUPPORTED /var/lib/locales/supported.d/local
   (or just copy the lines with "tr" in them)
b) sudo dpkg-reconfigure locales
msg62386 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-14 10:52
I can confirm this on SVN trunk on a Mandriva system.
msg62433 - (view) Author: Árni Már Jónsson (arnimar) Date: 2008-02-15 16:36
There is more to this bug than appears. I'm guessing that the name
mangling code in locale (e.g. the normalizing code) is locale dependent. 

See this example:

#!/usr/bin/python2.5

import locale

print 'TR', locale.normalize('tr')

print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))

# first issue, not quite the same coming out, as came in
print locale.getlocale()

# and this fails
print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))



First, the value returned from getlocale is ('tr_TR', 'so8859-9'), not
('tr_TR', 'ISO8859-9'), and the second setlocale fails.
msg62463 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-16 19:34
The C library's tolower() and toupper() are used in a handful of source
files. It might make sense to replace some of those calls with
ascii-only versions of the corresponding functions.

Modules/_sre.c:    return ((ch) < 256 ? (unsigned int)tolower((ch)) : ch);
Modules/_sqlite/cursor.c:        *dst++ = tolower(*src++);
Modules/stropmodule.c:			*s_new = tolower(c);
Modules/stropmodule.c:			*s_new = toupper(c);
Modules/stropmodule.c:			*s_new = toupper(c);
Modules/stropmodule.c:			*s_new = tolower(c);
Modules/stropmodule.c:			*s_new = toupper(c);
Modules/stropmodule.c:			*s_new = tolower(c);
Modules/unicodedata.c:        h = (h * scale) + (unsigned char)
toupper(Py_CHARMASK(s[i]));
Modules/unicodedata.c:        if (toupper(Py_CHARMASK(name[i])) !=
buffer[i])
Modules/_tkinter.c:		argv0[0] = tolower(Py_CHARMASK(argv0[0]));
Modules/binascii.c:			c = tolower(c);
Objects/stringobject.c:			s[i] = _tolower(c);
Objects/stringobject.c:			s[i] = _toupper(c);
Objects/stringobject.c:			    c = toupper(c);
Objects/stringobject.c:			    c = tolower(c);
Objects/stringobject.c:			*s_new = toupper(c);
Objects/stringobject.c:			*s_new = tolower(c);
Objects/stringobject.c:			*s_new = toupper(c);
Objects/stringobject.c:			*s_new = tolower(c);
Parser/tokenizer.c:		else buf[i] = tolower(c);
Python/codecs.c:            ch = tolower(Py_CHARMASK(ch));
Python/dynload_win.c:		first  = tolower(*string1);
Python/dynload_win.c:		second = tolower(*string2);
Python/pystrcmp.c:	while ((--size > 0) && (tolower(*s1) == tolower(*s2))) {
Python/pystrcmp.c:	return tolower(*s1) - tolower(*s2);
Python/pystrcmp.c:	while (*s1 && (tolower(*s1++) == tolower(*s2++))) {
Python/pystrcmp.c:	return (tolower(*s1) - tolower(*s2));
msg62464 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-16 19:58
As for the .upper() and .lower() methods, they are used in quite a bunch
of standard library modules :-/...

Lib/base64.py
Lib/BaseHTTPServer.py
Lib/bsddb/test/test_compare.py
Lib/bsddb/test/test_dbobj.py
Lib/CGIHTTPServer.py
Lib/cgi.py
Lib/compiler/ast.py
Lib/ConfigParser.py
Lib/cookielib.py
Lib/Cookie.py
Lib/csv.py
Lib/ctypes/test/test_byteswap.py
Lib/ctypes/util.py
Lib/decimal.py
Lib/distutils/command/bdist_rpm.py
Lib/distutils/command/bdist_wininst.py
Lib/distutils/command/register.py
Lib/distutils/msvc9compiler.py
Lib/distutils/msvccompiler.py
Lib/distutils/sysconfig.py
Lib/distutils/tests/test_dist.py
Lib/distutils/util.py
Lib/email/charset.py
Lib/email/encoders.py
Lib/email/header.py
Lib/email/__init__.py
Lib/email/message.py
Lib/email/_parseaddr.py
Lib/email/test/test_email.py
Lib/email/test/test_email_renamed.py
Lib/encodings/idna.py
Lib/encodings/punycode.py
Lib/formatter.py
Lib/ftplib.py
Lib/gettext.py
Lib/htmllib.py
Lib/HTMLParser.py
Lib/httplib.py
Lib/idlelib/configDialog.py
Lib/idlelib/EditorWindow.py
Lib/idlelib/IOBinding.py
Lib/idlelib/keybindingDialog.py
Lib/idlelib/PyShell.py
Lib/idlelib/SearchDialogBase.py
Lib/idlelib/tabbedpages.py
Lib/idlelib/TreeWidget.py
Lib/imaplib.py
Lib/inspect.py
Lib/lib-tk/turtle.py
Lib/locale.py
Lib/logging/handlers.py
Lib/logging/__init__.py
Lib/_LWPCookieJar.py
Lib/macpath.py
Lib/mailcap.py
Lib/markupbase.py
Lib/mhlib.py
Lib/mimetools.py
Lib/mimetypes.py
Lib/mimify.py
Lib/msilib/__init__.py
Lib/nntplib.py
Lib/ntpath.py
Lib/nturl2path.py
Lib/optparse.py
Lib/os2emxpath.py
Lib/os.py
Lib/pdb.py
Lib/plat-irix5/flp.py
Lib/plat-irix6/flp.py
Lib/plat-mac/buildtools.py
Lib/plat-mac/gensuitemodule.py
Lib/plat-riscos/riscospath.py
Lib/pyclbr.py
Lib/rfc822.py
Lib/robotparser.py
Lib/sgmllib.py
Lib/SimpleHTTPServer.py
Lib/smtpd.py
Lib/smtplib.py
Lib/socket.py
Lib/sqlite3/test/hooks.py
Lib/sre_constants.py
Lib/stringold.py
Lib/stringprep.py
Lib/string.py
Lib/_strptime.py
Lib/subprocess.py
Lib/test/regrtest.py
Lib/test/test_bigmem.py
Lib/test/test_codeccallbacks.py
Lib/test/test_codecs.py
Lib/test/test_cookielib.py
Lib/test/test_datetime.py
Lib/test/test_decimal.py
Lib/test/test_deque.py
Lib/test/test_descr.py
Lib/test/test_fileinput.py
Lib/test/test_grp.py
Lib/test/test_hmac.py
Lib/test/test_httplib.py
Lib/test/test_os.py
Lib/test/test_smtplib.py
Lib/test/test_sort.py
Lib/test/test_ssl.py
Lib/test/test_strop.py
Lib/test/test_strptime.py
Lib/test/test_support.py
Lib/test/test_ucn.py
Lib/test/test_unicodedata.py
Lib/test/test_urllib2.py
Lib/test/test_urllib.py
Lib/test/test_wsgiref.py
Lib/test/test_xmlrpc.py
Lib/urllib2.py
Lib/urllib.py
Lib/urlparse.py
Lib/UserString.py
Lib/uuid.py
Lib/warnings.py
Lib/webbrowser.py
Lib/wsgiref/handlers.py
Lib/wsgiref/headers.py
Lib/wsgiref/simple_server.py
Lib/wsgiref/util.py
Lib/wsgiref/validate.py
Lib/xml/dom/minidom.py
Lib/xml/dom/xmlbuilder.py
Lib/xmllib.py
msg62466 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2008-02-16 20:04
Even if we don't fix all uses of (?to)(lower|upper) in the source tree,
I think it's important that codec and locale lookup work properly when
the current locale defines non-latin case folding for latin characters.
Here is a patch.

Perhaps also the str type should grow ascii_lower() and ascii_upper()
methods, since many cases of using lower() and upper() actually assume
ascii semantics (e.g. for parsing of HTTP or SMTP headers).
msg62472 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-02-16 22:20
I agree that it's a bit unfortunate that the 8-bit string APIs in Python
use the locale aware C functions per default (this should really be
reversed: there should be locale-aware .upper() and .lower() methods and
the the standard ones should work just like the Unicode ones - without
dependency on the locale, using ASCII mappings), but for historical
reasons this cannot easily be changed.

.lower() and .upper() for 8-bit strings were always locale dependent and
before the addition of Unicode, setting the locale was the most common
way to make an application understand different character sets.

In Python 3k the problem will probably go away, since .lower() and
.upper() will then no longer depend on the locale.

Perhaps we should just convert a few of the cases you found to using
Unicode strings instead of 8-bit strings in 2.6 ?! That would both make
the code more portable and also provide a clear statement of "this is a
text string", making porting to Py3k easier.
msg64109 - (view) Author: Sean Reifschneider (jafo) * (Python committer) Date: 2008-03-19 21:44
Marc-Andre: How should we proceed with this bug?  Discuss on python-dev
or c.l.python?
msg64162 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-03-20 10:20
Sean: I'd suggest to discuss this on python-dev.

Note that even if we do use Unicode for the cases in question, the
Turkish locale will still pose a problem - see #1528802 for a discussion.
msg111605 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-07-26 12:24
Does anyone know if this was discussed on python-dev?  I've tried searching the archives and didn't find anything, but that's not to say it isn't there.
msg111765 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2010-07-28 02:16
There is also a locale normalization function in unicodeobject.c: normalize_encoding(). This function uses "if (ISUPPER(*e)) *l++ = TOLOWER(*e++);" which uses the Python, *locale-independent*, implementation of ctype.

We should maybe use the ISUPPER / TOLOWER in codecs.c.

Anyway, a function should be fixed, but I don't know which one :-)
msg119686 - (view) Author: Dirkjan Ochtman (djc) * (Python committer) Date: 2010-10-27 10:30
We've included this patch in Gentoo for about two years now. Can we get some discussion going on doing something like this?
msg119692 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2010-10-27 11:27
Looking at this again, I think we should change the codec registry C code to use Py_TOLOWER() and the encoding search function code to use the .translate() approach that Antoine suggested.
msg140399 - (view) Author: STINNER Victor (haypo) * (Python committer) Date: 2011-07-15 09:14
The decimal module has been fixed in Python 2.7, 3.2 and 3.3 for Turkish local: issue #11830.
msg141028 - (view) Author: Roundup Robot (python-dev) Date: 2011-07-24 00:43
New changeset 92d02de91cc9 by Antoine Pitrou in branch '3.2':
Issue #1813: Fix codec lookup under Turkish locales.
http://hg.python.org/cpython/rev/92d02de91cc9

New changeset a77a4df54b95 by Antoine Pitrou in branch '3.2':
Add a test for issue #1813: getlocale() failing under a Turkish locale
http://hg.python.org/cpython/rev/a77a4df54b95

New changeset fe0caf8c48d2 by Antoine Pitrou in branch 'default':
Add a test for issue #1813: getlocale() failing under a Turkish locale
http://hg.python.org/cpython/rev/fe0caf8c48d2
msg141029 - (view) Author: Roundup Robot (python-dev) Date: 2011-07-24 00:52
New changeset 739958134fe5 by Antoine Pitrou in branch '2.7':
Issue #1813: Fix codec lookup and setting/getting locales under Turkish locales.
http://hg.python.org/cpython/rev/739958134fe5
msg141030 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-07-24 00:53
Finally fixed in 2.7, 3.2, 3.3!
msg141190 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-07-26 22:50
The Fedora bot fails because here ...

  locale.setlocale(locale.LC_CTYPE, loc)

loc = ('tr_TR', 'ISO8859-9'), and apparently setlocale can only
handle "tr_TR", but not "tr_TR.ISO8859-9":



144         if (locale) {
145             /* set locale */
146             result = setlocale(category, locale);
147             if (!result) {
148                 /* operation failed, no setting was changed */
149                 PyErr_SetString(Error, "unsupported locale setting");
150                 return NULL;
(gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
$8 = 0x0
(gdb) p result = setlocale(category, "tr_TR")
$9 = 0x96d770 "tr_TR"
(gdb) p locale
$10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
(gdb)
msg141191 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-07-26 23:01
Stefan Krah <report@bugs.python.org> wrote:
> (gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
> $8 = 0x0
> (gdb) p result = setlocale(category, "tr_TR")
> $9 = 0x96d770 "tr_TR"
> (gdb) p locale
> $10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
> (gdb)

Perhaps this is a bug in Fedora's setlocale that can't handle the turkish 'I'
in 'ISO' when CTYPE is turkish.
msg141193 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-07-26 23:02
> Stefan Krah <report@bugs.python.org> wrote:
> > (gdb) p result = setlocale(category, "tr_TR.ISO8859-9")
> > $8 = 0x0
> > (gdb) p result = setlocale(category, "tr_TR")
> > $9 = 0x96d770 "tr_TR"
> > (gdb) p locale
> > $10 = 0x7ffff0f6a5b0 "tr_TR.ISO8859-9"
> > (gdb)
> 
> Perhaps this is a bug in Fedora's setlocale that can't handle the turkish 'I'
> in 'ISO' when CTYPE is turkish.

Perhaps indeed. Maybe you should try to report it.
It does look like an OS bug in any case.
(fortunately that buildbot is in the "unstable" bunch :-))
msg141196 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-07-26 23:34
Yes, it's a bug. This works:

#include <stdio.h>
#include <locale.h>
int
main(void)
{
    char *s;
    printf("%s\n", setlocale(LC_CTYPE, "tr_TR.ISO8859-9"));
    printf("%s\n", setlocale(LC_CTYPE, NULL));
    s = setlocale(LC_CTYPE, "tr_TR.ISO8859-9");
    printf("%s\n", s ? s : "null");
    return 0;
}

But when I change the first setlocale call to "tr_TR", the result of
the last call is NULL.
msg141262 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-07-27 18:42
I'm seeing this test failure in Gentoo, as well.
msg141322 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-07-28 23:10
Fedora bug report:

https://bugzilla.redhat.com/show_bug.cgi?id=726536
msg141550 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-08-02 09:41
Unrelated to the Fedora issue: The test is currently skipped on the
FreeBSD bot, but completes successfully with:

diff -r 0b52b6f1bfab Lib/test/test_locale.py
--- a/Lib/test/test_locale.py   Tue Aug 02 10:16:45 2011 +0200
+++ b/Lib/test/test_locale.py   Tue Aug 02 11:37:39 2011 +0200
@@ -399,7 +399,7 @@
         oldlocale = locale.setlocale(locale.LC_CTYPE)
         self.addCleanup(locale.setlocale, locale.LC_CTYPE, oldlocale)
         try:
-            locale.setlocale(locale.LC_CTYPE, 'tr_TR')
+            locale.setlocale(locale.LC_CTYPE, 'tr_TR.UTF-8')
         except locale.Error:
             # Unsupported locale on this system
             self.skipTest('test needs Turkish locale')
msg141551 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-08-02 10:21
As I wrote on python-dev, this test also fails on Debian lenny, which has
the same setlocale() bug as Fedora.

So, indeed the test should be skipped on a multitude of platforms.
msg141559 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2011-08-02 11:34
On Tue, 02 Aug 2011 12:12:37 +0200, Stefan Krah <stefan@bytereef.org> wrote:
> I suspect many buildbots are green because they don't have tr_TR and
> tr_TR.iso8859-9 installed.

This is true for my Gentoo buildbots.  Once we've figured out the
best way to handle this, I'll fix that (install the other locales) for
my two.

When I run the C test program I get null as the final output of that
regardless of whether I use 'tr_TR' or 'tr_TR.utf8'.

This is with glibc-2.13-r2 (the r2 is Gentoo's mod number).

As someone pointed out on python-dev, if this isn't fixable then it should be an expected failure, not a skip.

One question is, is there any platform on which the turkish locale is installed where this test actually works?
msg141561 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-08-02 12:01
[Re-opening to fix the skips]

Yes, the test works on:

  Ubuntu Lucid (libc-2.11.1), OpenSUSE (libc-2.11.1), FreeBSD-8.2


Failure:

  Fedora 14 (libc-2.13), Debian lenny (libc-2.7), Gentoo (libc-2.13-r2)


So perhaps this test should be marked as expected failure on Linux
altogether (unless we test for the libc version).
msg141562 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011-08-02 12:06
> As someone pointed out on python-dev, if this isn't fixable then it
> should be an expected failure, not a skip.

The Python bug is fixed, the problem is apparently some libcs have the
same bug as we did...

> One question is, is there any platform on which the turkish locale is
> installed where this test actually works?

Well, it works here (Mageia).
msg143954 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2011-09-13 11:39
https://bugzilla.redhat.com/show_bug.cgi?id=726536 claims that the
glibc issue (which is relevant for skipping the test case) is fixed
in glibc-2.14.90-8.

I suspect the only way of running the test case reliably is whitelisting
a couple of known good glibc versions.
msg152461 - (view) Author: Roundup Robot (python-dev) Date: 2012-02-02 15:59
New changeset a55ffb6c1993 by Stefan Krah in branch '3.2':
Issue #1813: Revert workaround for a glibc bug on the Fedora buildbot.
http://hg.python.org/cpython/rev/a55ffb6c1993

New changeset 4244e4348362 by Stefan Krah in branch 'default':
Issue #1813: merge changeset that reverts a glibc workaround for the
http://hg.python.org/cpython/rev/4244e4348362

New changeset 0b8917fc6db5 by Stefan Krah in branch '2.7':
Issue #1813: backport changeset that reverts a glibc workaround for the
http://hg.python.org/cpython/rev/0b8917fc6db5
msg152462 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-02-02 16:06
I've upgraded the Fedora buildbot to Fedora-16. The specific glibc
workaround should not be necessary any more.


So the test will now fail again on all systems that a) have the bug
and b) the tr_Tr locale.
History
Date User Action Args
2012-02-04 04:04:36Arfreversetstatus: open -> closed
2012-02-02 16:06:37skrahsetmessages: + msg152462
2012-02-02 16:00:00python-devsetmessages: + msg152461
2011-09-13 11:39:30skrahsetmessages: + msg143954
2011-08-02 12:06:31pitrousetmessages: + msg141562
2011-08-02 12:01:12skrahsetstatus: closed -> open

messages: + msg141561
2011-08-02 11:34:29r.david.murraysetmessages: + msg141559
2011-08-02 10:21:35skrahsetmessages: + msg141551
2011-08-02 09:41:45skrahsetmessages: + msg141550
2011-07-28 23:10:01skrahsetmessages: + msg141322
2011-07-27 18:42:42r.david.murraysetnosy: + r.david.murray
messages: + msg141262
2011-07-26 23:34:59skrahsetmessages: + msg141196
2011-07-26 23:02:52pitrousetmessages: + msg141193
2011-07-26 23:01:52skrahsetmessages: + msg141191
2011-07-26 22:50:18skrahsetnosy: + skrah
messages: + msg141190
2011-07-24 00:53:04pitrousetstatus: open -> closed
versions: + Python 3.3, - Python 2.6, Python 3.1
messages: + msg141030

resolution: fixed
stage: committed/rejected
2011-07-24 00:52:27python-devsetmessages: + msg141029
2011-07-24 00:43:24python-devsetnosy: + python-dev
messages: + msg141028
2011-07-15 15:35:32Arfreversetnosy: + Arfrever
2011-07-15 09:14:54hayposetmessages: + msg140399
2011-05-23 20:22:14gkcnsetnosy: + gkcn
2010-10-27 11:27:26lemburgsetmessages: + msg119692
2010-10-27 10:30:19djcsetnosy: + djc
messages: + msg119686
2010-07-28 02:16:52hayposetmessages: + msg111765
2010-07-26 12:24:23BreamoreBoysetnosy: + BreamoreBoy
messages: + msg111605
2010-05-14 12:38:50pitrousetnosy: + haypo

versions: + Python 3.1, Python 2.7, Python 3.2
2010-04-28 14:12:28jwilksetnosy: + jwilk
2008-03-20 10:20:48lemburgsetmessages: + msg64162
2008-03-19 21:44:49jafosetpriority: normal
assignee: lemburg
messages: + msg64109
keywords: + patch
nosy: + jafo
2008-02-16 22:20:15lemburgsetnosy: + lemburg
messages: + msg62472
2008-02-16 20:04:38pitrousetversions: + Python 2.6, - Python 2.5
2008-02-16 20:04:33pitrousetfiles: + turklocale.patch
messages: + msg62466
2008-02-16 19:58:26pitrousetmessages: + msg62464
2008-02-16 19:34:21pitrousetmessages: + msg62463
2008-02-15 16:36:35arnimarsetmessages: + msg62433
2008-02-14 10:52:10pitrousetnosy: + pitrou
messages: + msg62386
2008-02-13 23:03:06arnimarsetcomponents: + Library (Lib), - Interpreter Core
2008-01-12 15:00:02arnimarcreate