Issue1813
Created on 2008-01-12 15:00 by arnimar, last changed 2008-03-20 10:20 by lemburg.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | Remove |
| verify_locale.py | arnimar, 2008-01-12 15:00 | Program to verify bug/fix | ||
| turklocale.patch | pitrou, 2008-02-16 20:04 | |||
| Messages | |||
|---|---|---|---|
| msg59821 (view) | Author: Árni Már Jónsson (arnimar) | Date: 2008-01-12 15:00 | |
When switching to a turkish locale, the codecs registry fails on a codec
lookup which worked before the locale change.
This happens when the codec name contains an uppercase 'I'. What
happens, is just before doing a cache lookup, the string is normalized,
which includes a call to <ctype.h>'s tolower. tolower is locale
dependant, and the turkish locale handles 'I's different from other
locales. Thus, the lookup fails, since the normalization behaves
differently then it did before.
Replacing the tolower() call with this made the lookup work:
int my_tolower(char c)
{
if ('A' <= c && c <= 'Z')
c += 32;
return c;
}
PS: If the turkish locale is not supported, this here will enable it to
an Ubuntu system
a) sudo cp /usr/share/i18n/SUPPORTED /var/lib/locales/supported.d/local
(or just copy the lines with "tr" in them)
b) sudo dpkg-reconfigure locales
|
|||
| msg62386 (view) | Author: Antoine Pitrou (pitrou) | Date: 2008-02-14 10:52 | |
I can confirm this on SVN trunk on a Mandriva system. |
|||
| msg62433 (view) | Author: Árni Már Jónsson (arnimar) | Date: 2008-02-15 16:36 | |
There is more to this bug than appears. I'm guessing that the name
mangling code in locale (e.g. the normalizing code) is locale dependent.
See this example:
#!/usr/bin/python2.5
import locale
print 'TR', locale.normalize('tr')
print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))
# first issue, not quite the same coming out, as came in
print locale.getlocale()
# and this fails
print locale.setlocale(locale.LC_ALL, ('tr_TR', 'ISO8859-9'))
First, the value returned from getlocale is ('tr_TR', 'so8859-9'), not
('tr_TR', 'ISO8859-9'), and the second setlocale fails.
|
|||
| msg62463 (view) | Author: Antoine Pitrou (pitrou) | Date: 2008-02-16 19:34 | |
The C library's tolower() and toupper() are used in a handful of source files. It might make sense to replace some of those calls with ascii-only versions of the corresponding functions. Modules/_sre.c: return ((ch) < 256 ? (unsigned int)tolower((ch)) : ch); Modules/_sqlite/cursor.c: *dst++ = tolower(*src++); Modules/stropmodule.c: *s_new = tolower(c); Modules/stropmodule.c: *s_new = toupper(c); Modules/stropmodule.c: *s_new = toupper(c); Modules/stropmodule.c: *s_new = tolower(c); Modules/stropmodule.c: *s_new = toupper(c); Modules/stropmodule.c: *s_new = tolower(c); Modules/unicodedata.c: h = (h * scale) + (unsigned char) toupper(Py_CHARMASK(s[i])); Modules/unicodedata.c: if (toupper(Py_CHARMASK(name[i])) != buffer[i]) Modules/_tkinter.c: argv0[0] = tolower(Py_CHARMASK(argv0[0])); Modules/binascii.c: c = tolower(c); Objects/stringobject.c: s[i] = _tolower(c); Objects/stringobject.c: s[i] = _toupper(c); Objects/stringobject.c: c = toupper(c); Objects/stringobject.c: c = tolower(c); Objects/stringobject.c: *s_new = toupper(c); Objects/stringobject.c: *s_new = tolower(c); Objects/stringobject.c: *s_new = toupper(c); Objects/stringobject.c: *s_new = tolower(c); Parser/tokenizer.c: else buf[i] = tolower(c); Python/codecs.c: ch = tolower(Py_CHARMASK(ch)); Python/dynload_win.c: first = tolower(*string1); Python/dynload_win.c: second = tolower(*string2); Python/pystrcmp.c: while ((--size > 0) && (tolower(*s1) == tolower(*s2))) { Python/pystrcmp.c: return tolower(*s1) - tolower(*s2); Python/pystrcmp.c: while (*s1 && (tolower(*s1++) == tolower(*s2++))) { Python/pystrcmp.c: return (tolower(*s1) - tolower(*s2)); |
|||
| msg62464 (view) | Author: Antoine Pitrou (pitrou) | Date: 2008-02-16 19:58 | |
As for the .upper() and .lower() methods, they are used in quite a bunch of standard library modules :-/... Lib/base64.py Lib/BaseHTTPServer.py Lib/bsddb/test/test_compare.py Lib/bsddb/test/test_dbobj.py Lib/CGIHTTPServer.py Lib/cgi.py Lib/compiler/ast.py Lib/ConfigParser.py Lib/cookielib.py Lib/Cookie.py Lib/csv.py Lib/ctypes/test/test_byteswap.py Lib/ctypes/util.py Lib/decimal.py Lib/distutils/command/bdist_rpm.py Lib/distutils/command/bdist_wininst.py Lib/distutils/command/register.py Lib/distutils/msvc9compiler.py Lib/distutils/msvccompiler.py Lib/distutils/sysconfig.py Lib/distutils/tests/test_dist.py Lib/distutils/util.py Lib/email/charset.py Lib/email/encoders.py Lib/email/header.py Lib/email/__init__.py Lib/email/message.py Lib/email/_parseaddr.py Lib/email/test/test_email.py Lib/email/test/test_email_renamed.py Lib/encodings/idna.py Lib/encodings/punycode.py Lib/formatter.py Lib/ftplib.py Lib/gettext.py Lib/htmllib.py Lib/HTMLParser.py Lib/httplib.py Lib/idlelib/configDialog.py Lib/idlelib/EditorWindow.py Lib/idlelib/IOBinding.py Lib/idlelib/keybindingDialog.py Lib/idlelib/PyShell.py Lib/idlelib/SearchDialogBase.py Lib/idlelib/tabbedpages.py Lib/idlelib/TreeWidget.py Lib/imaplib.py Lib/inspect.py Lib/lib-tk/turtle.py Lib/locale.py Lib/logging/handlers.py Lib/logging/__init__.py Lib/_LWPCookieJar.py Lib/macpath.py Lib/mailcap.py Lib/markupbase.py Lib/mhlib.py Lib/mimetools.py Lib/mimetypes.py Lib/mimify.py Lib/msilib/__init__.py Lib/nntplib.py Lib/ntpath.py Lib/nturl2path.py Lib/optparse.py Lib/os2emxpath.py Lib/os.py Lib/pdb.py Lib/plat-irix5/flp.py Lib/plat-irix6/flp.py Lib/plat-mac/buildtools.py Lib/plat-mac/gensuitemodule.py Lib/plat-riscos/riscospath.py Lib/pyclbr.py Lib/rfc822.py Lib/robotparser.py Lib/sgmllib.py Lib/SimpleHTTPServer.py Lib/smtpd.py Lib/smtplib.py Lib/socket.py Lib/sqlite3/test/hooks.py Lib/sre_constants.py Lib/stringold.py Lib/stringprep.py Lib/string.py Lib/_strptime.py Lib/subprocess.py Lib/test/regrtest.py Lib/test/test_bigmem.py Lib/test/test_codeccallbacks.py Lib/test/test_codecs.py Lib/test/test_cookielib.py Lib/test/test_datetime.py Lib/test/test_decimal.py Lib/test/test_deque.py Lib/test/test_descr.py Lib/test/test_fileinput.py Lib/test/test_grp.py Lib/test/test_hmac.py Lib/test/test_httplib.py Lib/test/test_os.py Lib/test/test_smtplib.py Lib/test/test_sort.py Lib/test/test_ssl.py Lib/test/test_strop.py Lib/test/test_strptime.py Lib/test/test_support.py Lib/test/test_ucn.py Lib/test/test_unicodedata.py Lib/test/test_urllib2.py Lib/test/test_urllib.py Lib/test/test_wsgiref.py Lib/test/test_xmlrpc.py Lib/urllib2.py Lib/urllib.py Lib/urlparse.py Lib/UserString.py Lib/uuid.py Lib/warnings.py Lib/webbrowser.py Lib/wsgiref/handlers.py Lib/wsgiref/headers.py Lib/wsgiref/simple_server.py Lib/wsgiref/util.py Lib/wsgiref/validate.py Lib/xml/dom/minidom.py Lib/xml/dom/xmlbuilder.py Lib/xmllib.py |
|||
| msg62466 (view) | Author: Antoine Pitrou (pitrou) | Date: 2008-02-16 20:04 | |
Even if we don't fix all uses of (?to)(lower|upper) in the source tree, I think it's important that codec and locale lookup work properly when the current locale defines non-latin case folding for latin characters. Here is a patch. Perhaps also the str type should grow ascii_lower() and ascii_upper() methods, since many cases of using lower() and upper() actually assume ascii semantics (e.g. for parsing of HTTP or SMTP headers). |
|||
| msg62472 (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2008-02-16 22:20 | |
I agree that it's a bit unfortunate that the 8-bit string APIs in Python use the locale aware C functions per default (this should really be reversed: there should be locale-aware .upper() and .lower() methods and the the standard ones should work just like the Unicode ones - without dependency on the locale, using ASCII mappings), but for historical reasons this cannot easily be changed. .lower() and .upper() for 8-bit strings were always locale dependent and before the addition of Unicode, setting the locale was the most common way to make an application understand different character sets. In Python 3k the problem will probably go away, since .lower() and .upper() will then no longer depend on the locale. Perhaps we should just convert a few of the cases you found to using Unicode strings instead of 8-bit strings in 2.6 ?! That would both make the code more portable and also provide a clear statement of "this is a text string", making porting to Py3k easier. |
|||
| msg64109 (view) | Author: Sean Reifschneider (jafo) | Date: 2008-03-19 21:44 | |
Marc-Andre: How should we proceed with this bug? Discuss on python-dev or c.l.python? |
|||
| msg64162 (view) | Author: Marc-Andre Lemburg (lemburg) | Date: 2008-03-20 10:20 | |
Sean: I'd suggest to discuss this on python-dev. Note that even if we do use Unicode for the cases in question, the Turkish locale will still pose a problem - see #1528802 for a discussion. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2008-03-20 10:20:48 | lemburg | set | messages: + msg64162 |
| 2008-03-19 21:44:49 | jafo | set | priority: normal assignee: lemburg messages: + msg64109 keywords: + patch nosy: + jafo |
| 2008-02-16 22:20:15 | lemburg | set | nosy:
+ lemburg messages: + msg62472 |
| 2008-02-16 20:04:38 | pitrou | set | versions: + Python 2.6, - Python 2.5 |
| 2008-02-16 20:04:33 | pitrou | set | files:
+ turklocale.patch messages: + msg62466 |
| 2008-02-16 19:58:26 | pitrou | set | messages: + msg62464 |
| 2008-02-16 19:34:21 | pitrou | set | messages: + msg62463 |
| 2008-02-15 16:36:35 | arnimar | set | messages: + msg62433 |
| 2008-02-14 10:52:10 | pitrou | set | nosy:
+ pitrou messages: + msg62386 |
| 2008-02-13 23:03:06 | arnimar | set | components: + Library (Lib), - Interpreter Core |
| 2008-01-12 15:00:02 | arnimar | create | |