classification
Title: Python and Turkish Locale
Type: Stage:
Components: Unicode Versions:
process
Status: closed Resolution: duplicate
Dependencies: Superseder: Turkish Character
View: 1528802
Assigned To: lemburg Nosy List: caglar, exa, georg.brandl, lemburg, usta
Priority: high Keywords:

Created on 2005-04-30 17:37 by caglar, last changed 2007-08-30 10:14 by georg.brandl. This issue is now closed.

Messages (6)
msg25185 - (view) Author: S.Çağlar Onur (caglar) Date: 2005-04-30 17:37
On behalf of this thread;

http://mail.python.org/pipermail/python-dev/2005-April/052968.html

As described in
http://www.i18nguy.com/unicode/turkish-i18n.html [ How
Applications Fail With Turkish Language
] , Turkish has 4 "i" in their alphabet. 

Without --with-wctype-functions support Python convert
these characters locare-independent manner in
tr_TR.UTF-8 locale. So all conversitons maps to "i" or
"I" which is wrong in Turkish locale. 

So if Python Developers will remove the wctype
functions from Python, then there must be a
locale-dependent upper/lower funtion to handle these
characters properly.
msg25186 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2005-05-02 08:00
Logged In: YES 
user_id=38388

I'm not sure I understand: are you saying that the Unicode
mappings for upper and lower case are wrong in the standard ?

Note that removing the wctype functions will only remove the
possibility to use these functions for case mapping of
Unicode characters instead of using the builtin Unicode
character database. This was originally meant as
optimization to avoid having to load the Unicode database -
nowadays the database is always included, so the
optimization is no longer needed. Even worse: the wctype
functions sometimes behave differently than the mappings in
the Unicode database (due to differences in the Unicode
database version or implementation s).

Now, since the string .lower() and .upper() methods are
locale dependent (due to their reliance on the C functions
toupper() and tolower() - not by intent), while the Unicode
versions are not, we have a rather annoying situation where
switching from strings to Unicode cause semantic differences.

Ideally, both string and Unicode methods should do case
mapping in an locale independent way. The support for
differences in locale dependent case mapping, collation,
etc. should be moved to an external module, e.g. the locale
module.
msg25187 - (view) Author: S.Çağlar Onur (caglar) Date: 2005-05-02 08:45
Logged In: YES 
user_id=858447

No, im not. These rules defined in
http://www.unicode.org/Public/UNIDATA/CaseFolding.txt and
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt.
Note that there is a comments says;

# T: special case for uppercase I and dotted uppercase I
#    - For non-Turkic languages, this mapping is normally
not used.
#    - For Turkic languages (tr, az), this mapping can be
used instead of the normal mapping for these characters.
#      Note that the Turkic mappings do not maintain
canonical equivalence without additional processing.
#      See the discussions of case mapping in the Unicode
Standard for more information.

So without wctype functions support, python can't convert
these. This _is_ the problem. As a side effect of this,
another huge problem occurs, keywords can't be locale
dependent. If Python compiled with wctype support functions,
all "i".upper() turns into "0" which is wrong for keyword
comparision ( like quit v.s QU0T )

So i suggest implement two new functions like
localeAwareLower()/localeAwareUpper() for python and let
lower()/upper() locale independent. And as you wrote locale
module may be a perfect home for these :)

msg25188 - (view) Author: Eray Ozkural (exa) Date: 2005-10-11 21:36
Logged In: YES 
user_id=1454

The better solution is to use an optional locale argument for 
upper/lower functions and other language-dependent text 
processing functions. 
 
msg25189 - (view) Author: Ömer FADIL USTA (usta) Date: 2006-09-30 15:58
Logged In: YES 
user_id=278064

http://img147.imageshack.us/img147/3717/pythonte4.jpg
I think this photo summarize the bug which is related to 
upper() in Turkish encoding.
msg55471 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-08-30 10:14
Dupe of #1528802.
History
Date User Action Args
2007-08-30 10:14:40georg.brandlsetstatus: open -> closed
resolution: duplicate
superseder: Turkish Character
messages: + msg55471
nosy: + georg.brandl
2005-04-30 17:37:22caglarcreate