classification
Title: Turkish Character
Type: Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ahmetbiskinler, cartman, georg.brandl, lemburg, loewis, sgala
Priority: high Keywords:

Created on 2006-07-26 07:05 by ahmetbiskinler, last changed 2007-08-30 19:03 by georg.brandl. This issue is now closed.

Messages (20)
msg29283 - (view) Author: Ahmet Bişkinler (ahmetbiskinler) Date: 2006-07-26 07:05
>>> print "Mayıs".upper()
>>> MAYıS
>>> import locale
>>> locale.setlocale(locale.LC_ALL,'Turkish_Turkey.1254')
>>> print "Mayıs".upper()
>>> MAYıS

>>> print "ğüşiöçı".upper()
>>> ğüşIöçı


MAYıS     should be MAYIS
ğüşIöçı   should be ĞÜŞİÖÇI

but 
>>> "Mayıs".upper()
>>> "MAYIS"

is right



msg29284 - (view) Author: Ahmet Bişkinler (ahmetbiskinler) Date: 2006-08-11 08:10
Logged In: YES 
user_id=1481281

What happened?
Is it solved?
How is it going?
What is the final step?
...?
...?

Could you please give me some information about the bug please?
msg29285 - (view) Author: Santiago Gala (sgala) Date: 2006-08-17 14:53
Logged In: YES 
user_id=178886

The behaviour of python in this area is confusing. See a
session with my Spanish keyboard:

>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
á
>>> print len(u"á")
1
>>> print u"á".upper()
Á
>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec can't encode
character u'\xe1' in position 0: ordinal not in range(128)


I guess this is what is happening to the reporter.

This violates the least surprising behavior principle in so
many different ways that it hurts. Can anybody make sense of it?
msg29286 - (view) Author: Santiago Gala (sgala) Date: 2006-08-17 14:59
Logged In: YES 
user_id=178886

(I tested it in 2.5rc1), 2.4 gives 

>>> str(u"á")
'\xc3\xa1'

instead of the exception
msg29287 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-08-17 15:03
Logged In: YES 
user_id=849994

sgala: it looks like your console sends UTF-8 encoded text.

>>> print "á"
á

print is just printing out a byte string consisting of two
bytes, which your console displays as accent-a.

>>> print len("á")
2

A UTF-8-encoded string containing an accented a has two bytes.

>>> print "á".upper()
á

str.upper() doesn't take locale into account, so the
accented a has no uppercase version defined.

>>> str("á")
'\xc3\xa1'

str() applied to a byte string returns that byte string.
Since return values from functions are printed by the
interactive interpreter using repr() first, you get this
representation (which you could also get from "print
repr('a')".)

>>> print u"á"
á

That's also okay. Python knows the terminal encoding and
properly translates the byte string to a unicode string of
one character. On printout, it converts it to a UTF-8 string
again, which your terminal displays correctly.

>>> print len(u"á")
1

Since your two-byte-UTF-8 sequence is converted to a unicode
character, the length of this unicode string is 1.

>>> print u"á".upper()
Á

There are comprehensive capitalization tables available for
unicode.

>>> str(u"á")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
__builtin__.UnicodeEncodeError: 'ascii' codec
can't encode
character u'\xe1' in position 0: ordinal not in
range(128)

Applying str() to a unicode string must convert it to a byte
string. If you don't specify an encoding, the default
encoding is "ascii", which can't encode the accented a. Use
"a".encode("utf-8").
msg29288 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-08-17 15:04
Logged In: YES 
user_id=38388

String upper and lower conversion are locale dependent and
implemented by the underlying libc, whereas Unicode
upper/lower conversion is not and only depends on the
Unicode character database.

OTOH, there are special cases where the standard Unicode
upper/lower mapping is no what you might expect, since the
database only provides a single mapping and is not context
aware.

There's nothing we can do if the libc is broken in some
respect. As for the extended case mapping support in
Unicode: patches are welcome.
msg29289 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-08-17 15:08
Logged In: YES 
user_id=849994

Using Unicode strings, the OP's example works.
msg29290 - (view) Author: Santiago Gala (sgala) Date: 2006-08-17 18:58
Logged In: YES 
user_id=178886

Idle from 2.5rc1 (svn today) produces a different result
than console (with my default, utf-8, encoding):

IDLE 1.2c1      
>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
á
>>> print len(u"á")
2
>>> print u"á".upper()
á
>>> str(u"á")

Traceback (most recent call last):
  File "<pyshell#7>", line 1, in <module>
    str(u"á")
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-1: ordinal not in range(128)

Again, IDLE 1.1.3 (python 2.4.3) produces a different result:

IDLE 1.1.3      
>>> print "á"
á
>>> print len("á")
2
>>> print "á".upper()
á
>>> str("á")
'\xc3\xa1'
>>> print u"á"
á
>>> print len(u"á")
2
>>> print u"á".upper()
á
>>> str(u"á")
'\xc3\x83\xc2\xa1'
>>> 


I'd say idle is broken, as it is not able to respect utf-8
for print (or even len) of unicode strings.

OTOH, with some tricks I can manage to get an accented a in
a unicode in idle:

>>> import unicodedata
>>> print unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE")
á
>>> print len(unicodedata.lookup("LATIN SMALL LETTER A WITH
ACUTE"))
1

msg29291 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2006-08-17 19:08
Logged In: YES 
user_id=849994

Please submit that as a separate IDLE bug.
msg29292 - (view) Author: Santiago Gala (sgala) Date: 2006-08-18 14:37
Logged In: YES 
user_id=178886

Done: Bug #1542677
msg29293 - (view) Author: Ahmet Bişkinler (ahmetbiskinler) Date: 2006-08-21 07:55
Logged In: YES 
user_id=1481281

There are still some problems with it. As in the image.
http://img205.imageshack.us/img205/3998/turkishcharpythonyu5.jpg
The upper() works fine(except ı and i uppercase) with IDLE
since upper() doesn't even work.

Another problem is with the ı(dotless) and i(dotted) 's upper.
ı(dotless) should be I (dotless)
i(dotted)  should be İ (dotted)
ı = I
i = İ

For more information:
http://www.i18nguy.com/unicode/turkish-i18n.html
msg29294 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-08-21 10:01
Logged In: YES 
user_id=38388

Could we please get some things straight first:

1. if you're working with IDLE and it doesn't do what you
expect it to, please file an IDLE bug report, not a Python
one; the same it true for any other Python IDE you are using

2. string's .lower() and .upper() method rely 100% on the
platform's C lib implementation of these functions; there's
nothing Python can do about bugs in these implementations

3. if you want reproducable behavior across platforms,
please always use Unicode, *not* 8-bit strings, for text data.

I see that #1 has already been done, so the IDLE specific
discussion should continue there.

#2 is the cause of the problem, then all we can do is point
you to #3.

If #3 fails for some reason, then we should investigate
this. However, be aware that the Unicode database has a
fixed set of case mappings and we currently don't support
extended case mapping which is locale and context sensitive.
Again, patches are welcome.

Please provide your examples using the repr() of the string
or Unicode objects in question. This makes it a lot easier
to test your examples on other platforms.

Thanks.
msg29295 - (view) Author: Ahmet Bişkinler (ahmetbiskinler) Date: 2006-08-28 13:57
Logged In: YES 
user_id=1481281

As you saw in the picture the IDLE does its work. Its is the
one who is working right.
The python interpreter(C:\Python25\Python.exe) has the
problem with it. Does the interpreter generate bug reports
if there is no crashing or else... And I don't know how to
file an IDLE bug report from the
interpreter(C:\Python25\Python.exe).
msg29296 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2006-08-29 17:43
Logged In: YES 
user_id=38388

Could you test this with Unicode strings, ie. u"...".upper() ?

It would also help if you'd provide the repr()-version of
the strings - makes testing on non-Turkish systems easier.

Thanks.
msg55347 - (view) Author: Ismail Donmez (cartman) Date: 2007-08-28 01:58
This works fine with python 2.4 :

>>> import locale
>>> locale.setlocale(locale.LC_ALL,"tr_TR.UTF-8")
'tr_TR.UTF-8'
>>> print u"Mayıs".upper()
MAYIS
msg55472 - (view) Author: Georg Brandl (georg.brandl) * (Python committer) Date: 2007-08-30 10:16
If I'm not mistaken, "i".upper() will never be LATIN CAPITAL LETTER I
WITH DOT ABOVE, regardless of the locale?
msg55476 - (view) Author: Ismail Donmez (cartman) Date: 2007-08-30 11:46
@George,

"i".upper() WILL be I-with-a-dot-above in Turkish.i
msg55478 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2007-08-30 13:21
Unassigning this.

Unless someone provides a patch to add context sensitivity to the
Unicode upper/lower conversions, I don't think anything will change.

The mapping you see in Python (for Unicode) is taken straight from the
Unicode database and there's nothing we can or want to do to change
those predefined mappings.

The 8-bit string mappings OTOH are taken from the underlying C library -
again nothing we can change.
msg55479 - (view) Author: Ismail Donmez (cartman) Date: 2007-08-30 13:43
There is no need to unassign this, the bug is invalid.
msg55501 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-08-30 18:46
I agree with cartman: Python behaves as designed in all cases discussed
here. Closing this report as invalid.
History
Date User Action Args
2007-08-30 19:03:07georg.brandlsetstatus: open -> closed
resolution: not a bug
2007-08-30 18:46:31loewissetnosy: + loewis
messages: + msg55501
2007-08-30 13:43:41cartmansetmessages: + msg55479
2007-08-30 13:21:54lemburgsetassignee: lemburg ->
messages: + msg55478
2007-08-30 11:46:01cartmansetmessages: + msg55476
2007-08-30 10:16:38georg.brandlsetmessages: + msg55472
2007-08-30 10:14:40georg.brandllinkissue1193061 superseder
2007-08-28 01:58:09cartmansetnosy: + cartman
messages: + msg55347
2006-07-26 07:05:07ahmetbiskinlercreate