Issue 1528802: Turkish Character

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/43722

classification

Title:	Turkish Character
Type:		Stage:
Components:	Unicode	Versions:	Python 2.5

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	ahmetbiskinler, donmez, georg.brandl, lemburg, loewis, sgala
Priority:	high	Keywords:

Created on 2006-07-26 07:05 by ahmetbiskinler, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Messages (20)
msg29283 - (view)	Author: Ahmet Bişkinler (ahmetbiskinler)	Date: 2006-07-26 07:05
>>> print "Mayıs".upper() >>> MAYıS >>> import locale >>> locale.setlocale(locale.LC_ALL,'Turkish_Turkey.1254') >>> print "Mayıs".upper() >>> MAYıS >>> print "ÄŸÃ¼ÅŸiöçı".upper() >>> ÄŸÃ¼ÅŸIöçı MAYıS should be MAYIS ÄŸÃ¼ÅŸIöçı should be ÄÃœÅÄ°Ã–Ã‡I but >>> "Mayıs".upper() >>> "MAYIS" is right
msg29284 - (view)	Author: Ahmet Bişkinler (ahmetbiskinler)	Date: 2006-08-11 08:10
Logged In: YES user_id=1481281 What happened? Is it solved? How is it going? What is the final step? ...? ...? Could you please give me some information about the bug please?
msg29285 - (view)	Author: Santiago Gala (sgala)	Date: 2006-08-17 14:53
Logged In: YES user_id=178886 The behaviour of python in this area is confusing. See a session with my Spanish keyboard: >>> print "á" á >>> print len("á") 2 >>> print "á".upper() á >>> str("á") '\xc3\xa1' >>> print u"á" á >>> print len(u"á") 1 >>> print u"á".upper() Ã >>> str(u"á") Traceback (most recent call last): File "<stdin>", line 1, in <module> __builtin__.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) I guess this is what is happening to the reporter. This violates the least surprising behavior principle in so many different ways that it hurts. Can anybody make sense of it?
msg29286 - (view)	Author: Santiago Gala (sgala)	Date: 2006-08-17 14:59
Logged In: YES user_id=178886 (I tested it in 2.5rc1), 2.4 gives >>> str(u"á") '\xc3\xa1' instead of the exception
msg29287 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2006-08-17 15:03
Logged In: YES user_id=849994 sgala: it looks like your console sends UTF-8 encoded text. >>> print "á" á print is just printing out a byte string consisting of two bytes, which your console displays as accent-a. >>> print len("á") 2 A UTF-8-encoded string containing an accented a has two bytes. >>> print "á".upper() á str.upper() doesn't take locale into account, so the accented a has no uppercase version defined. >>> str("á") '\xc3\xa1' str() applied to a byte string returns that byte string. Since return values from functions are printed by the interactive interpreter using repr() first, you get this representation (which you could also get from "print repr('a')".) >>> print u"á" á That's also okay. Python knows the terminal encoding and properly translates the byte string to a unicode string of one character. On printout, it converts it to a UTF-8 string again, which your terminal displays correctly. >>> print len(u"á") 1 Since your two-byte-UTF-8 sequence is converted to a unicode character, the length of this unicode string is 1. >>> print u"á".upper() Ã There are comprehensive capitalization tables available for unicode. >>> str(u"á") Traceback (most recent call last): File "<stdin>", line 1, in <module> __builtin__.UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) Applying str() to a unicode string must convert it to a byte string. If you don't specify an encoding, the default encoding is "ascii", which can't encode the accented a. Use "a".encode("utf-8").
msg29288 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-08-17 15:04
Logged In: YES user_id=38388 String upper and lower conversion are locale dependent and implemented by the underlying libc, whereas Unicode upper/lower conversion is not and only depends on the Unicode character database. OTOH, there are special cases where the standard Unicode upper/lower mapping is no what you might expect, since the database only provides a single mapping and is not context aware. There's nothing we can do if the libc is broken in some respect. As for the extended case mapping support in Unicode: patches are welcome.
msg29289 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2006-08-17 15:08
Logged In: YES user_id=849994 Using Unicode strings, the OP's example works.
msg29290 - (view)	Author: Santiago Gala (sgala)	Date: 2006-08-17 18:58
Logged In: YES user_id=178886 Idle from 2.5rc1 (svn today) produces a different result than console (with my default, utf-8, encoding): IDLE 1.2c1 >>> print "á" á >>> print len("á") 2 >>> print "á".upper() á >>> str("á") '\xc3\xa1' >>> print u"á" ÃƒÂ¡ >>> print len(u"á") 2 >>> print u"á".upper() ÃƒÂ¡ >>> str(u"á") Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> str(u"á") UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128) Again, IDLE 1.1.3 (python 2.4.3) produces a different result: IDLE 1.1.3 >>> print "á" á >>> print len("á") 2 >>> print "á".upper() á >>> str("á") '\xc3\xa1' >>> print u"á" ÃƒÂ¡ >>> print len(u"á") 2 >>> print u"á".upper() ÃƒÂ¡ >>> str(u"á") '\xc3\x83\xc2\xa1' >>> I'd say idle is broken, as it is not able to respect utf-8 for print (or even len) of unicode strings. OTOH, with some tricks I can manage to get an accented a in a unicode in idle: >>> import unicodedata >>> print unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE") á >>> print len(unicodedata.lookup("LATIN SMALL LETTER A WITH ACUTE")) 1
msg29291 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2006-08-17 19:08
Logged In: YES user_id=849994 Please submit that as a separate IDLE bug.
msg29292 - (view)	Author: Santiago Gala (sgala)	Date: 2006-08-18 14:37
Logged In: YES user_id=178886 Done: Bug #1542677
msg29293 - (view)	Author: Ahmet Bişkinler (ahmetbiskinler)	Date: 2006-08-21 07:55
Logged In: YES user_id=1481281 There are still some problems with it. As in the image. http://img205.imageshack.us/img205/3998/turkishcharpythonyu5.jpg The upper() works fine(except ı and i uppercase) with IDLE since upper() doesn't even work. Another problem is with the ı(dotless) and i(dotted) 's upper. ı(dotless) should be I (dotless) i(dotted) should be İ (dotted) ı = I i = İ For more information: http://www.i18nguy.com/unicode/turkish-i18n.html
msg29294 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-08-21 10:01
Logged In: YES user_id=38388 Could we please get some things straight first: 1. if you're working with IDLE and it doesn't do what you expect it to, please file an IDLE bug report, not a Python one; the same it true for any other Python IDE you are using 2. string's .lower() and .upper() method rely 100% on the platform's C lib implementation of these functions; there's nothing Python can do about bugs in these implementations 3. if you want reproducable behavior across platforms, please always use Unicode, not 8-bit strings, for text data. I see that #1 has already been done, so the IDLE specific discussion should continue there. #2 is the cause of the problem, then all we can do is point you to #3. If #3 fails for some reason, then we should investigate this. However, be aware that the Unicode database has a fixed set of case mappings and we currently don't support extended case mapping which is locale and context sensitive. Again, patches are welcome. Please provide your examples using the repr() of the string or Unicode objects in question. This makes it a lot easier to test your examples on other platforms. Thanks.
msg29295 - (view)	Author: Ahmet Bişkinler (ahmetbiskinler)	Date: 2006-08-28 13:57
Logged In: YES user_id=1481281 As you saw in the picture the IDLE does its work. Its is the one who is working right. The python interpreter(C:\Python25\Python.exe) has the problem with it. Does the interpreter generate bug reports if there is no crashing or else... And I don't know how to file an IDLE bug report from the interpreter(C:\Python25\Python.exe).
msg29296 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2006-08-29 17:43
Logged In: YES user_id=38388 Could you test this with Unicode strings, ie. u"...".upper() ? It would also help if you'd provide the repr()-version of the strings - makes testing on non-Turkish systems easier. Thanks.
msg55347 - (view)	Author: Ismail Donmez (donmez) *	Date: 2007-08-28 01:58
This works fine with python 2.4 : >>> import locale >>> locale.setlocale(locale.LC_ALL,"tr_TR.UTF-8") 'tr_TR.UTF-8' >>> print u"Mayıs".upper() MAYIS
msg55472 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2007-08-30 10:16
If I'm not mistaken, "i".upper() will never be LATIN CAPITAL LETTER I WITH DOT ABOVE, regardless of the locale?
msg55476 - (view)	Author: Ismail Donmez (donmez) *	Date: 2007-08-30 11:46
@George, "i".upper() WILL be I-with-a-dot-above in Turkish.i
msg55478 - (view)	Author: Marc-Andre Lemburg (lemburg) *	Date: 2007-08-30 13:21
Unassigning this. Unless someone provides a patch to add context sensitivity to the Unicode upper/lower conversions, I don't think anything will change. The mapping you see in Python (for Unicode) is taken straight from the Unicode database and there's nothing we can or want to do to change those predefined mappings. The 8-bit string mappings OTOH are taken from the underlying C library - again nothing we can change.
msg55479 - (view)	Author: Ismail Donmez (donmez) *	Date: 2007-08-30 13:43
There is no need to unassign this, the bug is invalid.
msg55501 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2007-08-30 18:46
I agree with cartman: Python behaves as designed in all cases discussed here. Closing this report as invalid.

History
Date	User	Action	Args
2022-04-11 14:56:19	admin	set	github: 43722
2007-08-30 19:03:07	georg.brandl	set	status: open -> closed resolution: not a bug
2007-08-30 18:46:31	loewis	set	nosy: + loewis messages: + msg55501
2007-08-30 13:43:41	donmez	set	messages: + msg55479
2007-08-30 13:21:54	lemburg	set	assignee: lemburg -> messages: + msg55478
2007-08-30 11:46:01	donmez	set	messages: + msg55476
2007-08-30 10:16:38	georg.brandl	set	messages: + msg55472
2007-08-30 10:14:40	georg.brandl	link	issue1193061 superseder
2007-08-28 01:58:09	donmez	set	nosy: + donmez messages: + msg55347
2006-07-26 07:05:07	ahmetbiskinler	create