Issue 14587: Certain diacritical marks can and should be capitalized... e.g. ü --> Ü

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58792

classification

Title:	Certain diacritical marks can and should be capitalized... e.g. ü --> Ü
Type:		Stage:	resolved
Components:	Unicode	Versions:	Python 2.7

process

Status:	closed	Resolution:	not a bug
Dependencies:		Superseder:
Assigned To:		Nosy List:	Christian.Clauss, ezio.melotti, loewis, r.david.murray, vstinner
Priority:	normal	Keywords:

Created on 2012-04-15 14:17 by Christian.Clauss, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Messages (6)
msg158332 - (view)	Author: Christian Clauss (Christian.Clauss) *	Date: 2012-04-15 14:17
BUGS: certain diacritical marks can and should be capitalized... str.upper() does not .replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü'), etc. str.lower() does not .replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü'), etc. str.title() has the same problems plus it capitalizes the letter _after_ a diacritic. e.g. 'lüsai'.title() --> 'LÜSai' with a capitol 'S' myUpper(), myLower(), myTitle() exhibit the correct behavior with a handful of diacritic marks. def myUpper(inString): return inString.upper().replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü') def myLower(inString): return inString.lower().replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü') def myTitle(inString): # WARNING: converts all whitespace to a single space returnValue = [] for theWord in inString.split(): returnValue.append(myUpper(theWord[:1]) + myLower(theWord[1:])) return ' '.join(returnValue)
msg158336 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-15 14:43
It works fine if you use unicode.
msg158339 - (view)	Author: Christian Clauss (Christian.Clauss) *	Date: 2012-04-15 15:10
On Apr 15, 2012, at 4:43 PM, R. David Murray wrote: > > R. David Murray <rdmurray@bitdance.com> added the comment: > > It works fine if you use unicode. > > ---------- > nosy: +r.david.murray > resolution: -> invalid > stage: -> committed/rejected > status: open -> closed > > _______________________________________ > Python tracker <report@bugs.python.org> > <http://bugs.python.org/issue14587> > _______________________________________ What does it mean in this context to "use unicode"?? =============================================== In Idle... =============================================== Python 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012, 20:52:43) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type "copyright", "credits" or "license()" for more information. >>> lusai = u'lüsai' Unsupported characters in input >>> lusai = 'lüsai' Unsupported characters in input >>> print "ŠČŽ" Unsupported characters in input =============================================== In a script... Every time that I try to "use unicode" an exception is thrown. All try blocks in the following code trigger an exception =============================================== #/bin/bash/env python # -- coding: utf-8 -- print '==========' import sys # sys.version_info = sys.version_info(major=2, minor=7, micro=1, releaselevel='final', serial=0) print 'sys.version_info = {}.{}.{} {} {}'.format(sys.version_info[0], sys.version_info[1], sys.version_info[2], sys.version_info[3], sys.version_info[4]) import commands, os print 'os.name = {}'.format(os.name) print 'os.uname = {}'.format(os.uname()) print '==========' def myUpper(inString): return inString.upper().replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü').replace('ẞ', 'ß') def myLower(inString): return inString.lower().replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü').replace('ß', 'ẞ') def myTitle(inString): returnValue = [] for theWord in inString.split(): returnValue.append(myUpper(theWord[:1]) + myLower(theWord[1:])) return ' '.join(returnValue) def formatted(inValue, inSep = ' '): s = str(inValue) print ' s={}{}su={}{}sl={}{}st={}...'.format(s, inSep, s.upper(), inSep, s.lower(), inSep, s.title()) print ' s={}{}mu={}{}ml={}{}mt={}...'.format(s, inSep, myUpper(s), inSep, myLower(s), inSep, myTitle(s)) u = unicode(inValue, 'utf8') try: print ' u={}{}uu={}{}ul={}{}ut={}...'.format(u, inSep, u.upper(), inSep, u.lower(), inSep, u.title()) except: print "=== Exception thrown trying to print unicode({}, 'utf8')".format(repr(s)) kolnUpperUnspecified = str('KÖLN') kolnUpperAsString = str('KÖLN') kolnUpperAsUnicode = unicode('KÖLN', 'utf8') kolnLowerUnspecified = str('köln') kolnLowerAsString = str('köln') kolnLowerAsUnicode = unicode('köln', 'utf8') formatted(kolnUpperUnspecified) formatted(kolnUpperAsString) try: formatted(kolnUpperAsUnicode) except: pass formatted(kolnLowerUnspecified) formatted(kolnLowerAsString) try: formatted(kolnLowerAsUnicode) except: pass formatted('Ötto Clauß lives in the hamlet of Lüsai in the village of Lü in the valley of Val Müstair in the Canton of Graubünden', '\n') formatted('ZÜRICH is the largest city in Switzerland and the geographic center of the country is in Älggi-Alp which can be reached via the Lötschberg Tunnel', '\n') formatted('20% of Swiss people speak Französisch but only 0.5% speak Rätoromanisch', '\n') formatted('LÜSAI, lüsai, München, Neuchâtel, Ny-Ålesund, Tromsø, ZÜRICH', '\n') print """BUGS: certain diacritical marks can and should be capitalized... str.upper() does not .replace('à', 'À').replace('ä', 'Ä').replace('è', 'È').replace('é', 'É').replace('ö', 'Ö').replace('ü', 'Ü'), etc. str.lower() does not .replace('À', 'à').replace('Ä', 'ä').replace('È', 'è').replace('É', 'é').replace('Ö', 'ö').replace('Ü', 'ü'), etc. str.title() has the same problems plus it capitalizes the letter _after_ a diacritic. e.g. 'lüsai'.title() --> 'LÜSai' with a capitol 'S' myUpper(), myLower(), myTitle() exhibit the correct behavior with a handful of diacritic marks."""
msg158341 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2012-04-15 15:50
In addition to R. David's remark, it also works fine in a German locale. In general, you cannot know whether the byte '\xe4' denotes 'ä' or some other letter. For example, in KOI8-R, it denotes Д, instead, which already is an upper-case letter. So either do setlocale at the start of your program, or (better) switch to Unicode strings. Python 2.6.6 (r266:84292, Dec 27 2010, 00:02:40) [GCC 4.4.5] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> print u'ä'.upper() Ä
msg158344 - (view)	Author: STINNER Victor (vstinner) *	Date: 2012-04-15 17:43
Or you can port your program to Python 3 to avoid such issues :-)
msg158346 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2012-04-15 18:02
Indeed, this type of confusion is a large part of the motivation behind Python3. You might try posting to the python-list mailing list asking for help if for some reason you are required to use python2 for your program.

History
Date	User	Action	Args
2022-04-11 14:57:29	admin	set	github: 58792
2012-04-15 18:02:18	r.david.murray	set	messages: + msg158346
2012-04-15 17:43:24	vstinner	set	nosy: + vstinner messages: + msg158344
2012-04-15 15:50:55	loewis	set	nosy: + loewis messages: + msg158341
2012-04-15 15:10:08	Christian.Clauss	set	messages: + msg158339
2012-04-15 14:43:22	r.david.murray	set	status: open -> closed nosy: + r.david.murray messages: + msg158336 resolution: not a bug stage: resolved
2012-04-15 14:17:27	Christian.Clauss	create