classification
Title: locale 1251 does not convert to upper case properly
Type: behavior Stage: needs patch
Components: Library (Lib) Versions: Python 3.0, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: BreamoreBoy, ajaksu2, amaury.forgeotdarc, dobrokot, loewis
Priority: normal Keywords:

Created on 2007-01-13 17:30 by dobrokot, last changed 2010-09-27 17:36 by amaury.forgeotdarc.

Files
File name Uploaded Description Edit
yo.py dobrokot, 2007-01-13 17:30 source code
toupper.zip dobrokot, 2007-01-18 21:18 _toupper.c and toupper.c files from VC++7.1 CRT
Messages (10)
msg31021 - (view) Author: Ivan Dobrokotov (dobrokot) Date: 2007-01-13 17:30
<pre>
 # -*- coding: 1251 -*-

import locale

locale.setlocale(locale.LC_ALL, ".1251") #locale name may be Windows specific?

#-----------------------------------------------
print chr(184), chr(168)
assert  chr(255).upper() == chr(223) #OK
assert  chr(184).upper() == chr(168) #fail
#-----------------------------------------------
assert  'q'.upper() == 'Q' #OK 
assert  'ж'.upper() == 'Ж' #OK
assert  'Ж'.upper() == 'Ж' #OK
assert  'я'.upper() == 'Я' #OK
assert  u'ё'.upper() == u'Ё' #OK (locale independent)
assert  'ё'.upper() == 'Ё' #fail
</pre>

I suppose incorrect realization of uppercase like 

<pre>
if ('a' <= c && c <= 'я')
  return c+'Я'-'я'
</pre>

symbol 'ё' (184 in cp1251) is not in range 'a'-'я'
msg31022 - (view) Author: Ivan Dobrokotov (dobrokot) Date: 2007-01-13 17:49
C-CRT library fucntion toupper('Ñ‘') works properly, if I set setlocale(LC_ALL, ".1251")
msg31023 - (view) Author: Ivan Dobrokotov (dobrokot) Date: 2007-01-13 17:51
sorry, I mean 
toupper((int)(unsigned char)'Ñ‘') 
not just  toupper('Ñ‘') 
msg31024 - (view) Author: Ivan Dobrokotov (dobrokot) Date: 2007-01-13 21:08
forgot to mention used python version - http://www.python.org/ftp/python/2.5/python-2.5.msi
msg31025 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2007-01-18 20:08
You can see the implementation of .upper in

http://svn.python.org/projects/python/tags/r25/Objects/stringobject.c
(function string_upper)

Off-hand, I cannot see anything wrong in that code. It definitely does *not* use c+'Я'-'я'.
msg31026 - (view) Author: Ivan Dobrokotov (dobrokot) Date: 2007-01-18 21:18
well, C:
----------------------------

#include <locale.h>
#include <stdio.h>
#include <assert.h>

int main()
{
  int i = 184;
  char *old = setlocale(LC_CTYPE, ".1251");
  assert(old);
  printf("%d -> %d\n", i, _toupper(i));   
  printf("%d -> %d\n", i, toupper(i));   
}

----------------------------
C ouput: 
184 -> 152
184 -> 168

so, _toupper and upper are different functions. MSDN does not mention nothing about difference, except that 'toupper' is "ANSI compatible" :(



File Added: toupper.zip
msg31027 - (view) Author: Ivan Dobrokotov (dobrokot) Date: 2007-01-18 21:59

----------------------------------------------
standard header ctype.h:

#define _toupper(_c)    ( (_c)-'a'+'A' )


----------------------------------------------
CRT file toupper.c:



/* define function-like macro equivalent to _toupper()
 */
#define mkupper(c)  ( (c)-'a'+'A' )



int __cdecl _toupper (
        int c
        )
{
        return(mkupper(c));
}

( http://www.everfall.com/paste/id.php?j13ernl40i9e )

suggestion: replace _toupper with toupper. Performance may degrade ( a lot thread locks/MultiByteToWideChar/other code for every non-ASCII lowercase symbol). Sugestion for optimization: setup "int toupper_table[256]"  (and other tables) in everycall to setlocale.


msg84614 - (view) Author: Daniel Diniz (ajaksu2) Date: 2009-03-30 19:04
May be related to issue 1633600.
msg116585 - (view) Author: Mark Lawrence (BreamoreBoy) Date: 2010-09-16 18:01
I've tried to see if this is still an issue but frankly can't make head nor tail out of it :(  Any locale gurus up for this?
msg117452 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2010-09-27 17:36
the OP is right: str.upper is supposed to be locale-dependent
http://docs.python.org/library/stdtypes.html#str.upper

But the implementation uses _toupper() which is a macro with Visual Studio, and obviously not locale-dependent:

#define _toupper(_Char)    ( (_Char)-'a'+'A' )
History
Date User Action Args
2010-09-27 17:36:29amaury.forgeotdarcsetnosy: + amaury.forgeotdarc

messages: + msg117452
stage: test needed -> needs patch
2010-09-16 18:01:57BreamoreBoysetnosy: + BreamoreBoy
messages: + msg116585
2009-03-30 19:04:16ajaksu2setversions: + Python 2.6, Python 3.0
nosy: + ajaksu2

messages: + msg84614

type: behavior
stage: test needed
2007-01-13 17:30:16dobrokotcreate