This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: LC_CTYPE system setting not respected by setlocale()
Type: behavior Stage: resolved
Components: Interpreter Core, macOS, Tkinter Versions: Python 2.5
process
Status: closed Resolution: out of date
Dependencies: Superseder:
Assigned To: ronaldoussoren Nosy List: antlong, belopolsky, eric.araujo, eric.smith, jkloth, loewis, mark.dickinson, ronaldoussoren, vstinner
Priority: normal Keywords:

Created on 2010-07-23 02:38 by antlong, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue9335-test.py belopolsky, 2010-07-23 14:13
isalpha.c vstinner, 2010-07-25 23:27
Messages (29)
msg111233 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 02:38
On mac 10.5, python 2.6.4 (via mac ports) performing

len(string.letters) will produce 117 instead of 52.

from terminal:
along-mb:~ along$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

This appears to be related to:

locale.setlocale(locale.LC_CTYPE) not being respected.

len(string.letters) should produce 52.
msg111234 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 02:59
I can reproduce this in Apple's idle, but not in trunk or 2.7 versions.  I'll leave it open in case Ronald is interested.  Antlong also reports that this happens on windows, but I cannot verify that.

Here is my session copied from idle:


Python 2.5.3c1 (release25-maint, Dec 17 2008, 21:50:37) 
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "copyright", "credits" or "license()" for more information.

    ****************************************************************
    Personal firewall software may warn about the connection IDLE
    makes to its subprocess using this computer's internal loopback
    interface.  This connection is not visible on any external
    interface and no data is sent to or received from the Internet.
    ****************************************************************
    
IDLE 1.2.3c1      
>>> from string import letters
>>> letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> len(letters)
117
>>> letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> print _
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzᆰᄉᄎÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
>>> letters.isalpha()
True
>>> import locale
>>> locale.getlocale()
('en_US', 'UTF8')
>>> locale.setlocale(locale.LC_CTYPE)
'en_US.UTF-8'
>>>
msg111235 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 03:16
Also: windows 64x, python 2.7

   1.
      Python 2.7 (r27:82525, Jul  4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32
   2.
      Type "copyright", "credits" or "license()" for more information.
   3.
      >>> import string
   4.
      >>> string.letters
   5.
      'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\x9a\x9c\x9e\x9f\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
   6.
      >>> import locale
   7.
      >>> locale.getdefaultlocale()
   8.
      ('en_US', 'cp1252')
   9.
      >>>
msg111236 - (view) Author: Jeremy Kloth (jkloth) * Date: 2010-07-23 03:20
Note that this behavior is only present when running IDLE.  Python command-line does not show this oddity.
msg111237 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 03:23
Here is a simpler test: in idle2.6,

>>> '\xff'.isalpha()
True

but in idle2.7 and plain python prompt, it is False.
msg111239 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 03:42
Here is a way to reproduce this from command line:


$ python2.6
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xff'.isalpha()
False
>>> import idlelib.run
>>> '\xff'.isalpha()
True
msg111240 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 03:47
Or even simpler:


$ python2.6
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import Tkinter
>>> '\xff'.isalpha()
True
msg111242 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 04:02
Windows 64 bit, python 2.7:
>>> '\xff'.isalpha()
>>> False
>>> import idlelib.run
>>> '\xff'.isalpha()
>>> False

and- Windows 32 bit, python 2.6: Both False.
msg111243 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 04:02
Mac 10.5.6: py 2.6.4 - broken

Python 2.6.4 (r264:75706, Mar 18 2010, 14:58:13) 
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xff'.isalpha()
False
>>> import Tkinter
>>> '\xff'.isalpha()
True
>>>
msg111244 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 04:17
Python 2.6.4, Mac 10.5:

>>> from string import letters
>>> letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc
3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd
8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xe
c\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF8')
>>>
msg111246 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 04:18
This is clearly a Tkinter rather than Mac issue, so I am unassigning this from Ronald.  This appears to be the same problem as the one Mark described in msg102301.

>>> import locale
>>> locale.nl_langinfo(locale.CODESET)
'US-ASCII'
>>> import _tkinter
>>> locale.nl_langinfo(locale.CODESET)
'UTF-8'

This happens in both 2.6 and 2.7, but seems to be deliberate.  As Mark wrote in msg102328:

"""
There's still the issue of the Tkinter import changing the locale, but that seems to be out of Python's control.  As far as I can tell, it happens when the module initialization calls Tcl_FindExecutable, which is part of the Tcl library itself.  This may well be deliberate:  see

http://www.tcl.tk/cgi-bin/tct/tip/66.html
"""

What is still unclear to me, is why after CODESET changes to 'UTF-8', 2.6 thinks that '\xff' is a letter, but 2.7 does not.

Of course, '\xff' makes little sense in 'UTF-8', but why does the answer change between versions?
msg111247 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 04:26
After import _tkinter, I would up getting this, which is totally different than before:

>>> letters
'abcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xffABCDEFGHIJKLMNOPQRSTUVWXYZ\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde'
msg111248 - (view) Author: Anthony Long (antlong) Date: 2010-07-23 04:40
A bit more info:

Python 2.6.4 (r264:75706, Mar 18 2010, 14:58:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.nl_langinfo(locale.CODESET)
'US-ASCII'
>>>
along-mb:~ along$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
along-mb:~ along$
msg111249 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 04:50
In 3.x, it is different:


>>> locale.nl_langinfo(locale.CODESET)
'UTF-8'

Victor,

This looks like your cup of tee.
msg111251 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-23 08:03
I fail to see the bug in this report. '\xff' is a letter because the C library says it is. If you think the result is wrong, file a bug report with the OS vendor.
msg111326 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 14:13
On Fri, Jul 23, 2010 at 4:03 AM, Martin v. Löwis <report@bugs.python.org> wrote:
..
> I fail to see the bug in this report. '\xff' is a letter because the C library says it is.

This does not explain the difference between 2.6 and 2.7.  With
attached issue9335-test.py,

$ cat issue9335-test.py
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
print(chr(255).isalpha())

$ python2.7 issue9335-test.py
False
$ python2.6 issue9335-test.py
True
$ python2.5 issue9335-test.py
True

Since chr(255) = '\xff', is not a valid UTF-8 byte sequence, it makes
little sense to ask whether it is a letter or not in a locale that
uses UTF-8 encoding.   Nevertheless the behavior changed between
revisions and it is not mentioned in "what's new in 2.7".  (I suspect
this was introduced in issue5793 (r72040), but I have not verified.)

There are two possible action items here:

1. New behavior needs to be documented.   I believe 2.7 is correct
because when isalpha is used to sanitize untrusted input, it is better
to reject in the case of uncertainy.

2. Arguably, this is a security issue and thus eligible for backporting to 2.6.
msg111329 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 14:20
Another issue that may be worth revisiting is whether or not it is OK for _tkinter to set the locale.

"""
21.2.2. For extension writers and programs that embed Python

Extension modules should never call setlocale(), except to find out what the current locale is. 
""" 

http://docs.python.org/dev/library/locale.html#for-extension-writers-and-programs-that-embed-python
msg111332 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-23 14:28
This might be caused by the fix for issue7072 (which is mentioned in the NEWS file).
msg111345 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 15:27
> This might be caused by the fix for issue7072.

Ronald,

You are absolutely right.  Reverting r80178 in the trunk restores the old behavior.

> msg103494:
> Fixed in r80178 (trunk), r80180 (2.6), r80182 (3.2), r80183 (3.1)

I think this can be closed as out of date, but I am giving it back to you to decide whether security implications are important enough to backport to 2.5.


Anthony,

Please open a separate issue for Tkinter if you want it considered.  It was rejected once already [msg102328],  but even if Tkinter behavior is deemed appropriate,  I think it should at least be documented.
msg111348 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-23 15:45
Why do you think this may have security implications?

I'm closing this as out of date because the issue is fixed and the fix is imho inappropriate for a backport to 2.6 due to the change in behaviour.
msg111350 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-23 15:53
Accepting binary input where only letters are expected by an application is a very common source of security holes.   An application that relies on s.isalpha() to guarantee that s does not contain non-ASCII characters when UTF-8 locale is in use, may have a security hole if it is ran with python 2.5.
msg111447 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-24 10:33
If an application uses .isalpha for a security-relevant check, this is a security issue in the application, not in Python.
msg111448 - (view) Author: Anthony Long (antlong) Date: 2010-07-24 10:35
I disagree. It's expected that the function will return valid data. This doesn't return valid data so isalpha() is compromised.
msg111449 - (view) Author: Ronald Oussoren (ronaldoussoren) * (Python committer) Date: 2010-07-24 10:41
I agree with Martin that the security problem would be in the application, not python itself.

Testing with isalpha is generally not the right thing to do anyway, it is much better to restrict input to a know-good set of data, such as by using regular expressions.  For multi-byte encodings like UTF-8 you cannot rely on per-byte calls to isalpha anyway.  The situation is even worse for an encoding like Shift-JIS where you need context to know if a byte is part of a multi-byte value.
msg111457 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2010-07-24 11:37
> I disagree. It's expected that the function will return valid data. This doesn't return valid data so isalpha() is compromised.

What is "valid data"? The function (isalpha) should return a boolean, 
and it does. So the result is certainly "valid".

The documentation says "For 8-bit strings, this method is 
locale-dependent." So it is correct if it returns what the OS vendor says
to return.
msg111467 - (view) Author: Anthony Long (antlong) Date: 2010-07-24 12:23
The locale is set incorrectly though - so it is not valid data. Valid data is a-Z. nothing more nothing less, and the locale and the alphabet should not be changed.
msg111574 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-25 23:27
> Victor, This looks like your cup of tee.

Unicode is my cup of tee, but not programs considering that bytes are characters.

<a byte string>.isalpha() doesn't mean anything to me :-)

This issue is a more question about the C library, not about Python :-) So try the attached program "isalpha.c" if you would like to test your libc.

Results on my Linux box (Debian Sid, eglibc 2.11.2):
----------------
$ ./isalpha C
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (52)

$ ./isalpha fr_FR.UTF-8
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (52)

$ ./isalpha fr_FR.iso88591
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff (117)

$ ./isalpha fr_FR.iso885915@euro
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xa6\xa8\xaa\xb4\xb5\xb8\xba\xbc\xbd\xbe\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff (124)
----------------

If your libc consider that \xff is a valid UTF-8 character, you should change your OS for a better one :-)

--

> >>> len(letters)
> 117
> ...
> >>> locale.setlocale(locale.LC_CTYPE)
> 'en_US.UTF-8'

It looks like Mac OS X uses ISO-8859-1 instead of UTF-8.

--

string.letters is built using strop.lowercase + strop.uppsercase which are built using the C functions islower() and islower(). locale.setlocale() regenerates strop/string.lowercase, strop/string.uppercase and string.letters for LC_CTYPE and LC_ALL categories.

--

You don't need to run IDLE or import Tkinter to set the locale:

   import locale; locale.setlocale(locale.LC_ALL, '')

is enough.

--

A library should not change the locale (only the application).

$ python2.6
>>> import locale
>>> locale.getlocale()
(None, None)
>>> import Tkinter
>>> locale.getlocale()
('fr_FR', 'UTF8')

=> Tkinter is an horrible library! (The bug is in the C library, not in the Python wrapper)

Use a better one like Gtk ou Qt ;-)

$ python
>>> import locale
>>> import pygtk
>>> locale.getlocale()
(None, None)
>>> import PyQt4
>>> locale.getlocale()
(None, None)

(IDLE is based on Tkinter)

--

I don't understand why Alexander gets different results on Python 2.6 and Python 2.7.

@belopolsky: Are both programs linked to (built with?) the same C library? (same libray version)
msg111587 - (view) Author: Alexander Belopolsky (belopolsky) * (Python committer) Date: 2010-07-26 00:20
On Sun, Jul 25, 2010 at 7:27 PM, STINNER Victor <report@bugs.python.org> wrote:
..
> Unicode is my cup of tee, but not programs considering that bytes are characters.
>

What I called "your cup of tee" was 3.x returning 'UTF-8' from
locale.nl_langinfo(locale.CODESET) where 2.x returned 'US-ASCII'.  (In
both cases this was the first call to locale module functions.)

> I don't understand why Alexander gets different results on Python 2.6 and Python 2.7.
>

It looks like you have missed most of the discussion under this issue.
 Sorry that you had to reinvestigate.  Ronald explained the difference
in msg111332.  He introduced a workaround for broken OSX C library
isalpha in r80178.
msg111588 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2010-07-26 00:26
Oops, the issue is already closed /o\
History
Date User Action Args
2022-04-11 14:57:04adminsetgithub: 53581
2010-07-26 00:26:23vstinnersetmessages: + msg111588
2010-07-26 00:20:55belopolskysetmessages: + msg111587
2010-07-25 23:27:20vstinnersetfiles: + isalpha.c

messages: + msg111574
2010-07-24 12:23:24antlongsetmessages: + msg111467
2010-07-24 11:37:47loewissetmessages: + msg111457
2010-07-24 10:41:11ronaldoussorensetmessages: + msg111449
2010-07-24 10:35:29antlongsetmessages: + msg111448
2010-07-24 10:33:54loewissetmessages: + msg111447
2010-07-23 15:53:12belopolskysetmessages: + msg111350
2010-07-23 15:45:08ronaldoussorensetstatus: pending -> closed

messages: + msg111348
2010-07-23 15:27:01belopolskysetstatus: open -> pending

assignee: belopolsky -> ronaldoussoren
components: + macOS
versions: - Python 2.6
nosy: loewis, ronaldoussoren, mark.dickinson, belopolsky, vstinner, eric.smith, jkloth, eric.araujo, antlong
messages: + msg111345
resolution: out of date
stage: resolved
2010-07-23 14:28:00ronaldoussorensetmessages: + msg111332
2010-07-23 14:27:09eric.araujosetnosy: + eric.araujo
2010-07-23 14:20:21belopolskysetnosy: + eric.smith
messages: + msg111329

components: + Interpreter Core, - macOS
type: behavior
2010-07-23 14:13:26belopolskysetfiles: + issue9335-test.py

messages: + msg111326
2010-07-23 08:03:38loewissetnosy: + loewis
messages: + msg111251
2010-07-23 04:50:34belopolskysetnosy: + vstinner
messages: + msg111249
2010-07-23 04:40:35antlongsetmessages: + msg111248
2010-07-23 04:26:50antlongsetmessages: + msg111247
2010-07-23 04:20:29belopolskysetassignee: ronaldoussoren -> belopolsky
2010-07-23 04:20:10belopolskysetnosy: + mark.dickinson
2010-07-23 04:18:52belopolskysetmessages: + msg111246
2010-07-23 04:17:47antlongsetmessages: + msg111244
2010-07-23 04:02:35antlongsetmessages: + msg111243
2010-07-23 04:02:09antlongsetmessages: + msg111242
2010-07-23 03:47:01belopolskysetnosy: ronaldoussoren, belopolsky, jkloth, antlong
messages: + msg111240
components: + Tkinter
2010-07-23 03:42:15belopolskysetmessages: + msg111239
2010-07-23 03:23:01belopolskysetmessages: + msg111237
2010-07-23 03:20:25jklothsetnosy: + jkloth
messages: + msg111236
2010-07-23 03:16:12antlongsetnosy: ronaldoussoren, belopolsky, antlong
type: behavior -> (no value)
messages: + msg111235
components: - IDLE
2010-07-23 02:59:44belopolskysetnosy: + belopolsky
messages: + msg111234

components: + IDLE
type: behavior
2010-07-23 02:38:29antlongcreate