msg111233 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 02:38 |
On mac 10.5, python 2.6.4 (via mac ports) performing
len(string.letters) will produce 117 instead of 52.
from terminal:
along-mb:~ along$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
This appears to be related to:
locale.setlocale(locale.LC_CTYPE) not being respected.
len(string.letters) should produce 52.
|
msg111234 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 02:59 |
I can reproduce this in Apple's idle, but not in trunk or 2.7 versions. I'll leave it open in case Ronald is interested. Antlong also reports that this happens on windows, but I cannot verify that.
Here is my session copied from idle:
Python 2.5.3c1 (release25-maint, Dec 17 2008, 21:50:37)
[GCC 4.0.1 (Apple Computer, Inc. build 5363)] on darwin
Type "copyright", "credits" or "license()" for more information.
****************************************************************
Personal firewall software may warn about the connection IDLE
makes to its subprocess using this computer's internal loopback
interface. This connection is not visible on any external
interface and no data is sent to or received from the Internet.
****************************************************************
IDLE 1.2.3c1
>>> from string import letters
>>> letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> len(letters)
117
>>> letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> print _
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzᆰᄉᄎÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ
>>> letters.isalpha()
True
>>> import locale
>>> locale.getlocale()
('en_US', 'UTF8')
>>> locale.setlocale(locale.LC_CTYPE)
'en_US.UTF-8'
>>>
|
msg111235 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 03:16 |
Also: windows 64x, python 2.7
1.
Python 2.7 (r27:82525, Jul 4 2010, 07:43:08) [MSC v.1500 64 bit (AMD64)] on win32
2.
Type "copyright", "credits" or "license()" for more information.
3.
>>> import string
4.
>>> string.letters
5.
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\x83\x8a\x8c\x8e\x9a\x9c\x9e\x9f\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
6.
>>> import locale
7.
>>> locale.getdefaultlocale()
8.
('en_US', 'cp1252')
9.
>>>
|
msg111236 - (view) |
Author: Jeremy Kloth (jkloth) * |
Date: 2010-07-23 03:20 |
Note that this behavior is only present when running IDLE. Python command-line does not show this oddity.
|
msg111237 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 03:23 |
Here is a simpler test: in idle2.6,
>>> '\xff'.isalpha()
True
but in idle2.7 and plain python prompt, it is False.
|
msg111239 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 03:42 |
Here is a way to reproduce this from command line:
$ python2.6
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xff'.isalpha()
False
>>> import idlelib.run
>>> '\xff'.isalpha()
True
|
msg111240 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 03:47 |
Or even simpler:
$ python2.6
Python 2.6.5 (r265:79359, Mar 24 2010, 01:32:55)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import Tkinter
>>> '\xff'.isalpha()
True
|
msg111242 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 04:02 |
Windows 64 bit, python 2.7:
>>> '\xff'.isalpha()
>>> False
>>> import idlelib.run
>>> '\xff'.isalpha()
>>> False
and- Windows 32 bit, python 2.6: Both False.
|
msg111243 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 04:02 |
Mac 10.5.6: py 2.6.4 - broken
Python 2.6.4 (r264:75706, Mar 18 2010, 14:58:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '\xff'.isalpha()
False
>>> import Tkinter
>>> '\xff'.isalpha()
True
>>>
|
msg111244 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 04:17 |
Python 2.6.4, Mac 10.5:
>>> from string import letters
>>> letters
'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc
3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd
8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xe
c\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'
>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF8')
>>>
|
msg111246 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 04:18 |
This is clearly a Tkinter rather than Mac issue, so I am unassigning this from Ronald. This appears to be the same problem as the one Mark described in msg102301.
>>> import locale
>>> locale.nl_langinfo(locale.CODESET)
'US-ASCII'
>>> import _tkinter
>>> locale.nl_langinfo(locale.CODESET)
'UTF-8'
This happens in both 2.6 and 2.7, but seems to be deliberate. As Mark wrote in msg102328:
"""
There's still the issue of the Tkinter import changing the locale, but that seems to be out of Python's control. As far as I can tell, it happens when the module initialization calls Tcl_FindExecutable, which is part of the Tcl library itself. This may well be deliberate: see
http://www.tcl.tk/cgi-bin/tct/tip/66.html
"""
What is still unclear to me, is why after CODESET changes to 'UTF-8', 2.6 thinks that '\xff' is a letter, but 2.7 does not.
Of course, '\xff' makes little sense in 'UTF-8', but why does the answer change between versions?
|
msg111247 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 04:26 |
After import _tkinter, I would up getting this, which is totally different than before:
>>> letters
'abcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xffABCDEFGHIJKLMNOPQRSTUVWXYZ\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde'
|
msg111248 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-23 04:40 |
A bit more info:
Python 2.6.4 (r264:75706, Mar 18 2010, 14:58:13)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.nl_langinfo(locale.CODESET)
'US-ASCII'
>>>
along-mb:~ along$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
along-mb:~ along$
|
msg111249 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 04:50 |
In 3.x, it is different:
>>> locale.nl_langinfo(locale.CODESET)
'UTF-8'
Victor,
This looks like your cup of tee.
|
msg111251 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2010-07-23 08:03 |
I fail to see the bug in this report. '\xff' is a letter because the C library says it is. If you think the result is wrong, file a bug report with the OS vendor.
|
msg111326 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 14:13 |
On Fri, Jul 23, 2010 at 4:03 AM, Martin v. Löwis <report@bugs.python.org> wrote:
..
> I fail to see the bug in this report. '\xff' is a letter because the C library says it is.
This does not explain the difference between 2.6 and 2.7. With
attached issue9335-test.py,
$ cat issue9335-test.py
import locale
locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
print(chr(255).isalpha())
$ python2.7 issue9335-test.py
False
$ python2.6 issue9335-test.py
True
$ python2.5 issue9335-test.py
True
Since chr(255) = '\xff', is not a valid UTF-8 byte sequence, it makes
little sense to ask whether it is a letter or not in a locale that
uses UTF-8 encoding. Nevertheless the behavior changed between
revisions and it is not mentioned in "what's new in 2.7". (I suspect
this was introduced in issue5793 (r72040), but I have not verified.)
There are two possible action items here:
1. New behavior needs to be documented. I believe 2.7 is correct
because when isalpha is used to sanitize untrusted input, it is better
to reject in the case of uncertainy.
2. Arguably, this is a security issue and thus eligible for backporting to 2.6.
|
msg111329 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 14:20 |
Another issue that may be worth revisiting is whether or not it is OK for _tkinter to set the locale.
"""
21.2.2. For extension writers and programs that embed Python
Extension modules should never call setlocale(), except to find out what the current locale is.
"""
http://docs.python.org/dev/library/locale.html#for-extension-writers-and-programs-that-embed-python
|
msg111332 - (view) |
Author: Ronald Oussoren (ronaldoussoren) * |
Date: 2010-07-23 14:28 |
This might be caused by the fix for issue7072 (which is mentioned in the NEWS file).
|
msg111345 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 15:27 |
> This might be caused by the fix for issue7072.
Ronald,
You are absolutely right. Reverting r80178 in the trunk restores the old behavior.
> msg103494:
> Fixed in r80178 (trunk), r80180 (2.6), r80182 (3.2), r80183 (3.1)
I think this can be closed as out of date, but I am giving it back to you to decide whether security implications are important enough to backport to 2.5.
Anthony,
Please open a separate issue for Tkinter if you want it considered. It was rejected once already [msg102328], but even if Tkinter behavior is deemed appropriate, I think it should at least be documented.
|
msg111348 - (view) |
Author: Ronald Oussoren (ronaldoussoren) * |
Date: 2010-07-23 15:45 |
Why do you think this may have security implications?
I'm closing this as out of date because the issue is fixed and the fix is imho inappropriate for a backport to 2.6 due to the change in behaviour.
|
msg111350 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-23 15:53 |
Accepting binary input where only letters are expected by an application is a very common source of security holes. An application that relies on s.isalpha() to guarantee that s does not contain non-ASCII characters when UTF-8 locale is in use, may have a security hole if it is ran with python 2.5.
|
msg111447 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2010-07-24 10:33 |
If an application uses .isalpha for a security-relevant check, this is a security issue in the application, not in Python.
|
msg111448 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-24 10:35 |
I disagree. It's expected that the function will return valid data. This doesn't return valid data so isalpha() is compromised.
|
msg111449 - (view) |
Author: Ronald Oussoren (ronaldoussoren) * |
Date: 2010-07-24 10:41 |
I agree with Martin that the security problem would be in the application, not python itself.
Testing with isalpha is generally not the right thing to do anyway, it is much better to restrict input to a know-good set of data, such as by using regular expressions. For multi-byte encodings like UTF-8 you cannot rely on per-byte calls to isalpha anyway. The situation is even worse for an encoding like Shift-JIS where you need context to know if a byte is part of a multi-byte value.
|
msg111457 - (view) |
Author: Martin v. Löwis (loewis) * |
Date: 2010-07-24 11:37 |
> I disagree. It's expected that the function will return valid data. This doesn't return valid data so isalpha() is compromised.
What is "valid data"? The function (isalpha) should return a boolean,
and it does. So the result is certainly "valid".
The documentation says "For 8-bit strings, this method is
locale-dependent." So it is correct if it returns what the OS vendor says
to return.
|
msg111467 - (view) |
Author: Anthony Long (antlong) |
Date: 2010-07-24 12:23 |
The locale is set incorrectly though - so it is not valid data. Valid data is a-Z. nothing more nothing less, and the locale and the alphabet should not be changed.
|
msg111574 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-07-25 23:27 |
> Victor, This looks like your cup of tee.
Unicode is my cup of tee, but not programs considering that bytes are characters.
<a byte string>.isalpha() doesn't mean anything to me :-)
This issue is a more question about the C library, not about Python :-) So try the attached program "isalpha.c" if you would like to test your libc.
Results on my Linux box (Debian Sid, eglibc 2.11.2):
----------------
$ ./isalpha C
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (52)
$ ./isalpha fr_FR.UTF-8
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz (52)
$ ./isalpha fr_FR.iso88591
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xaa\xb5\xba\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff (117)
$ ./isalpha fr_FR.iso885915@euro
ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz\xa6\xa8\xaa\xb4\xb5\xb8\xba\xbc\xbd\xbe\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff (124)
----------------
If your libc consider that \xff is a valid UTF-8 character, you should change your OS for a better one :-)
--
> >>> len(letters)
> 117
> ...
> >>> locale.setlocale(locale.LC_CTYPE)
> 'en_US.UTF-8'
It looks like Mac OS X uses ISO-8859-1 instead of UTF-8.
--
string.letters is built using strop.lowercase + strop.uppsercase which are built using the C functions islower() and islower(). locale.setlocale() regenerates strop/string.lowercase, strop/string.uppercase and string.letters for LC_CTYPE and LC_ALL categories.
--
You don't need to run IDLE or import Tkinter to set the locale:
import locale; locale.setlocale(locale.LC_ALL, '')
is enough.
--
A library should not change the locale (only the application).
$ python2.6
>>> import locale
>>> locale.getlocale()
(None, None)
>>> import Tkinter
>>> locale.getlocale()
('fr_FR', 'UTF8')
=> Tkinter is an horrible library! (The bug is in the C library, not in the Python wrapper)
Use a better one like Gtk ou Qt ;-)
$ python
>>> import locale
>>> import pygtk
>>> locale.getlocale()
(None, None)
>>> import PyQt4
>>> locale.getlocale()
(None, None)
(IDLE is based on Tkinter)
--
I don't understand why Alexander gets different results on Python 2.6 and Python 2.7.
@belopolsky: Are both programs linked to (built with?) the same C library? (same libray version)
|
msg111587 - (view) |
Author: Alexander Belopolsky (belopolsky) * |
Date: 2010-07-26 00:20 |
On Sun, Jul 25, 2010 at 7:27 PM, STINNER Victor <report@bugs.python.org> wrote:
..
> Unicode is my cup of tee, but not programs considering that bytes are characters.
>
What I called "your cup of tee" was 3.x returning 'UTF-8' from
locale.nl_langinfo(locale.CODESET) where 2.x returned 'US-ASCII'. (In
both cases this was the first call to locale module functions.)
> I don't understand why Alexander gets different results on Python 2.6 and Python 2.7.
>
It looks like you have missed most of the discussion under this issue.
Sorry that you had to reinvestigate. Ronald explained the difference
in msg111332. He introduced a workaround for broken OSX C library
isalpha in r80178.
|
msg111588 - (view) |
Author: STINNER Victor (vstinner) * |
Date: 2010-07-26 00:26 |
Oops, the issue is already closed /o\
|
|
Date |
User |
Action |
Args |
2022-04-11 14:57:04 | admin | set | github: 53581 |
2010-07-26 00:26:23 | vstinner | set | messages:
+ msg111588 |
2010-07-26 00:20:55 | belopolsky | set | messages:
+ msg111587 |
2010-07-25 23:27:20 | vstinner | set | files:
+ isalpha.c
messages:
+ msg111574 |
2010-07-24 12:23:24 | antlong | set | messages:
+ msg111467 |
2010-07-24 11:37:47 | loewis | set | messages:
+ msg111457 |
2010-07-24 10:41:11 | ronaldoussoren | set | messages:
+ msg111449 |
2010-07-24 10:35:29 | antlong | set | messages:
+ msg111448 |
2010-07-24 10:33:54 | loewis | set | messages:
+ msg111447 |
2010-07-23 15:53:12 | belopolsky | set | messages:
+ msg111350 |
2010-07-23 15:45:08 | ronaldoussoren | set | status: pending -> closed
messages:
+ msg111348 |
2010-07-23 15:27:01 | belopolsky | set | status: open -> pending
assignee: belopolsky -> ronaldoussoren components:
+ macOS versions:
- Python 2.6 nosy:
loewis, ronaldoussoren, mark.dickinson, belopolsky, vstinner, eric.smith, jkloth, eric.araujo, antlong messages:
+ msg111345 resolution: out of date stage: resolved |
2010-07-23 14:28:00 | ronaldoussoren | set | messages:
+ msg111332 |
2010-07-23 14:27:09 | eric.araujo | set | nosy:
+ eric.araujo
|
2010-07-23 14:20:21 | belopolsky | set | nosy:
+ eric.smith messages:
+ msg111329
components:
+ Interpreter Core, - macOS type: behavior |
2010-07-23 14:13:26 | belopolsky | set | files:
+ issue9335-test.py
messages:
+ msg111326 |
2010-07-23 08:03:38 | loewis | set | nosy:
+ loewis messages:
+ msg111251
|
2010-07-23 04:50:34 | belopolsky | set | nosy:
+ vstinner messages:
+ msg111249
|
2010-07-23 04:40:35 | antlong | set | messages:
+ msg111248 |
2010-07-23 04:26:50 | antlong | set | messages:
+ msg111247 |
2010-07-23 04:20:29 | belopolsky | set | assignee: ronaldoussoren -> belopolsky |
2010-07-23 04:20:10 | belopolsky | set | nosy:
+ mark.dickinson
|
2010-07-23 04:18:52 | belopolsky | set | messages:
+ msg111246 |
2010-07-23 04:17:47 | antlong | set | messages:
+ msg111244 |
2010-07-23 04:02:35 | antlong | set | messages:
+ msg111243 |
2010-07-23 04:02:09 | antlong | set | messages:
+ msg111242 |
2010-07-23 03:47:01 | belopolsky | set | nosy:
ronaldoussoren, belopolsky, jkloth, antlong messages:
+ msg111240 components:
+ Tkinter |
2010-07-23 03:42:15 | belopolsky | set | messages:
+ msg111239 |
2010-07-23 03:23:01 | belopolsky | set | messages:
+ msg111237 |
2010-07-23 03:20:25 | jkloth | set | nosy:
+ jkloth messages:
+ msg111236
|
2010-07-23 03:16:12 | antlong | set | nosy:
ronaldoussoren, belopolsky, antlong type: behavior -> (no value) messages:
+ msg111235 components:
- IDLE |
2010-07-23 02:59:44 | belopolsky | set | nosy:
+ belopolsky messages:
+ msg111234
components:
+ IDLE type: behavior |
2010-07-23 02:38:29 | antlong | create | |