classification
Title: isalpha bug
Type: behavior Stage:
Components: Unicode Versions: Python 2.5
process
Status: closed Resolution: works for me
Dependencies: Superseder:
Assigned To: Nosy List: ZooKeeper, lemburg, vstinner
Priority: normal Keywords:

Created on 2008-11-13 14:39 by ZooKeeper, last changed 2008-11-13 15:57 by vstinner. This issue is now closed.

Messages (7)
msg75820 - (view) Author: ZooKeeper (ZooKeeper) Date: 2008-11-13 14:39
This may be a little tricky to recreate but here it is:

q = u'абвгде'
q.isalpha()
True

foo = u'ч'
foo.isalpha()
False

So the Russian character u'ч' and u'ё' as well as a bunch of others is
not recognized by isalpha as a alphabetic character, which it is a
matter of fact.
This applies to both capital and regular versions of the letters.

http://en.wikipedia.org/wiki/%D0%81
http://en.wikipedia.org/wiki/Che_(Cyrillic)

Using: Python 2.5 (r25:51908, Sep 19 2006, 09:52:17) [MSC v.1310 32 bit
(Intel)] on win32
msg75821 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-11-13 14:46
Are you sure that you've used the right source code encoding for writing
these characters ?

Note that the Unicode .isalpha() method relies entirely on what the
Unicode database provides as code point information. If a character is
marked as not being alphanumeric (ie. is not in one of the categories
'Ll', 'Lu', 'Lt', 'Lo' or 'Lm'), it will return False.
msg75822 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-11-13 14:48
FWIW: I get the following in Python 2.5:

>>> print u'\u0401'
Ё
>>> print u'\u0451'
ё
>>> print u'\u0401'.isalpha()
True
>>> print u'\u0451'.isalpha()
True
msg75823 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008-11-13 14:49
... and for the other character:

>>> print u'\u0427'
Ч
>>> print u'\u0447'
ч
>>> print u'\u0427'.isalpha()
True
>>> print u'\u0447'.isalpha()
True

Looks fine.
msg75824 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-13 14:52
Results on Linux:

With Python 2.7 trunk:
>>> print(', '.join('%s:%s' % (c, c.isalpha()) for c in u'абвгдеч'))
а:True, б:True, в:True, г:True, д:True, е:True, ч:True

With Python 2.5.1:
>>> print(', '.join('%s:%s' % (c, c.isalpha()) for c in u'абвгдеч'))
а:True, б:True, в:True, г:True, д:True, е:True, ч:True

With Python 3.0 trunk:
>>> print(', '.join('%s:%s' % (c, c.isalpha()) for c in 'абвгдеч'))
а:True, б:True, в:True, г:True, д:True, е:True, ч:True

Are you sure that you really typed the character "ч"? Can you retry 
using unichr(0x447).isalpha()?

Test with Python3:
>>> print(' - '.join((r"\u%04x" % x) for x in range(0x400, 0x4ff+1) if 
not chr(x).isalpha()))
\u0482 - \u0483 - \u0484 - \u0485 - \u0486 - \u0487 - \u0488 - \u0489

Which means that Python thinks that all unicode character in range 
U+0400..U+04ff are letters except the range U+0482..U+0489 (thousands 
sign ҂ to million sign ҉).
msg75826 - (view) Author: ZooKeeper (ZooKeeper) Date: 2008-11-13 15:55
I'll investigate it in further shortly, but for now replicating the test.
print u'\u0451'
¸
print u'\u0427'
×

Something must be going on here. Running Win XP.
msg75827 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2008-11-13 15:57
$ python
Python 2.5.1 (r251:54863, Jul 31 2008, 23:17:40)
>>> print u'\u0451'
ё
>>> print u'\u0427'
Ч

@ZooKeeper: Try Python 2.6, I guess that your bug is already fixed.
History
Date User Action Args
2008-11-13 15:57:45vstinnersetmessages: + msg75827
2008-11-13 15:55:27ZooKeepersetmessages: + msg75826
2008-11-13 14:52:32vstinnersetnosy: + vstinner
messages: + msg75824
2008-11-13 14:49:53lemburgsetstatus: open -> closed
resolution: works for me
messages: + msg75823
2008-11-13 14:48:18lemburgsetmessages: + msg75822
2008-11-13 14:46:08lemburgsetnosy: + lemburg
messages: + msg75821
components: - Extension Modules
2008-11-13 14:39:37ZooKeepercreate