Message90878
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that
the regex r'\d' matches all unicode characters with category either 'Nd'
(Number, Decimal Digit) or 'No' (Number, Other), but not characters in
category 'Nl' (Number, Letter):
Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata
>>> x = '\u2781'
>>> unicodedata.category(x)
'No'
>>> unicodedata.name(x)
'DINGBAT CIRCLED SANS-SERIF DIGIT TWO'
>>> re.match(r'\d', '\u2781')
<_sre.SRE_Match object at 0x3d5d08>
I believe (but am not 100% sure) that r'\d' should only match characters
in category 'Nd'. To back up this belief:
(1) int and float currently accept characters in category 'Nd' but not
'No'; it would seem useful for '\d' to match those characters that are
accepted by int, so that e.g., something matched with '\d+' could be
directly passed to int. (This came up in a #python-dev discussion
about whether the Decimal type should accept other unicode digits;
that's a separate issue, though.)
(2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches
only characters in category 'Nd'
(3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at
http://unicode.org/unicode/reports/tr18/ recommends that '\d' should
correspond to \p{gc=Decimal_Number}
Marc-André, do you have any opinion on this?
It's probably slightly dangerous to change this in 2.6 or 3.1; I'm
proposing that '\d' should be modified to accept only characters of
category 'Nd' in 2.7 and 3.2.
(Thanks Ezio Melotti for finding all the references above and doing Perl
testing!) |
|
Date |
User |
Action |
Args |
2009-07-24 10:48:02 | mark.dickinson | set | recipients:
+ mark.dickinson, lemburg, ezio.melotti |
2009-07-24 10:48:01 | mark.dickinson | set | messageid: <1248432481.83.0.138486397936.issue6561@psf.upfronthosting.co.za> |
2009-07-24 10:48:00 | mark.dickinson | link | issue6561 messages |
2009-07-24 10:47:58 | mark.dickinson | create | |
|