Author mark.dickinson
Recipients ezio.melotti, lemburg, mark.dickinson
Date 2009-07-24.10:47:57
SpamBayes Score 2.42584e-14
Marked as misclassified No
Message-id <1248432481.83.0.138486397936.issue6561@psf.upfronthosting.co.za>
In-reply-to
Content
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that 
the regex r'\d' matches all unicode characters with category either 'Nd' 
(Number, Decimal Digit) or 'No' (Number, Other), but not characters in 
category 'Nl' (Number, Letter):

Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata
>>> x = '\u2781'
>>> unicodedata.category(x)
'No'
>>> unicodedata.name(x)
'DINGBAT CIRCLED SANS-SERIF DIGIT TWO'
>>> re.match(r'\d', '\u2781')
<_sre.SRE_Match object at 0x3d5d08>

I believe (but am not 100% sure) that r'\d' should only match characters 
in category 'Nd'.  To back up this belief:

(1) int and float currently accept characters in category 'Nd' but not 
'No'; it would seem useful for '\d' to match those characters that are 
accepted by int, so that e.g., something matched with '\d+' could be 
directly passed to int.  (This came up in a #python-dev discussion
about whether the Decimal type should accept other unicode digits;  
that's a separate issue, though.)

(2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches 
only characters in category 'Nd'

(3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at 
http://unicode.org/unicode/reports/tr18/ recommends that '\d' should 
correspond to \p{gc=Decimal_Number}

Marc-André, do you have any opinion on this?

It's probably slightly dangerous to change this in 2.6 or 3.1;  I'm 
proposing that '\d' should be modified to accept only characters of 
category 'Nd' in 2.7 and 3.2.

(Thanks Ezio Melotti for finding all the references above and doing Perl 
testing!)
History
Date User Action Args
2009-07-24 10:48:02mark.dickinsonsetrecipients: + mark.dickinson, lemburg, ezio.melotti
2009-07-24 10:48:01mark.dickinsonsetmessageid: <1248432481.83.0.138486397936.issue6561@psf.upfronthosting.co.za>
2009-07-24 10:48:00mark.dickinsonlinkissue6561 messages
2009-07-24 10:47:58mark.dickinsoncreate