classification
Title: Regex '\d' should not match unicode category 'No'.
Type: behavior Stage: resolved
Components: Extension Modules Versions: Python 2.7
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: eric.smith, ezio.melotti, lemburg, mark.dickinson, pitrou, r.david.murray
Priority: normal Keywords: needs review, patch

Created on 2009-07-24 10:48 by mark.dickinson, last changed 2009-07-28 21:24 by mark.dickinson. This issue is now closed.

Files
File name Uploaded Description Edit
issue6561.patch mark.dickinson, 2009-07-24 16:36
Messages (8)
msg90878 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-07-24 10:47
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that 
the regex r'\d' matches all unicode characters with category either 'Nd' 
(Number, Decimal Digit) or 'No' (Number, Other), but not characters in 
category 'Nl' (Number, Letter):

Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29) 
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> import unicodedata
>>> x = '\u2781'
>>> unicodedata.category(x)
'No'
>>> unicodedata.name(x)
'DINGBAT CIRCLED SANS-SERIF DIGIT TWO'
>>> re.match(r'\d', '\u2781')
<_sre.SRE_Match object at 0x3d5d08>

I believe (but am not 100% sure) that r'\d' should only match characters 
in category 'Nd'.  To back up this belief:

(1) int and float currently accept characters in category 'Nd' but not 
'No'; it would seem useful for '\d' to match those characters that are 
accepted by int, so that e.g., something matched with '\d+' could be 
directly passed to int.  (This came up in a #python-dev discussion
about whether the Decimal type should accept other unicode digits;  
that's a separate issue, though.)

(2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches 
only characters in category 'Nd'

(3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at 
http://unicode.org/unicode/reports/tr18/ recommends that '\d' should 
correspond to \p{gc=Decimal_Number}

Marc-André, do you have any opinion on this?

It's probably slightly dangerous to change this in 2.6 or 3.1;  I'm 
proposing that '\d' should be modified to accept only characters of 
category 'Nd' in 2.7 and 3.2.

(Thanks Ezio Melotti for finding all the references above and doing Perl 
testing!)
msg90885 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-07-24 14:51
Patch against py3k.
msg90888 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-07-24 16:36
New patch;  same as before, but includes clarification to the 
documentation.
msg90927 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2009-07-25 17:23
This sounds reasonable to me.
msg90929 - (view) Author: Ezio Melotti (ezio.melotti) * (Python committer) Date: 2009-07-25 18:01
This seems to me quite redundant:
+      Matches any Unicode decimal digit; more specifically, matches
+      any character in Unicode category [Nd] (Number, Decimal Digit).
+      This includes ``[0-9]``, and also many other digit characters.
I suggest something like:
Matches the decimal digits ``[0-9]`` and all the characters that belong
to the Unicode category Nd (Number, Decimal Digit).

Two more minor details: instead of '\d', I'd use '^\d$' and instead of
self.assertEqual(re.match('\d', x), None)
self.assertIsNone(re.match('\d', x)).
msg90971 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2009-07-27 02:23
It may be redundant, but it is also more technically accurate.  I'm -0
on your proposed rephrasing, and trust Mark to make the right decision :)
msg91012 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-07-28 17:23
[ezio.melotti]
> I suggest something like:
> Matches the decimal digits ``[0-9]`` and all the characters that belong
> to the Unicode category Nd (Number, Decimal Digit).

Hmm.  I don't like this because it suggests (to me) that the characters 
[0-9] don't belong to category [Nd].  I agree the previous version was 
clunky, though.  I've shortened it some;  if anyone else wants to work on 
the wording please feel free.  It might be nice to annotate each of these 
character classes (\w, \s) with the Unicode character categories that they 
correspond to.

> Two more minor details: instead of '\d', I'd use '^\d$' and instead of
> self.assertEqual(re.match('\d', x), None)
> self.assertIsNone(re.match('\d', x)).

Thanks.  Changes applied.

Committed to py3k, r74237.   Leaving open for backport to trunk.
msg91018 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009-07-28 21:24
Backported to trunk in r74240.
History
Date User Action Args
2009-07-28 21:24:48mark.dickinsonsetstatus: open -> closed

messages: + msg91018
2009-07-28 17:23:36mark.dickinsonsetstage: patch review -> resolved
messages: + msg91012
versions: - Python 3.2
2009-07-27 02:23:07r.david.murraysetnosy: + r.david.murray
messages: + msg90971
2009-07-25 18:01:50ezio.melottisetpriority: normal
keywords: + needs review
messages: + msg90929

stage: test needed -> patch review
2009-07-25 17:23:37pitrousetnosy: + pitrou
messages: + msg90927
2009-07-24 16:36:43mark.dickinsonsetfiles: - issue6561.patch
2009-07-24 16:36:30mark.dickinsonsetfiles: + issue6561.patch

messages: + msg90888
2009-07-24 14:51:50mark.dickinsonsetfiles: + issue6561.patch
keywords: + patch
messages: + msg90885
2009-07-24 11:58:04eric.smithsetnosy: + eric.smith
2009-07-24 10:48:00mark.dickinsoncreate