Issue 6561: Regex '\d' should not match unicode category 'No'.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/50810

classification

Title:	Regex '\d' should not match unicode category 'No'.
Type:	behavior	Stage:	resolved
Components:	Extension Modules	Versions:	Python 2.7

process

Status:	closed	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	eric.smith, ezio.melotti, lemburg, mark.dickinson, pitrou, r.david.murray
Priority:	normal	Keywords:	needs review, patch

Created on 2009-07-24 10:48 by mark.dickinson, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue6561.patch	mark.dickinson, 2009-07-24 16:36

Messages (8)
msg90878 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-07-24 10:47
In Python 3, or in Python 2 with the re.UNICODE flag, it appears that the regex r'\d' matches all unicode characters with category either 'Nd' (Number, Decimal Digit) or 'No' (Number, Other), but not characters in category 'Nl' (Number, Letter): Python 3.2a0 (py3k:74188, Jul 23 2009, 16:01:29) [GCC 4.0.1 (Apple Inc. build 5493)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import re >>> import unicodedata >>> x = '\u2781' >>> unicodedata.category(x) 'No' >>> unicodedata.name(x) 'DINGBAT CIRCLED SANS-SERIF DIGIT TWO' >>> re.match(r'\d', '\u2781') <_sre.SRE_Match object at 0x3d5d08> I believe (but am not 100% sure) that r'\d' should only match characters in category 'Nd'. To back up this belief: (1) int and float currently accept characters in category 'Nd' but not 'No'; it would seem useful for '\d' to match those characters that are accepted by int, so that e.g., something matched with '\d+' could be directly passed to int. (This came up in a #python-dev discussion about whether the Decimal type should accept other unicode digits; that's a separate issue, though.) (2) In Perl 5.10 (and possibly some earlier versions too), '\d' matches only characters in category 'Nd' (3) Unicode Technical Standard #18 ("Unicode Regular Expressions") at http://unicode.org/unicode/reports/tr18/ recommends that '\d' should correspond to \p{gc=Decimal_Number} Marc-André, do you have any opinion on this? It's probably slightly dangerous to change this in 2.6 or 3.1; I'm proposing that '\d' should be modified to accept only characters of category 'Nd' in 2.7 and 3.2. (Thanks Ezio Melotti for finding all the references above and doing Perl testing!)
msg90885 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-07-24 14:51
Patch against py3k.
msg90888 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-07-24 16:36
New patch; same as before, but includes clarification to the documentation.
msg90927 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2009-07-25 17:23
This sounds reasonable to me.
msg90929 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2009-07-25 18:01
This seems to me quite redundant: + Matches any Unicode decimal digit; more specifically, matches + any character in Unicode category [Nd] (Number, Decimal Digit). + This includes ``[0-9]``, and also many other digit characters. I suggest something like: Matches the decimal digits ``[0-9]`` and all the characters that belong to the Unicode category Nd (Number, Decimal Digit). Two more minor details: instead of '\d', I'd use '^\d$' and instead of self.assertEqual(re.match('\d', x), None) self.assertIsNone(re.match('\d', x)).
msg90971 - (view)	Author: R. David Murray (r.david.murray) *	Date: 2009-07-27 02:23
It may be redundant, but it is also more technically accurate. I'm -0 on your proposed rephrasing, and trust Mark to make the right decision :)
msg91012 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-07-28 17:23
[ezio.melotti] > I suggest something like: > Matches the decimal digits ``[0-9]`` and all the characters that belong > to the Unicode category Nd (Number, Decimal Digit). Hmm. I don't like this because it suggests (to me) that the characters [0-9] don't belong to category [Nd]. I agree the previous version was clunky, though. I've shortened it some; if anyone else wants to work on the wording please feel free. It might be nice to annotate each of these character classes (\w, \s) with the Unicode character categories that they correspond to. > Two more minor details: instead of '\d', I'd use '^\d$' and instead of > self.assertEqual(re.match('\d', x), None) > self.assertIsNone(re.match('\d', x)). Thanks. Changes applied. Committed to py3k, r74237. Leaving open for backport to trunk.
msg91018 - (view)	Author: Mark Dickinson (mark.dickinson) *	Date: 2009-07-28 21:24
Backported to trunk in r74240.

History
Date	User	Action	Args
2022-04-11 14:56:51	admin	set	github: 50810
2009-07-28 21:24:48	mark.dickinson	set	status: open -> closed messages: + msg91018
2009-07-28 17:23:36	mark.dickinson	set	stage: patch review -> resolved messages: + msg91012 versions: - Python 3.2
2009-07-27 02:23:07	r.david.murray	set	nosy: + r.david.murray messages: + msg90971
2009-07-25 18:01:50	ezio.melotti	set	priority: normal keywords: + needs review messages: + msg90929 stage: test needed -> patch review
2009-07-25 17:23:37	pitrou	set	nosy: + pitrou messages: + msg90927
2009-07-24 16:36:43	mark.dickinson	set	files: - issue6561.patch
2009-07-24 16:36:30	mark.dickinson	set	files: + issue6561.patch messages: + msg90888
2009-07-24 14:51:50	mark.dickinson	set	files: + issue6561.patch keywords: + patch messages: + msg90885
2009-07-24 11:58:04	eric.smith	set	nosy: + eric.smith
2009-07-24 10:48:00	mark.dickinson	create