Author terry.reedy
Recipients ezio.melotti, mrabarnett, reuven, rhettinger, serhiy.storchaka, terry.reedy
Date 2020-11-28.11:33:01
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1606563181.61.0.0522674819635.issue42469@roundup.psfhosted.org>
In-reply-to
Content
Pending further information, I believe that expecting '{1, 4}' to be interpreted as an re quantifier involves two mental errors: a) thinking that the domain specific re language allows optional whitespace and b) thinking that '{' is a special character only used for quantifier patterns.  I propose changes directed at both errors.

In Python code, a space ' ' is always special (indent, required separator, optional separator) except in comments and strings.  In the latter, a space is ordinary except as seen otherwise by reader, including humans.  Functions that read a string as Python code make them special again.  AFAIK, re functions always see spaces as ordinary unless the re.VERBOSE compile flag is passed, and even then they are only sometimes ignored.

Suggestion 1. Because this is contrary to human habit, put a special disclaimer at the end of the 2-sentence paragraph beginning "Some characters, like '|' or '(', are special."  Add something like "Space ' ' is always ordinary, like 'c' is.  Do not put one where a 'c' could not go."

"The special characters are:" is misleading because the special bracketed quantifier patterns that follow are composed of ordinary characters. (See below.)

Suggestion 2.  Add 'and patterns' after 'characters'.  Or put the quantifier patterns after a separate header.

'[' is special in that it always begins a set pattern.  ']' is always special when preceded by '['.  It is an error if the set is not closed with ']'. In particular, compile('[') raises.

'{' and }' are different.  They are not special by themselves anymore than digits and ',' are, whereas within quantifier patterns, all characters, not just '{' and '}', have special interpretations. In particular, compile('{') matches instead of raising like '[' does.
>>> re.findall('{', 'a{b')
['{']

Only a complete quantifier pattern is special.  As near as I can tell, re.compile act at least as if it tries to match '{\d*(,\d*)?}' (with re.ASCII flag to only allow ascii digits?) when it encounters '{'.  If this fails, it continues with '{' treated as ordinary.  So '{1, 4}', '{1,c4}', '{ 1,4}', '{1,4x}', and '{0x33}' are all compiled as sequences of 6 ordinary characters that match themselves.

Suggestion 3.  Somewhere say that '{' and '}' are ordinary unless there is a complete quantifier match, with nothing but digits[,digits] between '{}', with nothing else, including ' ', added.
---

Turning '{' into a special character by making it an error when it does not begin a quantifier would break all intentional uses of '{' as ordinary.  I don't see making '{' sort-of special by sometimes raising and sometime not by a new special rule to be much better.  In particular, I don't like having only extra ' 's or other whitespace raise.  AFAIK, re is unforgiving of any extra chars, not just spaces.

The re.VERBOSE causes whitespace to be sometimes be ignored.  The list of situations when not ignored does not, to me, obviously include quantifiers.  But it should.

>>> re.findall("""d{1,4}""", 'dddd', re.X)
['dddd']
>>> re.findall("""d{1, 4}""", 'dddd', re.X)
[]

Suggestion 4. Say so, by adding 'or quantifier' after 'character class'
History
Date User Action Args
2020-11-28 11:33:01terry.reedysetrecipients: + terry.reedy, rhettinger, ezio.melotti, mrabarnett, serhiy.storchaka, reuven
2020-11-28 11:33:01terry.reedysetmessageid: <1606563181.61.0.0522674819635.issue42469@roundup.psfhosted.org>
2020-11-28 11:33:01terry.reedylinkissue42469 messages
2020-11-28 11:33:01terry.reedycreate