Author vbr
Recipients akitada, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date 2010-09-12.23:34:26
SpamBayes Score 1.00919e-12
Marked as misclassified No
Message-id <1284334468.68.0.357269745287.issue2636@psf.upfronthosting.co.za>
In-reply-to
Content
Just another rather marginal findings; differences between regex and re:

>>> regex.findall(r"[\B]", "aBc")
['B']
>>> re.findall(r"[\B]", "aBc")
[]

(Python 2.7 ... on win32; regex - issue2636-20100912.zip)
I believe, regex is more correct here, as uppercase \B doesn't have a special meaning within a set (unlike backspace \b), hence it should be treated as B, but I wanted to mention it as a difference, just in case it would matter.

I also noticed another case, where regex is more permissive:

>>> regex.findall(r"[\d-h]", "ab12c-h")
['1', '2', '-', 'h']
>>> re.findall(r"[\d-h]", "ab12c-h")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "re.pyc", line 177, in findall
  File "re.pyc", line 245, in _compile
error: bad character range
>>> 

howewer, there might be an issue in negated sets, where the negation seem to apply for the first shorthand literal only; the rest is taken positively

>>> regex.findall(r"[^\d-h]", "a^b12c-h")
['-', 'h']

cf. also a simplified pattern, where re seems to work correctly:

>>> regex.findall(r"[^\dh]", "a^b12c-h")
['h']
>>> re.findall(r"[^\dh]", "a^b12c-h")
['a', '^', 'b', 'c', '-']
>>> 

or maybe regardless the order - in presence of shorthand literals and normal characters in negated sets, these normal characters are matched positively

>>> regex.findall(r"[^h\s\db]", "a^b 12c-h")
['b', 'h']
>>> re.findall(r"[^h\s\db]", "a^b 12c-h")
['a', '^', 'c', '-']
>>> 

also related to character sets but possibly different - maybe adding a (reduntant) character also belonging to the shorthand in a negated set seem to somehow confuse the parser:

regex.findall(r"[^b\w]", "a b")
[]
re.findall(r"[^b\w]", "a b")
[' ']

regex.findall(r"[^b\S]", "a b")
[]
re.findall(r"[^b\S]", "a b")
[' ']

>>> regex.findall(r"[^8\d]", "a 1b2")
[]
>>> re.findall(r"[^8\d]", "a 1b2")
['a', ' ', 'b']
>>> 

I didn't find any relevant tracker issues, sorry if I missed some...
I initially wanted to provide test code additions, but as I am not sure about the intended output in all cases, I am leaving it in this form;

vbr
History
Date User Action Args
2010-09-12 23:34:28vbrsetrecipients: + vbr, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jhalcrow
2010-09-12 23:34:28vbrsetmessageid: <1284334468.68.0.357269745287.issue2636@psf.upfronthosting.co.za>
2010-09-12 23:34:27vbrlinkissue2636 messages
2010-09-12 23:34:26vbrcreate