Message 116252 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	vbr
Recipients	akitada, amaury.forgeotdarc, collinwinter, ezio.melotti, georg.brandl, giampaolo.rodola, gregory.p.smith, jaylogan, jhalcrow, jimjjewett, loewis, mark, moreati, mrabarnett, nneonneo, pitrou, r.david.murray, rsc, sjmachin, timehorse, vbr
Date	2010-09-12.23:34:26
SpamBayes Score	1.0091927e-12
Marked as misclassified	No
Message-id	<1284334468.68.0.357269745287.issue2636@psf.upfronthosting.co.za>
In-reply-to

Content
Just another rather marginal findings; differences between regex and re: >>> regex.findall(r"[\B]", "aBc") ['B'] >>> re.findall(r"[\B]", "aBc") [] (Python 2.7 ... on win32; regex - issue2636-20100912.zip) I believe, regex is more correct here, as uppercase \B doesn't have a special meaning within a set (unlike backspace \b), hence it should be treated as B, but I wanted to mention it as a difference, just in case it would matter. I also noticed another case, where regex is more permissive: >>> regex.findall(r"[\d-h]", "ab12c-h") ['1', '2', '-', 'h'] >>> re.findall(r"[\d-h]", "ab12c-h") Traceback (most recent call last): File "<input>", line 1, in <module> File "re.pyc", line 177, in findall File "re.pyc", line 245, in _compile error: bad character range >>> howewer, there might be an issue in negated sets, where the negation seem to apply for the first shorthand literal only; the rest is taken positively >>> regex.findall(r"[^\d-h]", "a^b12c-h") ['-', 'h'] cf. also a simplified pattern, where re seems to work correctly: >>> regex.findall(r"[^\dh]", "a^b12c-h") ['h'] >>> re.findall(r"[^\dh]", "a^b12c-h") ['a', '^', 'b', 'c', '-'] >>> or maybe regardless the order - in presence of shorthand literals and normal characters in negated sets, these normal characters are matched positively >>> regex.findall(r"[^h\s\db]", "a^b 12c-h") ['b', 'h'] >>> re.findall(r"[^h\s\db]", "a^b 12c-h") ['a', '^', 'c', '-'] >>> also related to character sets but possibly different - maybe adding a (reduntant) character also belonging to the shorthand in a negated set seem to somehow confuse the parser: regex.findall(r"[^b\w]", "a b") [] re.findall(r"[^b\w]", "a b") [' '] regex.findall(r"[^b\S]", "a b") [] re.findall(r"[^b\S]", "a b") [' '] >>> regex.findall(r"[^8\d]", "a 1b2") [] >>> re.findall(r"[^8\d]", "a 1b2") ['a', ' ', 'b'] >>> I didn't find any relevant tracker issues, sorry if I missed some... I initially wanted to provide test code additions, but as I am not sure about the intended output in all cases, I am leaving it in this form; vbr

Just another rather marginal findings; differences between regex and re:

>>> regex.findall(r"[\B]", "aBc")
['B']
>>> re.findall(r"[\B]", "aBc")
[]

(Python 2.7 ... on win32; regex - issue2636-20100912.zip)
I believe, regex is more correct here, as uppercase \B doesn't have a special meaning within a set (unlike backspace \b), hence it should be treated as B, but I wanted to mention it as a difference, just in case it would matter.

I also noticed another case, where regex is more permissive:

>>> regex.findall(r"[\d-h]", "ab12c-h")
['1', '2', '-', 'h']
>>> re.findall(r"[\d-h]", "ab12c-h")
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "re.pyc", line 177, in findall
  File "re.pyc", line 245, in _compile
error: bad character range
>>> 

howewer, there might be an issue in negated sets, where the negation seem to apply for the first shorthand literal only; the rest is taken positively

>>> regex.findall(r"[^\d-h]", "a^b12c-h")
['-', 'h']

cf. also a simplified pattern, where re seems to work correctly:

>>> regex.findall(r"[^\dh]", "a^b12c-h")
['h']
>>> re.findall(r"[^\dh]", "a^b12c-h")
['a', '^', 'b', 'c', '-']
>>> 

or maybe regardless the order - in presence of shorthand literals and normal characters in negated sets, these normal characters are matched positively

>>> regex.findall(r"[^h\s\db]", "a^b 12c-h")
['b', 'h']
>>> re.findall(r"[^h\s\db]", "a^b 12c-h")
['a', '^', 'c', '-']
>>> 

also related to character sets but possibly different - maybe adding a (reduntant) character also belonging to the shorthand in a negated set seem to somehow confuse the parser:

regex.findall(r"[^b\w]", "a b")
[]
re.findall(r"[^b\w]", "a b")
[' ']

regex.findall(r"[^b\S]", "a b")
[]
re.findall(r"[^b\S]", "a b")
[' ']

>>> regex.findall(r"[^8\d]", "a 1b2")
[]
>>> re.findall(r"[^8\d]", "a 1b2")
['a', ' ', 'b']
>>> 

I didn't find any relevant tracker issues, sorry if I missed some...
I initially wanted to provide test code additions, but as I am not sure about the intended output in all cases, I am leaving it in this form;

vbr

History
Date	User	Action	Args
2010-09-12 23:34:28	vbr	set	recipients: + vbr, loewis, georg.brandl, collinwinter, gregory.p.smith, jimjjewett, sjmachin, amaury.forgeotdarc, pitrou, nneonneo, giampaolo.rodola, rsc, timehorse, mark, ezio.melotti, mrabarnett, jaylogan, akitada, moreati, r.david.murray, jhalcrow
2010-09-12 23:34:28	vbr	set	messageid: <1284334468.68.0.357269745287.issue2636@psf.upfronthosting.co.za>
2010-09-12 23:34:27	vbr	link	issue2636 messages
2010-09-12 23:34:26	vbr	create