Message 152237 - Python tracker

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

Author	ezio.melotti
Recipients	docs@python, ezio.melotti, georg.brandl, mrabarnett, sjmachin
Date	2012-01-29.15:32:26
SpamBayes Score	3.0792036e-13
Marked as misclassified	No
Message-id	<1327851147.32.0.752780075223.issue13899@psf.upfronthosting.co.za>
In-reply-to

Content
[\w] should definitely work, but [\B] doesn't seem to match anything useful, and it just fails silently because it's neither equivalent to \B nor to [B]: >>> re.match(r'foo\B', 'foobar') # on a non-word-boundary -- matches fine <_sre.SRE_Match object at 0xb76dd3a0> >>> re.match(r'foo[B]', 'fooBar') # same as r'fooB' <_sre.SRE_Match object at 0xb76dd1e0> >>> re.match(r'foo[\B]', 'foobar') # not equivalent to \B >>> re.match(r'foo[\B]', 'fooBar') # not equivalent to [B] The same is true for \Z and \A: >>> re.match(r'foo\Z', 'foo') # end of the string -- matches fine <_sre.SRE_Match object at 0xb76dd3a0> >>> re.match(r'foo[Z]', 'fooZ') # same as r'fooZ' <_sre.SRE_Match object at 0xb76dd1e0> >>> re.match(r'foo[\Z]', 'foo') # not equivalent to \Z >>> re.match(r'foo[\Z]', 'fooZ') # not equivalent to [Z] >>> >>> re.match(r'\Afoo', 'foo') # beginning of the string -- matches fine <_sre.SRE_Match object at 0xb76dd1e0> >>> re.match(r'[A]foo', 'Afoo') # same as r'Afoo' <_sre.SRE_Match object at 0xb76dd3a0> >>> re.match(r'[\A]foo', 'foo') # not equivalent to \A >>> re.match(r'[\A]foo', 'Afoo') # not equivalent to [A] Inside [], \b switches from word boundary to backspace: >>> re.match(r'foo\b', 'foobar') # not on a word boundary -- no matches >>> re.match(r'foo\b', 'foo bar') # on a word boundary -- matches fine <_sre.SRE_Match object at 0xb74a4ec8> >>> re.match(r'foo[\b]', 'foo bar') # not equivalent to \b >>> re.match(r'foo[\b]', 'foo\bbar') # matches backspace <_sre.SRE_Match object at 0xb76dd3d8> >>> re.match(r'foo([\b])', 'foo\bbar').group(1) '\x08' Given that \b doesn't keep its word boundary meaning inside the [], \B (and \A and \Z) shouldn't keep it either (also because I can't see how having these inside [] would be of any use). On the other hand I'm not sure they should be equivalent to B, A, Z either. There are several escape sequences in the form \X (where X is an upper- or lower-case letter) that are not equivalent to X (\a\b\d\f\s\x\w\D\S\W...). Raising an error that says something like "I don't think [\A] does what you think it does, use [A] instead." might be a better option (and in case anyone is wondering about re.escape, I just checked and it doesn't escape letters). Even if this is technically backward incompatible, any string that has \A, \B, \Z inside [] can be considered buggy IMHO (unless someone can come up with a valid use case where they do something useful).

[\w] should definitely work, but [\B] doesn't seem to match anything useful, and it just fails silently because it's neither equivalent to \B nor to [B]:
>>> re.match(r'foo\B', 'foobar')  # on a non-word-boundary -- matches fine
<_sre.SRE_Match object at 0xb76dd3a0>
>>> re.match(r'foo[B]', 'fooBar')  # same as r'fooB'
<_sre.SRE_Match object at 0xb76dd1e0>
>>> re.match(r'foo[\B]', 'foobar')  # not equivalent to \B
>>> re.match(r'foo[\B]', 'fooBar')  # not equivalent to [B]

The same is true for \Z and \A:
>>> re.match(r'foo\Z', 'foo')  # end of the string -- matches fine
<_sre.SRE_Match object at 0xb76dd3a0>
>>> re.match(r'foo[Z]', 'fooZ')  # same as r'fooZ'
<_sre.SRE_Match object at 0xb76dd1e0>
>>> re.match(r'foo[\Z]', 'foo')  # not equivalent to \Z
>>> re.match(r'foo[\Z]', 'fooZ')  # not equivalent to [Z]
>>>
>>> re.match(r'\Afoo', 'foo')  # beginning of the string -- matches fine
<_sre.SRE_Match object at 0xb76dd1e0>
>>> re.match(r'[A]foo', 'Afoo')  # same as r'Afoo'
<_sre.SRE_Match object at 0xb76dd3a0>
>>> re.match(r'[\A]foo', 'foo')  # not equivalent to \A
>>> re.match(r'[\A]foo', 'Afoo')  # not equivalent to [A]

Inside [], \b switches from word boundary to backspace:
>>> re.match(r'foo\b', 'foobar')  # not on a word boundary -- no matches
>>> re.match(r'foo\b', 'foo bar')  # on a word boundary  -- matches fine
<_sre.SRE_Match object at 0xb74a4ec8>
>>> re.match(r'foo[\b]', 'foo bar')  # not equivalent to \b
>>> re.match(r'foo[\b]', 'foo\bbar')  # matches backspace
<_sre.SRE_Match object at 0xb76dd3d8>
>>> re.match(r'foo([\b])', 'foo\bbar').group(1)
'\x08'

Given that \b doesn't keep its word boundary meaning inside the [], \B (and \A and \Z) shouldn't keep it either (also because I can't see how having these inside [] would be of any use).
On the other hand I'm not sure they should be equivalent to B, A, Z either.  There are several escape sequences in the form \X (where X is an upper- or lower-case letter) that are not equivalent to X (\a\b\d\f\s\x\w\D\S\W...).
Raising an error that says something like "I don't think [\A] does what you think it does, use [A] instead." might be a better option (and in case anyone is wondering about re.escape, I just checked and it doesn't escape letters).  Even if this is technically backward incompatible, any string that has \A, \B, \Z inside [] can be considered buggy IMHO (unless someone can come up with a valid use case where they do something useful).

History
Date	User	Action	Args
2012-01-29 15:32:27	ezio.melotti	set	recipients: + ezio.melotti, georg.brandl, sjmachin, mrabarnett, docs@python
2012-01-29 15:32:27	ezio.melotti	set	messageid: <1327851147.32.0.752780075223.issue13899@psf.upfronthosting.co.za>
2012-01-29 15:32:26	ezio.melotti	link	issue13899 messages
2012-01-29 15:32:26	ezio.melotti	create