Issue 13899: re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/58107

classification

Title:	re pattern r"[\A]" should work like "A" but matches nothing. Ditto B and Z.
Type:	behavior	Stage:	resolved
Components:	Documentation	Versions:	Python 3.2, Python 3.3, Python 3.4, Python 2.7

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	docs@python, ezio.melotti, georg.brandl, jcea, mrabarnett, python-dev, sjmachin, terry.reedy
Priority:	normal	Keywords:	patch

Created on 2012-01-28 22:28 by sjmachin, last changed 2022-04-11 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue13899.patch	mrabarnett, 2013-01-07 18:05		review

Messages (16)
msg152194 - (view)	Author: John Machin (sjmachin)	Date: 2012-01-28 22:28
Expected behaviour illustrated using "C": >>> import re >>> re.findall(r'[\C]', 'CCC') ['C', 'C', 'C'] >>> re.compile(r'[\C]', 128) literal 67 <_sre.SRE_Pattern object at 0x01FC6E78> >>> re.compile(r'C', 128) literal 67 <_sre.SRE_Pattern object at 0x01FC6F08> Incorrect behaviour exhibited by "A" (and by "B" and Z"): >>> re.findall(r'[\A]', 'AAA') [] >>> re.compile(r'A', 128) literal 65 <_sre.SRE_Pattern object at 0x01FC6F98> >>> re.compile(r'[\A]', 128) in at at_beginning_string #### FAIL #### <_sre.SRE_Pattern object at 0x01FDF0B0> >>> Also there is no self-checking at runtime; the switch default has a comment to the effect that nothing can be done, so pretend that the unknown opcode matched nothing. Zen?
msg152195 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-01-28 22:42
This happens because \A, \B and \Z are valid escape sequences[0]. If what you mean is that they shouldn't be recognized as such inside a character class, then I can agree with that. ^ and $ are similar to \A and \Z but they are considered as literals inside []. I think the same could also be applied to \b and \B, unless you expect r'[\b]' to match the same as r'\b'. (On an unrelated note, it's preferable to avoid using ints as flag -- using re.DEBUG is better) [0]: http://docs.python.org/library/re.html#regular-expression-syntax
msg152198 - (view)	Author: John Machin (sjmachin)	Date: 2012-01-28 23:11
@ezio: Of course the context is "inside a character class". I expect r'[\b]' to act like r'\b' aka r'\x08' aka backspace because (1) that is the treatment applied to all other C-like control char escapes (2) the docs say so explicitly: "Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals."
msg152232 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-01-29 14:35
r'[\w]' also matches word chars. I find that a very useful property, since you can easily build classes like '[\w.]' It's also impossible to change this without breaking lots of regexes. It's also explicitly documented, although IMO it's not clear it extends to \A and \Z, since it talks about "character classes". So this is a docs issue.
msg152237 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-01-29 15:32
[\w] should definitely work, but [\B] doesn't seem to match anything useful, and it just fails silently because it's neither equivalent to \B nor to [B]: >>> re.match(r'foo\B', 'foobar') # on a non-word-boundary -- matches fine <_sre.SRE_Match object at 0xb76dd3a0> >>> re.match(r'foo[B]', 'fooBar') # same as r'fooB' <_sre.SRE_Match object at 0xb76dd1e0> >>> re.match(r'foo[\B]', 'foobar') # not equivalent to \B >>> re.match(r'foo[\B]', 'fooBar') # not equivalent to [B] The same is true for \Z and \A: >>> re.match(r'foo\Z', 'foo') # end of the string -- matches fine <_sre.SRE_Match object at 0xb76dd3a0> >>> re.match(r'foo[Z]', 'fooZ') # same as r'fooZ' <_sre.SRE_Match object at 0xb76dd1e0> >>> re.match(r'foo[\Z]', 'foo') # not equivalent to \Z >>> re.match(r'foo[\Z]', 'fooZ') # not equivalent to [Z] >>> >>> re.match(r'\Afoo', 'foo') # beginning of the string -- matches fine <_sre.SRE_Match object at 0xb76dd1e0> >>> re.match(r'[A]foo', 'Afoo') # same as r'Afoo' <_sre.SRE_Match object at 0xb76dd3a0> >>> re.match(r'[\A]foo', 'foo') # not equivalent to \A >>> re.match(r'[\A]foo', 'Afoo') # not equivalent to [A] Inside [], \b switches from word boundary to backspace: >>> re.match(r'foo\b', 'foobar') # not on a word boundary -- no matches >>> re.match(r'foo\b', 'foo bar') # on a word boundary -- matches fine <_sre.SRE_Match object at 0xb74a4ec8> >>> re.match(r'foo[\b]', 'foo bar') # not equivalent to \b >>> re.match(r'foo[\b]', 'foo\bbar') # matches backspace <_sre.SRE_Match object at 0xb76dd3d8> >>> re.match(r'foo([\b])', 'foo\bbar').group(1) '\x08' Given that \b doesn't keep its word boundary meaning inside the [], \B (and \A and \Z) shouldn't keep it either (also because I can't see how having these inside [] would be of any use). On the other hand I'm not sure they should be equivalent to B, A, Z either. There are several escape sequences in the form \X (where X is an upper- or lower-case letter) that are not equivalent to X (\a\b\d\f\s\x\w\D\S\W...). Raising an error that says something like "I don't think [\A] does what you think it does, use [A] instead." might be a better option (and in case anyone is wondering about re.escape, I just checked and it doesn't escape letters). Even if this is technically backward incompatible, any string that has \A, \B, \Z inside [] can be considered buggy IMHO (unless someone can come up with a valid use case where they do something useful).
msg152238 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2012-01-29 15:35
Interesting. That shifts the issue, since the current behavior is neither of the two that make sense. Then it would indeed make the most sense to raise in these cases. (I wonder what these patterns actually would match, but I have no time to look in the sre sources right now...)
msg152263 - (view)	Author: John Machin (sjmachin)	Date: 2012-01-29 21:41
@Ezio: Comparison of the behaviour of \letter inside/outside character classes is irrelevant. The rules for inside can be expressed simply as: 1. Letters dDsSwW are special; they represent categories as documented, and do in fact have a similar meaning outside character classes. 2. Otherwise normal Python rules for backslash escapes in string literals should be followed. This means automatically that \a -> \x07, \A -> A, \b -> backspace, \B -> B, \z -> z and \Z -> Z. @Georg: No need to read the source, just read my initial posting: It's compiled as a zero-length matcher ("at") inside a character class ("in") i.e. a nonsense, then at runtime the illegality is deliberately ignored.
msg152264 - (view)	Author: John Machin (sjmachin)	Date: 2012-01-29 21:50
Whoops: "normal Python rules for backslash escapes" should have had a note "but revert to the C behaviour of stripping the \ from unrecognised escapes" which is what re appears to do in its own \ handling.
msg152384 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2012-01-31 13:13
The rule 1 makes sense, but it's not entirely obvious (people might consider bBaAzZ special too). The "normal Python rules for backslash escapes but revert to the C behaviour of stripping the \ from unrecognised escapes" is not obvious either, and from r'[\A]' people might expect: 1) same as \A, (beginning of the string); 2) a letter 'A'; 3) a '\' or a letter 'A' (especially if they write it as '[\\A]'); This is why I suggested to raise an error (and refuse the temptation to guess), but on the other hand, if you consider 'A' a "normal" letter like 'C', having an error for \A would be incoherent. It would have been better if \C raised an error too (I don't see why that would appear in a regex, since re.escape doesn't escape C and the user has no reason to add the \), but now it's too late for that.
msg152577 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-02-04 01:58
Does anyone have regex installed, to see what it does?
msg152586 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2012-02-04 03:08
This should answer that question: >>> re.findall(r"[\A\C]", r"\AC") ['C'] >>> regex.findall(r"[\A\C]", r"\AC") ['A', 'C'] The behaviour of regex is intended to match that of re for backwards compatibility.
msg152591 - (view)	Author: Terry J. Reedy (terry.reedy) *	Date: 2012-02-04 04:56
I presume you intend regex to match the spec rather than bugs. So if re has a bug in an obsure corner case and the spec is ambiguous, as I have the impression is the case here, using the interpretation embodied in regex would avoid creating a conflict.
msg152639 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2012-02-04 18:37
In re, "\A" within a character set should be similar to "\C", but instead it's still interpreted as meaning the start of the string. That's definitely a bug. If it doesn't do what it's supposed to do, then it's a bug. regex tries to be backwards compatible with re but fix such bugs. The only buggy behaviour which it retains in its version 0 (compatible) behaviour is not splitting on a zero-width match, and that's only because GvR believes that some existing code which uses re may rely on that behaviour. In its version 1 (extended) behaviour it does split on a zero-width match.
msg179274 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2013-01-07 18:05
I've attached a patch.
msg179637 - (view)	Author: Roundup Robot (python-dev)	Date: 2013-01-11 06:44
New changeset 2bc04449fd8c by Ezio Melotti in branch '2.7': #13899: \A, \Z, and \B now correctly match the A, Z, and B literals when used inside character classes (e.g. [A]). Patch by Matthew Barnett. http://hg.python.org/cpython/rev/2bc04449fd8c New changeset 081db241ccda by Ezio Melotti in branch '3.2': #13899: \A, \Z, and \B now correctly match the A, Z, and B literals when used inside character classes (e.g. [A]). Patch by Matthew Barnett. http://hg.python.org/cpython/rev/081db241ccda New changeset 17b1eb4a8144 by Ezio Melotti in branch '3.3': #13899: merge with 3.2. http://hg.python.org/cpython/rev/17b1eb4a8144 New changeset 35ece2465936 by Ezio Melotti in branch 'default': #13899: merge with 3.3. http://hg.python.org/cpython/rev/35ece2465936
msg179638 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2013-01-11 06:46
Fixed, thanks for the report John, and for the patch Matthew!

History
Date	User	Action	Args
2022-04-11 14:57:26	admin	set	github: 58107
2013-01-11 06:46:43	ezio.melotti	set	status: open -> closed resolution: fixed messages: + msg179638 stage: patch review -> resolved
2013-01-11 06:44:49	python-dev	set	nosy: + python-dev messages: + msg179637
2013-01-09 06:03:20	ezio.melotti	set	assignee: ezio.melotti stage: needs patch -> patch review
2013-01-07 18:05:23	mrabarnett	set	files: + issue13899.patch keywords: + patch messages: + msg179274
2013-01-07 06:09:28	ezio.melotti	set	stage: needs patch versions: + Python 3.3, Python 3.4
2012-03-10 06:55:20	georg.brandl	link	issue14237 superseder
2012-02-04 18:37:23	mrabarnett	set	messages: + msg152639
2012-02-04 04:56:08	terry.reedy	set	messages: + msg152591
2012-02-04 03:08:51	mrabarnett	set	messages: + msg152586
2012-02-04 01:58:12	terry.reedy	set	nosy: + terry.reedy messages: + msg152577
2012-01-31 16:39:18	jcea	set	nosy: + jcea
2012-01-31 13:13:10	ezio.melotti	set	messages: + msg152384
2012-01-29 21:50:46	sjmachin	set	messages: + msg152264
2012-01-29 21:41:10	sjmachin	set	messages: + msg152263
2012-01-29 15:35:23	georg.brandl	set	messages: + msg152238
2012-01-29 15:32:26	ezio.melotti	set	assignee: docs@python -> (no value) messages: + msg152237
2012-01-29 14:35:57	georg.brandl	set	nosy: + georg.brandl, docs@python messages: + msg152232 assignee: docs@python components: + Documentation
2012-01-28 23:11:12	sjmachin	set	messages: + msg152198
2012-01-28 22:42:33	ezio.melotti	set	nosy: + ezio.melotti, mrabarnett messages: + msg152195
2012-01-28 22:28:53	sjmachin	create