Issue 2650: re.escape should not escape underscore

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46902

classification

Title:	re.escape should not escape underscore
Type:	behavior	Stage:	resolved
Components:	Regular Expressions	Versions:	Python 3.2

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:	ezio.melotti	Nosy List:	SilentGhost, amaury.forgeotdarc, belopolsky, benjamin.peterson, bjourne, donlorenzo, ezio.melotti, foom, georg.brandl, mortenlj, mrabarnett, pitrou, python-dev, rsc, swamiyeswanth, timehorse, zanella
Priority:	normal	Keywords:	easy, patch

Created on 2008-04-17 14:14 by rsc, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
re.patch	rsc, 2008-04-23 23:39	patch to python svn trunk (2.6a2+)	review
re.patch	rsc, 2008-04-24 12:38		review
re.patch	donlorenzo, 2008-04-28 17:24	patch using enumerate and frozenset	review
re_patch.diff	zanella, 2008-05-08 01:35	Compilation of previous patches, testing for more characters included on the ASCII table	review
test_re.diff	SilentGhost, 2011-02-23 12:30		review
test_re.diff	SilentGhost, 2011-03-13 03:01		review
issue2650.diff	ezio.melotti, 2011-03-25 13:25	Patch to add '_' to the non-escaped chars.	review

Messages (43)
msg65585 - (view)	Author: Russ Cox (rsc)	Date: 2008-04-17 14:14
import re print re.escape("_") Prints \_ but should be _. This behavior differs from Perl and other systems: _ is an identifier character and as such does not need to be escaped.
msg65590 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-04-17 20:44
It seems that escape is pretty dumb. The documentations says that re.escape escapes all non-alphanumeric characters, and it does that faithfully. It would seem more useful to have a list of meta-characters and just escape those. This is more true in Py3k when str can have thousands of possible characters that could be considered alphanumeric.
msg65599 - (view)	Author: Russ Cox (rsc)	Date: 2008-04-18 01:08
> It seems that escape is pretty dumb. The documentations says that > re.escape escapes all non-alphanumeric characters, and it does that > faithfully. It would seem more useful to have a list of meta-characters > and just escape those. This is more true in Py3k when str can have > thousands of possible characters that could be considered alphanumeric. The usual convention is to escape everything that is ASCII and not A-Za-z0-9_, in case other punctuation becomes special in the future. But I agree -- escaping just the actual special characters makes the most sense. Russ
msg65600 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-04-18 01:19
Would you like to work on a patch?
msg65708 - (view)	Author: Benjamin Peterson (benjamin.peterson) *	Date: 2008-04-24 01:30
Thanks. The loop in escape should really use enumerate instead of "for i in range(len(pattern))". Instead of using a loop, can't the test just use "self.assertEqual(re.esacpe(same), same)?" Also, please add tests for what re.escape should escape.
msg65721 - (view)	Author: Russ Cox (rsc)	Date: 2008-04-24 12:38
> The loop in escape should really use enumerate > instead of "for i in range(len(pattern))". It needs i to edit s[i]. > Instead of using a loop, can't the test just > use "self.assertEqual(re.esacpe(same), same)?" Done. > Also, please add tests for what re.escape should escape. That's handled in the existing test over all bytes 0-255.
msg65923 - (view)	Author: Lorenz Quack (donlorenzo) *	Date: 2008-04-28 17:23
>> The loop in escape should really use enumerate >> instead of "for i in range(len(pattern))". > >It needs i to edit s[i]. enumerate(iterable) returns a tuple for each element in iterable containing the index and the element itself. I attached a patch using enumerate. The patch also uses a frozenset rather than a dict for the special characters.
msg66386 - (view)	Author: Rafael Zanella (zanella)	Date: 2008-05-08 01:35
AFAIK the lookup on dictionaries is faster than on lists. Patch added, mainly a compilation of the previous patches with an expanded test.
msg66416 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-05-08 14:08
Lorenz's patch uses a set, not a list for special characters. Set lookup is as fast as dict lookup, but a set takes less memory because it does not have to store dummy values. More importantly, use of frozenset instead of dict makes the code clearer. On the other hand, I would simply use a string. For a dozen entries, hash lookup does not buy you much. Another nit: why use "\\%c" % (c) instead of obvious "\\" + c? Finally, you can eliminate use of index and a temporary list altogether by using a generator expression: ''.join(("\\" + c if c in _special else '\\000' if c == "\000" else c), for c in pattern)
msg66418 - (view)	Author: Russ Cox (rsc)	Date: 2008-05-08 14:36
> Lorenz's patch uses a set, not a list for special characters. Set > lookup is as fast as dict lookup, but a set takes less memory because it > does not have to store dummy values. More importantly, use of frozenset > instead of dict makes the code clearer. On the other hand, I would > simply use a string. For a dozen entries, hash lookup does not buy you > much. > > Another nit: why use "\\%c" % (c) instead of obvious "\\" + c? > > Finally, you can eliminate use of index and a temporary list altogether > by using a generator expression: > > ''.join(("\\" + c if c in _special else '\\000' if c == "\000" else c), > for c in pattern) The title of this issue (#2650) is "re.escape should not escape underscore", not "re.escape is too slow and too easy to read". If you have an actual, measured performance problem with re.escape, please open a new issue with numbers to back it up. That's not what this one is about. Thanks. Russ
msg66419 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-05-08 15:08
On Thu, May 8, 2008 at 10:36 AM, Russ Cox <report@bugs.python.org> wrote: .. > The title of this issue (#2650) is "re.escape should not escape underscore", > not "re.escape is too slow and too easy to read". > Neither does the title say "re.escape should only escape .^$+?{}[]\\|()". I reviewed the patch rather than its conformance with the title. (BTW, the patch does not update documentation in Doc/library/re.rst.) > If you have an actual, measured performance problem with re.escape, > please open a new issue with numbers to back it up. > That's not what this one is about. You don't need to get so defensive. I did not raise a performance problem, I was simply responding to Rafael's "AFAIK the lookup on dictionaries is faster than on lists" comment. I did not say that you should* rewrite your patch the way I suggested, only that you can use new language features to simplify the code. In any case, I am -0 on the patch. The current documentation says: """ escape(string) Return string with all non-alphanumerics backslashed; this is useful if you want to match an arbitrary literal string that may have regular expression metacharacters in it. """ and the current implementation serves the intended use case well. I did not see a compelling use case presented for the change. On the downside, since there is no mechanism to assure that _special indeed contains all re metacharacters, it may present a maintenance problem if additional metacharacters are added in the future.
msg66420 - (view)	Author: Russ Cox (rsc)	Date: 2008-05-08 15:44
> You don't need to get so defensive. I did not raise a performance > problem, I was simply responding to Rafael's "AFAIK the lookup on > dictionaries is faster than on lists" comment. I did not say that you > should rewrite your patch the way I suggested, only that you can > use new language features to simplify the code. I was responding to the entire thread more than your mail. I'm frustrated because the only substantial discussion has focused on details of how to implement set lookup the fastest in a function that likely doesn't matter for speed. > In any case, I am -0 on the patch. The current documentation says: Now these are the kinds of comments I was hoping for. Thank you. > Return string with all non-alphanumerics backslashed; this is useful if you > want to match an arbitrary literal string that may have regular expression > metacharacters in it. Sure; the documentation is wrong too. > I did not see a compelling use case presented for the change. The usual convention in regular expressions is that escaping a word character means you intend a special meaning, and underscore is a word character. Even though the current re module does accept \_ as synonymous with _ (just as it accepts \q as synonymous with q), it is no more correct to escape _ than to escape q. I think it is fine to escape all non-word characters, but someone else suggested that it would be easier when moving to larger character sets to escape just the special ones. I'm happy with either version. My argument is only that Python should behave the same in this respect as other systems that use substantially the same regular expressions. > since there is no mechanism to assure that _special indeed > contains all re metacharacters, it may present a maintenance problem > if additional metacharacters are added in the future. The test suite will catch these easily, since it checks that re.escape(c) matches c for all characters c. But again, I'm happy with escaping all ASCII non-word characters. Russ
msg66421 - (view)	Author: Alexander Belopolsky (belopolsky) *	Date: 2008-05-08 16:12
On Thu, May 8, 2008 at 11:45 AM, Russ Cox <report@bugs.python.org> wrote: .. > My argument is only that Python should behave the same in > this respect as other systems that use substantially the same > regular expressions. > This is not enough to justify the change in my view. After all, "A Foolish Consistency is the Hobgoblin of Little Minds" <http://www.python.org/dev/peps/pep-0008/>. I don't know if there is much code out there that relies on the current behavior, but technically speaking, this is an incompatible change. A backward compatible way to add your desired functionality would be to add the "escape_special" function, but not every useful 3-line function belongs to stdlib. This said, I would prefer simply adding '_' to _alphanum over _special approach, but still -1 on the whole idea.
msg66422 - (view)	Author: Russ Cox (rsc)	Date: 2008-05-08 16:19
On Thu, May 8, 2008 at 12:12 PM, Alexander Belopolsky <report@bugs.python.org> wrote: > > Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment: > > On Thu, May 8, 2008 at 11:45 AM, Russ Cox <report@bugs.python.org> wrote: > .. >> My argument is only that Python should behave the same in >> this respect as other systems that use substantially the same >> regular expressions. >> > > This is not enough to justify the change in my view. After all, "A > Foolish Consistency is the Hobgoblin of Little Minds" > <http://www.python.org/dev/peps/pep-0008/>. > > I don't know if there is much code out there that relies on the > current behavior, but technically speaking, this is an incompatible > change. A backward compatible way to add your desired functionality > would be to add the "escape_special" function, but not every useful > 3-line function belongs to stdlib. In my mind, arguing that re.escape can't possibly be changed due to imagined backward incompatibilities is the foolish consistency. > This said, I would prefer simply adding '_' to _alphanum over _special > approach, but still -1 on the whole idea. I don't use Python enough to care one way or the other. I noticed a bug, I reported it. Y'all are welcome to do as you see fit. Russ
msg66423 - (view)	Author: A.M. Kuchling (akuchling) *	Date: 2008-05-08 17:00
I haven't assessed the patch, but wouldn't mind to see it applied to an alpha release or to 3.0; +0 from me. Given that the next 2.6 release is planned to be a beta, though, the release manager would have to rule. Note that I don't think this change is actually backwards-incompatible and is actually fairly low-risk. It does change what re.escape will return, but the common use-case is escaping some user- or data-supplied string so that it can be passed to re.compile()without triggering a syntax error or very long loop. In that use-case, whether it returns _ or \_ is immaterial; the result is the same. Doing a Google code search for re.escape confirms that this is the general usage. Interestingly, SCons defines its own re_escape, with a comment saying '# re.escape escapes too much'. But their function doesn't escape \ or $ at all, so I don't understand why they bothered. On the other hand, if this patch doesn't affect the usage of the function, why bother? Matching Perl or other systems probably won't improve interoperability very much, so the release manager might decide to leave well enough alone.
msg68208 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-14 15:28
Talking about performance, why use a loop to escape special characters when you could use a regular expression to escape them all at once?
msg68785 - (view)	Author: Morten Lied Johansen (mortenlj)	Date: 2008-06-26 14:45
One issue that the current implementation has, which I can't see have been commented on here, is that it kills utf8 characters (and probably every other character encoding that is multi-byte). A é character in an utf8 encoded string will be represented by two bytes. When passed through re.escape, those two bytes are checked individually, and both are considered non-alphanumeric, and is consequently escaped, breaking the utf8 string into complete gibberish instead.
msg68786 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-06-26 15:18
The escaped regexp is not utf-8 (why should it be?), but it still matches the same bytes in the searched text, which has to be utf-8 encoded anyway: >>> text = u"été".encode('utf-8') >>> regexp = u"é".encode('utf-8') >>> re.findall(regexp, text) ['\xc3\xa9', '\xc3\xa9'] >>> escaped_regexp = re.escape(regexp) >>> re.findall(escaped_regexp, text) ['\xc3\xa9', '\xc3\xa9']
msg68895 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2008-06-28 19:20
> The escaped regexp is not utf-8 (why should it be?) I suppose it is annoying if you want to print the escaped regexp for debugging purposes. Anyway, I suppose someone should really decide if improving re.escape is worth it, and if not, close the bug.
msg68910 - (view)	Author: Morten Lied Johansen (mortenlj)	Date: 2008-06-28 20:51
In my particular case, we were passing the regex on to a database which has regex support syntactically equal to Python, so it seemed natural to use re.escape to make sure we weren't matching against the pattern we really wanted. The documentation of re.escape also states that it will only escape non- alphanumeric characters, which is apparently only true if you are using a single byte encoding (ie. not utf-8, or any other encoding using more than a single byte per character). At the very least, that's probably worth mentioning in the docs.
msg92548 - (view)	Author: Björn Lindqvist (bjourne)	Date: 2009-09-12 16:21
In my app, I need to transform the regexp created from user input so that it matches unicode characters with their ascii equivalents. For example, if someone searches for "el nino", that should match the string "el ñino". Similarly, searching for "el ñino" should match "el nino". The code to transform the regexp looks like this: s = re.escape(user_input) s = re.sub(u'n\|ñ', u'[n\|ñ]') matches = list(re.findall(s, data, re.IGNORECASE\|re.UNICODE)) It doesn't work because the ñ in the user_input is escaped with a backslash. My workaround, to compensate for re.escape's to eager escaping, is to escape re.sub pattern: s = re.sub(u'\\\\n\|\\\\ñ', u'[\\\\n\|\\\\ñ]') It works but is not very nice. It would have been much better if re.escape worked like one could expect in the first place.
msg122408 - (view)	Author: Matthew Barnett (mrabarnett) *	Date: 2010-11-25 20:51
Re the regex module (issue #2636), would a good compromise be: regex.escape(user_input, special_only=True) to maintain compatibility?
msg125746 - (view)	Author: James Y Knight (foom)	Date: 2011-01-08 03:25
I just ran into the impl of escape after being surprised that '/' was being escaped, and then was completely amazed that it wasn't just implemented as a one-line re.subn. Come on, a loop for string replacement? This is in the freaking re module for pete's sake! The extra special \\000 behavior seems entirely superfluous, as well. re works just fine with nul bytes in the pattern; there's no need to special case that. So: return re.subn('([^abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890])', '\\\\\\1', pattern)[0] or, for the new proposed list of special chars: return re.subn('([][.^$*+?{}\\\|()])', '\\\\\\1', pattern)[0] (pre-compilation of pattern left as an exercise to the reader)
msg125770 - (view)	Author: Georg Brandl (georg.brandl) *	Date: 2011-01-08 09:34
The loop looks strange to me too, not to mention inefficient compared with a regex replacement done in C.
msg125778 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-01-08 11:29
James, could you propose a proper patch? Even better if you also give a couple of timing results, just for the record?
msg126138 - (view)	Author: yeswanth (swamiyeswanth)	Date: 2011-01-12 21:09
As James said I have written the patch using only regular expressions . This is going to be my first patch . I need help writing the test for it
msg126141 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2011-01-12 21:21
> As James said I have written the patch using only regular expressions . > This is going to be my first patch . I need help writing the test for it You will find the current tests in Lib/test/test_re.py. To execute them, run: $ ./python -m test.regrtest -v test_re In this case, there are probably already some tests for re.escape. So you have to check that they are sufficient, and that your patch doesn't make them fail.
msg126168 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2011-01-13 13:49
Here is the patch, including adjustment to the test.
msg126176 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2011-01-13 15:48
The naïve version of the code proposed was about 3 times slower than existing version. However, the test, I think, is valuable enough. So, I'm reinstating it.
msg126177 - (view)	Author: James Y Knight (foom)	Date: 2011-01-13 16:09
Show your speed test? Looks 2.5x faster to me. But I'm running this on python 2.6, so I guess it's possible that the re module's speed was decimated in Py3k. python -m timeit -s "$(printf "import re\ndef escape(s):\n return re.sub('([][.^$+?{}\\\|()])', '\\\1', s)")" 'escape("!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()")' 100000 loops, best of 3: 18.4 usec per loop python -m timeit -s "import re" 're.escape("!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&*()")' 10000 loops, best of 3: 45.7 usec per loop
msg126181 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2011-01-13 16:48
James, I think the setup statement should have been: "import re\ndef escape(s):\n return re.sub(r'([][.^$*+?{}\\\|()])', r'\\\1', s)")" note the raw string literals. The timings that I got after applying file20388 (http://bugs.python.org/file20388/issue2650.diff) were: >PCbuild\python.exe -m timeit -s "import re, string" "re.escape(string.printable)" 10000 loops, best of 3: 63.3 usec per loop >python.exe -m timeit -s "import re, string" "re.escape(string.printable)" 100000 loops, best of 3: 19.3 usec per loop
msg126184 - (view)	Author: James Y Knight (foom)	Date: 2011-01-13 17:12
Right you are, it seems that python's regexp implementation is terribly slow when doing replacements with a substitution in them. (fixing the broken test, as you pointed out changed the timing to 97.6 usec vs the in-error-reported 18.3usec.) Oh well. I still think it's crazy not to use re for this in its own module. Someone just needs to fix re to not be horrifically slow, too. :)
msg126185 - (view)	Author: yeswanth (swamiyeswanth)	Date: 2011-01-13 17:13
@James test results for py3k python -m timeit -s "$(printf "import re\ndef escape(s):\n return re.sub('([][.^$+?{}\\\|()])', '\\\1', s)")" 'escape("!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()")' 100000 loops, best of 3: 17.1 usec per loop python -m timeit -s "import re" 're.escape("!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&()!@#$%^&*()")' 10000 loops, best of 3: 102 usec per loop
msg130719 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2011-03-13 03:01
Here is the latest patch for test_re incorporating review suggestions by Ezio and some improvements along the way.
msg130783 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-03-14 01:38
I took a look to what other languages do, and it turned out that: perl escapes [^A-Za-z_0-9] [0]; .net escapes the metachars and whitespace [1]; java escapes the metachars or escape sequences [2]; ruby escapes the metachars [3]; It might be OK to exclude _ from the escaped chars, but I would keep escaping all the other non-alnum chars too (i.e. match perl behavior). (FWIW, I don't think re.escape() is used in performance-critical situation, so readability should probably be preferred over speed.) [0]: http://perldoc.perl.org/functions/quotemeta.html [1]: http://msdn.microsoft.com/en-us/library/system.text.regularexpressions.regex.escape.aspx [2]: http://download.oracle.com/javase/1.5.0/docs/api/java/util/regex/Pattern.html [3]: http://www.ruby-doc.org/core/classes/Regexp.html
msg130821 - (view)	Author: SilentGhost (SilentGhost) *	Date: 2011-03-14 14:46
I think these are two different questions: 1. What to escape 2. What to do about poor performance of the re.escape when re.sub is used In my opinion, there isn't any justifiable reason to escape non-meta characters: it doesn't affect matching; escaped strings are typically just re-used in regex. I would favour simpler and cleaner code with re.sub. I don't think that re.quote could be a performance bottleneck in any application. I did some profiling with python3.2 and it seems that the reason for this poor performance is many abstraction layers when using re.sub. However, we need to bear in mind that we're only talking about 40 usec difference for a 100-char string (string.printable): I'd think that strings being escaped are typically shorter. As a compromise, I tested this code: _mp = {ord(i): '\\' + i for i in '][.^$+?{}\\\|()'} def escape(pattern): if isinstance(pattern, str): return pattern.translate(_mp) return sub(br'([][.^$+?{}\\\|()])', br'\\\1', pattern) which is fast (faster than existing code) for str and slow for bytes patterns. I don't particularly like it, because of the difference between str and bytes handling, but I do think that it will be much easier to "fix" once/when/if re module is improved.
msg130930 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-03-14 23:23
re.escape and its tests can be refactored in 2.7/3.1, the '_' can be added to the list of chars that are not escaped in 3.3. I'll put together a patch and fix this unless someone thinks that the '_' should be escaped in 3.3 too.
msg132082 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-03-25 12:27
New changeset 1402c719b7cf by Ezio Melotti in branch '3.1': #2650: Refactor the tests for re.escape. http://hg.python.org/cpython/rev/1402c719b7cf New changeset 9147f7ed75b3 by Ezio Melotti in branch '3.1': #2650: Add tests with non-ascii chars for re.escape. http://hg.python.org/cpython/rev/9147f7ed75b3 New changeset ed02db9921ac by Ezio Melotti in branch '3.1': #2650: Refactor re.escape to use enumerate(). http://hg.python.org/cpython/rev/ed02db9921ac New changeset 42ab3ebb8c2c by Ezio Melotti in branch '3.2': #2650: Merge with 3.1. http://hg.python.org/cpython/rev/42ab3ebb8c2c New changeset 9da300ad8255 by Ezio Melotti in branch 'default': #2650: Merge with 3.2. http://hg.python.org/cpython/rev/9da300ad8255
msg132084 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-03-25 12:51
New changeset d52b1faa7b11 by Ezio Melotti in branch '2.7': #2650: Refactor re.escape and its tests. http://hg.python.org/cpython/rev/d52b1faa7b11
msg132085 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-03-25 12:57
I did a few more tests and using a re.sub seems indeed slower (the implementation is just 4 lines though, and it's more readable): wolf@hp:~/dev/py/3.1$ ./python -m timeit -s 'import re,string; escape_pattern = re.compile("([^\x00a-zA-Z0-9])")' 'escape_pattern.sub(r"\\\1", string.printable).replace("\x00", "\\000")' 1000 loops, best of 3: 219 usec per loop wolf@hp:~/dev/py/3.1$ ./python -m timeit -s 'import re,string' 're.escape(string.printable)' 10000 loops, best of 3: 59.3 usec per loop wolf@hp:~/dev/py/3.1$ ./python -c 'import re,string; escape_pattern = re.compile("([^\x00a-zA-Z0-9])"); print(escape_pattern.sub(r"\\\1", string.printable).replace("\x00", "\\000") == re.escape(string.printable))' True wolf@hp:~/dev/py/3.1$ ./python -m timeit -s 'import re,string; escape_pattern = re.compile(b"([^\x00a-zA-Z0-9])"); s = string.printable.encode("ascii")' 'escape_pattern.sub(br"\\\1", s).replace(b"\x00", b"\\000")' 1000 loops, best of 3: 231 usec per loop wolf@hp:~/dev/py/3.1$ ./python -m timeit -s 'import re,string; s = string.printable.encode("ascii")' 're.escape(s)' 10000 loops, best of 3: 73.2 usec per loop wolf@hp:~/dev/py/3.1$ ./python -c 'import re,string; escape_pattern = re.compile(b"([^\x00a-zA-Z0-9])"); s = string.printable.encode("ascii"); print(escape_pattern.sub(br"\\\1", s).replace(b"\x00", b"\\000") == re.escape(s))' True The .replace() doesn't seem to affect the affect the speed in any significant way. I also did a few more tests: 1) using enumerate(); 2) like 1) but also moving \x00 in the set of alnum chars, removing the "if c == '\000'" from the loop and using .replace("\x00", "\\000") on the joined string; 3) like 2) but also moving the loop in a genexp inside the join(); 1) is the fastest (10-15% faster than the original), 2) is pretty much the same speed of 1), and 3) is slower, so I just changed re.escape to use enumerate() and refactored its tests in 2.7/3.1/3.2/3.3.
msg132087 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-03-25 13:25
The attached patch (issue2650.diff) adds '_' to the list of chars that are not escaped.
msg132838 - (view)	Author: Ezio Melotti (ezio.melotti) *	Date: 2011-04-03 14:45
Georg, do you think a versionchanged note should be added for this? The change is minor and the patch updates the documentation to reflect the change.
msg133461 - (view)	Author: Roundup Robot (python-dev)	Date: 2011-04-10 09:59
New changeset dda33191f7f5 by Ezio Melotti in branch 'default': #2650: re.escape() no longer escapes the "_". http://hg.python.org/cpython/rev/dda33191f7f5

History
Date	User	Action	Args
2022-04-11 14:56:33	admin	set	github: 46902
2011-04-10 10:00:23	ezio.melotti	set	status: open -> closed resolution: fixed stage: needs patch -> resolved
2011-04-10 09:59:35	python-dev	set	messages: + msg133461
2011-04-03 14:45:44	ezio.melotti	set	messages: + msg132838
2011-03-25 13:25:56	ezio.melotti	set	files: + issue2650.diff keywords: + patch messages: + msg132087
2011-03-25 12:57:26	ezio.melotti	set	messages: + msg132085
2011-03-25 12:51:07	python-dev	set	messages: + msg132084
2011-03-25 12:27:14	python-dev	set	nosy: + python-dev messages: + msg132082
2011-03-14 23:23:56	ezio.melotti	set	assignee: ezio.melotti messages: + msg130930 nosy: georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth
2011-03-14 14:46:47	SilentGhost	set	keywords: - patch nosy: georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg130821
2011-03-14 01:38:11	ezio.melotti	set	nosy: georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg130783
2011-03-13 03:01:35	SilentGhost	set	files: + test_re.diff nosy: georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg130719
2011-02-23 12:30:07	SilentGhost	set	files: - test_re.diff nosy: georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth
2011-02-23 12:30:02	SilentGhost	set	files: + test_re.diff nosy: georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth
2011-01-13 17:29:37	akuchling	set	nosy: - akuchling
2011-01-13 17:13:40	swamiyeswanth	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg126185
2011-01-13 17:12:03	foom	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg126184
2011-01-13 16:48:32	SilentGhost	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg126181
2011-01-13 16:09:02	foom	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg126177
2011-01-13 15:48:07	SilentGhost	set	files: + test_re.diff nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg126176
2011-01-13 15:44:42	SilentGhost	set	files: - issue2650.diff nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth
2011-01-13 13:49:24	SilentGhost	set	files: + issue2650.diff nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, SilentGhost, swamiyeswanth messages: + msg126168
2011-01-13 00:46:31	SilentGhost	set	nosy: + SilentGhost
2011-01-12 21:21:03	pitrou	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, swamiyeswanth messages: + msg126141 stage: needs patch
2011-01-12 21:09:51	swamiyeswanth	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett, swamiyeswanth messages: + msg126138
2011-01-11 18:42:30	swamiyeswanth	set	nosy: + swamiyeswanth
2011-01-08 11:29:56	pitrou	set	nosy: akuchling, georg.brandl, amaury.forgeotdarc, belopolsky, foom, pitrou, rsc, timehorse, benjamin.peterson, zanella, donlorenzo, ezio.melotti, bjourne, mortenlj, mrabarnett messages: + msg125778 versions: + Python 3.2, - Python 3.1, Python 2.7
2011-01-08 09:34:05	georg.brandl	set	nosy: + georg.brandl messages: + msg125770
2011-01-08 03:25:39	foom	set	nosy: + foom messages: + msg125746
2010-11-25 20:51:07	mrabarnett	set	nosy: + mrabarnett messages: + msg122408
2009-09-12 16:21:12	bjourne	set	nosy: + bjourne messages: + msg92548
2009-04-29 11:03:57	ezio.melotti	set	nosy: + ezio.melotti
2008-09-28 18:55:25	timehorse	set	nosy: + timehorse
2008-09-28 18:53:53	timehorse	set	versions: + Python 3.1, Python 2.7, - Python 2.6, Python 3.0
2008-06-28 20:51:59	mortenlj	set	messages: + msg68910
2008-06-28 19:20:19	pitrou	set	messages: + msg68895
2008-06-26 15:18:38	amaury.forgeotdarc	set	nosy: + amaury.forgeotdarc messages: + msg68786
2008-06-26 14:45:08	mortenlj	set	nosy: + mortenlj messages: + msg68785
2008-06-14 15:28:52	pitrou	set	nosy: + pitrou messages: + msg68208
2008-05-08 17:01:02	akuchling	set	nosy: + akuchling messages: + msg66423
2008-05-08 16:19:13	rsc	set	messages: + msg66422
2008-05-08 16:12:27	belopolsky	set	messages: + msg66421
2008-05-08 15:45:38	rsc	set	messages: + msg66420
2008-05-08 15:08:51	belopolsky	set	messages: + msg66419
2008-05-08 14:36:09	rsc	set	messages: + msg66418
2008-05-08 14:08:28	belopolsky	set	nosy: + belopolsky messages: + msg66416
2008-05-08 01:35:58	zanella	set	files: + re_patch.diff nosy: + zanella messages: + msg66386
2008-04-28 17:24:25	donlorenzo	set	files: + re.patch nosy: + donlorenzo messages: + msg65923
2008-04-24 12:38:29	rsc	set	files: + re.patch messages: + msg65721
2008-04-24 01:30:04	benjamin.peterson	set	messages: + msg65708
2008-04-23 23:39:05	rsc	set	files: + re.patch keywords: + patch
2008-04-18 01:19:36	benjamin.peterson	set	messages: + msg65600
2008-04-18 01:08:16	rsc	set	messages: + msg65599
2008-04-17 20:44:15	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg65590
2008-04-17 14:33:38	gvanrossum	set	versions: + Python 2.6, Python 3.0, - Python 2.5
2008-04-17 14:33:24	gvanrossum	set	keywords: + easy
2008-04-17 14:14:25	rsc	set	components: + Regular Expressions
2008-04-17 14:14:08	rsc	create