classification
Title: re.escape should not escape underscore
Type: behavior
Components: Regular Expressions Versions: Python 3.0, Python 2.6
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: akuchling, belopolsky, benjamin.peterson, donlorenzo, rsc, zanella
Priority: Keywords: easy, patch

Created on 2008-04-17 14:14 by rsc, last changed 2008-05-08 17:01 by akuchling.

Files
File name Uploaded Description Edit Remove
re.patch rsc, 2008-04-23 23:39 patch to python svn trunk (2.6a2+)
re.patch rsc, 2008-04-24 12:38
re.patch donlorenzo, 2008-04-28 17:24 patch using enumerate and frozenset
re_patch.diff zanella, 2008-05-08 01:35 Compilation of previous patches, testing for more characters included on the ASCII table
Messages
msg65585 (view) Author: Russ Cox (rsc) Date: 2008-04-17 14:14
import re
print re.escape("_")

Prints \_ but should be _.

This behavior differs from Perl and other systems: _ is an identifier
character and as such does not need to be escaped.
msg65590 (view) Author: Benjamin Peterson (benjamin.peterson) Date: 2008-04-17 20:44
It seems that escape is pretty dumb. The documentations says that
re.escape escapes all non-alphanumeric characters, and it does that
faithfully. It would seem more useful to have a list of meta-characters
and just escape those. This is more true in Py3k when str can have
thousands of possible characters that could be considered alphanumeric.
msg65599 (view) Author: Russ Cox (rsc) Date: 2008-04-18 01:08
> It seems that escape is pretty dumb. The documentations says that
> re.escape escapes all non-alphanumeric characters, and it does that
> faithfully. It would seem more useful to have a list of meta-characters
> and just escape those. This is more true in Py3k when str can have
> thousands of possible characters that could be considered alphanumeric.

The usual convention is to escape everything that is
ASCII and not A-Za-z0-9_, in case other punctuation
becomes special in the future.  But I agree -- escaping
just the actual special characters makes the most sense.

Russ
msg65600 (view) Author: Benjamin Peterson (benjamin.peterson) Date: 2008-04-18 01:19
Would you like to work on a patch?
msg65708 (view) Author: Benjamin Peterson (benjamin.peterson) Date: 2008-04-24 01:30
Thanks.

The loop in escape should really use enumerate instead of "for i in
range(len(pattern))".

Instead of using a loop, can't the test just use
"self.assertEqual(re.esacpe(same), same)?" Also, please add tests for
what re.escape should escape.
msg65721 (view) Author: Russ Cox (rsc) Date: 2008-04-24 12:38
> The loop in escape should really use enumerate 
> instead of "for i in range(len(pattern))".

It needs i to edit s[i].

> Instead of using a loop, can't the test just
> use "self.assertEqual(re.esacpe(same), same)?" 

Done.

> Also, please add tests for what re.escape should escape.

That's handled in the existing test over all bytes 0-255.
msg65923 (view) Author: Lorenz Quack (donlorenzo) Date: 2008-04-28 17:23
>> The loop in escape should really use enumerate 
>> instead of "for i in range(len(pattern))".
>
>It needs i to edit s[i].

enumerate(iterable) returns a tuple for each element in iterable
containing the index and the element itself.

I attached a patch using enumerate. The patch also uses a frozenset
rather than a dict for the special characters.
msg66386 (view) Author: Rafael Zanella (zanella) Date: 2008-05-08 01:35
AFAIK the lookup on dictionaries is faster than on lists.

Patch added, mainly a compilation of the previous patches with an
expanded test.
msg66416 (view) Author: Alexander Belopolsky (belopolsky) Date: 2008-05-08 14:08
Lorenz's patch uses a set, not a list for special characters.  Set 
lookup is as fast as dict lookup, but a set takes less memory because it 
does not have to store dummy values.  More importantly, use of frozenset 
instead of dict makes the code clearer.  On the other hand, I would 
simply use a string.  For a dozen entries, hash lookup does not buy you 
much.

Another nit: why use "\\%c" % (c) instead of obvious "\\" + c?

Finally, you can eliminate use of index and a temporary list altogether 
by using a generator expression:

''.join(("\\" + c if c in _special else '\\000' if c == "\000" else c),
        for c in pattern)
msg66418 (view) Author: Russ Cox (rsc) Date: 2008-05-08 14:36
> Lorenz's patch uses a set, not a list for special characters.  Set 
> lookup is as fast as dict lookup, but a set takes less memory because it 
> does not have to store dummy values.  More importantly, use of frozenset 
> instead of dict makes the code clearer.  On the other hand, I would 
> simply use a string.  For a dozen entries, hash lookup does not buy you 
> much.
> 
> Another nit: why use "\\%c" % (c) instead of obvious "\\" + c?
> 
> Finally, you can eliminate use of index and a temporary list altogether 
> by using a generator expression:
> 
> ''.join(("\\" + c if c in _special else '\\000' if c == "\000" else c),
>         for c in pattern)

The title of this issue (#2650) is "re.escape should not escape underscore",
not "re.escape is too slow and too easy to read".

If you have an actual, measured performance problem with re.escape,
please open a new issue with numbers to back it up. 
That's not what this one is about.

Thanks.
Russ
msg66419 (view) Author: Alexander Belopolsky (belopolsky) Date: 2008-05-08 15:08
On Thu, May 8, 2008 at 10:36 AM, Russ Cox <report@bugs.python.org> wrote:
..
>  The title of this issue (#2650) is "re.escape should not escape underscore",
>  not "re.escape is too slow and too easy to read".
>

Neither does the title say "re.escape should only escape
.^$*+?{}[]\|()".  I reviewed the patch rather than its conformance
with the title.  (BTW, the patch does not update documentation in
Doc/library/re.rst.)

>  If you have an actual, measured performance problem with re.escape,
>  please open a new issue with numbers to back it up.
>  That's not what this one is about.

You don't need to get so defensive.  I did not raise a performance
problem, I was simply responding to Rafael's "AFAIK the lookup on
dictionaries is faster than on lists" comment.  I did not say that you
*should* rewrite your patch the way I suggested, only that you *can*
use new language features to simplify the code.

In any case, I am -0 on the patch.  The current documentation says:

"""
escape(string)

   Return *string* with all non-alphanumerics backslashed; this is useful if you
   want to match an arbitrary literal string that may have regular expression
   metacharacters in it.

"""

and the current implementation serves the intended use case well.  I
did not see a compelling use case presented for the change.  On the
downside, since there is no mechanism to assure that _special indeed
contains all re metacharacters, it may present a maintenance problem
if additional metacharacters are added in the future.
msg66420 (view) Author: Russ Cox (rsc) Date: 2008-05-08 15:44
> You don't need to get so defensive.  I did not raise a performance
> problem, I was simply responding to Rafael's "AFAIK the lookup on
> dictionaries is faster than on lists" comment.  I did not say that you
> *should* rewrite your patch the way I suggested, only that you *can*
> use new language features to simplify the code.

I was responding to the entire thread more than your mail.
I'm frustrated because the only substantial discussion has
focused on details of how to implement set lookup the fastest
in a function that likely doesn't matter for speed.

> In any case, I am -0 on the patch.  The current documentation says:

Now these are the kinds of comments I was hoping for.
Thank you.

>    Return *string* with all non-alphanumerics backslashed; this is useful if you
>    want to match an arbitrary literal string that may have regular expression
>    metacharacters in it.

Sure; the documentation is wrong too.

> I did not see a compelling use case presented for the change.  

The usual convention in regular expressions is that escaping
a word character means you intend a special meaning, and
underscore is a word character.  Even though the current re
module does accept \_ as synonymous with _ (just as it accepts
\q as synonymous with q), it is no more correct to escape _ than
to escape q.

I think it is fine to escape all non-word characters, but someone
else suggested that it would be easier when moving to larger
character sets to escape just the special ones.  I'm happy with
either version.

My argument is only that Python should behave the same in 
this respect as other systems that use substantially the same
regular expressions.

> since there is no mechanism to assure that _special indeed
> contains all re metacharacters, it may present a maintenance problem
> if additional metacharacters are added in the future.

The test suite will catch these easily, since it checks that 
re.escape(c) matches c for all characters c.  But again, I'm happy
with escaping all ASCII non-word characters.

Russ
msg66421 (view) Author: Alexander Belopolsky (belopolsky) Date: 2008-05-08 16:12
On Thu, May 8, 2008 at 11:45 AM, Russ Cox <report@bugs.python.org> wrote:
..
>  My argument is only that Python should behave the same in
>  this respect as other systems that use substantially the same
>  regular expressions.
>

This is not enough to justify the change in my view.  After all, "A
Foolish Consistency is the Hobgoblin of Little Minds"
<http://www.python.org/dev/peps/pep-0008/>;.

I don't know if there is much code out there that relies on the
current behavior, but technically speaking, this is an incompatible
change.  A backward compatible way to add your desired functionality
would be to add the "escape_special" function, but not every useful
3-line function belongs to stdlib.

This said, I would prefer simply adding '_' to _alphanum over _special
approach, but still -1 on the whole idea.
msg66422 (view) Author: Russ Cox (rsc) Date: 2008-05-08 16:19
On Thu, May 8, 2008 at 12:12 PM, Alexander Belopolsky
<report@bugs.python.org> wrote:
>
> Alexander Belopolsky <belopolsky@users.sourceforge.net> added the comment:
>
> On Thu, May 8, 2008 at 11:45 AM, Russ Cox <report@bugs.python.org> wrote:
> ..
>>  My argument is only that Python should behave the same in
>>  this respect as other systems that use substantially the same
>>  regular expressions.
>>
>
> This is not enough to justify the change in my view.  After all, "A
> Foolish Consistency is the Hobgoblin of Little Minds"
> <http://www.python.org/dev/peps/pep-0008/>;.
>
> I don't know if there is much code out there that relies on the
> current behavior, but technically speaking, this is an incompatible
> change.  A backward compatible way to add your desired functionality
> would be to add the "escape_special" function, but not every useful
> 3-line function belongs to stdlib.

In my mind, arguing that re.escape can't possibly be changed
due to imagined backward incompatibilities is the foolish consistency.

> This said, I would prefer simply adding '_' to _alphanum over _special
> approach, but still -1 on the whole idea.

I don't use Python enough to care one way or the other.
I noticed a bug, I reported it.  Y'all are welcome to do
as you see fit.

Russ
msg66423 (view) Author: A.M. Kuchling (akuchling) Date: 2008-05-08 17:00
I haven't assessed the patch, but wouldn't mind to see it applied to 
an alpha release or to 3.0; +0 from me.  Given that the next 2.6 release
is planned to be a beta, though, the release manager would have to rule.

Note that I don't think this change is actually backwards-incompatible
and is actually fairly low-risk.  It does change what re.escape will
return, but the common use-case is escaping some user- or data-supplied
string so that it can be passed to re.compile()without triggering a
syntax error or very long loop.  In that use-case, whether it returns _
or \_ is immaterial; the result is the same.  Doing a Google code search
for re.escape confirms that this is the general usage.

Interestingly, SCons defines its own re_escape, with a comment saying '#
re.escape escapes too much'.  But their function doesn't escape \ or $
at all, so I don't understand why they bothered.

On the other hand, if this patch doesn't affect the usage of the
function, why bother?  Matching Perl or other systems probably won't
improve interoperability very much, so the release manager might decide
to leave well enough alone.
History
Date User Action Args
2008-05-08 17:01:02akuchlingsetnosy: + akuchling
messages: + msg66423
2008-05-08 16:19:13rscsetmessages: + msg66422
2008-05-08 16:12:27belopolskysetmessages: + msg66421
2008-05-08 15:45:38rscsetmessages: + msg66420
2008-05-08 15:08:51belopolskysetmessages: + msg66419
2008-05-08 14:36:09rscsetmessages: + msg66418
2008-05-08 14:08:28belopolskysetnosy: + belopolsky
messages: + msg66416
2008-05-08 01:35:58zanellasetfiles: + re_patch.diff
nosy: + zanella
messages: + msg66386
2008-04-28 17:24:25donlorenzosetfiles: + re.patch
nosy: + donlorenzo
messages: + msg65923
2008-04-24 12:38:29rscsetfiles: + re.patch
messages: + msg65721
2008-04-24 01:30:04benjamin.petersonsetmessages: + msg65708
2008-04-23 23:39:05rscsetfiles: + re.patch
keywords: + patch
2008-04-18 01:19:36benjamin.petersonsetmessages: + msg65600
2008-04-18 01:08:16rscsetmessages: + msg65599
2008-04-17 20:44:15benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg65590
2008-04-17 14:33:38gvanrossumsetversions: + Python 2.6, Python 3.0, - Python 2.5
2008-04-17 14:33:24gvanrossumsetkeywords: + easy
2008-04-17 14:14:25rscsetcomponents: + Regular Expressions
2008-04-17 14:14:08rsccreate