This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: with should be as fast as try/finally
Type: enhancement Stage:
Components: Interpreter Core Versions: Python 2.6
process
Status: closed Resolution: accepted
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, belopolsky, benjamin.peterson, christian.heimes, collinwinter, jcea, jyasskin, ncoghlan, nnorwitz, rhettinger
Priority: normal Keywords: patch

Created on 2008-02-24 23:10 by jyasskin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
faster_with.patch jyasskin, 2008-03-03 00:29 proof of concept
faster_with.patch jyasskin, 2008-03-03 02:34 complete
Messages (10)
msg62951 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-02-24 23:10
Currently, 'with' costs about .2us over try/finally:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally:
lock.release()'
1000000 loops, best of 3: 0.617 usec per loop
$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'with lock: pass'
1000000 loops, best of 3: 0.774 usec per loop

Since it's doing the same thing (and calling the same C functions to do
the lock manipulation), it shouldn't take more time. The bytecodes
associated with the two options look like:

2)    with lock:
3)      pass

 2           0 LOAD_GLOBAL              0 (lock)
              3 DUP_TOP             
              4 LOAD_ATTR                1 (__exit__)
              7 STORE_FAST               0 (_[1])
             10 LOAD_ATTR                2 (__enter__)
             13 CALL_FUNCTION            0
             16 POP_TOP             
             17 SETUP_FINALLY            4 (to 24)

  3          20 POP_BLOCK           
             21 LOAD_CONST               0 (None)
        >>   24 LOAD_FAST                0 (_[1])
             27 DELETE_FAST              0 (_[1])
             30 WITH_CLEANUP        
             31 END_FINALLY         
             32 LOAD_CONST               0 (None)
             35 RETURN_VALUE        


6)    lock.acquire()
7)    try:
8)      pass
9)    finally:
10)     lock.release()

  6           0 LOAD_GLOBAL              0 (lock)
              3 LOAD_ATTR                1 (acquire)
              6 CALL_FUNCTION            0
              9 POP_TOP             

  7          10 SETUP_FINALLY            4 (to 17)

  8          13 POP_BLOCK           
             14 LOAD_CONST               0 (None)

 10     >>   17 LOAD_GLOBAL              0 (lock)
             20 LOAD_ATTR                2 (release)
             23 CALL_FUNCTION            0
             26 POP_TOP             
             27 END_FINALLY         
             28 LOAD_CONST               0 (None)
             31 RETURN_VALUE


The major difference I see is the extra local variable (_[1]) used by
with. It looks like the compiler ought to be able to save that on the
stack instead, and save 3 opcodes. Neal Norwitz suggested that, if that
optimization is impossible, WITH_CLEANUP could be modified to take the
variable as a parameter, which would let it absorb the LOAD_FAST and
DELETE_FAST instructions.

I've added everyone on the previous bug to the nosy list. Sorry if you
don't care. :)
msg63165 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-03-01 16:24
A closer approximation of what the with statement is doing would be:

exit = lock.release()
lock.acquire()
try:
    pass
finally:
    exit()

The problem with trying to store the result of the retrieval of __exit__
on the stack is that we need to leave the context manager itself on top
of the stack for the next LOAD_ATTR opcode (when we're retrieving the
__enter__ method).

However, changing WITH_CLEANUP to take an argument indicating which
local variable holds the bound __exit__ method sounds like it might work.
msg63167 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-03-01 16:34
Scratch the parentheses on that first line of sample code in my previous
comment (isn't copy and paste wonderful?)
msg63197 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-03-03 00:29
Here's a proof-of-concept patch that keeps the __exit__ method on the
stack. It uses ROT_TWO to stuff it under the context object instead of
storing it into a temporary. (Thanks Nick for pointing out that problem
before I had to waste time on it.) test_with passes, although I need to
update several more things and maybe fix a refleak.

The patch changes the compilation of:

    def with_(l):
        with l:
            pass

from

  4           0 LOAD_FAST                0 (l)
              3 DUP_TOP             
              4 LOAD_ATTR                0 (__exit__)
              7 STORE_FAST               1 (_[1])
             10 LOAD_ATTR                1 (__enter__)
             13 CALL_FUNCTION            0
             16 POP_TOP             
             17 SETUP_FINALLY            4 (to 24)

  5          20 POP_BLOCK           
             21 LOAD_CONST               0 (None)
        >>   24 LOAD_FAST                1 (_[1])
             27 DELETE_FAST              1 (_[1])
             30 WITH_CLEANUP        
             31 END_FINALLY         
             32 LOAD_CONST               0 (None)
             35 RETURN_VALUE        

to

  4           0 LOAD_FAST                0 (l)
              3 DUP_TOP             
              4 LOAD_ATTR                0 (__exit__)
              7 ROT_TWO             
              8 LOAD_ATTR                1 (__enter__)
             11 CALL_FUNCTION            0
             14 POP_TOP             
             15 SETUP_FINALLY            4 (to 22)

  5          18 POP_BLOCK           
             19 LOAD_CONST               0 (None)
        >>   22 WITH_CLEANUP        
             23 END_FINALLY         
             24 LOAD_CONST               0 (None)
             27 RETURN_VALUE        


And speeds it up from:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'with lock: pass'
1000000 loops, best of 3: 0.832 usec per loop

to:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'with lock: pass'
1000000 loops, best of 3: 0.762 usec per loop


That's only half of the way to parity with try/finally:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally:
lock.release()'
1000000 loops, best of 3: 0.638 usec per loop

What's strange is that calling __enter__ and __exit__ in a try/finally
block brings the speed back to the faster 'with' speed, even though they
call the same C functions:

$ ./python.exe -m timeit -s 'import thread; lock =
thread.allocate_lock()' 'lock.__enter__()' 'try: pass' 'finally:
lock.__exit__()'
1000000 loops, best of 3: 0.754 usec per loop

Any ideas?
msg63200 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-03-03 02:34
Now with documentation, a working test_compile, and one less refleak.
msg63323 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-06 15:53
> What's strange is that calling __enter__ and __exit__ in a 
> try/finally block brings the speed back to the faster 'with' speed, 
> even though they call the same C functions

Looking carefully at the code, there are two reasons for this:
- LockType has no methods! try "dir(thread.LockType)". Instead, LockType
defines a tp_getattr which does a *linear* search in a PyMethodDef
array. First items are served faster...
- After converting this to use the usual tp_methods slot, there is still
a performance difference, due to the fact that release() is a
METH_NOARGS method, while __exit__() uses METH_VARARGS.

Phew.
msg63353 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-03-07 12:01
Patch applied cleanly for me and all tests pass.

It also looked good on a visual scan over the diff text.
msg63355 - (view) Author: Nick Coghlan (ncoghlan) * (Python committer) Date: 2008-03-07 14:17
I went ahead and committed the change to the bytecode generation as r61290.

The deficiencies in the lock implementation should probably be raised as
a separate issue, so I'm closing this one.
msg63356 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2008-03-07 14:43
Hm, my tests do not see any speedup with this patch.
I used VS2005 on win2K, and VS2008 on winXP.
Timings are very similar before and after this patch.

Maybe the optimization is only useful with gcc?
msg63425 - (view) Author: Jeffrey Yasskin (jyasskin) * (Python committer) Date: 2008-03-09 17:03
Thanks Nick and Amaury!

Amaury, what times are you seeing? It could be a just-gcc speedup, but I
wouldn't have thought so since the patch eliminates 2 times around the
eval loop. It could be that the overhead of WITH_CLEANUP is high enough
to drown out those iterations. You had mentioned optimizing the
PyObject_CallFunctionObjArgs() call before. If you're still seeing with
significantly slower than try, that's probably the right place to look.

Here are my current timings. To avoid the lock issues, I wrote
simple_cm.py containing:

class CM(object):
    def __enter__(self):
        pass
    def __exit__(self, *args):
        pass

$ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
'with cm: pass'
1000000 loops, best of 3: 0.885 usec per loop
$ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()'
1000000 loops, best of 3: 0.858 usec per loop

If __exit__ doesn't contain *args (making it not a context manager), the
try/finally time goes down to:
$ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()'
'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()'
1000000 loops, best of 3: 0.619 usec per loop

I think in theory, with could be slightly faster than finally with the
same argument list, but it's pretty close now.
History
Date User Action Args
2022-04-11 14:56:31adminsetgithub: 46432
2008-03-09 17:03:40jyasskinsetmessages: + msg63425
2008-03-08 00:08:03jceasetnosy: + jcea
2008-03-07 14:43:02amaury.forgeotdarcsetmessages: + msg63356
2008-03-07 14:17:51ncoghlansetstatus: open -> closed
resolution: accepted
messages: + msg63355
2008-03-07 12:01:48ncoghlansetmessages: + msg63353
2008-03-06 15:54:00amaury.forgeotdarcsetmessages: + msg63323
2008-03-03 15:27:57belopolskysetnosy: + belopolsky
2008-03-03 02:34:12jyasskinsetfiles: + faster_with.patch
messages: + msg63200
2008-03-03 00:29:46jyasskinsetfiles: + faster_with.patch
keywords: + patch
messages: + msg63197
2008-03-01 16:34:03ncoghlansetmessages: + msg63167
2008-03-01 16:24:17ncoghlansetnosy: + ncoghlan
messages: + msg63165
2008-02-25 00:56:30collinwintersetnosy: + collinwinter
2008-02-24 23:10:40jyasskincreate