Issue 2179: with should be as fast as try/finally

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46432

classification

Title:	with should be as fast as try/finally
Type:	enhancement	Stage:
Components:	Interpreter Core	Versions:	Python 2.6

process

Status:	closed	Resolution:	accepted
Dependencies:		Superseder:
Assigned To:		Nosy List:	amaury.forgeotdarc, belopolsky, benjamin.peterson, christian.heimes, collinwinter, jcea, jyasskin, ncoghlan, nnorwitz, rhettinger
Priority:	normal	Keywords:	patch

Created on 2008-02-24 23:10 by jyasskin, last changed 2022-04-11 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
faster_with.patch	jyasskin, 2008-03-03 00:29	proof of concept
faster_with.patch	jyasskin, 2008-03-03 02:34	complete

Messages (10)
msg62951 - (view)	Author: Jeffrey Yasskin (jyasskin) *	Date: 2008-02-24 23:10
Currently, 'with' costs about .2us over try/finally: $ ./python.exe -m timeit -s 'import thread; lock = thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally: lock.release()' 1000000 loops, best of 3: 0.617 usec per loop $ ./python.exe -m timeit -s 'import thread; lock = thread.allocate_lock()' 'with lock: pass' 1000000 loops, best of 3: 0.774 usec per loop Since it's doing the same thing (and calling the same C functions to do the lock manipulation), it shouldn't take more time. The bytecodes associated with the two options look like: 2) with lock: 3) pass 2 0 LOAD_GLOBAL 0 (lock) 3 DUP_TOP 4 LOAD_ATTR 1 (__exit__) 7 STORE_FAST 0 (_[1]) 10 LOAD_ATTR 2 (__enter__) 13 CALL_FUNCTION 0 16 POP_TOP 17 SETUP_FINALLY 4 (to 24) 3 20 POP_BLOCK 21 LOAD_CONST 0 (None) >> 24 LOAD_FAST 0 (_[1]) 27 DELETE_FAST 0 (_[1]) 30 WITH_CLEANUP 31 END_FINALLY 32 LOAD_CONST 0 (None) 35 RETURN_VALUE 6) lock.acquire() 7) try: 8) pass 9) finally: 10) lock.release() 6 0 LOAD_GLOBAL 0 (lock) 3 LOAD_ATTR 1 (acquire) 6 CALL_FUNCTION 0 9 POP_TOP 7 10 SETUP_FINALLY 4 (to 17) 8 13 POP_BLOCK 14 LOAD_CONST 0 (None) 10 >> 17 LOAD_GLOBAL 0 (lock) 20 LOAD_ATTR 2 (release) 23 CALL_FUNCTION 0 26 POP_TOP 27 END_FINALLY 28 LOAD_CONST 0 (None) 31 RETURN_VALUE The major difference I see is the extra local variable (_[1]) used by with. It looks like the compiler ought to be able to save that on the stack instead, and save 3 opcodes. Neal Norwitz suggested that, if that optimization is impossible, WITH_CLEANUP could be modified to take the variable as a parameter, which would let it absorb the LOAD_FAST and DELETE_FAST instructions. I've added everyone on the previous bug to the nosy list. Sorry if you don't care. :)
msg63165 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-03-01 16:24
A closer approximation of what the with statement is doing would be: exit = lock.release() lock.acquire() try: pass finally: exit() The problem with trying to store the result of the retrieval of __exit__ on the stack is that we need to leave the context manager itself on top of the stack for the next LOAD_ATTR opcode (when we're retrieving the __enter__ method). However, changing WITH_CLEANUP to take an argument indicating which local variable holds the bound __exit__ method sounds like it might work.
msg63167 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-03-01 16:34
Scratch the parentheses on that first line of sample code in my previous comment (isn't copy and paste wonderful?)
msg63197 - (view)	Author: Jeffrey Yasskin (jyasskin) *	Date: 2008-03-03 00:29
Here's a proof-of-concept patch that keeps the __exit__ method on the stack. It uses ROT_TWO to stuff it under the context object instead of storing it into a temporary. (Thanks Nick for pointing out that problem before I had to waste time on it.) test_with passes, although I need to update several more things and maybe fix a refleak. The patch changes the compilation of: def with_(l): with l: pass from 4 0 LOAD_FAST 0 (l) 3 DUP_TOP 4 LOAD_ATTR 0 (__exit__) 7 STORE_FAST 1 (_[1]) 10 LOAD_ATTR 1 (__enter__) 13 CALL_FUNCTION 0 16 POP_TOP 17 SETUP_FINALLY 4 (to 24) 5 20 POP_BLOCK 21 LOAD_CONST 0 (None) >> 24 LOAD_FAST 1 (_[1]) 27 DELETE_FAST 1 (_[1]) 30 WITH_CLEANUP 31 END_FINALLY 32 LOAD_CONST 0 (None) 35 RETURN_VALUE to 4 0 LOAD_FAST 0 (l) 3 DUP_TOP 4 LOAD_ATTR 0 (__exit__) 7 ROT_TWO 8 LOAD_ATTR 1 (__enter__) 11 CALL_FUNCTION 0 14 POP_TOP 15 SETUP_FINALLY 4 (to 22) 5 18 POP_BLOCK 19 LOAD_CONST 0 (None) >> 22 WITH_CLEANUP 23 END_FINALLY 24 LOAD_CONST 0 (None) 27 RETURN_VALUE And speeds it up from: $ ./python.exe -m timeit -s 'import thread; lock = thread.allocate_lock()' 'with lock: pass' 1000000 loops, best of 3: 0.832 usec per loop to: $ ./python.exe -m timeit -s 'import thread; lock = thread.allocate_lock()' 'with lock: pass' 1000000 loops, best of 3: 0.762 usec per loop That's only half of the way to parity with try/finally: $ ./python.exe -m timeit -s 'import thread; lock = thread.allocate_lock()' 'lock.acquire()' 'try: pass' 'finally: lock.release()' 1000000 loops, best of 3: 0.638 usec per loop What's strange is that calling __enter__ and __exit__ in a try/finally block brings the speed back to the faster 'with' speed, even though they call the same C functions: $ ./python.exe -m timeit -s 'import thread; lock = thread.allocate_lock()' 'lock.__enter__()' 'try: pass' 'finally: lock.__exit__()' 1000000 loops, best of 3: 0.754 usec per loop Any ideas?
msg63200 - (view)	Author: Jeffrey Yasskin (jyasskin) *	Date: 2008-03-03 02:34
Now with documentation, a working test_compile, and one less refleak.
msg63323 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-03-06 15:53
> What's strange is that calling __enter__ and __exit__ in a > try/finally block brings the speed back to the faster 'with' speed, > even though they call the same C functions Looking carefully at the code, there are two reasons for this: - LockType has no methods! try "dir(thread.LockType)". Instead, LockType defines a tp_getattr which does a linear search in a PyMethodDef array. First items are served faster... - After converting this to use the usual tp_methods slot, there is still a performance difference, due to the fact that release() is a METH_NOARGS method, while __exit__() uses METH_VARARGS. Phew.
msg63353 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-03-07 12:01
Patch applied cleanly for me and all tests pass. It also looked good on a visual scan over the diff text.
msg63355 - (view)	Author: Nick Coghlan (ncoghlan) *	Date: 2008-03-07 14:17
I went ahead and committed the change to the bytecode generation as r61290. The deficiencies in the lock implementation should probably be raised as a separate issue, so I'm closing this one.
msg63356 - (view)	Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) *	Date: 2008-03-07 14:43
Hm, my tests do not see any speedup with this patch. I used VS2005 on win2K, and VS2008 on winXP. Timings are very similar before and after this patch. Maybe the optimization is only useful with gcc?
msg63425 - (view)	Author: Jeffrey Yasskin (jyasskin) *	Date: 2008-03-09 17:03
Thanks Nick and Amaury! Amaury, what times are you seeing? It could be a just-gcc speedup, but I wouldn't have thought so since the patch eliminates 2 times around the eval loop. It could be that the overhead of WITH_CLEANUP is high enough to drown out those iterations. You had mentioned optimizing the PyObject_CallFunctionObjArgs() call before. If you're still seeing with significantly slower than try, that's probably the right place to look. Here are my current timings. To avoid the lock issues, I wrote simple_cm.py containing: class CM(object): def __enter__(self): pass def __exit__(self, args): pass $ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()' 'with cm: pass' 1000000 loops, best of 3: 0.885 usec per loop $ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()' 'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()' 1000000 loops, best of 3: 0.858 usec per loop If __exit__ doesn't contain args (making it not a context manager), the try/finally time goes down to: $ ./python.exe -m timeit -s 'import simple_cm; cm = simple_cm.CM()' 'cm.__enter__()' 'try: pass' 'finally: cm.__exit__()' 1000000 loops, best of 3: 0.619 usec per loop I think in theory, with could be slightly faster than finally with the same argument list, but it's pretty close now.

History
Date	User	Action	Args
2022-04-11 14:56:31	admin	set	github: 46432
2008-03-09 17:03:40	jyasskin	set	messages: + msg63425
2008-03-08 00:08:03	jcea	set	nosy: + jcea
2008-03-07 14:43:02	amaury.forgeotdarc	set	messages: + msg63356
2008-03-07 14:17:51	ncoghlan	set	status: open -> closed resolution: accepted messages: + msg63355
2008-03-07 12:01:48	ncoghlan	set	messages: + msg63353
2008-03-06 15:54:00	amaury.forgeotdarc	set	messages: + msg63323
2008-03-03 15:27:57	belopolsky	set	nosy: + belopolsky
2008-03-03 02:34:12	jyasskin	set	files: + faster_with.patch messages: + msg63200
2008-03-03 00:29:46	jyasskin	set	files: + faster_with.patch keywords: + patch messages: + msg63197
2008-03-01 16:34:03	ncoghlan	set	messages: + msg63167
2008-03-01 16:24:17	ncoghlan	set	nosy: + ncoghlan messages: + msg63165
2008-02-25 00:56:30	collinwinter	set	nosy: + collinwinter
2008-02-24 23:10:40	jyasskin	create