Issue 36839: Support the buffer protocol in code objects

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/81020

classification

Title:	Support the buffer protocol in code objects
Type:	enhancement	Stage:	patch review
Components:	Interpreter Core	Versions:	Python 3.8

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:	dino.viehland	Nosy List:	brett.cannon, dino.viehland, eric.snow, methane, serhiy.storchaka, skrah, vstinner
Priority:	normal	Keywords:	patch

Created on 2019-05-07 19:56 by dino.viehland, last changed 2022-04-11 14:59 by admin.

Pull Requests
URL	Status	Linked	Edit
PR 13177	open	dino.viehland, 2019-05-07 20:56

Messages (33)
msg341808 - (view)	Author: Dino Viehland (dino.viehland) *	Date: 2019-05-07 19:56
Many Python deployments involve large code based that are used in scenarios where processes are fork-and-exec'd. When running in these environments code objects can end up occupying a lot of memory. Because the code objects are on random pages and are ref counted the ref counts can cause all of the code to not be shared across processes. If code objects supported the buffer protocol it would be possible to load code from a memory mapped file which is shared across all of the processes.
msg343731 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-28 05:00
I don't like this. It removes guarantee that code object is constant / immutable. > Because the code objects are on random pages and are ref counted the ref counts can cause all of the code to not be shared across processes. These ref counts are not incremented or decremented typically, until shutting down. If you want to avoid CoW when shutting down, you can use os._exit.
msg343824 - (view)	Author: Dino Viehland (dino.viehland) *	Date: 2019-05-28 22:58
The PR actually checks that the buffer is read-only (this was also a concern that Mark Shannon had). And the Python buffer protocol says that you need to consistently hand out read-only buffers. So while someone could create a buffer and mutate it outside of the buffer protocol it should be really read-only. As far as the ref counts, it's not just the ref counts for the code byte strings that are potentially problematic. But it's the ref counts on all of the random other objects which the code objects are on the same page as, as well as other random read-write data that could be on those pages. There's also an additional benefit in that code objects not loaded before forking can continue to share their memory as well.
msg343837 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-28 23:58
read-only is slightly different than const / immutable. const / immutable means anyone can not modify the memory. read-only means only the memory should not be modified through the buffer. But underlaying memory block can be modified by owner who creates buffer. For example, you can create read only buffer from bytearray, or even raw C char array. It doesn't violate semantics of read-only.
msg343851 - (view)	Author: Dino Viehland (dino.viehland) *	Date: 2019-05-29 01:55
Sure, but immutable/const is almost always a language level guarantee. The only case where that's not true is when you have OS/hardware level memory protection and that doesn't apply to any of Python's existing byte codes. So from a Python perspective, code objects are remaining immutable - they can only be created by objects which expose the read-only buffer protocol. So for example passing in a memoryview(b'abc') will work here while a memoryview(bytearray(b'abc')) will fail. And because when asking for a non read-write view the buffer implementer needs to be consistent for all callers (https://docs.python.org/3/c-api/buffer.html#c.PyBUF_WRITABLE) it seems that this invariant should hold for all objects being passed in. Could someone create a buffer object which still allows the underlying memory to be written? Sure. But I can use ctypes to modify byte code today as well with something like "ctypes.cast(id(f.__code__.co_code) + 32, ctypes.POINTER(ctypes.c_char)) [0] = 101" So people will still be able to do nasty things, but there are certainly guards in place to strongly discourage them from doing so.
msg343859 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-29 03:15
> Could someone create a buffer object which still allows the underlying memory to be written? Sure. But I can use ctypes to modify byte code today as well with something like "ctypes.cast(id(f.__code__.co_code) + 32, ctypes.POINTER(ctypes.c_char)) [0] = 101" You are comparing apple and orange. Breanking memory inside immutable object by ctypes is far different from mutating mutable memory. It introduce more weakness and complexity into code object. At least, you need to demonstrate the benefit. When importing module, there are many objects are created. Why avoiding decref only for co_code make much difference?
msg343992 - (view)	Author: Dino Viehland (dino.viehland) *	Date: 2019-05-30 18:03
In the Instagram case there's about 20mb of byte code total and there are 3-4 dozen worker processes running per server. The byte code also represents the second largest section of memory as far as serialized code objects are concerned, the only larger one is strings (which are about 25mb). Anyway, that leads to around 1gb of memory savings per server, which is the reason why this was an interesting optimization to pursue when it's applied to many servers.
msg343993 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-05-30 19:07
I concur with Inada. This would add complexity in the code object. Note also that the code object is not just a sequence of bytes. It is a complex object which contains references to a dozen of objects: bytes objects, strings, integers, tuples, dicts. I suppose that the byte code itself is only a tiny part of the memory consumed by the code object in large program.
msg343996 - (view)	Author: Eric Snow (eric.snow) *	Date: 2019-05-30 19:43
FWIW, I don't see the problem with supporting any read-only "buffer" object, rather than just bytes objects, for the string of bytes in a code object. That's all that Dino is proposing. The change is not invasive and solves a real need. The additional maintenance burden is negligible. Furthermore, the accommodation makes sense conceptually.
msg343999 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2019-05-30 20:12
I agree with Eric. While I understand what Serhiy is saying about code objects being more than a bytearray of bytecode, the buffer protocol is still a view on an object, and that view happens to be for a subset which I think is acceptable as I think of what a buffer of a code object would be other than the bytecode. I also think read-only is a good enough guarantee. We're talking low-level stuff here and if anyone were to muck with this stuff they are already playing with fire.
msg344009 - (view)	Author: Dino Viehland (dino.viehland) *	Date: 2019-05-30 22:46
Well given that we're one day away from 3.8 Beta 1 I'm not going to rush this in :) In particular the discussion has made me wonder if I can also do more with sharing strings using the legacy unicode strings (which I don't believe will require any runtime changes). So I'll keep pushing on this and hopefully be able to demonstrate an even large win from the concept.
msg344013 - (view)	Author: Inada Naoki (methane) *	Date: 2019-05-30 23:27
> In the Instagram case there's about 20mb of byte code total and there are 3-4 dozen worker processes running per server. The byte code also represents the second largest section of memory as far as serialized code objects are concerned, the only larger one is strings (which are about 25mb). How did you measure it? How much % of the RAM usage of process? What "serialized code objects are concerned" means? Don't you concern RAM usage in process? > FWIW, I don't see the problem with supporting any read-only "buffer" object, rather than just bytes objects, for the string of bytes in a code object. That's all that Dino is proposing. It means byte code is changed in anytime by anyone other than code object. code object is not immutable anymore. "read-only" means only code object doesn't change the co_code. Anyone else can modify it without breaking Python (and Python/C API) semantics. For example, per-opcode cache can't assume opcode is not changed. I think there are some other possible optimization from co_code is constant. Another complexity is GC. We allowed only bytes here (no subclass allowed) and it allow us to code object is non-GC object. If we allow any object implementing buffer protocol, we should make code object as GC object. We can untrack the code object if co_code is bytes (like tuples does), but all code objects needs 16byte header (like tuples requires it...) Of course we can break the rule and allow Python makes circular reference not collected. But please don't think changing from bytes to arbitrary objects is simple change. I think it needs significant benefits for typical users, not only for Instagram. If only Instagram get benefit from this, keep it as Instagram's internal patch.
msg344097 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2019-05-31 17:47
RE: "I think it needs significant benefits for typical users, not only for Instagram. If only Instagram get benefit from this, keep it as Instagram's internal patch." But who's typical in this case? You? Me? We're talking code objects, something that the typical Python user doesn't even know exists if you take "typical" to mean "common" or "majority". I suspect if we asked Python developers if they knew what lnotab was an abbreviation of they wouldn't know, let alone how it worked. And just because Instagram did this work and is making it public doesn't mean (a) others wouldn't use it or (b) others are already doing it privately. There are plenty of other companies with massive server installs of Python beyond Instagram. My point is we have to be very careful when we start to claim we know what is typical or useful to the wider community considering how huge and diverse the Python community is.
msg344168 - (view)	Author: Inada Naoki (methane) *	Date: 2019-06-01 11:38
On Sat, Jun 1, 2019 at 2:47 AM Brett Cannon <report@bugs.python.org> wrote: > > Brett Cannon <brett@python.org> added the comment: > > RE: "I think it needs significant benefits for typical users, not only for Instagram. If only Instagram get benefit from this, keep it as Instagram's internal patch." > > But who's typical in this case? You? Me? We're talking code objects, something that the typical Python user doesn't even know exists if you take "typical" to mean "common" or "majority". I suspect if we asked Python developers if they knew what lnotab was an abbreviation of they wouldn't know, let alone how it worked. > > And just because Instagram did this work and is making it public doesn't mean (a) others wouldn't use it or (b) others are already doing it privately. There are plenty of other companies with massive server installs of Python beyond Instagram. > > My point is we have to be very careful when we start to claim we know what is typical or useful to the wider community considering how huge and diverse the Python community is. > I'm sorry for bad word choice. I didn't mean "majority" or "common". For example, I was +1 for gc.freeze(), even if majority users doesn't use it. My point is (a) it's very hard others utilize it, and (b) it's very hard others confirm the benefit. And I suspect that Dino's estimation is wrong, but no one can verify the estimation because he didn't show enough concrete information. Elaborately speaking, * They need to use custom importer and serializer other than marshal to import code objects. * They need to use lightweight object for buffer. At least, memoryview object is large (192byte on Python 3.7.3 amd64). Even if Instaram can not open source their import system, they can be show more concrete data. For example, their this [1] report is very good at considering gc.freeze(). [1] https://instagram-engineering.com/dismissing-python-garbage-collection-at-instagram-4dca40b29172 While I agree that mmap pyc files is interesting idea, I feel allowing buffer object for co_code is wrong way to achieve it (). But there are no enough data to discuss the benefit. Current status is Dino claims this idea have benefit in Instagram, but no one can confirm the benefit. I think we shouldn't discuss without reliable or reproducible data for this type of proposal. () One possible idea is code object have just a pointer and length for co_code, and reference to object who owns the data (mmaped file). code object can load co_lnotab and docstring lazily too.
msg344172 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-06-01 13:12
> * They need to use lightweight object for buffer. At least, > memoryview object is large (192byte > on Python 3.7.3 amd64). Actually it is larger, because you should add the size of internal objects. In 3.8: >>> sys.getsizeof(memoryview(b'')) 184 >>> sys.getsizeof(gc.get_referents(memoryview(b''))[0]) 128 312 bytes total, not counting padding. The average size of co_code in Lib/*.py is 85 bytes. Unless Instagram uses gigantic functions and methods (and no comprehensions or lambdas), the net benefit will be negative.
msg344173 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-06-01 13:29
The size of the code object is at least 144 bytes. >>> def f(): pass ... >>> sys.getsizeof(f.__code__) 144 If the function uses parameters, variables or constants (and you cannot do much useful without using them), their size should be added too. It overwhelms the size of the raw bytecode.
msg344200 - (view)	Author: Stefan Krah (skrah) *	Date: 2019-06-01 19:09
The managed buffer can be shared: >>> b = b'12345' >>> m1 = memoryview(b) >>> m2 = memoryview(m1) >>> gc.get_referents(m1)[0] <managedbuffer object at 0x1216fd8> >>> gc.get_referents(m2)[0] <managedbuffer object at 0x1216fd8> And I understood that Dino proposed to share one code instance as a memory mapped file for all processes.
msg344253 - (view)	Author: Inada Naoki (methane) *	Date: 2019-06-02 02:38
> And I understood that Dino proposed to share one code instance as a memory mapped file for all processes. What instance means? code object? co_code? Anyway, Dino didn't propose such thing. He only proposed accepting buffer object for code constructor. He didn't describe how to use buffer object. Python doesn't provide share buffer object inter processes. There are no enough concrete information for estimate benefits. We're discussing imagination, not concrete idea. Let's stop discuss more, and wait Dino provide concrete example which can be used to estimate benefits.
msg344256 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-06-02 06:15
The managed buffer can be shared if it refers to the same bytes. But different code objects in the same process refer to different bytes, therefore they should have distinct managed buffers.
msg344262 - (view)	Author: Stefan Krah (skrah) *	Date: 2019-06-02 09:32
On Sun, Jun 02, 2019 at 02:38:21AM +0000, Inada Naoki wrote: > What instance means? code object? co_code? > Anyway, Dino didn't propose such thing. He only proposed accepting buffer object for code constructor. > He didn't describe how to use buffer object. Python doesn't provide share buffer object inter processes. He did, at a high level. in the original mail: "If code objects supported the buffer protocol it would be possible to load code from a memory mapped file which is shared across all of the processes." It is not very detailed, but gives the rationale. I assumed that the new shared memory support would be used, but it would be nice to hear more details.
msg344494 - (view)	Author: Dino Viehland (dino.viehland) *	Date: 2019-06-03 23:45
The 20MB of savings is actually the amount of byte code that exists in the IG code base. I was just measuring the web site code, and not the other various Python code in the process (e.g. no std lib code, no 3rd party libraries, etc...). The IG code base is pretty monolithic and starting up the site requires about half of the code to get imported. So I think the 20MB per process is a pretty realistic number. I've also created a C extension and the object implementing the buffer protocol looks like: typedef struct { PyObject_HEAD const char* data; size_t size; Py_ssize_t hash; CIceBreaker breaker; size_t exports; PyObject code_obj; /* borrowed reference, the code object keeps us alive */ } CIceBreakerCode; All of the modules are currently getting compiled into a single memory mapped file and then these objects get created which implement the buffer protocol for each function. So the overhead it just takes a byte code w/ 16 opcodes before it breaks even, so it is significantly lighter weight than using a memoryview object. It's certainly true that the byte code isn't the #1 source of memory here (the code objects themselves are pretty big), but in the serialized state it ends up representing 25% of the serialized data. I would expect when you add in ref counts and typing information it's not quite as good, but reducing the overhead of code by 20% is still a pretty nice win. I can't make any promises about open sourcing the import system, but I can certainly look into that as well.
msg344506 - (view)	Author: Inada Naoki (methane) *	Date: 2019-06-04 00:30
On Tue, Jun 4, 2019 at 8:45 AM Dino Viehland <report@bugs.python.org> wrote: > > The 20MB of savings is actually the amount of byte code that exists in the IG code base. Wait, do you expect you can reduce 100% saving of co_code object? co_code takes 0byte? You said "the overhead it just takes a byte code w/ 16 opcodes before it breaks even" later. So I believe you must understand there is overhead. > I was just measuring the web site code, and not the other various Python code in the process (e.g. no std lib code, no 3rd party libraries, etc...). The IG code base is pretty monolithic and starting up the site requires about half of the code to get imported. So I think the 20MB per process is a pretty realistic number. How many MBs your application takes? 20MB of 200MB is attractive. 20MB of 20GB is not attractive. Absolute number is not important here. You should use ratio. > > It's certainly true that the byte code isn't the #1 source of memory here (the code objects themselves are pretty big), but in the serialized state it ends up representing 25% of the serialized data. "25% of the serialized data" is misleading information. You are trying to reduce memory usage, not serialized data size. You should use ratio of total RAM usage, not serialized data size. > I can't make any promises about open sourcing the import system, but I can certainly look into that as well. Open sourcing is not necessary. Try it by yourself and share the real number, instead of number of your estimation. How much RSS is used by Python, and how much of it you can reduce by this system, actually? Regards, -- Inada Naoki <songofacandy@gmail.com>
msg344529 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2019-06-04 06:16
Yet one consideration. The bytecode is actually a wordcode, a sequence of 16-bit words, and the start of the array must be properly aligned. There is no problem if it is a data of the bytes object, but if you have a mapping of the pyc file the alignment is not guarantied. You need to change also the marshal protocol or save the bytecode in other file. This is a large change. So you need to make several large changes (changes in the import system, change in the pyc file) for using this feature. Supporting the buffer protocol in code objects is minor change to this. If you need to patch other parts of Python it is not hard to patch also the code object in your custom build. Other Python users will not have benefit from this.
msg344538 - (view)	Author: Stefan Krah (skrah) *	Date: 2019-06-04 10:16
720MB <= "3-4 dozen" * 20 MB <= 960MB. Per server. It has all been said. :-) I don't understand the objections about alignment. malloc() and obmalloc() are at least 8-byte aligned, mmap() with NULL as the first parameter is page-aligned.
msg344543 - (view)	Author: Inada Naoki (methane) *	Date: 2019-06-04 10:42
> Stefan Krah <stefan@bytereef.org> added the comment: > > 720MB <= "3-4 dozen" * 20 MB <= 960MB. Per server. > > It has all been said. :-) I don't understand what message you are replying. I'm not interested in the number. Who asked MBs / server? Absolute number is not important when optimizing. "I can reduce 3sec" is not important, unless total time is given. 3.5s -> 0.5s is great. But 60 hours -> 59h59m57s is not great. Hi didn't write anything about how much memory used by process. Additionally, he measured serialized size, not memory usage. It doesn't make any sense. I don't trust even 20MB/process saving. It is only in his mind. This proposal is based on fantasy, not based on pragmatic survey. I think we should not take more time to discuss, until he proof the idea by actually saving 20MB / process and provide detailed report. > > I don't understand the objections about alignment. malloc() and obmalloc() > are at least 8-byte aligned, mmap() with NULL as the first parameter > is page-aligned. > I think Serhiy meant alignment in pyc file. bytes object in pyc file is not aligned. So it's bad idea to use mmap for pyc files. But Instagram may use original serialize format instead of pyc files, and it may align bytes object. So it may be not a problem. No one other than Instagrammer knows, because there is no detailed information.
msg344544 - (view)	Author: Stefan Krah (skrah) *	Date: 2019-06-04 10:47
There's no point discussing unless people develop reading skills.
msg344622 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2019-06-04 18:43
Let's please keep this respectful. Saying people are basing things "on fantasy" or that "people need to develop reading skills" is not helpful.
msg344660 - (view)	Author: Inada Naoki (methane) *	Date: 2019-06-05 00:54
I'm sorry, I thought "fantasy" was good metaphor. I just meant "the estimation seems too optimistic and rough. discussion should not based on it".
msg344752 - (view)	Author: Brett Cannon (brett.cannon) *	Date: 2019-06-05 17:26
"I'm sorry, I thought "fantasy" was good metaphor." No problem! Saying something is "fantasy" like "it's based on fantasy" -- in North America at least -- is like saying "you're fooling yourself" or "you're delusional".
msg372737 - (view)	Author: Stefan Krah (skrah) *	Date: 2020-07-01 08:33
After seeing people from a certain company defend a bizarre and broken stack with falsehoods, I apologize to Inada-san and now assume that he had a similar experience.
msg372738 - (view)	Author: Stefan Krah (skrah) *	Date: 2020-07-01 08:44
Though code objects do not concern me directly, as the author of memoryview I now concur with Inada-san that we should not support hacks in the Python code base that benefit a single company.
msg372739 - (view)	Author: Inada Naoki (methane) *	Date: 2020-07-01 09:17
FWI, I wrote my idea in python-ideas mailing list. https://mail.python.org/archives/list/python-ideas@python.org/message/VKBXY7KDI2OGESB7IPAMAIIHKR4TC7TQ/
msg372745 - (view)	Author: Stefan Krah (skrah) *	Date: 2020-07-01 10:38
For clarification, the incident I referred to is entirely unrelated to this issue: Of course Dino Viehland has been rational, friendly and competent. I wanted to point out that people might have had a formative experience elsewhere.

History
Date	User	Action	Args
2022-04-11 14:59:14	admin	set	github: 81020
2020-07-01 10:38:05	skrah	set	messages: + msg372745
2020-07-01 09:24:47	vstinner	set	nosy: + vstinner
2020-07-01 09:17:26	methane	set	messages: + msg372739
2020-07-01 08:44:06	skrah	set	messages: + msg372738
2020-07-01 08:33:49	skrah	set	nosy: + skrah messages: + msg372737
2019-06-05 17:26:01	brett.cannon	set	messages: + msg344752
2019-06-05 00:54:41	methane	set	messages: + msg344660
2019-06-04 18:43:37	brett.cannon	set	messages: + msg344622
2019-06-04 10:48:02	skrah	set	nosy: - skrah
2019-06-04 10:47:48	skrah	set	messages: + msg344544
2019-06-04 10:42:22	methane	set	messages: + msg344543
2019-06-04 10:16:19	skrah	set	messages: + msg344538
2019-06-04 06:16:04	serhiy.storchaka	set	messages: + msg344529
2019-06-04 00:31:00	methane	set	messages: + msg344506
2019-06-03 23:45:51	dino.viehland	set	messages: + msg344494
2019-06-02 09:32:08	skrah	set	messages: + msg344262
2019-06-02 06:15:24	serhiy.storchaka	set	messages: + msg344256
2019-06-02 02:38:21	methane	set	messages: + msg344253
2019-06-01 19:09:20	skrah	set	nosy: + skrah messages: + msg344200
2019-06-01 13:29:51	serhiy.storchaka	set	messages: + msg344173
2019-06-01 13:12:58	serhiy.storchaka	set	messages: + msg344172
2019-06-01 11:38:56	methane	set	messages: + msg344168
2019-05-31 17:47:19	brett.cannon	set	messages: + msg344097
2019-05-30 23:27:20	methane	set	messages: + msg344013
2019-05-30 22:46:10	dino.viehland	set	messages: + msg344009
2019-05-30 20:12:34	brett.cannon	set	nosy: + brett.cannon messages: + msg343999
2019-05-30 19:43:18	eric.snow	set	messages: + msg343996
2019-05-30 19:07:32	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg343993
2019-05-30 18:03:30	dino.viehland	set	messages: + msg343992
2019-05-29 03:15:14	methane	set	messages: + msg343859
2019-05-29 01:55:27	dino.viehland	set	messages: + msg343851
2019-05-28 23:58:32	methane	set	messages: + msg343837
2019-05-28 22:58:33	dino.viehland	set	messages: + msg343824
2019-05-28 05:00:22	methane	set	nosy: + methane messages: + msg343731
2019-05-07 21:26:56	dino.viehland	set	nosy: + eric.snow
2019-05-07 20:56:33	dino.viehland	set	keywords: + patch stage: needs patch -> patch review pull_requests: + pull_request13093
2019-05-07 19:56:26	dino.viehland	create