New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BytesIO copy-on-write #66202
Comments
This is a followup to the thread at https://mail.python.org/pipermail/python-dev/2014-July/135543.html , discussing the existing behaviour of BytesIO copying its source object, and how this regresses compared to cStringIO.StringI. The goal of posting the patch on list was to try and stimulate discussion around the approach. The patch itself obviously isn't ready for review, and I'm not in a position to dedicate time to it just now (although in a few weeks I'd love to give it full attention!). Ignoring this quick implementation, are there any general comments around the approach? My only concern is that it might keep large objects alive in a non-intuitive way in certain circumstances, though I can't think of any obvious ones immediately. Also interested in comments on the second half of that thread: "a natural extension of this is to do something very similar on the write side: instead of generating a temporary private heap allocation, generate (and freely resize) a private PyBytes object until it is exposed to the user, at which point, _getvalue() returns it, and converts its into an IO_SHARED buffer." There are quite a few interactions with making that work correctly, in particular:
Could also add a private _PyBytes_SetSize() API to allow truncation to the final size during getvalue() without informing the allocator. Then we'd simply overallocate by up to 10% or 1-2kb, and write off the loss of the slack space. Notably, this approach completely differs from the one documented in http://bugs.python.org/issue15381 .. it's not clear to me which is better. |
Submitted contributor agreement. Please consider the demo patch licensed under the Apache 2 licence. |
Be careful what happens when the original object is mutable: >>> b = bytearray(b"abc")
>>> bio = io.BytesIO(b)
>>> b[:] = b"defghi"
>>> bio.getvalue()
b'abc' I don't know what your patch does in this case. |
Good catch :( There doesn't seem to be way a to ask for an immutable buffer, so perhaps it could just be a little more selective. I think the majority of use cases would still be covered if the sharing behaviour was restricted only to BytesType. In that case "Py_buffer initialdata" could become a PyObject*, saving a small amount of memory, and allowing reuse of the struct member if BytesIO was also modified to directly write into a private BytesObject |
Even if there is no way to explicitly request a RO buffer, the Py_buffer struct that you get back actually tells you if it's read-only or not. Shouldn't that be enough to enable this optimisation? Whether or not implementors of the buffer protocol set this flag correctly is another question, but if not then they need fixing on their side anyway. (And in the vast majority of cases, the implementor will be either CPython or NumPy.) Also, generally speaking, I think such an optimisation would be nice, even if it only catches some common cases (and doesn't break the others :). It could still copy data if necessary, but try to avoid it if possible. |
This version is tidied up enough that I think it could be reviewed. Changes are:
Outstanding issues:
|
New patch also calls unshare() during getbuffer() |
I'm not sure the "read only buffer" test is strong enough: having a readonly view is not a guarantee that the data in the view cannot be changed through some other means, i.e. it is read-only, not immutable. Pretty sure this approach is broken. What about the alternative approach of specializing for Bytes? |
That would certainly sound good enough, to optimize the common case. Also, it would be nice if you could add some tests to the patch (e.g. to stress the bytearray case). Thank you! |
As for whether the "checking for a readonly view" approach is broken, I don't know: that part of the buffer API is still mysterious to me. Stefan, would you have some insight? |
I think checking for a readonly view is fine. The protocol is this:
It is not possible to ask for a readonly buffer explicitly, but the It is hard to guess the original intention of the PEP-3118 authors, but |
The original wording in the PEP is this: readonly To me this means that a hypothetical compiler that could figure |
Stefan, Thanks for digging here. As much as I'd love to follow this interpretation, it simply doesn't match existing buffer implementations, including within the standard library. For example, mmap.mmap(..., flags=mmap.MAP_SHARED, prot=mmap.PROT_READ) will produce a read-only buffer, yet mutability is entirely at the whim of the operating system. In this case, "immutability" may be apparent for years, until some machine has memory pressure, causing the shared mapping to be be flushed, and refreshed from (say, incoherent NFS storage) on next access. I thought it would be worth auditing some of the most popular types of buffer just to check your interpretation, and this was the first, most obvious candidate. Any thoughts? I'm leaning heavily toward the Bytes specialization approach |
I'm not sure how much work it would be, or even if it could be made sufficient to solve our problem, but what about extending the buffers interface to include a "int stable" flag, defaulting to 0? It seems though, that it would just be making the already arcane buffers interface even more arcane simply for the benefit of our specific use case |
I'm sure many exporters aren't setting the right flags; on the other hand |
Hi Stefan, How does this approach in reinit() look? We first ask for a writable buffer, and if the object obliges, immediately copy it. Otherwise if it refused, ask for a read-only buffer, and this time expect that it will never change. This still does not catch the case of mmap.mmap. I am not sure how do deal with mmap.mmap. There is no way for it to export PROT_READ as a read-only buffer without permitted mutation, so the only options seem to either be a) remove buffer support from mmap, or b) blacklist it in bytesio(!). Antoine, I have padded out the unit tests a little. test_memoryio.py seems the best place for them. Also modified test_sizeof(), although to the way this test is designed seems inherently brittle to begin with. Now it is also sensitive to changes in Py_buffer struct. Various other changes:
Probably the patch guts could be rearranged again, since the definition of the functions is no longer as clear as it was in cow3.patch. |
See also bpo-15381. |
There's also the following code in numpy's getbuffer method: /*
* If a read-only buffer is requested on a read-write array, we return a
* read-write buffer, which is dubious behavior. But that's why this call
* is guarded by PyArray_ISWRITEABLE rather than (flags &
* PyBUF_WRITEABLE).
*/
if (PyArray_ISWRITEABLE(self)) {
if (array_might_be_written(self) < 0) {
goto fail;
}
} ... which seems to imply that mmap is not the only one with "dubious behaviour" (?). |
Actually we have an extra safety net in memory_hash() apart from This might be applicable here, too. Unfortunately mmap objects >>> import mmap
>>> with open("hello.txt", "wb") as f:
... f.write(b"xxxxx\n")
...
6
>>> f = open("hello.txt", "r+b")
>>> mm = mmap.mmap(f.fileno(), 0, prot=mmap.PROT_READ)
>>> x = memoryview(mm)
>>> hash(mm)
-9223363309538046107
>>> hash(x)
-3925142568057840789
>>> x.tolist()
[120, 120, 120, 120, 120, 10]
>>>
>>> with open("hello.txt", "wb") as g:
... g.write(b"yyy\n")
...
4
>>> hash(mm)
-9223363309538046107
>>> hash(x)
-3925142568057840789
>>> x.tolist()
[121, 121, 121, 10, 0, 0] memoryview (rightfully) assumes that hashable objects are immutable I'm not sure why mmap objects are hashable, it looks like a bug |
I think the mmap behavior is probably worse than the NumPy example. I assume that in the example the exporter sets view.readonly=0. |
Stefan, I like your new idea. If there isn't some backwards compatibility argument about mmap.mmap being hashable, then it could be considered a bug, and fixed in the same hypothetical future release that includes this BytesIO change. The only cost now is that to test for hashability, we must hash the object, which causes every byte in it to be touched (aka. almost 50% the cost of a copy) If however we can't fix mmap.mmap due to the interface change (I think that's a silly idea -- Python has never been about letting the user shoot themselves in the foot), then the specialized-for-Bytes approach is almost as good (and perhaps even better, since the resulting concept and structure layout is more aligned with Serhiy's patch in bpo-15381). tl;dr: a) mmap.mmap can be fixed - use hashability as strong test for immutability (instead of ugly heuristic involving buffer blags)
b) mmap.mmap can't be fixed - use the Bytes specialization approach. |
I don't like the idea of trying to hash the object. It may be a time-consuming operation, while the result will be thrown away. I think restricting the optimization to bytes objects is fine. We can whitelist other types, such as memoryview. |
This new patch abandons the buffer interface and specializes for Bytes per the comments on this issue. Anyone care to glance at least at the general structure? Tests could probably use a little more work. Microbenchmark seems fine, at least for construction. It doesn't seem likely this patch would introduce severe performance troubles elsewhere, but I'd like to trying it out with some example heavy BytesIO consumers (any suggestions? Some popular template engine?) cpython] ./python.exe -m timeit -s 'import i' 'i.readlines()' [23:52:55 eldil!58 cpython] ./python-nocow -m timeit -s 'import i' 'i.readlines()' [23:52:59 eldil!59 cpython] cat i.py |
I don't have any specific suggestions, but you could try the benchmark suite here: |
Hey Antoine, Thanks for the link. I'm having trouble getting reproducible results at present, and running out of ideas as to what might be causing it. Even after totally isolating a CPU for e.g. django_v2 and with frequency scaling disabled, numbers still jump around for the same binary by as much as 3%. I could not detect any significant change between runs of the old and new binary that could not be described as noise, given the isolation issues above. |
That's expected. If the difference doesn't go above 5-10%, then you IMO can pretty much consider your patch didn't have any impact on those benchmarks. |
Newest patch incorporates Antoine's review comments. The final benchmark results are below. Just curious, what causes e.g. telco to differ up to 7% between runs? That's really huge Report on Linux k2 3.14-1-amd64 #1 SMP Debian 3.14.9-1 (2014-06-30) x86_64 ### call_method_slots ### ### call_method_unknown ### ### call_simple ### ### etree_generate ### ### etree_iterparse ### ### etree_process ### ### fastpickle ### ### nqueens ### ### silent_logging ### ### spectral_norm ### ### telco ### The following not significant results are hidden, use -v to show them: |
telco.py always varies a lot between runs (up to 10%), even in the http://bytereef.org/mpdecimal/quickstart.html#telco-benchmark Using the average of 10 runs, I can't really see a slowdown. |
So I wonder why the benchmark suite says that the telco slowdown is significant. :) |
I suspect it's all covered now, but is there anything else I can help with to get this patch pushed along its merry way? |
New changeset 79a5fbe2c78f by Antoine Pitrou in branch 'default': |
The latest patch is good indeed. Thank you very much! |
Shouldn't this fix be mentioned in https://docs.python.org/3.5/whatsnew/3.5.html#optimizations ? |
Attached trivial patch for whatsnew.rst. |
New changeset 7ae156f07a90 by Berker Peksag in branch 'default': |
Why does it abandon buffer interface? Because of the following?
Shouldn't existing buffer implementations be fixed then and this feature made to use buffer interface instead of specialize for Bytes? If so is there at least any information on this in the comments so that one wouldn't wonder why there is specialization instead of relaying on buffer interface? |
Hi Piotr, There wasn't an obvious fix that didn't involve changing the buffer interface itself. There is presently ambiguity in the interface regarding the difference between a "read only" buffer and an "immutable" buffer, which is crucial for its use in this case. Fixing the interface, followed by every buffer interface user, is a significantly more complicated task than simply optimizing for the most common case, as done here. FWIW I still think this work is worth doing, though I personally don't have time to approach it just now. We could have (and possibly should) approach fixing e.g. mmap.mmap() hashability, possibly causing user code regressions, but even if such cases were fixed it still wouldn't be a enough to rely on for the optimization implemented here. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: