classification
Title: Make rawiobase_read() read directly to bytes object
Type: performance Stage: patch review
Components: IO, Library (Lib) Versions: Python 3.4, Python 3.5
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: amaury.forgeotdarc, dwight.guth, jcea, methane, pitrou, sbt, vstinner
Priority: normal Keywords: patch

Created on 2012-09-10 12:23 by sbt, last changed 2019-04-08 13:06 by BreamoreBoy.

Files
File name Uploaded Description Edit
iobase_read.patch sbt, 2012-09-10 12:23 review
iobase_read.patch sbt, 2012-09-17 20:17 review
Messages (20)
msg170183 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-10 12:23
Currently rawiobase_read() reads to a bytearray object and then copies the data to a bytes object.

There is a TODO comment saying that the bytes object should be created directly.  The attached patch does that.
msg170226 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2012-09-10 20:01
It works as along as the bytes object cannot leak to Python code, (imagine a custom readinto() method which plays with gc.get_referrers, then calls hash(b)...)
This is OK with this patch.
msg170624 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-17 20:17
New patch which checks the refcount of the memoryview and bytes object after calling readinto().

If either refcount is larger than the expected value of 1, then the data is copied rather than resized.
msg170627 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-17 22:30
> If either refcount is larger than the expected value of 1, then the
> data is copied rather than resized.

I think that's a useless precaution. The bytes object cannot "leak" since you are using PyMemoryView_FromMemory(), which doesn't know about the original object.

Out of curiousity, have you done any benchmarks?
msg170628 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-17 23:04
> I think that's a useless precaution. The bytes object cannot "leak" 
> since you are using PyMemoryView_FromMemory(), which doesn't know about 
> the original object.

The bytes object cannot "leak" so, as you say, checking that refcount is pointless.  But the view might "leak", and since it does not own a reference to the base object we have a problem: we can't deallocate the bytes object for fear of breaking the view.

It looks like objects returned by PyMemoryView_FromMemory() must never be allowed to "leak", so I am not sure there are many circumstances in which PyMemoryView_FromMemory() is safe to use.

Perhaps using PyBuffer_FillInfo() and PyMemory_FromBuffer() would keep alive the bytes object while the view is alive, without letting the bytes object "leak".

> Out of curiousity, have you done any benchmarks?

No.
msg170629 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-17 23:14
> The bytes object cannot "leak" so, as you say, checking that refcount
> is pointless.  But the view might "leak", and since it does not own a
> reference to the base object we have a problem: we can't deallocate the 
> bytes object for fear of breaking the view.

Indeed, that's a problem (but your patch does deallocate the bytes object).
It's quite fishy, I'm not sure how to solve the issue cleanly. Stefan, do you have an idea?
msg170638 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-09-18 08:33
So the problem is that readinto(view) might result in several references
to view? I don't think that can be solved on the memoryview side.

One could do:

   view = PyMemoryView_FromObject(b);
   // Lie about writability
   ((PyMemoryViewObject *)view)->view.readonly = 0;

   [...]

Then the view owns a reference to the bytes object. But that does not
solve the problem that writable memoryviews based on a readonly object
might be hanging around.
msg170639 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-18 10:47
> Then the view owns a reference to the bytes object. But that does not
> solve the problem that writable memoryviews based on a readonly object
> might be hanging around.

How about doing

    PyObject_GetBuffer(b, &buf, PyBUF_WRITABLE);
    view = PyMemoryView_FromBuffer(&buf);
    // readinto view
    PyBuffer_Release(&buf);

Would attempts to access a "leaked" reference to view now result in ValueError("operation forbidden on released memoryview object")?  If so then I think this would be safe.
msg170642 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-18 12:49
The current non-test uses of PyMemoryView_FromBuffer() are in _io.BufferedReader.read(), _io.BufferedWriter.write(), PyUnicode_Decode().

It looks like they can each be made to leak a memoryview that references a deallocated buffer.  (Maybe the answer is Don't Do That.)


import codecs, sys

def decode(buf):
    global view
    view = buf
    return codecs.latin_1_decode(buf)

def getregentry():
    return codecs.CodecInfo(name='foobar', decode=decode,
                            encode=codecs.latin_1_encode)

@codecs.register
def search_function(encoding):
    if encoding == 'foobar':
        return codecs.CodecInfo(*getregentry())

b = b'hello'.upper()
b.decode('foobar')
print(view.tobytes())                   # => b'HELLO'
del b
x = b'dump'.upper()
print(view.tobytes())                   # => b'DUMP\x00'



import io, sys

class File(io.RawIOBase):
    def readinto(self, buf):
        global view
        view = buf
        n = len(buf)
        buf[:] = b'x'*n
        return n

    def readable(self):
        return True

f = io.BufferedReader(File())
f.read(1)
print(view[:5].tobytes())       # => b'xxxxx'
del f
print(view[:5].tobytes())       # => b'\xdd\xdd\xdd\xdd\xdd'
msg170658 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-18 18:19
I am rather confused about the ownership semantics when one uses PyMemoryView_FromBuffer().

It looks as though PyMemoryView_FromBuffer() "steals" ownership of the buffer since, when the associated _PyManagedBufferObject is garbage collected, PyBuffer_Release() is called on its copy of the buffer info.  However, the _PyManagedBufferObject does not own a reference of the base object, so one still needs to decref the base object (at some time when it is safe to do so).

So am I right in thinking that

  PyObject_GetBuffer(obj, &buf, ...);
  view = PyMemoryView_FromBuffer(&buf);     // view->master owns the buffer, but view->master->obj == NULL
  ...
  Py_DECREF(view);                          // releases buffer (assuming no other exports)
  Py_XDECREF(buf.obj);

has balanced refcounting and is more or less equivalent to

  view = PyMemoryView_FromObject(obj);
  ...
  Py_DECREF(view);

The documentation is not very helpful.  It just says that calls to PyObject_GetBuffer() must be matched with calls to PyBuffer_Release().
msg170659 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-09-18 18:20
Richard Oudkerk <report@bugs.python.org> wrote:
>     PyObject_GetBuffer(b, &buf, PyBUF_WRITABLE);
>     view = PyMemoryView_FromBuffer(&buf);
>     // readinto view
>     PyBuffer_Release(&buf);
> 
> Would attempts to access a "leaked" reference to view now result in ValueError("operation forbidden on released memoryview object")?  If so then I think this would be safe.

You would need to call memory_release(). Perhaps we can just expose it on the
C-API level as PyMemoryView_Release().

IMO the use of PyObject_GetBuffer() should be discouraged. The semantics
aren't clear (see #15821). I'd suggest using:

  1) A buffer provider is involved (the case here):

        PyMemoryView_From Object()

  2) A piece of memory needs to be wrapped temporarily and no references
     to the memoryview are "leaked" on the Python level:

        PyMemoryView_FromMemory()

  3) A piece of memory needs to be packaged as a memoryview with automatic
     cleanup in mbuf_dealloc():

        PyMemoryView_FromBufferWithCleanup() (proposed in msg169613)

So I think the combination of PyMemoryView_FromObject() with a call to
PyMemoryView_Release() should indeed work here.
msg170660 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-09-18 18:22
Richard Oudkerk <report@bugs.python.org> wrote:
> The documentation is not very helpful.  It just says that calls
> to PyObject_GetBuffer() must be matched with calls to PyBuffer_Release().

Yes, we need to sort that out, see #15821.
msg170675 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-18 20:47
> You would need to call memory_release(). Perhaps we can just expose it on the
> C-API level as PyMemoryView_Release().

Should PyMemoryView_Release() release the _PyManagedBufferObject by doing mbuf_release(view->mbuf) even if view->mbuf->exports > 0?

Doing

  Py_TYPE(view->mbuf)->tp_clear((PyObject *)view->mbuf);

seems to have the desired effect of causing ValueError when I try to access any associated memoryview.

>  3) A piece of memory needs to be packaged as a memoryview with automatic
>     cleanup in mbuf_dealloc():
>
>        PyMemoryView_FromBufferWithCleanup() (proposed in msg169613)

Maybe this should also handle decrefing the base object (given a flag PyManagedBuffer_FreeObj).  I do worry about creating memoryviews that survive deallocation of the base object.
msg170676 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-18 20:56
> So I think the combination of PyMemoryView_FromObject() with a call to
> PyMemoryView_Release() should indeed work here.

I don't think we want to expose a mutable bytes object to outside code, so IMO PyMemoryView_FromMemory() is preferrable.
msg170678 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-09-18 21:19
Richard Oudkerk <report@bugs.python.org> wrote:
> Should PyMemoryView_Release() release the _PyManagedBufferObject by doing mbuf_release(view->mbuf) even if view->mbuf->exports > 0?

No, I think it should really just be a wrapper:

diff --git a/Objects/memoryobject.c b/Objects/memoryobject.c
--- a/Objects/memoryobject.c
+++ b/Objects/memoryobject.c
@@ -1093,6 +1093,12 @@
     return memory_release((PyMemoryViewObject *)self, NULL);
 }

+PyObject *
+PyMemoryView_Release(PyObject *m)
+{
+    return memory_release((PyMemoryViewObject *)m, NULL);
+}
+

We decided in #10181 not to allow releasing a view with exports, since the
logic is already quite complex. Is there a reasonable expectation that
existing code creates memoryviews of the readinto() argument?
msg170679 - (view) Author: Stefan Krah (skrah) * (Python committer) Date: 2012-09-18 21:41
Antoine Pitrou <report@bugs.python.org> wrote:
> I don't think we want to expose a mutable bytes object to outside code,
> so IMO PyMemoryView_FromMemory() is preferrable.

I agree, but PyMemoryView_FromMemory(PyBytes_AS_STRING(b), n, PyBUF_WRITE)
just hides the fact that a mutable bytes object is exposed.

Are we talking about a big speedup here or could we perhaps just keep
the existing code?
msg170680 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2012-09-18 21:51
> Antoine Pitrou <report@bugs.python.org> wrote:
> > I don't think we want to expose a mutable bytes object to outside code,
> > so IMO PyMemoryView_FromMemory() is preferrable.
> 
> I agree, but PyMemoryView_FromMemory(PyBytes_AS_STRING(b), n, PyBUF_WRITE)
> just hides the fact that a mutable bytes object is exposed.

Except that the mutable bytes object is not exposed to any outside code,
so that weird behaviour can't be observed.
msg170681 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2012-09-18 22:02
> Are we talking about a big speedup here or could we perhaps just keep
> the existing code?

I doubt it is worth the hassle.  But I did want to know if there was a clean way to do what I wanted.
msg190457 - (view) Author: Dwight Guth (dwight.guth) Date: 2013-06-01 23:23
I was programming something today and thought I should let you know I came across a situation where the current behavior of this function is able to expose what seems to be raw memory to the user.

import io
class A(io.RawIOBase):
  def readinto(self, b):
    return len(b)

A().read(100)
msg223225 - (view) Author: Mark Lawrence (BreamoreBoy) * Date: 2014-07-16 15:48
Please note this is also referred to from #15994.
History
Date User Action Args
2019-04-08 13:06:30BreamoreBoysetnosy: - BreamoreBoy
2019-04-08 13:02:20methanesetnosy: + methane
2014-10-14 18:01:58skrahsetnosy: - skrah
2014-07-16 15:48:19BreamoreBoysetnosy: + BreamoreBoy

messages: + msg223225
versions: + Python 3.5
2013-06-01 23:23:23dwight.guthsetnosy: + dwight.guth
messages: + msg190457
2012-09-18 22:02:18sbtsetmessages: + msg170681
2012-09-18 21:51:38pitrousetmessages: + msg170680
2012-09-18 21:41:23skrahsetmessages: + msg170679
2012-09-18 21:19:27skrahsetmessages: + msg170678
2012-09-18 20:56:43pitrousetmessages: + msg170676
2012-09-18 20:50:10vstinnersetnosy: + vstinner
2012-09-18 20:47:55sbtsetmessages: + msg170675
2012-09-18 18:22:55skrahsetmessages: + msg170660
2012-09-18 18:20:41skrahsetmessages: + msg170659
2012-09-18 18:19:43sbtsetmessages: + msg170658
2012-09-18 12:49:29sbtsetmessages: + msg170642
2012-09-18 10:47:49sbtsetmessages: + msg170639
2012-09-18 08:33:54skrahsetmessages: + msg170638
2012-09-17 23:14:26pitrousetnosy: + skrah
messages: + msg170629
2012-09-17 23:04:56sbtsetmessages: + msg170628
2012-09-17 22:30:01pitrousetnosy: + pitrou
messages: + msg170627
2012-09-17 20:59:07serhiy.storchakasettype: performance
components: + Library (Lib), IO
2012-09-17 20:17:09sbtsetfiles: + iobase_read.patch

messages: + msg170624
2012-09-10 20:01:10amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg170226
2012-09-10 18:01:46jceasetnosy: + jcea
stage: patch review

versions: + Python 3.4
2012-09-10 12:23:03sbtcreate