This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: io.BytesIO: no way to get the length of the underlying buffer without copying data
Type: enhancement Stage: resolved
Components: IO Versions:
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: martin.panter, r.david.murray, rthr
Priority: normal Keywords:

Created on 2017-07-25 12:03 by rthr, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (6)
msg299054 - (view) Author: Arthur Darcet (rthr) * Date: 2017-07-25 12:03
If I'm not mistaken, a BytesIO buffer can be in three states:

 (1) `b = BytesIO(b'data')` -> free of any constraints
 (2) `d = b'data'; b = BytesIO(d)` -> cannot modify the underlying bytes without copying them
 (3) `b = BytesIO(b'data'); d = b.getbuffer()` -> cannot return a "bytes" representation of the data without copying it (the underlying buffer might change)


My use-case is "how to get the length of the data currently in the BytesIO object".
And right now, there are two solutions:
 (a) `len(b.getvalue())`
 (b) `len(b.getbuffer())`

but, solution (a) is copying data if the buffer is in state (3) ; and solution (b) is copying data for state (2).

And I don't see any way to distinguish between the three states from Python code.
So as far as I understand it, there is no way to get the size of the buffer in Python that would reliably not copy any data


Should I open a PR to add a `size()` method on the BytesIO class? (simply returning `PyLong_FromSsize_t(self->string_size)`
msg299056 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2017-07-25 12:14
Can’t you use b.seek(0, SEEK_END)?
msg299060 - (view) Author: Arthur Darcet (rthr) * Date: 2017-07-25 12:21
it's a tiny bit slow, but that works, thank you. I guess we can close this



% python -m timeit -s "import io; b = io.BytesIO(b'0' * 2 ** 30)" "p = b.tell(); b.seek(0, 2); b.tell(); b.seek(p)"
1000000 loops, best of 3: 0.615 usec per loop

% python -m timeit -s "import io; b = io.BytesIO(b'0' * 2 ** 30)" "len(b.getvalue())"
10000000 loops, best of 3: 0.174 usec per loop
msg299125 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-25 17:43
I'm confused, I don't see how there can be any difference between (1) and (2).
msg299215 - (view) Author: Arthur Darcet (rthr) * Date: 2017-07-26 08:15
BytesIO is heavily optimised to avoid copying bytes when it can.
For case (1), if you want to modify the data, then there is no need to actually copy it before overwriting it, because no-one else is using it

For case (2), if you want to change something, then you need to copy it first, otherwise the original bytes object would get modified


Case (1):
% python -m timeit -s "import io; b = io.BytesIO(b'0' * 2 ** 30)" "b.getbuffer()"
1000000 loops, best of 3: 0.201 usec per loop


Case (2):
python -m timeit -s "import io; a = b'0' * 2 ** 30; b = io.BytesIO(a)" "b.getbuffer()"
10 loops, best of 3: 54.5 msec per loop
msg299233 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2017-07-26 14:09
So you are saying that BytesIO has code that checks that its argument only has a single reference and modifies the string in place when it can if so?  You can't depend on that in any other implementation of Python, and shouldn't depend on it in CPython either.  Even in CPython you can't guarantee that case 1 is case 1, since the argument could conceivably be an interned string.

So the seek approach is the only one that makes semantic sense, I think.
History
Date User Action Args
2022-04-11 14:58:49adminsetgithub: 75208
2017-07-26 14:09:58r.david.murraysetmessages: + msg299233
2017-07-26 08:15:34rthrsetmessages: + msg299215
versions: - Python 3.7
2017-07-25 17:43:02r.david.murraysetnosy: + r.david.murray
messages: + msg299125
2017-07-25 12:21:58rthrsetstatus: open -> closed

messages: + msg299060
stage: resolved
2017-07-25 12:14:58martin.pantersetnosy: + martin.panter

messages: + msg299056
versions: - Python 2.7, Python 3.3, Python 3.4, Python 3.5, Python 3.6
2017-07-25 12:03:57rthrcreate