Issue 21057: TextIOWrapper does not support reading bytearrays or memoryviews

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/65256

classification

Title:	TextIOWrapper does not support reading bytearrays or memoryviews
Type:	enhancement	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.5

process

Status:	closed	Resolution:	fixed
Dependencies:		Superseder:
Assigned To:		Nosy List:	loewis, nikratio, pitrou, python-dev, serhiy.storchaka, vstinner
Priority:	normal	Keywords:	patch

Created on 2014-03-25 04:14 by nikratio, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
issue21057.diff	nikratio, 2014-03-28 03:46		review
issue21057.diff	nikratio, 2014-04-01 04:19		review
issue_21057_r3.diff	nikratio, 2014-04-29 03:00		review

Messages (20)
msg214776 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-25 04:14
In Python 3.4, TextIOWrapper can not read from streams that return bytearrays or memoryviews: from io import TextIOWrapper, BytesIO class MyByteStream(BytesIO): def read1(self, len_): return memoryview(super().read(len_)) bs = MyByteStream(b'some data in ascii\n') ss = TextIOWrapper(bs, encoding='ascii') print(ss.read(200)) results in: TypeError: underlying read1() should have returned a bytes object, not 'memoryview' I don't think there's any good reason for TextIOWrapper not accepting any bytes-like object (especially now that b''.join finally accepts bytes-like objects as well).
msg214777 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-25 04:15
If someone is willing to review, I'd be happy to write a patch for this.
msg214787 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-03-25 08:26
I'm interested to review a patch, but I'm not sure that read1() can return a type different than bytes. You can use the Py_buffer API for your patch.
msg214852 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-03-25 20:39
read1() should return bytes. MyByteStream doesn't implement the io.BufferedIOBase interface.
msg214853 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-25 21:00
On 03/25/2014 01:39 PM, Serhiy Storchaka wrote: > read1() should return bytes. MyByteStream doesn't implement the io.BufferedIOBase interface. Indeed, this is what this issue is about :-). The question is: is there a good reason to require io.BufferedIOBase implementors to return bytes rather than any bytes-like object from read1? I'd also argue that the current documentation of io.BufferedIOBase actually does not clearly require read1 to return bytes. The exact formulation is "Read and return up to size bytes" (note that "bytes" is not interpreted text). This can just as easily read as "return binary data of up to size bytes using one of the customary Python types". So if it turns out that there is a good reason for this requirement, the documentation should at least make this requirement more explicit. Best, -Nikolaus -- Encrypted emails preferred. PGP fingerprint: 5B93 61F8 4EA2 E279 ABF6 02CF A9AD B7F8 AE4E 425C »Time flies like an arrow, fruit flies like a Banana.«
msg214871 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-03-25 23:35
bytes objects have two things going for them: - they have the full bytes API (all the startswith(), etc. methods) - not only buffer access - they are immutable: you can keep an internal reference to a bytes object and be sure it won't change under your feet This is especially handy when writing C code.
msg214874 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-26 00:37
Yes, bytes objects have some advantages. But if a bytes object is desired, it can always be created from bytes-like object. If a BufferedIOBase instance is required to only provide bytes objects, this conversion is forced even when it may not be necessary. If someone is willing to do the work (and I am), is there a reason not to allow TextIOWrapper to accept bytes-like objects?
msg214886 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-03-26 10:07
> If someone is willing to do the work (and I am), is there a reason not to > allow TextIOWrapper to accept bytes-like objects? Yes, there are. The code which works only with bytes is much simpler. Not only TextIOWrapper, but many other classes in the stdlib (GzipFile, BZ2File, LZMAFile, ZipFile) expect that the read method of underlied file object returns bytes, so your MyByteStream is just broken for such cases.
msg214891 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-03-26 10:38
2014-03-26 0:35 GMT+01:00 Antoine Pitrou <report@bugs.python.org>: > - they are immutable: you can keep an internal reference to a bytes object and be sure it won't change under your feet Is it possible to request this feature using PyObject_GetBuffer()? I don't see such flag. If it's not possible, a new flag is maybe needed?
msg214893 - (view)	Author: STINNER Victor (vstinner) *	Date: 2014-03-26 10:43
> class MyByteStream(BytesIO): > def read1(self, len_): > return memoryview(super().read(len_)) > bs = MyByteStream(b'some data in ascii\n') I guess that you are trying to implement a zero-copy I/O. The problem is that BytesIO does copy data. Example: >>> data=b'abcdef' >>> x=io.BytesIO(data) >>> x.read() is data False Before trying to avoid copies in the "buffered" layer, something should be done for the "raw" layer (BytesIO in this case).
msg214899 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-26 15:17
On 03/26/2014 03:43 AM, STINNER Victor wrote: >> class MyByteStream(BytesIO): >> def read1(self, len_): >> return memoryview(super().read(len_)) >> bs = MyByteStream(b'some data in ascii\n') > > I guess that you are trying to implement a zero-copy I/O. The problem is that BytesIO does copy data. Right on the first count, but wrong on the second. The class I'm concerned with wants to do zero-copy I/O, but is not related to BytesIO. I only picked that to produce a minimal example.
msg215015 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-03-28 03:45
I'm attaching a patch that enables TextIOWrapper to work with bytes-like objects from the underlying file descriptor. The code changes are pretty small, without introducing any significant additional complexity. For streams providing bytes objects, this patch does not change anything. For streams that are able to provide bytearrays or memoryviews, this patch removes the overhead that would be incurred by explicitly converting to bytes.
msg215293 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-04-01 04:19
Thanks for the feedback! I have attached an updated patch. I did not include any testcase because the patch did not create any new code paths, so I was assuming it'd be covered by the existing test case. But of course I was wrong. In the revised patch, I added a testcase based on your example of a more complex memoryview. (Note, however, that even with the previous implementation using nbytes = PySequence_Size(input_chunk) this test does not fail, because nbytes is used only to estimate the size of the text string).
msg215985 - (view)	Author: Martin v. Löwis (loewis) *	Date: 2014-04-12 22:23
Nikolaus, can you please restate from scratch what the issue is? If the issue is still the one you posed in msg214776, I think the issue should be closed as "invalid" - it's not the case that there is no good reason for TextIOWrapper not accepting any bytes-like object. Or, to drop the double negation: as Serhiy states, read1() should return bytes, and it's perfectly fine to rely on that. Your MyByteStream ought to fail, so it is correct behaviour that it does fail.
msg215987 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-04-12 23:03
My usecase is that I have a binary stream class that internally uses memoryviews. I would like to read text data from this stream and thus encapsulate it in a TextIOWrapper. Currently, TextIOWrapper (correctly) expects read() to return bytes and fails if it receives any other bytes-like object. I can change my custom class to internally convert to bytes, but this means that the data is needlessly copied around and affects every other consumer of the class as well. Changing TextIOWrapper to work with any bytes-like object is (as far as I can see) rather simple. It does not introduce any new branches in the code, and it does not change the behavior for bytes objects at all. It does, however, eliminate unnecessary memcopies for classes that do not internally work with bytes objects. Therefore, I was hoping this patch could be considered for inclusion. The MyByteStream example that I gave in the first message is useless. I merely included it as the smallest possible code fragment that currently does not work, but would work after the patch in an attempt to illustrate what I meant - but apparently it had the opposite effect. Thanks for considering!
msg217405 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-28 19:34
> My usecase is that I have a binary stream class that internally uses memoryviews. Ok, I think it is a reasonable use case. I'm gonna look at your patch and give it a review.
msg217490 - (view)	Author: Roundup Robot (python-dev)	Date: 2014-04-29 08:27
New changeset 2a1d63f09560 by Antoine Pitrou in branch 'default': Issue #21057: TextIOWrapper now allows the underlying binary stream's read() or read1() method to return an arbitrary bytes-like object (such as a memoryview). http://hg.python.org/cpython/rev/2a1d63f09560
msg217491 - (view)	Author: Antoine Pitrou (pitrou) *	Date: 2014-04-29 08:27
Thank you, I've committed tha patch to 3.5.
msg218480 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2014-05-13 19:41
Note that this is not work with the punycode encoding (and may be some third-party encodings).
msg218495 - (view)	Author: Nikolaus Rath (nikratio) *	Date: 2014-05-14 00:33
On 05/13/2014 12:41 PM, Serhiy Storchaka wrote: > Note that this is not work with the punycode encoding (and may be some third-party encodings). I do not understand. Could you elaborate?

History
Date	User	Action	Args
2022-04-11 14:58:00	admin	set	github: 65256
2014-05-14 00:33:29	nikratio	set	messages: + msg218495
2014-05-13 19:41:04	serhiy.storchaka	set	messages: + msg218480
2014-04-29 08:27:55	pitrou	set	status: open -> closed resolution: fixed messages: + msg217491 stage: patch review -> resolved
2014-04-29 08:27:25	python-dev	set	nosy: + python-dev messages: + msg217490
2014-04-29 03:00:21	nikratio	set	files: + issue_21057_r3.diff
2014-04-28 19:34:13	pitrou	set	messages: + msg217405 stage: patch review
2014-04-12 23:03:43	nikratio	set	type: behavior -> enhancement messages: + msg215987 versions: - Python 3.4
2014-04-12 22:23:05	loewis	set	nosy: + loewis messages: + msg215985
2014-04-01 04:19:11	nikratio	set	files: + issue21057.diff messages: + msg215293
2014-03-28 03:46:26	nikratio	set	files: + issue21057.diff
2014-03-28 03:46:18	nikratio	set	files: - issue21057.diff
2014-03-28 03:45:36	nikratio	set	files: + issue21057.diff keywords: + patch messages: + msg215015
2014-03-26 15:17:27	nikratio	set	messages: + msg214899
2014-03-26 10:43:14	vstinner	set	messages: + msg214893
2014-03-26 10:38:24	vstinner	set	messages: + msg214891
2014-03-26 10:07:25	serhiy.storchaka	set	messages: + msg214886
2014-03-26 00:37:27	nikratio	set	messages: + msg214874
2014-03-25 23:35:21	pitrou	set	messages: + msg214871
2014-03-25 21:00:17	nikratio	set	messages: + msg214853
2014-03-25 20:39:58	serhiy.storchaka	set	nosy: + serhiy.storchaka messages: + msg214852
2014-03-25 08:26:53	vstinner	set	nosy: + vstinner, pitrou messages: + msg214787 versions: + Python 3.4
2014-03-25 04:15:16	nikratio	set	messages: + msg214777
2014-03-25 04:14:29	nikratio	create