Issue 25190: Define StringIO seek offset as code point offset

➜

This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/69377

classification

Title:	Define StringIO seek offset as code point offset
Type:	enhancement	Stage:
Components:	IO	Versions:	Python 3.6

process

Status:	open	Resolution:
Dependencies:		Superseder:
Assigned To:		Nosy List:	Eli_B, benjamin.peterson, martin.panter, pitrou, serhiy.storchaka, socketpair, stutzbach
Priority:	normal	Keywords:	patch

Created on 2015-09-20 06:15 by martin.panter, last changed 2022-04-11 14:58 by admin.

Files
File name	Uploaded	Description	Edit
stringio-seek.patch	martin.panter, 2015-12-15 06:59		review

Messages (6)
msg251149 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-09-20 06:15
This follows from Issue 12922. When no newline translation is being done, it would be useful to define the seek() offset as the code point offset into the underlying string, allowing stuff like: s = StringIO() print("line", file=s) # Some inflexible API with an unwanted newline s.seek(-1, SEEK_CUR) # Undo the trailing newline s.truncate() In general, relative seeks are not allowed for text streams, and absolute offsets have arbitrary values. But when no encoding is actually going on, these restrictions are annoying. I guess the biggest problem is what to do when newline translation is enabled. But I think this is a rarely-used feature of StringIO. I suggest to say that offsets in that case remain arbitrary, and let the code do whatever it happens to do (probably jumping to the wrong character, chopping CRLFs in half, etc, as long as it won’t crash).
msg251152 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) *	Date: 2015-09-20 06:47
I suspect it would be not easy to do for Python implementation.
msg251206 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-09-21 06:52
I see the _pyio implementation wraps BytesIO with UTF-8 encoding. Perhaps it would be okay to change to UTF-32 encoding (a fixed-length Unicode encoding). That would use more memory, but the C implementation seems to use a Py_UCS4 buffer already. Then you could reimplement seek(), tell(), and truncate() by detaching and rebuilding the TextIOWrapper over the top. Not super efficient, but perhaps that does not matter for the _pyio implementation. The fact that it is so hard to do this (random write access to a large Unicode buffer) in native Python could be another argument to support this in the default StringIO implementation :)
msg256292 - (view)	Author: Марк Коренберг (socketpair) *	Date: 2015-12-12 20:20
#25849 ?
msg256302 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-12-12 23:10
Mark: This issue is about StringIO only. I am not proposing any change to TextIOBase or how on-disk text files are handled. I intend to propose a patch to make StringIO more liberal, but haven’t got around to it yet. Do you think it would be worthwhile? IMO it would make StringIO a fairly efficient mutable text buffer. The alternatives [list(str), array("u")] are slower and/or use more than 4 bytes per character.
msg256441 - (view)	Author: Martin Panter (martin.panter) *	Date: 2015-12-15 06:59
There were a few tricky bits doing this with _pyio.StringIO, but I think I was successful. Here is a patch with both implementations and some tests. If people think this should go ahead, I can add documentation. In the process I may have discovered a bug with the TextIOWrapper implementations. Is calling truncate() meant to truncate the internal read buffer? At the moment you can read back truncated data, although the underlying byte stream is actually truncated.

History
Date	User	Action	Args
2022-04-11 14:58:21	admin	set	github: 69377
2021-05-19 09:44:48	Eli_B	set	nosy: + Eli_B
2020-01-20 08:22:33	serhiy.storchaka	link	issue39365 superseder
2015-12-15 06:59:59	martin.panter	set	files: + stringio-seek.patch keywords: + patch messages: + msg256441
2015-12-12 23:10:02	martin.panter	set	messages: + msg256302
2015-12-12 20:20:10	socketpair	set	nosy: + socketpair messages: + msg256292
2015-09-21 06:52:13	martin.panter	set	messages: + msg251206
2015-09-20 06:47:42	serhiy.storchaka	set	nosy: + pitrou, benjamin.peterson, stutzbach, serhiy.storchaka messages: + msg251152
2015-09-20 06:15:12	martin.panter	create