Title: Define StringIO seek offset as code point offset
Type: enhancement Stage:
Components: IO Versions: Python 3.6
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: benjamin.peterson, martin.panter, pitrou, serhiy.storchaka, socketpair, stutzbach
Priority: normal Keywords: patch

Created on 2015-09-20 06:15 by martin.panter, last changed 2015-12-15 06:59 by martin.panter.

File name Uploaded Description Edit
stringio-seek.patch martin.panter, 2015-12-15 06:59 review
Messages (6)
msg251149 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-09-20 06:15
This follows from Issue 12922. When no newline translation is being done, it would be useful to define the seek() offset as the code point offset into the underlying string, allowing stuff like:

s = StringIO()
print("line", file=s)  # Some inflexible API with an unwanted newline, SEEK_CUR)  # Undo the trailing newline

In general, relative seeks are not allowed for text streams, and absolute offsets have arbitrary values. But when no encoding is actually going on, these restrictions are annoying.

I guess the biggest problem is what to do when newline translation is enabled. But I think this is a rarely-used feature of StringIO. I suggest to say that offsets in that case remain arbitrary, and let the code do whatever it happens to do (probably jumping to the wrong character, chopping CRLFs in half, etc, as long as it won’t crash).
msg251152 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2015-09-20 06:47
I suspect it would be not easy to do for Python implementation.
msg251206 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-09-21 06:52
I see the _pyio implementation wraps BytesIO with UTF-8 encoding. Perhaps it would be okay to change to UTF-32 encoding (a fixed-length Unicode encoding). That would use more memory, but the C implementation seems to use a Py_UCS4 buffer already. Then you could reimplement seek(), tell(), and truncate() by detaching and rebuilding the TextIOWrapper over the top. Not super efficient, but perhaps that does not matter for the _pyio implementation.

The fact that it is so hard to do this (random write access to a large Unicode buffer) in native Python could be another argument to support this in the default StringIO implementation :)
msg256292 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-12 20:20
#25849 ?
msg256302 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-12 23:10
Mark: This issue is about StringIO only. I am not proposing any change to TextIOBase or how on-disk text files are handled.

I intend to propose a patch to make StringIO more liberal, but haven’t got around to it yet. Do you think it would be worthwhile? IMO it would make StringIO a fairly efficient mutable text buffer. The alternatives [list(str), array("u")] are slower and/or use more than 4 bytes per character.
msg256441 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-15 06:59
There were a few tricky bits doing this with _pyio.StringIO, but I think I was successful. Here is a patch with both implementations and some tests. If people think this should go ahead, I can add documentation.

In the process I may have discovered a bug with the TextIOWrapper implementations. Is calling truncate() meant to truncate the internal read buffer? At the moment you can read back truncated data, although the underlying byte stream is actually truncated.
Date User Action Args
2020-01-20 08:22:33serhiy.storchakalinkissue39365 superseder
2015-12-15 06:59:59martin.pantersetfiles: + stringio-seek.patch
keywords: + patch
messages: + msg256441
2015-12-12 23:10:02martin.pantersetmessages: + msg256302
2015-12-12 20:20:10socketpairsetnosy: + socketpair
messages: + msg256292
2015-09-21 06:52:13martin.pantersetmessages: + msg251206
2015-09-20 06:47:42serhiy.storchakasetnosy: + pitrou, benjamin.peterson, stutzbach, serhiy.storchaka
messages: + msg251152
2015-09-20 06:15:12martin.pantercreate