Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define StringIO seek offset as code point offset #69377

Open
vadmium opened this issue Sep 20, 2015 · 7 comments
Open

Define StringIO seek offset as code point offset #69377

vadmium opened this issue Sep 20, 2015 · 7 comments
Labels
topic-IO type-feature A feature request or enhancement

Comments

@vadmium
Copy link
Member

vadmium commented Sep 20, 2015

BPO 25190
Nosy @pitrou, @benjaminp, @socketpair, @eli-b, @vadmium, @serhiy-storchaka
Files
  • stringio-seek.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = None
    created_at = <Date 2015-09-20.06:15:12.941>
    labels = ['type-feature', 'expert-IO']
    title = 'Define StringIO seek offset as code point offset'
    updated_at = <Date 2021-05-19.09:44:48.912>
    user = 'https://github.com/vadmium'

    bugs.python.org fields:

    activity = <Date 2021-05-19.09:44:48.912>
    actor = 'Eli_B'
    assignee = 'none'
    closed = False
    closed_date = None
    closer = None
    components = ['IO']
    creation = <Date 2015-09-20.06:15:12.941>
    creator = 'martin.panter'
    dependencies = []
    files = ['41313']
    hgrepos = []
    issue_num = 25190
    keywords = ['patch']
    message_count = 6.0
    messages = ['251149', '251152', '251206', '256292', '256302', '256441']
    nosy_count = 7.0
    nosy_names = ['pitrou', 'benjamin.peterson', 'stutzbach', 'socketpair', 'Eli_B', 'martin.panter', 'serhiy.storchaka']
    pr_nums = []
    priority = 'normal'
    resolution = None
    stage = None
    status = 'open'
    superseder = None
    type = 'enhancement'
    url = 'https://bugs.python.org/issue25190'
    versions = ['Python 3.6']

    @vadmium
    Copy link
    Member Author

    vadmium commented Sep 20, 2015

    This follows from bpo-12922. When no newline translation is being done, it would be useful to define the seek() offset as the code point offset into the underlying string, allowing stuff like:

    s = StringIO()
    print("line", file=s)  # Some inflexible API with an unwanted newline
    s.seek(-1, SEEK_CUR)  # Undo the trailing newline
    s.truncate()

    In general, relative seeks are not allowed for text streams, and absolute offsets have arbitrary values. But when no encoding is actually going on, these restrictions are annoying.

    I guess the biggest problem is what to do when newline translation is enabled. But I think this is a rarely-used feature of StringIO. I suggest to say that offsets in that case remain arbitrary, and let the code do whatever it happens to do (probably jumping to the wrong character, chopping CRLFs in half, etc, as long as it won’t crash).

    @vadmium vadmium added topic-IO type-feature A feature request or enhancement labels Sep 20, 2015
    @serhiy-storchaka
    Copy link
    Member

    I suspect it would be not easy to do for Python implementation.

    @vadmium
    Copy link
    Member Author

    vadmium commented Sep 21, 2015

    I see the _pyio implementation wraps BytesIO with UTF-8 encoding. Perhaps it would be okay to change to UTF-32 encoding (a fixed-length Unicode encoding). That would use more memory, but the C implementation seems to use a Py_UCS4 buffer already. Then you could reimplement seek(), tell(), and truncate() by detaching and rebuilding the TextIOWrapper over the top. Not super efficient, but perhaps that does not matter for the _pyio implementation.

    The fact that it is so hard to do this (random write access to a large Unicode buffer) in native Python could be another argument to support this in the default StringIO implementation :)

    @socketpair
    Copy link
    Mannequin

    socketpair mannequin commented Dec 12, 2015

    bpo-25849 ?

    @vadmium
    Copy link
    Member Author

    vadmium commented Dec 12, 2015

    Mark: This issue is about StringIO only. I am not proposing any change to TextIOBase or how on-disk text files are handled.

    I intend to propose a patch to make StringIO more liberal, but haven’t got around to it yet. Do you think it would be worthwhile? IMO it would make StringIO a fairly efficient mutable text buffer. The alternatives [list(str), array("u")] are slower and/or use more than 4 bytes per character.

    @vadmium
    Copy link
    Member Author

    vadmium commented Dec 15, 2015

    There were a few tricky bits doing this with _pyio.StringIO, but I think I was successful. Here is a patch with both implementations and some tests. If people think this should go ahead, I can add documentation.

    In the process I may have discovered a bug with the TextIOWrapper implementations. Is calling truncate() meant to truncate the internal read buffer? At the moment you can read back truncated data, although the underlying byte stream is actually truncated.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    @marscher
    Copy link

    After 7 years, I'd like to use this "feature", but it still raises the confusing error message described earlier. IMO this should be possible.

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    topic-IO type-feature A feature request or enhancement
    Projects
    None yet
    Development

    No branches or pull requests

    3 participants