This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: files, opened in unicode (text): write() returns symbols count, but seek() expect offset in bytes
Type: Stage: resolved
Components: IO, Library (Lib), Unicode Versions: Python 3.6, Python 3.3, Python 3.4, Python 3.5
process
Status: closed Resolution: not a bug
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, martin.panter, pitrou, r.david.murray, socketpair, vstinner
Priority: normal Keywords:

Created on 2015-12-12 20:19 by socketpair, last changed 2022-04-11 14:58 by admin. This issue is now closed.

Messages (15)
msg256291 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-12 20:19
It seems, that we should deprecate .seek() on files, opened in text mode.

Since it is not possible to seek to position between symbols. Yes, it is possible to decode UTF-8 (or other charset) starting from beginning of the file and count symbols, but it is EXTREMELY SLOW, and is not what user expect. If so, seeking from end of file back to begin may be implemented in even more hard and error-prone way.

Moreover, I consider that we should disallow seek in text files except seek() to begin of the file (position 0) or end of file (seek(0, SEEK_END)).

Seel also issue25190 #25190 about something related for that.
msg256293 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-12 20:33
As mentioned in those issues, currently the peek/seek token is a black box.  That doesn't mean it isn't useful.  Those issues are talking about potential ways to make it more useful, so any discussion should occur there.
msg256324 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-13 17:33
https://docs.python.org/3.5/library/io.html?highlight=stringio#id3 :

Also, TextIOWrapper.tell() and TextIOWrapper.seek() are both quite slow due to the reconstruction algorithm used.

What is reconstruction algorightm ? Experiments show, that seek() and tell() returns values of count of bytes (not letters).


#!/usr/bin/python3.5
import tempfile

with tempfile.TemporaryFile(mode='r+t') as f:
    l = f.write('привет')
    print(l, f.tell()) # "6 12"
    f.seek(3)
    f.write('прекол42')
    f.seek(0)
    print(f.read()) # raise UnicodeDecodeError

So, please reopen. Issue is still here.
msg256329 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-13 19:01
I'm still not seeing a bug.

If you have a performance enhancement or functional enhancement you'd like us to consider, please attach a patch, with benchmark results.

Since you say "are quite slow because of the reconstruction algorithm", what makes you say this?  I'd think the "algorithm" was just using the underlying bytes tell/seek value, which then becomes a black box token because it does not have a one to one releationship to the character count.
msg256415 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-14 22:12
First, it seems that there are no real "reconstruction algorithm" at all. Seek is allowed to point to any byte position, even to place "inside" characters for multibyte encodings, such as UTF-8.

Second, about performance:  I talk about implementation mentioned in first message. If it is not used (and will not be used), we may forget about that sentence.

Next, once again:

I consider it is a bug in allowing to seek to invalid byte offsets for text files. Since we cannot easily calculate what offset will be valid (for example, seek past the end of file, or places inside character), just disallow seek. In real applications, no one will seek/peek to places other than

* beginning of the file
* current byte offset
* seeking to the end of file.

so this seeks/peeks must be allowed.

This is applicable only to variable multibyte encodings (such as UTF-8).
msg256416 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-14 22:14
s/peek/tell/
msg256417 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-14 22:17
Also, can you provide the case, where such random seeks can be used on text files? It would be programmer error to seek to places other I mention. Does not it ?
msg256418 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2015-12-14 22:42
I think you haven't quite gotten what "opaque token" means in this context.  The way you use tell/seek with text files is: you have read to some certain point in the file.  You call 'tell' and get back an opqaue token.  Later you can call seek with that token to get back to the place in the file that you "bookmarked".  It will never be between characters, because tell won't return such a poitner.  If you decide to call seek with something (other than 0) that you didn't get from tell, then you are on your own.
msg256433 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-15 03:20
You might be right about the “reconstruction algorithm”. This text was added in revision 0bba533c0959; maybe Antoine can comment whether we should clarify or remove it.

I think the text added for TextIOBase.seek() in revision 03e61104f7a2 (Issue 12922) is closer to the truth. Though I would probably drop the bit about tell() not usually returning a byte position; for many codecs it does seem to.

This illustrates the only four cases of seeking I understand are allowed for text streams:

>>> text = TextIOWrapper(BytesIO(), "utf-7")
>>> text.write("привет")
6
>>> text.seek(0)  # 1: Rewind to start
0
>>> text.read(1)
'п'
>>> saved = text.tell()
>>> text.read()
'ривет'
>>> text.seek(saved)  # 2: Seek to saved offset
340282368347045388720132684115559317504
>>> text.read(1)
'р'
>>> text.seek(0, SEEK_CUR)  # 3: No movement
680564735267983852183507291547327528960
>>> text.read(1)
'и'
>>> text.seek(0, SEEK_END)  # 4: Seek to end
18
>>> text.read()  # EOF
''

If the “slow reconstruction algorithm” was clarified or removed, and the documentation explained that you cannot seek to arbitrary characters without having previously called tell(), would that work?
msg256440 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-15 06:48
I’m starting to understand that there might be a “reconstruction algorithm” needed. When reading, TextIOWrapper buffers decoded characters. If you call tell() and there is unread but decoded data, it is not enough to return the incremental decoder state. You have to handle the unread buffered data as well. Looking at the _pyio tell() implementation, it tries to wind the decoder backwards to minimize the state.
msg256447 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2015-12-15 08:33
I don't understand what the complaint is. If you think seek()/tell() are not useful, just don't use them.
msg256448 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015-12-15 10:05
> If the “slow reconstruction algorithm” was clarified or removed, ...

I wrote this algorithm, or I helpd to write it, I don't recall.

The problem is readahead: TextIOWrapper read more bytes than requested for performances. But when tell() is called, the user expects to get the current file position, not the "read ahead" file position. So we have to go backward. Problem: TextIOWrapper uses text (Unicode) whereas all files are bytes on the disk. We need to compute the size of the readahead buffer in bytes from a buffer in characters.

The bad performances comes from multibyte codecs which requires heuristic to first guess the number of bytes and then really encode back bytes to find the exact size.

See _pyio.TextIOWrapper.tell() for the Python implementation.

# Fast search for an acceptable start point, close to our
# current pos.
# Rationale: calling decoder.decode() has a large overhead
# regardless of chunk size; we want the number of such calls to
# be O(1) in most situations (common decoders, non-crazy input).
# Actually, it will be exactly 1 for fixed-size codecs (all
# 8-bit codecs, also UTF-16 and UTF-32).

(Incomplete) history of the Python implementation of the tell() method:

* changeset 7c6972f37fe3 (2007)
* changeset 28bc7ed26574: More efficient implementation
* changeset b5a2e753b682: use the new getstate/setstate decoder API
* changeset 04050373d799 (2008): fix for stateful decoders
* changeset 39a4f4393ef1: additional fixes to the handling of 'limit'
* (Lib/io.py moved to Lib/_pyio.py)
* changeset 4b6052320e98 (Issue #11114): optimization
msg256551 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-16 22:40
Well,  03e61104f7a2 adds good description, why not to enforce checks instead of saying that some values are unsupported ?

Also, idea in returning special object instance from tell(), this object should incapsulate byte offset. And allow for the seek() either such objects or zero.
msg256559 - (view) Author: Марк Коренберг (socketpair) * Date: 2015-12-16 22:59
Seems this object should also incapsulate weakref to opened file... (to prevent using it on another text file)

* after writing, all objects, that incapsulate offset > current (during start of write()) should become invalid.
* truncate(arg) should also make that objects invalid for offsets > arg


One bad thing: seeking to the end of file is not safe:
If someone opens logfile, last byte in that text file may belong to middle of multibyte sequence. This really may happen, since Linux (libc) writes files page-by-page (if not line-buffered), so if page boundary is inside character, this bad thing may happen. Real case - is implementing something like `tail -f` in python (to, say, monitor log files).

So, seeking to the end of file is also wrong, and should be disallowed. If so, how to append text to text files ? very upset :( :( I don't know decision of that problem.
msg256563 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2015-12-16 23:27
I think changing the TextIOBase API would be hard to do if you want to keep compatibility with existing code. I agree that encoding the position to a number and back seems like a bad design, but I doubt it is worth changing it at this point.
History
Date User Action Args
2022-04-11 14:58:24adminsetgithub: 70036
2017-11-09 17:58:10serhiy.storchakasetstatus: pending -> closed
2017-09-20 13:39:12serhiy.storchakasetstatus: open -> pending
2015-12-16 23:27:16martin.pantersetmessages: + msg256563
2015-12-16 22:59:26socketpairsetmessages: + msg256559
2015-12-16 22:40:14socketpairsetmessages: + msg256551
2015-12-15 10:05:01vstinnersetmessages: + msg256448
2015-12-15 08:33:49pitrousetmessages: + msg256447
2015-12-15 06:48:42martin.pantersetmessages: + msg256440
2015-12-15 03:20:44martin.pantersetnosy: + pitrou, martin.panter
messages: + msg256433
2015-12-14 22:42:40r.david.murraysetmessages: + msg256418
2015-12-14 22:17:19socketpairsetmessages: + msg256417
2015-12-14 22:14:41socketpairsetstatus: closed -> open

messages: + msg256416
2015-12-14 22:12:36socketpairsetmessages: + msg256415
2015-12-13 19:01:53r.david.murraysetmessages: + msg256329
2015-12-13 17:33:15socketpairsetmessages: + msg256324
2015-12-12 20:33:21r.david.murraysetstatus: open -> closed

nosy: + r.david.murray
messages: + msg256293

resolution: not a bug
stage: resolved
2015-12-12 20:19:49socketpaircreate