classification
Title: File truncate() not defaulting to current position as documented
Type: Stage: needs patch
Components: Documentation, IO Versions: Python 3.6, Python 3.5, Python 2.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: docs@python Nosy List: Zero, benjamin.peterson, docs@python, eryksun, fornax, martin.panter, pitrou, random832, serhiy.storchaka, socketpair, steve.dower, stutzbach
Priority: normal Keywords:

Created on 2016-01-19 15:19 by fornax, last changed 2016-01-20 04:31 by random832.

Messages (15)
msg258600 - (view) Author: Fornax (fornax) Date: 2016-01-19 15:19
io.IOBase.truncate() documentation says:
"Resize the stream to the given size in bytes (or the current position if size is not specified). The current stream position isn’t changed. This resizing can extend or reduce the current file size. In case of extension, the contents of the new file area depend on the platform (on most systems, additional bytes are zero-filled). The new file size is returned."

However:
>>> open('temp.txt', 'w').write('ABCDE\nFGHIJ\nKLMNO\nPQRST\nUVWXY\nZ\n')
32
>>> f = open('temp.txt', 'r+')
>>> f.readline()
'ABCDE\n'
>>> f.tell()
6                   # As expected, current position is 6 after the readline
>>> f.truncate()
32                  # ?!

Verified that the document does not get truncated to 6 bytes as expected. Adding an explicit f.seek(6) before the truncate causes it to work properly (truncate to 6). It also works as expected using a StringIO rather than a file, or in Python 2 (used 2.7.9).

Tested in 3.4.3/Windows, 3.4.1/Linux, 3.5.1/Linux.
msg258611 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-19 17:24
This is because the file is buffered.

>>> open('temp.txt', 'w').write('ABCDE\nFGHIJ\nKLMNO\nPQRST\nUVWXY\nZ\n')
32
>>> f = open('temp.txt', 'r+')
>>> f.readline()
'ABCDE\n'
>>> f.tell()
6
>>> f.buffer.tell()
32
>>> f.buffer.raw.tell()
32

The documentation needs a clarification.
msg258612 - (view) Author: Fornax (fornax) Date: 2016-01-19 17:46
To clarify... the intended behavior is for truncate to default to the current position of the buffer, rather than the current position as reported directly from the stream by tell?

That seems... surprising.
msg258613 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-01-19 17:46
Serhiy, why doesn't truncate do a seek(0, SEEK_CUR) to synchronize the buffer's file pointer before calling its truncate method? This also affects writing in "+" modes when the two file pointers are out of sync.
msg258614 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-19 18:04
This is not always possible. Consider following example:

>>> open('temp.txt', 'wb').write(b'+BDAEMQQyBDMENA-')
16
>>> f = open('temp.txt', 'r+', encoding='utf-7')
>>> f.read(2)
'аб'

What should be the result of truncating?

I think it would be better to not implement truncate() for text files at all.
msg258615 - (view) Author: Fornax (fornax) Date: 2016-01-19 18:23
Heh... building on Serhiy's example:

>>> f.tell()
680564735109527527154978616360239628288

I'm way out of my depth here. The results seem surprising, but I lack the experience to be able to say what they "should" look like. So I guess if it's working as intended and just needs clarification in the documentation, so be it.

Thanks for pointing out what's actually causing the behavior.
msg258617 - (view) Author: Fornax (fornax) Date: 2016-01-19 19:24
After taking a little time to let this sink in, I'm going to play Devil's Advocate just a little more.

It sounds like you're basically saying that any read-write text-based modes (e.g. r+, w+) should be used at your own peril. While I understand your UTF-7 counterexample, and it's a fair point, is it out of line to expect that for encodings that operate on full bytes, file positioning should work a bit more intuitively? (Which is to say, a write/truncate after a read should take place in the position immediately following the end of the read.)
msg258618 - (view) Author: Fornax (fornax) Date: 2016-01-19 19:38
Another surprising result:

>>> open('temp.txt', 'w').write('ABCDE\nFGHIJ\nKLMNO\nPQRST\nUVWXY\nZ\n')
32
>>> f = open('temp.txt', 'r+')
>>> f.write('test')
4
>>> f.close()
>>> open('temp.txt').read()
'testE\nFGHIJ\nKLMNO\nPQRST\nUVWXY\nZ\n'

>>> open('temp.txt', 'w').write('ABCDE\nFGHIJ\nKLMNO\nPQRST\nUVWXY\nZ\n')
32
>>> f = open('temp.txt', 'r+')
>>> f.write('test')
4
>>> f.read(1)
'A'
>>> f.close()
>>> open('temp.txt').read()
'ABCDE\nFGHIJ\nKLMNO\nPQRST\nUVWXY\nZ\ntest'

The position of the write in the file depends on whether or not there is a subsequent read.
msg258619 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-01-19 20:00
May be. Looking at the code, both Python and C implementations of TextIOWrapper look incorrect.

Python implementation:

    def truncate(self, pos=None):
        self.flush()
        if pos is None:
            pos = self.tell()
        return self.buffer.truncate(pos)

If pos is not specified, self.tell() is used as truncating position for underlying binary file. But self.tell() is not an offset in bytes, as seen from UTF-7 example. This is complex cookie that includes starting position in binary file, a number of bytes that should be read and feed to the decoder, and other decoder flags. Needed at least unpack the cookie returned by self.tell(), and raise an exception if it doesn't ambiguously point to binary file position.

C implementation is equivalent to:

    def truncate(self, pos=None):
        self.flush()
        return self.buffer.truncate(pos)

It just ignores decoder buffer.
msg258622 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-19 21:38
In theory, TextIOWrapper could rewrite the last bit of the file (or the whole file) to have the requested number of characters. But I wonder if it is worth it; maybe deprecation is better. Do you have a use case for any of these bugs, or are you just playing around to see what the methods do?

In Issue 12922, seek() and tell() were (re-)defined for TextIOBase, but the situation with truncate() was apparently not considered.

Perhaps the write()–read() bug is related to Issue 12215.
msg258624 - (view) Author: Fornax (fornax) Date: 2016-01-19 21:56
I don't have a specific use case. This spawned from a tangentially-related StackOverflow question (http://stackoverflow.com/questions/34858088), where in the answers a behavior difference between Python 2 and 3 was noted. I couldn't find any documentation to explain it, so I opened a follow-up question (http://stackoverflow.com/questions/34879318), and based on some feedback I got there, I opened up this issue.

Just to be sure I understand, you're suggesting deprecating truncate on text-mode file objects? And the interleaved read-writes are likely related to an issue that will be dealt with elsewhere?
msg258626 - (view) Author: Марк Коренберг (socketpair) * Date: 2016-01-19 22:11
text files and seek() offset: issue25849
msg258634 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2016-01-19 23:34
FYI, you can parse the cookie using struct or ctypes. For example:

    class Cookie(ctypes.Structure):
        _fields_ = (('start_pos',     ctypes.c_longlong),
                    ('dec_flags',     ctypes.c_int),
                    ('bytes_to_feed', ctypes.c_int),
                    ('chars_to_skip', ctypes.c_int),
                    ('need_eof',      ctypes.c_byte))

In the simple case only the buffer start_pos is non-zero, and the result of tell() is just the 64-bit file pointer. In Serhiy's UTF-7 example it needs to also convey the bytes_to_feed and chars_to_skip values:

    >>> f.tell()
    680564735109527527154978616360239628288
    >>> cookie_bytes = f.tell().to_bytes(ctypes.sizeof(Cookie), sys.byteorder)
    >>> state = Cookie.from_buffer_copy(cookie_bytes)
    >>> state.start_pos
    0
    >>> state.dec_flags
    0
    >>> state.bytes_to_feed
    16
    >>> state.chars_to_skip
    2
    >>> state.need_eof
    0

So a seek(0, SEEK_CUR) in this case has to seek the buffer to 0, read and decode 16 bytes, and skip 2 characters. 

Isn't this solvable at least for the case of truncating, Martin? It could do a tell(), seek to the start_pos, read and decode the bytes_to_feed, re-encode the chars_to_skip, seek back to the start_pos, write the encoded characters, and then truncate.

    >>> f = open('temp.txt', 'w+', encoding='utf-7')
    >>> f.write(b'+BDAEMQQyBDMENA-'.decode('utf-7'))
    5
    >>> _ = f.seek(0); f.read(2)
    'аб'
    >>> cookie_bytes = f.tell().to_bytes(sizeof(Cookie), byteorder)
    >>> state = Cookie.from_buffer_copy(cookie_bytes)
    >>> f.buffer.seek(state.start_pos)
    0
    >>> buf = f.buffer.read(state.bytes_to_feed)
    >>> s = buf.decode(f.encoding)[:state.chars_to_skip]
    >>> f.buffer.seek(state.start_pos)
    0
    >>> f.buffer.write(s.encode(f.encoding))
    8
    >>> f.buffer.truncate()
    8
    >>> f.close()
    >>> open('temp.txt', encoding='utf-7').read()
    'аб'

Rewriting the encoded bytes is necessary to properly terminate the UTF-7 sequence, which makes me doubt whether this simple approach will work for all codecs. But something like this is possible, no?
msg258637 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-01-20 03:04
Fornax: Yes, I was suggesting the idea of deprecating truncate() for text files! Although a blanket deprecation of all cases may not be realistic. Quickly reading the Stack Overflow pages, it seems like there is demand for this to work in some cases. Deprecating it in the more awkward situations, such as after after reading, and with specific kinds of codecs, might be an option though.

Now I think Issue 12215 (read then write) is more closely related to the read-then-truncate problem. For the write-then-read bug, it might be a separate problem with an easy fix: call flush() before changing to reader mode.

Eryk: If there is no decoder state, and the file data hasn’t changed, maybe it is solvable. But I realize now it won’t work in general. We would have to construct the encoder state from the decoder state. The same problem as Issue 12215.
msg258639 - (view) Author: (random832) Date: 2016-01-20 04:31
In the analogous C operations, ftell (analogous to .tell) actually causes the underlying file descriptor's position (analogous to the raw stream's position) to be reset to be at the same value that ftell has returned. Which means, yes, that you lose the benefits of buffering if you're so foolish as to call ftell after every read. But in this case the sequence "read / tell / truncate" would be analogous to "fread(f) / ftell(f) / ftruncate(fileno(f))

Though, the fact that fread operates on the FILE * whereas truncate operates on a file descriptor serves as a red flag to C programmers... arguably since this is not the case with Python, truncate on a buffered stream should implicitly include this same "reset underlying position" operation before actually performing the truncate.
History
Date User Action Args
2019-03-13 03:41:28josh.rlinkissue36278 superseder
2018-11-24 01:11:06martin.panterlinkissue35304 superseder
2016-01-20 04:31:30random832setnosy: + random832
messages: + msg258639
2016-01-20 03:04:51martin.pantersetmessages: + msg258637
2016-01-19 23:34:46eryksunsetmessages: + msg258634
2016-01-19 22:11:11socketpairsetnosy: + socketpair
messages: + msg258626
2016-01-19 22:07:34vstinnersetnosy: - vstinner
2016-01-19 21:56:22fornaxsettype: behavior ->
messages: + msg258624
2016-01-19 21:38:29martin.pantersetnosy: + martin.panter
messages: + msg258622
2016-01-19 20:00:14serhiy.storchakasettype: behavior
messages: + msg258619
stage: needs patch
2016-01-19 19:38:15fornaxsetmessages: + msg258618
2016-01-19 19:24:28fornaxsetmessages: + msg258617
2016-01-19 18:23:00fornaxsetmessages: + msg258615
2016-01-19 18:04:26serhiy.storchakasetmessages: + msg258614
2016-01-19 17:46:56eryksunsetnosy: + eryksun
messages: + msg258613
2016-01-19 17:46:14fornaxsetmessages: + msg258612
2016-01-19 17:24:43serhiy.storchakasetversions: + Python 2.7, Python 3.6, - Python 3.4
nosy: + stutzbach, benjamin.peterson, pitrou, docs@python, vstinner, serhiy.storchaka, steve.dower

messages: + msg258611

assignee: docs@python
components: + Documentation
2016-01-19 16:41:17Zerosetnosy: + Zero
2016-01-19 15:19:44fornaxcreate