This issue tracker has been migrated to GitHub, and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Python 3 f.tell() gets out of sync with file pointer in binary append+read mode
Type: behavior Stage: patch review
Components: IO Versions: Python 3.8, Python 3.7
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: Jeffrey.Kintscher, Windson Yang, benjamin.peterson, martin.panter, pewscorner, serhiy.storchaka, terry.reedy
Priority: normal Keywords: patch

Created on 2019-03-24 00:33 by pewscorner, last changed 2022-04-11 14:59 by admin.

Files
File name Uploaded Description Edit
bpo-36411.py Jeffrey.Kintscher, 2019-05-28 07:24 Python test file
bpo-36411.c Jeffrey.Kintscher, 2019-05-28 07:29 C test file
Pull Requests
URL Status Linked Edit
PR 13717 open Jeffrey.Kintscher, 2019-06-01 05:24
Messages (13)
msg338709 - (view) Author: PEW's Corner (pewscorner) * Date: 2019-03-24 00:33
When a file is opened in binary append+read mode, i.e. open('file', 'a+b'), and a write (i.e. append) operation is performed while the file pointer is not at the end of the file (e.g. after a seek(0)), tell() will subsequently return the wrong value in Python 3 (but not in Python 2).

See this SO question:
https://stackoverflow.com/questions/51010322/python-3-tell-gets-out-of-sync-with-file-pointer-in-appendread-mode
msg338775 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-03-25 00:51
This looks interesting. Let me try to fix it.
msg339171 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2019-03-30 04:56
'Python 3.9' does not exist yet.  This choice is for patches that should only be applied to 3.9 when available (like future deprecations or removals).
msg343340 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-05-24 03:10
I'm not sure it's a bug. When you write binary data to file (use BufferedIOBase by default). It actually writes the data to a buffer. That is why tell() gets out of sync. You can follow the instrument belolw. For instance, call flush() after writing to get the `correct answer.`

> When writing to this object, data is normally placed into an internal buffer. The buffer will be written out to the underlying RawIOBase object under various conditions, including:

> when the buffer gets too small for all pending data;
> when flush() is called;
> when a seek() is requested (for BufferedRandom objects);
> when the BufferedWriter object is closed or destroyed.

1. https://docs.python.org/3/library/io.html#io.BufferedWriter
msg343341 - (view) Author: Windson Yang (Windson Yang) * Date: 2019-05-24 03:12
I think we should mention it at the document, like in the tell() function.
msg343345 - (view) Author: PEW's Corner (pewscorner) * Date: 2019-05-24 06:06
But buffers are used - and update tell() correctly - in all other cases than binary append+read mode (and even this case works in Python 2). Also, the implementation clearly tries to keep tell() updated irrespective of the buffer, so isn't it just a matter of reminding it to move the pointer to the end of the file when doing an append?

If that isn't feasible, then of course it would be good to document the problem in the description of tell().
msg343742 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-28 07:24
I adapted the test code on StackOverflow to show the expected and actual values.

I get this output using Python 2.7.16:


opening myfile with wb
writing 3 bytes to file
('f.tell(): expecting 3, got', 3)
closing myfile
opening myfile with a+b
('f.tell(): expecting 3, got', 3)
writing 3 bytes to file
('f.tell(): expecting 6, got', 6)
('f.seek(0): expecting 0, got', None)
('f.tell(): expecint 0, got', 0)
writing 3 bytes to file
('f.tell(): expecting 9, got', 9)
writing 3 bytes to file
('f.tell(): expecting 12, got', 12)
('f.seek(0, io.SEEK_CUR): expecting 12, got', None)
('f.tell(): expecting 12, got', 12)
('f.seek(0): expecting 0, got', None)
("f.read(): expecting b'abcdefghijkl', got", 'abcdefghijkl')
closing file
removing file


As described, I get different results using Python 3.7.3 as well as the master branch (Python 3.8.0a4+):


closing myfile
opening myfile with a+b
f.tell(): expecting 3, got 3
writing 3 bytes to file
f.tell(): expecting 6, got 6
f.seek(0): expecting 0, got 0
f.tell(): expecint 0, got 0
writing 3 bytes to file
f.tell(): expecting 9, got 3
writing 3 bytes to file
f.tell(): expecting 12, got 6
f.seek(0, io.SEEK_CUR): expecting 12, got 12
f.tell(): expecting 12, got 12
f.seek(0): expecting 0, got 0
f.read(): expecting b'abcdefghijkl', got b'abcdefghijkl'
closing file
removing file


I wrote an equivalent C program using the stream I/O functions and get the same results as Python 2.7.16:


opening file with wb
writing 3 bytes to file
ftell(f): expecting 3, got 3
closing file
opening file with a+b
ftell(f): expecting 3, got 3
writing 3 bytes to filftell(f): expecting 6, got 6
fseek(f, 0, SEEK_SET): expecting 0, got 0
ftell(f): expecting 0, got 0
writing 3 bytes to file
ftell(f): expecting 9, got 9
writing 3 bytes to file
ftell(f): expecting 12, got 12
fseek(f, 0, SEEK_CUR): expecting 0, got 0
ftell(f): expecting 12, got 12
fseek(f, 0, SEEK_SET): expecting 0, got 0
fread(buf, 1, 256, f): expecting 12, got 12
expecting 'abcdefghijkl', got 'abcdefghijkl'
closing file
removing file


I consider this behavior to be a bug because Python 2.7 and C agree, but Python 3.7 behaves differently. Requiring that developers know and work around non-standard quirky behavior leads to extra work and error-prone code.
msg343744 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-28 07:29
I saw the bug in the C output after I hit submit. Here is the correct output:

opening file with wb
writing 3 bytes to file
ftell(f): expecting 3, got 3
closing file
opening file with a+b
ftell(f): expecting 3, got 3
writing 3 bytes to file
ftell(f): expecting 6, got 6
fseek(f, 0, SEEK_SET): expecting 0, got 0
ftell(f): expecting 0, got 0
writing 3 bytes to file
ftell(f): expecting 9, got 9
writing 3 bytes to file
ftell(f): expecting 12, got 12
fseek(f, 0, SEEK_CUR): expecting 0, got 0
ftell(f): expecting 12, got 12
fseek(f, 0, SEEK_SET): expecting 0, got 0
fread(buf, 1, 256, f): expecting 12, got 12
expecting 'abcdefghijkl', got 'abcdefghijkl'
closing file
removing file
msg343955 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-30 07:53
I did some tracing through bpo-36411.py using pdb and lldb.

The problem is that the file position variables stored in the buffered file object get out of sync with the kernel in the write operation after the seek(0) call. It thinks it is writing to the buffer as if it were relative to the last seek position. This is reflected in the incorrect values returned by tell (3 instead of 9, 6 instead of 12). The second seek operation flushes the buffer to disk and synchronizes the position variables with the kernel. The data is always written correctly to disk because the kernel maintains its own file position and handles the append operations internally.

Now that I have a handle on what is happening, I will dig into the buffered writes to find the root cause and hopefully a solution.

A workaround is to call f.seek(0, io.SEEK_CUR) before f.tell() to force the file object to synchronize its position variables with the kernel. There is a performance hit as f.seek() will flush dirty buffers to disk each time it is called, though it probably won't be noticed on a file system buffered by the kernel.

Here is what I found line by line (only relevant details are included):

1. The first f.open() creates the file and initializes the internal position variables to reflect the beginning of the file.
2. The first f.write() stores three bytes in the buffer and updates the internal position variables accordingly.
3. f.tell() correctly returns 3.
4. f.close() flushes the buffer to disk and closes the file.
5. When f.open() is called a second time with append mode, it calls lseek() to get the file position from the kernel and sets the internal position variables accordingly.
6. f.tell() correctly returns 3 to reflect the 3 bytes already in the file.
7. f.write() stores three bytes into the beginning of the buffer and updates the internal position variables accordingly.
8. f.tell() returns 6 as the correct position.
9. f.seek(0) flushes the dirty buffer to disk, calls lseek() so that the kernel updates the file position, and resets the internal position variables to reflect the beginning of the file.
10. f.write() stores three bytes into the beginning of the buffer, but the position variables are updated as if the write is relative to the beginning of the file. The position variables are now out of sync with the kernel.
11. f.tell() reflects this discrepancy by returning 3 instead of 9.
12. f.write() stores three bytes into the buffer starting at offset 3 and updates the position variables accordingly.
13. f.tell() returns 6 instead of 9.
14. f.seek(0, io.SEEK_CUR) flushes the 6 bytes in the dirty buffer to disk. The flush operation correctly appends the 6 bytes to the disk file regardless of the position variables because the kernel handles the append operation internally based upon its own file position variable. It then calls lseek() to determine the new position and updates the position variables accordingly. 
15. f.tell() returns the correct position.
16. The final f.seek(0) calls lseek() to move the file position back to the beginning of the file and updates the position variables, but doesn't flush anything to disk because the buffer is clean.
17. f.read() reads the entire file from the beginning.
18. f.close() closes the file. No flush operation is performed because the buffer is clean.
msg343969 - (view) Author: PEW's Corner (pewscorner) * Date: 2019-05-30 10:56
Great analysis, Jeffrey. Does the kernel position actually move to the end of the file on the f.write() in step 10, or on the flush in step 14? If it's the latter, then f.write() should probably call lseek() to set both the kernel position and internal position to the end of the file before the write. This would have the added benefit of making the Python append behavior independent of the kernel behavior (as some kernels apparently handle appending differently).
msg344017 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-05-31 04:14
I ran another C test where I called lseek(fd, 0, SEEK_CUR) in-between each of the stream I/O functions to see how it reports the buffered and kernel file positions:

opening file with wb
writing 3 bytes to file
lseek: 0
ftell(f): expecting 3, got 3
lseek: 3
closing file
opening file with a+b
lseek: 3
ftell(f): expecting 3, got 3
lseek: 3
writing 3 bytes to file
lseek: 3
ftell(f): expecting 6, got 6
lseek: 6
fseek(f, 0, SEEK_SET): expecting 0, got 0
lseek: 0
ftell(f): expecting 0, got 0
lseek: 0
writing 3 bytes to file
lseek: 0
ftell(f): expecting 9, got 9
lseek: 9
writing 3 bytes to file
lseek: 9
ftell(f): expecting 12, got 12
lseek: 12
fseek(f, 0, SEEK_CUR): expecting 0, got 0
lseek: 12
ftell(f): expecting 12, got 12
lseek: 12
fseek(f, 0, SEEK_SET): expecting 0, got 0
lseek: 0
fread(buf, 1, 256, f): expecting 12, got 12
lseek: 12
expecting 'abcdefghijkl', got 'abcdefghijkl'
closing file
removing file

The C library fseek() and ftell() functions update the kernel file position, but fwrite() doesn't (because it hasn't overflowed its write buffer). The kernel doesn't allow seeking past the end of the file, so it looks like fseek() and ftell() are flushing the C library's write buffer because the subsequent lseek() calls return the same value.

My observation of Python 3 and C library behavior with read/write append mode:

Bufferred write:
C - write to memory buffer, no kernel position updates unless the buffer is flushed to the kernel by buffer overflow
Python - same as C

Buffered seek:
C - write buffer is flushed to the kernel to update the kernel position, then a kernel seek operation is performed and the new location reported by the kernel is returned
Python - same as C

Buffered tell:
C - identical behavior as buffered seek; semantically a wrapper for fseek(f, 0, SEEK_CUR)
Python - the position is calculated using the last queried kernel position and the dirty buffer length, the kernel position is queried only if a kernel position hasn't been previously queried since the file was opened


Having the buffered I/O library keep track of the file position is attractive in that it saves time by skipping a few kernel calls, but, as demonstrated, it can easily get out of sync. I'm sure I can come up with a way to fix the accounting to work properly without querying the kernel, but I don't think that is a good solution.

I see dangers with trying to track the file position in the library independent of the kernel. For one, when the file is opened with the append flag, the kernel changes the position to the end of the file with every write. Yes, that can be tracked with careful code in the library, except when the file is opened two or more times and written to using different file descriptors. This is a common scenario for log files, with appends being performed by multiple processes. The only mitigation is to use file locking to guarantee exclusive file access, but the semantics vary from system to system and it is not universally supported.

My preferred solution is to follow the C library semantics and turn buffered tell() into a wrapper for buffered seek(0, io.SEEK_CUR). Please, let me know if there are some semantics I haven't considered.
msg344049 - (view) Author: PEW's Corner (pewscorner) * Date: 2019-05-31 09:42
Your proposal sounds good to me, Jeffrey.
msg344151 - (view) Author: Jeffrey Kintscher (Jeffrey.Kintscher) * Date: 2019-06-01 06:16
Further testing revealed the problem and fix also apply to text mode. I submitted a pull request to fix this issue with tests for both binary and text mode.
History
Date User Action Args
2022-04-11 14:59:12adminsetgithub: 80592
2019-06-01 11:04:40cheryl.sabellasetnosy: + benjamin.peterson
2019-06-01 06:16:00Jeffrey.Kintschersetmessages: + msg344151
2019-06-01 05:24:36Jeffrey.Kintschersetkeywords: + patch
stage: patch review
pull_requests: + pull_request13604
2019-05-31 09:42:48pewscornersetmessages: + msg344049
2019-05-31 04:14:36Jeffrey.Kintschersetmessages: + msg344017
2019-05-30 10:56:32pewscornersetmessages: + msg343969
2019-05-30 07:53:26Jeffrey.Kintschersetmessages: + msg343955
2019-05-28 07:29:30Jeffrey.Kintschersetfiles: + bpo-36411.c

messages: + msg343744
2019-05-28 07:29:04Jeffrey.Kintschersetfiles: - bpo-36411.c
2019-05-28 07:24:34Jeffrey.Kintschersetfiles: + bpo-36411.c
2019-05-28 07:24:11Jeffrey.Kintschersetfiles: + bpo-36411.py
nosy: + Jeffrey.Kintscher
messages: + msg343742

2019-05-24 06:06:41pewscornersetmessages: + msg343345
2019-05-24 03:12:07Windson Yangsetmessages: + msg343341
2019-05-24 03:10:20Windson Yangsetmessages: + msg343340
2019-03-30 04:56:11terry.reedysetnosy: + terry.reedy

messages: + msg339171
versions: - Python 3.9
2019-03-25 00:51:44Windson Yangsetnosy: + Windson Yang

messages: + msg338775
versions: + Python 3.8, Python 3.9, - Python 3.6
2019-03-24 07:56:25xtreaksetnosy: + martin.panter, serhiy.storchaka
2019-03-24 00:33:45pewscornercreate