Issue36411
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2019-03-24 00:33 by pewscorner, last changed 2022-04-11 14:59 by admin.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
bpo-36411.py | Jeffrey.Kintscher, 2019-05-28 07:24 | Python test file | ||
bpo-36411.c | Jeffrey.Kintscher, 2019-05-28 07:29 | C test file |
Pull Requests | |||
---|---|---|---|
URL | Status | Linked | Edit |
PR 13717 | open | Jeffrey.Kintscher, 2019-06-01 05:24 |
Messages (13) | |||
---|---|---|---|
msg338709 - (view) | Author: PEW's Corner (pewscorner) * | Date: 2019-03-24 00:33 | |
When a file is opened in binary append+read mode, i.e. open('file', 'a+b'), and a write (i.e. append) operation is performed while the file pointer is not at the end of the file (e.g. after a seek(0)), tell() will subsequently return the wrong value in Python 3 (but not in Python 2). See this SO question: https://stackoverflow.com/questions/51010322/python-3-tell-gets-out-of-sync-with-file-pointer-in-appendread-mode |
|||
msg338775 - (view) | Author: Windson Yang (Windson Yang) * | Date: 2019-03-25 00:51 | |
This looks interesting. Let me try to fix it. |
|||
msg339171 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2019-03-30 04:56 | |
'Python 3.9' does not exist yet. This choice is for patches that should only be applied to 3.9 when available (like future deprecations or removals). |
|||
msg343340 - (view) | Author: Windson Yang (Windson Yang) * | Date: 2019-05-24 03:10 | |
I'm not sure it's a bug. When you write binary data to file (use BufferedIOBase by default). It actually writes the data to a buffer. That is why tell() gets out of sync. You can follow the instrument belolw. For instance, call flush() after writing to get the `correct answer.` > When writing to this object, data is normally placed into an internal buffer. The buffer will be written out to the underlying RawIOBase object under various conditions, including: > when the buffer gets too small for all pending data; > when flush() is called; > when a seek() is requested (for BufferedRandom objects); > when the BufferedWriter object is closed or destroyed. 1. https://docs.python.org/3/library/io.html#io.BufferedWriter |
|||
msg343341 - (view) | Author: Windson Yang (Windson Yang) * | Date: 2019-05-24 03:12 | |
I think we should mention it at the document, like in the tell() function. |
|||
msg343345 - (view) | Author: PEW's Corner (pewscorner) * | Date: 2019-05-24 06:06 | |
But buffers are used - and update tell() correctly - in all other cases than binary append+read mode (and even this case works in Python 2). Also, the implementation clearly tries to keep tell() updated irrespective of the buffer, so isn't it just a matter of reminding it to move the pointer to the end of the file when doing an append? If that isn't feasible, then of course it would be good to document the problem in the description of tell(). |
|||
msg343742 - (view) | Author: Jeffrey Kintscher (Jeffrey.Kintscher) * | Date: 2019-05-28 07:24 | |
I adapted the test code on StackOverflow to show the expected and actual values. I get this output using Python 2.7.16: opening myfile with wb writing 3 bytes to file ('f.tell(): expecting 3, got', 3) closing myfile opening myfile with a+b ('f.tell(): expecting 3, got', 3) writing 3 bytes to file ('f.tell(): expecting 6, got', 6) ('f.seek(0): expecting 0, got', None) ('f.tell(): expecint 0, got', 0) writing 3 bytes to file ('f.tell(): expecting 9, got', 9) writing 3 bytes to file ('f.tell(): expecting 12, got', 12) ('f.seek(0, io.SEEK_CUR): expecting 12, got', None) ('f.tell(): expecting 12, got', 12) ('f.seek(0): expecting 0, got', None) ("f.read(): expecting b'abcdefghijkl', got", 'abcdefghijkl') closing file removing file As described, I get different results using Python 3.7.3 as well as the master branch (Python 3.8.0a4+): closing myfile opening myfile with a+b f.tell(): expecting 3, got 3 writing 3 bytes to file f.tell(): expecting 6, got 6 f.seek(0): expecting 0, got 0 f.tell(): expecint 0, got 0 writing 3 bytes to file f.tell(): expecting 9, got 3 writing 3 bytes to file f.tell(): expecting 12, got 6 f.seek(0, io.SEEK_CUR): expecting 12, got 12 f.tell(): expecting 12, got 12 f.seek(0): expecting 0, got 0 f.read(): expecting b'abcdefghijkl', got b'abcdefghijkl' closing file removing file I wrote an equivalent C program using the stream I/O functions and get the same results as Python 2.7.16: opening file with wb writing 3 bytes to file ftell(f): expecting 3, got 3 closing file opening file with a+b ftell(f): expecting 3, got 3 writing 3 bytes to filftell(f): expecting 6, got 6 fseek(f, 0, SEEK_SET): expecting 0, got 0 ftell(f): expecting 0, got 0 writing 3 bytes to file ftell(f): expecting 9, got 9 writing 3 bytes to file ftell(f): expecting 12, got 12 fseek(f, 0, SEEK_CUR): expecting 0, got 0 ftell(f): expecting 12, got 12 fseek(f, 0, SEEK_SET): expecting 0, got 0 fread(buf, 1, 256, f): expecting 12, got 12 expecting 'abcdefghijkl', got 'abcdefghijkl' closing file removing file I consider this behavior to be a bug because Python 2.7 and C agree, but Python 3.7 behaves differently. Requiring that developers know and work around non-standard quirky behavior leads to extra work and error-prone code. |
|||
msg343744 - (view) | Author: Jeffrey Kintscher (Jeffrey.Kintscher) * | Date: 2019-05-28 07:29 | |
I saw the bug in the C output after I hit submit. Here is the correct output: opening file with wb writing 3 bytes to file ftell(f): expecting 3, got 3 closing file opening file with a+b ftell(f): expecting 3, got 3 writing 3 bytes to file ftell(f): expecting 6, got 6 fseek(f, 0, SEEK_SET): expecting 0, got 0 ftell(f): expecting 0, got 0 writing 3 bytes to file ftell(f): expecting 9, got 9 writing 3 bytes to file ftell(f): expecting 12, got 12 fseek(f, 0, SEEK_CUR): expecting 0, got 0 ftell(f): expecting 12, got 12 fseek(f, 0, SEEK_SET): expecting 0, got 0 fread(buf, 1, 256, f): expecting 12, got 12 expecting 'abcdefghijkl', got 'abcdefghijkl' closing file removing file |
|||
msg343955 - (view) | Author: Jeffrey Kintscher (Jeffrey.Kintscher) * | Date: 2019-05-30 07:53 | |
I did some tracing through bpo-36411.py using pdb and lldb. The problem is that the file position variables stored in the buffered file object get out of sync with the kernel in the write operation after the seek(0) call. It thinks it is writing to the buffer as if it were relative to the last seek position. This is reflected in the incorrect values returned by tell (3 instead of 9, 6 instead of 12). The second seek operation flushes the buffer to disk and synchronizes the position variables with the kernel. The data is always written correctly to disk because the kernel maintains its own file position and handles the append operations internally. Now that I have a handle on what is happening, I will dig into the buffered writes to find the root cause and hopefully a solution. A workaround is to call f.seek(0, io.SEEK_CUR) before f.tell() to force the file object to synchronize its position variables with the kernel. There is a performance hit as f.seek() will flush dirty buffers to disk each time it is called, though it probably won't be noticed on a file system buffered by the kernel. Here is what I found line by line (only relevant details are included): 1. The first f.open() creates the file and initializes the internal position variables to reflect the beginning of the file. 2. The first f.write() stores three bytes in the buffer and updates the internal position variables accordingly. 3. f.tell() correctly returns 3. 4. f.close() flushes the buffer to disk and closes the file. 5. When f.open() is called a second time with append mode, it calls lseek() to get the file position from the kernel and sets the internal position variables accordingly. 6. f.tell() correctly returns 3 to reflect the 3 bytes already in the file. 7. f.write() stores three bytes into the beginning of the buffer and updates the internal position variables accordingly. 8. f.tell() returns 6 as the correct position. 9. f.seek(0) flushes the dirty buffer to disk, calls lseek() so that the kernel updates the file position, and resets the internal position variables to reflect the beginning of the file. 10. f.write() stores three bytes into the beginning of the buffer, but the position variables are updated as if the write is relative to the beginning of the file. The position variables are now out of sync with the kernel. 11. f.tell() reflects this discrepancy by returning 3 instead of 9. 12. f.write() stores three bytes into the buffer starting at offset 3 and updates the position variables accordingly. 13. f.tell() returns 6 instead of 9. 14. f.seek(0, io.SEEK_CUR) flushes the 6 bytes in the dirty buffer to disk. The flush operation correctly appends the 6 bytes to the disk file regardless of the position variables because the kernel handles the append operation internally based upon its own file position variable. It then calls lseek() to determine the new position and updates the position variables accordingly. 15. f.tell() returns the correct position. 16. The final f.seek(0) calls lseek() to move the file position back to the beginning of the file and updates the position variables, but doesn't flush anything to disk because the buffer is clean. 17. f.read() reads the entire file from the beginning. 18. f.close() closes the file. No flush operation is performed because the buffer is clean. |
|||
msg343969 - (view) | Author: PEW's Corner (pewscorner) * | Date: 2019-05-30 10:56 | |
Great analysis, Jeffrey. Does the kernel position actually move to the end of the file on the f.write() in step 10, or on the flush in step 14? If it's the latter, then f.write() should probably call lseek() to set both the kernel position and internal position to the end of the file before the write. This would have the added benefit of making the Python append behavior independent of the kernel behavior (as some kernels apparently handle appending differently). |
|||
msg344017 - (view) | Author: Jeffrey Kintscher (Jeffrey.Kintscher) * | Date: 2019-05-31 04:14 | |
I ran another C test where I called lseek(fd, 0, SEEK_CUR) in-between each of the stream I/O functions to see how it reports the buffered and kernel file positions: opening file with wb writing 3 bytes to file lseek: 0 ftell(f): expecting 3, got 3 lseek: 3 closing file opening file with a+b lseek: 3 ftell(f): expecting 3, got 3 lseek: 3 writing 3 bytes to file lseek: 3 ftell(f): expecting 6, got 6 lseek: 6 fseek(f, 0, SEEK_SET): expecting 0, got 0 lseek: 0 ftell(f): expecting 0, got 0 lseek: 0 writing 3 bytes to file lseek: 0 ftell(f): expecting 9, got 9 lseek: 9 writing 3 bytes to file lseek: 9 ftell(f): expecting 12, got 12 lseek: 12 fseek(f, 0, SEEK_CUR): expecting 0, got 0 lseek: 12 ftell(f): expecting 12, got 12 lseek: 12 fseek(f, 0, SEEK_SET): expecting 0, got 0 lseek: 0 fread(buf, 1, 256, f): expecting 12, got 12 lseek: 12 expecting 'abcdefghijkl', got 'abcdefghijkl' closing file removing file The C library fseek() and ftell() functions update the kernel file position, but fwrite() doesn't (because it hasn't overflowed its write buffer). The kernel doesn't allow seeking past the end of the file, so it looks like fseek() and ftell() are flushing the C library's write buffer because the subsequent lseek() calls return the same value. My observation of Python 3 and C library behavior with read/write append mode: Bufferred write: C - write to memory buffer, no kernel position updates unless the buffer is flushed to the kernel by buffer overflow Python - same as C Buffered seek: C - write buffer is flushed to the kernel to update the kernel position, then a kernel seek operation is performed and the new location reported by the kernel is returned Python - same as C Buffered tell: C - identical behavior as buffered seek; semantically a wrapper for fseek(f, 0, SEEK_CUR) Python - the position is calculated using the last queried kernel position and the dirty buffer length, the kernel position is queried only if a kernel position hasn't been previously queried since the file was opened Having the buffered I/O library keep track of the file position is attractive in that it saves time by skipping a few kernel calls, but, as demonstrated, it can easily get out of sync. I'm sure I can come up with a way to fix the accounting to work properly without querying the kernel, but I don't think that is a good solution. I see dangers with trying to track the file position in the library independent of the kernel. For one, when the file is opened with the append flag, the kernel changes the position to the end of the file with every write. Yes, that can be tracked with careful code in the library, except when the file is opened two or more times and written to using different file descriptors. This is a common scenario for log files, with appends being performed by multiple processes. The only mitigation is to use file locking to guarantee exclusive file access, but the semantics vary from system to system and it is not universally supported. My preferred solution is to follow the C library semantics and turn buffered tell() into a wrapper for buffered seek(0, io.SEEK_CUR). Please, let me know if there are some semantics I haven't considered. |
|||
msg344049 - (view) | Author: PEW's Corner (pewscorner) * | Date: 2019-05-31 09:42 | |
Your proposal sounds good to me, Jeffrey. |
|||
msg344151 - (view) | Author: Jeffrey Kintscher (Jeffrey.Kintscher) * | Date: 2019-06-01 06:16 | |
Further testing revealed the problem and fix also apply to text mode. I submitted a pull request to fix this issue with tests for both binary and text mode. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:59:12 | admin | set | github: 80592 |
2019-06-01 11:04:40 | cheryl.sabella | set | nosy:
+ benjamin.peterson |
2019-06-01 06:16:00 | Jeffrey.Kintscher | set | messages: + msg344151 |
2019-06-01 05:24:36 | Jeffrey.Kintscher | set | keywords:
+ patch stage: patch review pull_requests: + pull_request13604 |
2019-05-31 09:42:48 | pewscorner | set | messages: + msg344049 |
2019-05-31 04:14:36 | Jeffrey.Kintscher | set | messages: + msg344017 |
2019-05-30 10:56:32 | pewscorner | set | messages: + msg343969 |
2019-05-30 07:53:26 | Jeffrey.Kintscher | set | messages: + msg343955 |
2019-05-28 07:29:30 | Jeffrey.Kintscher | set | files:
+ bpo-36411.c messages: + msg343744 |
2019-05-28 07:29:04 | Jeffrey.Kintscher | set | files: - bpo-36411.c |
2019-05-28 07:24:34 | Jeffrey.Kintscher | set | files: + bpo-36411.c |
2019-05-28 07:24:11 | Jeffrey.Kintscher | set | files:
+ bpo-36411.py nosy: + Jeffrey.Kintscher messages: + msg343742 |
2019-05-24 06:06:41 | pewscorner | set | messages: + msg343345 |
2019-05-24 03:12:07 | Windson Yang | set | messages: + msg343341 |
2019-05-24 03:10:20 | Windson Yang | set | messages: + msg343340 |
2019-03-30 04:56:11 | terry.reedy | set | nosy:
+ terry.reedy messages: + msg339171 versions: - Python 3.9 |
2019-03-25 00:51:44 | Windson Yang | set | nosy:
+ Windson Yang messages: + msg338775 versions: + Python 3.8, Python 3.9, - Python 3.6 |
2019-03-24 07:56:25 | xtreak | set | nosy:
+ martin.panter, serhiy.storchaka |
2019-03-24 00:33:45 | pewscorner | create |