New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpooledTemporaryFile rollover corrupts data silently when Unicode characters are written #70917
Comments
The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written. The attached program demonstrates the failure. It creates a SpooledTemporaryFile object, writes 20 string characters to it, and then tries to read them back. If the SpooledTemporaryFile has rolled over to disk, as it does in the demonstration program, then the data is not read back correctly. Instead, an exception is recognized due to the data in the SpooledTemporaryFile being corrupted. The problem is this statement in tempfile.py, in the rollover() method: The related problem is the inconsistency of the behavior of tell() and seek() for text (non-binary) SpooledTemporaryFile objects. If the data hasn't yet rolled over to a TemporaryFile, they count in string characters. If the data has rolled over, they count in bytes. A quick fix for this is to remove the seek() in the rollover() method. I presume it's there to preserve the stream position if an explicit call to rollover() is made, since for an implicit call, the position would be at the end of the stream already. This quick fix, therefore, would introduce an external incompatibility in the behavior of rollover(). Another possibility is to never use a StringIO object, but to always buffer data in a BytesIO object, as is done for binary SpooledTemporaryFile objects. This has the advantage of "fixing" the tell() and seek() inconsistency, making them count bytes all the time. The downside, of course, is that data that doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a round trip that could otherwise be avoided. This problem can be circumvented by a user of SpooledTemporaryFile by explicitly seeking to the end of the stream after every write to the SpooledTemporaryFile object: spool.seek(0, io.SEEK_END) |
I can't see why it even does the seek. The existing tests pass without it. Does your example? Either way, the first step here is for someone to turn this into a unit test for the tempfile module (in test_tempfile). |
Here is a patch that fixes SpooledTemporaryFile by getting rid of StringIO. |
David: I guess the seek() is to support rollover() when you are positioned halfway through the file. Serhiy’s patch seems to be about the best we can do, although it does break the documented promise that the “ ‘_file’ attribute is either an io.BytesIO or io.StringIO”. |
I noticed another possible problem with using TextIOWrapper. We call tell() to see if a rollover is needed. But I think TextIOWrapper streams can return complicated values from tell(), encoding stuff like codec state and buffered data in an integer. At the minimum, I think the meaning of max_size has changed (code points vs bytes). And perhaps it won’t work very well with some codecs that record state in the tell() value. |
According to bpo-12215, the TextIOWrapper.tell() position only includes decoder state, not encoder state. So we may be lucky that tell() is proportional to the file size. If so my concern about max_size isn’t so important. But it still feels like a bad hack. Maybe another quick fix for this would be to do two writes of the StringIO data, and then seek back to the end of the first write: pos = file.tell() # StringIO position is in code points
file.seek(0)
newfile.write(file.read(pos))
pos = newfile.tell() # TextIOWrapper position is opaque
newfile.write(file.read())
newfile.seek(pos, 0) This still depends on StringIO.tell() corresponding to the number of code points, which is undocumented (we already disable newline translation in the StringIO). See also bpo-25190. |
And doesn't work with Python implementation of StringIO. More general solution needs the code like: pos = file.tell()
tail = file.read()
file.seek(0)
head = file.read()
if tail:
head = head[:-len(tail)]
newfile.write(head)
pos = newfile.tell() # TextIOWrapper position is opaque
newfile.write(tail)
newfile.seek(pos, 0) Yet one bad thing with using independent StringIO instead of TextIOWrapper, is that saved tell() position becomes not valid after rollover(). I'm going to write few tests for cases when file position points not at the end of the stream. |
Getting rid of StringIO seems better approach to me. |
I think this bug is critical, because it may break user's data silently. If we can not fix this bug soon, how about adding note in document and raise warning? |
This may be a silly question, however, does SpooledTemporaryFile need to exist at all? From some testing on macOS, SpooledTemporaryFile appeared to never have a performance advantage over OS file caching, but with max_size greater than 4GB, it was a significant disadvantage. So, if the purpose of SpooledTemporaryFile is to increase performance, it may not work. I found that the macOS built-in file cache was increasing in size as I wrote bigger TemporaryFile files, up to some limit the OS had decided. So, it seems the OS is automatically doing the same job as SpooledTemporaryFile. Once the OS decided to write to disk, there was no sudden hit to performance, it just slowed down. However, when SpooledTemporaryFile rolled-over large max_size files, there was a temporary big hit to performance, which then became a consistent slow down the same as a TemporaryFile that had exceeded the OS file cache. A big issue came with very large SpooledTemporaryFile max_sizes hogging RAM and causing the OS to start swapping all processes. This caused a huge performance hit to my program and the system as a whole. Once my program did finish, it took the system considerable time to reclaim swap. I’m guessing SpooledTemporaryFile may have benefits on light weight embedded OSes that have no, or poor, file caching. However, tuning the max_size to work with embedded systems’ limited RAM could be tricky for developers and would be hardware dependent. So, perhaps leaving file caching to the underlying operating systems is actually a better, and safer, option than offering it in Python? If there are no benefits to SpooledTemporaryFile, should it be deprecated? If so, as it is phasesd out, could it be patched to be a TemporaryFile wrapper, with no rollover functionality? |
Creating files is slow on Windows too. But I think we should fix the data corruption ASAP. While Serhiy's patch looks good to me, there is a more quick and safe way to fix the data corruption. Use TemporaryFile at first if it is text mode. |
I have to disagree with the idea that SpooledTemporaryFile is not useful. Although on some systems, the file system may appear as fast as memory, that cannot be assumed to be universally true. I think the idea behind SpooledTemporaryFile is completely valid. |
I don't like the idea of using a TemporaryFile right from the beginning in text mode. You might as well remove text mode support altogether if that's the approach you want to take, since it undoes any potential performance benefit of using SpooledTemporaryFile in the first place. |
But poor performance is better than silent data corruption. |
The quickest fix for the data corruption problem is to delete the line This doesn't fix the inconsistency of tell() and seek(), but it's very low risk. It's technically a change to the API, that rollover() no longer preserves the seek position, but unless the user was writing only characters from the ISO-8859-1 character set, it wasn't working properly before anyway. |
I cannot creàte a pr now. Please take this issue yourself Inada+san. |
Thank you Inada-san. |
Do we need to explicitly document the return value change of _file which is documented with a separate versionchanged directive for 3.7 and 3.8? Code like below could fail in them since TextIOWrapper doesn't have getvalue attribute as the minor version upgrade is done. import tempfile
with tempfile.SpooledTemporaryFile(mode='wt') as f:
f.write('abc')
assert f._file.getvalue() == 'abc' |
I feel it is too detailed. Note that the _file attribute may be TextIOWrapper after rollover() anyway and the rollover() may be called implicitly. |
Okay, thanks for the detail. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: