classification
Title: SpooledTemporaryFile rollover corrupts data silently when Unicode characters are written
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.9, Python 3.8, Python 3.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: James Hennessy, graham.coster, inada.naoki, martin.panter, miss-islington, r.david.murray, serhiy.storchaka, xtreak
Priority: critical Keywords: patch

Created on 2016-04-10 19:11 by James Hennessy, last changed 2019-11-28 08:27 by xtreak. This issue is now closed.

Files
File name Uploaded Description Edit
showbug.py James Hennessy, 2016-04-10 19:11 A program to demonstrate the bug
spooled_text_tempfile.patch serhiy.storchaka, 2016-04-10 20:40 review
Pull Requests
URL Status Linked Edit
PR 17400 merged inada.naoki, 2019-11-27 07:31
PR 17405 merged miss-islington, 2019-11-27 13:22
PR 17407 merged inada.naoki, 2019-11-27 13:29
Messages (24)
msg263147 - (view) Author: James Hennessy (James Hennessy) Date: 2016-04-10 19:11
The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written.  The attached program demonstrates the failure.  It creates a SpooledTemporaryFile object, writes 20 string characters to it, and then tries to read them back.  If the SpooledTemporaryFile has rolled over to disk, as it does in the demonstration program, then the data is not read back correctly.  Instead, an exception is recognized due to the data in the SpooledTemporaryFile being corrupted.

The problem is this statement in tempfile.py, in the rollover() method:
        newfile.seek(file.tell(), 0)
The "file" variable references a StringIO object, whose tell() and seek() methods count in characters, not bytes, yet this value is applied to a TemporaryFile object, whose tell() and seek() methods deal in bytes, not characters.  The demonstration program writes 10 characters to the SpooledTemporaryFile.  Since 10 exceeds the rollover size of 5, the implementation writes the 10 characters to the TemporaryFile and then seeks to position 10 in the TemporaryFile, which it thinks is the end of the stream.  But those 10 characters got encoded to 30 bytes, and seek position 10 is in the middle of the UTF-8 sequence for the fourth character.  The next write to the SpooledTemporaryFile starts overlaying bytes from there.  The attempt to read back the data fails because the byte stream no longer represents a valid UTF-8 stream of data.

The related problem is the inconsistency of the behavior of tell() and seek() for text (non-binary) SpooledTemporaryFile objects.  If the data hasn't yet rolled over to a TemporaryFile, they count in string characters.  If the data has rolled over, they count in bytes.

A quick fix for this is to remove the seek() in the rollover() method.  I presume it's there to preserve the stream position if an explicit call to rollover() is made, since for an implicit call, the position would be at the end of the stream already.  This quick fix, therefore, would introduce an external incompatibility in the behavior of rollover().

Another possibility is to never use a StringIO object, but to always buffer data in a BytesIO object, as is done for binary SpooledTemporaryFile objects.  This has the advantage of "fixing" the tell() and seek() inconsistency, making them count bytes all the time.  The downside, of course, is that data that doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a round trip that could otherwise be avoided.

This problem can be circumvented by a user of SpooledTemporaryFile by explicitly seeking to the end of the stream after every write to the SpooledTemporaryFile object:  spool.seek(0, io.SEEK_END)
msg263148 - (view) Author: R. David Murray (r.david.murray) * (Python committer) Date: 2016-04-10 19:25
I can't see why it even does the seek.  The existing tests pass without it.  Does your example?

Either way, the first step here is for someone to turn this into a unit test for the tempfile module (in test_tempfile).
msg263150 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-10 20:40
Here is a patch that fixes SpooledTemporaryFile by getting rid of StringIO.
msg263157 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-10 23:52
David: I guess the seek() is to support rollover() when you are positioned halfway through the file.

Serhiy’s patch seems to be about the best we can do, although it does break the documented promise that the “ ‘_file’ attribute is either an io.BytesIO or io.StringIO”.
msg263698 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-18 22:29
I noticed another possible problem with using TextIOWrapper. We call tell() to see if a rollover is needed. But I think TextIOWrapper streams can return complicated values from tell(), encoding stuff like codec state and buffered data in an integer. At the minimum, I think the meaning of max_size has changed (code points vs bytes). And perhaps it won’t work very well with some codecs that record state in the tell() value.
msg263707 - (view) Author: Martin Panter (martin.panter) * (Python committer) Date: 2016-04-19 03:54
According to Issue 12215, the TextIOWrapper.tell() position only includes decoder state, not encoder state. So we may be lucky that tell() is proportional to the file size. If so my concern about max_size isn’t so important. But it still feels like a bad hack.

Maybe another quick fix for this would be to do two writes of the StringIO data, and then seek back to the end of the first write:

pos = file.tell()  # StringIO position is in code points
file.seek(0)
newfile.write(file.read(pos))
pos = newfile.tell()  # TextIOWrapper position is opaque
newfile.write(file.read())
newfile.seek(pos, 0)

This still depends on StringIO.tell() corresponding to the number of code points, which is undocumented (we already disable newline translation in the StringIO). See also Issue 25190.
msg263708 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2016-04-19 05:05
> This still depends on StringIO.tell() corresponding to the number of code points, which is undocumented (we already disable newline translation in the StringIO). See also Issue 25190.

And doesn't work with Python implementation of StringIO. More general solution needs the code like:

pos = file.tell()
tail = file.read()
file.seek(0)
head = file.read()
if tail:
    head = head[:-len(tail)]
newfile.write(head)
pos = newfile.tell()  # TextIOWrapper position is opaque
newfile.write(tail)
newfile.seek(pos, 0)

Yet one bad thing with using independent StringIO instead of TextIOWrapper, is that saved tell() position becomes not valid after rollover(). I'm going to write few tests for cases when file position points not at the end of the stream.
msg339029 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-03-28 11:11
Getting rid of StringIO seems better approach to me.
.tell(), .seek(), and .truncate() behaviors are very different between
StringIO and TextIOWrapper.
msg339590 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-04-08 05:44
I think this bug is critical, because it may break user's data silently.

If we can not fix this bug soon, how about adding note in document and raise warning?
msg357474 - (view) Author: Graham Coster (graham.coster) * Date: 2019-11-26 02:00
This may be a silly question, however, does SpooledTemporaryFile need to exist at all?

From some testing on macOS, SpooledTemporaryFile appeared to never have a performance advantage over OS file caching, but with max_size greater than 4GB, it was a significant disadvantage. So, if the purpose of SpooledTemporaryFile is to increase performance, it may not work.

I found that the macOS built-in file cache was increasing in size as I wrote bigger TemporaryFile files, up to some limit the OS had decided. So, it seems the OS is automatically doing the same job as SpooledTemporaryFile.  Once the OS decided to write to disk, there was no sudden hit to performance, it just slowed down.

However, when SpooledTemporaryFile rolled-over large max_size files, there was a temporary big hit to performance, which then became a consistent slow down the same as a TemporaryFile that had exceeded the OS file cache.

A big issue came with very large SpooledTemporaryFile  max_sizes hogging RAM and causing the OS to start swapping all processes.  This caused a huge performance hit to my program and the system as a whole. Once my program did finish, it took the system considerable time to reclaim swap.

I’m guessing SpooledTemporaryFile may have benefits on light weight embedded OSes that have no, or poor, file caching.  However, tuning the max_size to work with embedded systems’ limited RAM could be tricky for developers and would be hardware dependent. So, perhaps leaving file caching to the underlying operating systems is actually a better, and safer, option than offering it in Python?

If there are no benefits to SpooledTemporaryFile, should it be deprecated? If so, as it is phasesd out, could it be patched to be a TemporaryFile wrapper, with no rollover functionality?
msg357491 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-11-26 10:17
Creating files is slow on Windows too.

But I think we should fix the data corruption ASAP.

While Serhiy's patch looks good to me, there is a more quick and safe way to fix the data corruption.  Use TemporaryFile at first if it is text mode.
msg357505 - (view) Author: James Hennessy (James Hennessy) Date: 2019-11-26 16:14
I have to disagree with the idea that SpooledTemporaryFile is not useful.  Although on some systems, the file system may appear as fast as memory, that cannot be assumed to be universally true.  I think the idea behind SpooledTemporaryFile is completely valid.
msg357506 - (view) Author: James Hennessy (James Hennessy) Date: 2019-11-26 16:16
I don't like the idea of using a TemporaryFile right from the beginning in text mode.  You might as well remove text mode support altogether if that's the approach you want to take, since it undoes any potential performance benefit of using SpooledTemporaryFile in the first place.
msg357510 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-11-26 16:50
But poor performance is better than silent data corruption.
If we can not fix the rollover right now, stop the spooling is a considerable option for next bugfix release.
msg357521 - (view) Author: James Hennessy (James Hennessy) Date: 2019-11-26 18:28
The quickest fix for the data corruption problem is to delete the line
    newfile.seek(file.tell(), 0)
from the rollover() method.

This doesn't fix the inconsistency of tell() and seek(), but it's very low risk.  It's technically a change to the API, that rollover() no longer preserves the seek position, but unless the user was writing only characters from the ISO-8859-1 character set, it wasn't working properly before anyway.
msg357550 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-11-27 02:59
@Serhiy, would you create a pull request based on your patch?  Or may I?

@James it doesn't make sense at all.  It breaks the file even when only ASCII characters are used.

f = SpooledTemporaryFile(mode="w+")
f.write("foobar")
f.seek(3)
f.rollover()
f.write("baz")
msg357558 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-27 05:50
I cannot creàte a pr now. Please take this issue yourself Inada+san.
msg357570 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-11-27 13:22
New changeset ea9835c5d154ab6a54eed627958473b6768b28cc by Inada Naoki in branch 'master':
bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400)
https://github.com/python/cpython/commit/ea9835c5d154ab6a54eed627958473b6768b28cc
msg357595 - (view) Author: miss-islington (miss-islington) Date: 2019-11-28 05:23
New changeset d21b8e82dda3caf0989bb39bc0e0ce2bfef97081 by Miss Islington (bot) in branch '3.8':
bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400)
https://github.com/python/cpython/commit/d21b8e82dda3caf0989bb39bc0e0ce2bfef97081
msg357596 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-11-28 05:24
New changeset e65b3fa9f16537d20f5f37c25673ac899fcd7099 by Inada Naoki in branch '3.7':
bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400)
https://github.com/python/cpython/commit/e65b3fa9f16537d20f5f37c25673ac899fcd7099
msg357602 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2019-11-28 05:44
Thank you Inada-san.
msg357603 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2019-11-28 05:44
Do we need to explicitly document the return value change of _file which is documented with a separate versionchanged directive for 3.7 and 3.8? Code like below could fail in them since TextIOWrapper doesn't have getvalue attribute as the minor version upgrade is done.

import tempfile

with tempfile.SpooledTemporaryFile(mode='wt') as f:
    f.write('abc')
    assert f._file.getvalue() == 'abc'
msg357604 - (view) Author: Inada Naoki (inada.naoki) * (Python committer) Date: 2019-11-28 06:04
> Do we need to explicitly document the return value change of _file which is documented with a separate versionchanged directive for 3.7 and 3.8?

I feel it is too detailed.

Note that the _file attribute may be TextIOWrapper after rollover() anyway and the rollover() may be called implicitly.
The code depending on the specific type of the _file is fragile at first.
msg357613 - (view) Author: Karthikeyan Singaravelan (xtreak) * (Python triager) Date: 2019-11-28 08:27
> Note that the _file attribute may be TextIOWrapper after rollover() anyway and the rollover() may be called implicitly.
The code depending on the specific type of the _file is fragile at first.

Okay, thanks for the detail.
History
Date User Action Args
2019-11-28 08:27:28xtreaksetmessages: + msg357613
2019-11-28 06:04:24inada.naokisetmessages: + msg357604
2019-11-28 05:44:16xtreaksetnosy: + xtreak
messages: + msg357603
2019-11-28 05:44:06serhiy.storchakasetmessages: + msg357602
2019-11-28 05:38:48inada.naokisetstatus: open -> closed
resolution: fixed
stage: patch review -> resolved
2019-11-28 05:24:02inada.naokisetmessages: + msg357596
2019-11-28 05:23:17miss-islingtonsetnosy: + miss-islington
messages: + msg357595
2019-11-27 13:29:41inada.naokisetpull_requests: + pull_request16887
2019-11-27 13:22:20miss-islingtonsetpull_requests: + pull_request16885
2019-11-27 13:22:10inada.naokisetmessages: + msg357570
2019-11-27 07:31:25inada.naokisetpull_requests: + pull_request16880
2019-11-27 05:50:49serhiy.storchakasetassignee: serhiy.storchaka ->
messages: + msg357558
2019-11-27 02:59:03inada.naokisetmessages: + msg357550
title: SpooledTemporaryFile doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written -> SpooledTemporaryFile rollover corrupts data silently when Unicode characters are written
2019-11-26 18:28:11James Hennessysetmessages: + msg357521
2019-11-26 16:50:01inada.naokisetmessages: + msg357510
2019-11-26 16:16:53James Hennessysetmessages: + msg357506
2019-11-26 16:14:49James Hennessysetmessages: + msg357505
2019-11-26 10:17:05inada.naokisetmessages: + msg357491
versions: + Python 3.9
2019-11-26 10:07:17inada.naokisetpriority: normal -> critical
2019-11-26 02:00:42graham.costersetnosy: + graham.coster
messages: + msg357474
2019-04-08 06:15:37serhiy.storchakasetassignee: serhiy.storchaka
versions: + Python 3.7, Python 3.8, - Python 3.5, Python 3.6
2019-04-08 05:44:43inada.naokisetmessages: + msg339590
2019-03-28 11:11:44inada.naokisetnosy: + inada.naoki
messages: + msg339029
2016-04-19 05:05:44serhiy.storchakasetmessages: + msg263708
2016-04-19 03:54:12martin.pantersetmessages: + msg263707
2016-04-18 22:29:20martin.pantersetmessages: + msg263698
2016-04-10 23:52:25martin.pantersetnosy: + martin.panter
messages: + msg263157
2016-04-10 20:40:07serhiy.storchakasetfiles: + spooled_text_tempfile.patch

nosy: + serhiy.storchaka
messages: + msg263150

keywords: + patch
stage: needs patch -> patch review
2016-04-10 19:25:33r.david.murraysetversions: + Python 3.5, Python 3.6, - Python 3.4
nosy: + r.david.murray

messages: + msg263148

stage: needs patch
2016-04-10 19:11:43James Hennessycreate