Author James Hennessy
Recipients James Hennessy
Date 2016-04-10.19:11:43
SpamBayes Score -1.0
Marked as misclassified Yes
Message-id <1460315503.46.0.361192457442.issue26730@psf.upfronthosting.co.za>
In-reply-to
Content
The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written.  The attached program demonstrates the failure.  It creates a SpooledTemporaryFile object, writes 20 string characters to it, and then tries to read them back.  If the SpooledTemporaryFile has rolled over to disk, as it does in the demonstration program, then the data is not read back correctly.  Instead, an exception is recognized due to the data in the SpooledTemporaryFile being corrupted.

The problem is this statement in tempfile.py, in the rollover() method:
        newfile.seek(file.tell(), 0)
The "file" variable references a StringIO object, whose tell() and seek() methods count in characters, not bytes, yet this value is applied to a TemporaryFile object, whose tell() and seek() methods deal in bytes, not characters.  The demonstration program writes 10 characters to the SpooledTemporaryFile.  Since 10 exceeds the rollover size of 5, the implementation writes the 10 characters to the TemporaryFile and then seeks to position 10 in the TemporaryFile, which it thinks is the end of the stream.  But those 10 characters got encoded to 30 bytes, and seek position 10 is in the middle of the UTF-8 sequence for the fourth character.  The next write to the SpooledTemporaryFile starts overlaying bytes from there.  The attempt to read back the data fails because the byte stream no longer represents a valid UTF-8 stream of data.

The related problem is the inconsistency of the behavior of tell() and seek() for text (non-binary) SpooledTemporaryFile objects.  If the data hasn't yet rolled over to a TemporaryFile, they count in string characters.  If the data has rolled over, they count in bytes.

A quick fix for this is to remove the seek() in the rollover() method.  I presume it's there to preserve the stream position if an explicit call to rollover() is made, since for an implicit call, the position would be at the end of the stream already.  This quick fix, therefore, would introduce an external incompatibility in the behavior of rollover().

Another possibility is to never use a StringIO object, but to always buffer data in a BytesIO object, as is done for binary SpooledTemporaryFile objects.  This has the advantage of "fixing" the tell() and seek() inconsistency, making them count bytes all the time.  The downside, of course, is that data that doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a round trip that could otherwise be avoided.

This problem can be circumvented by a user of SpooledTemporaryFile by explicitly seeking to the end of the stream after every write to the SpooledTemporaryFile object:  spool.seek(0, io.SEEK_END)
History
Date User Action Args
2016-04-10 19:11:43James Hennessysetrecipients: + James Hennessy
2016-04-10 19:11:43James Hennessysetmessageid: <1460315503.46.0.361192457442.issue26730@psf.upfronthosting.co.za>
2016-04-10 19:11:43James Hennessylinkissue26730 messages
2016-04-10 19:11:43James Hennessycreate