Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpooledTemporaryFile rollover corrupts data silently when Unicode characters are written #70917

Closed
JamesHennessy mannequin opened this issue Apr 10, 2016 · 24 comments
Closed
Labels
3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error

Comments

@JamesHennessy
Copy link
Mannequin

JamesHennessy mannequin commented Apr 10, 2016

BPO 26730
Nosy @bitdancer, @methane, @vadmium, @serhiy-storchaka, @miss-islington, @tirkarthi, @grumBit
PRs
  • bpo-26730: Fix SpooledTemporaryFile data corruption #17400
  • [3.8] bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400) #17405
  • [3.7] bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400) #17407
  • Files
  • showbug.py: A program to demonstrate the bug
  • spooled_text_tempfile.patch
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields:

    assignee = None
    closed_at = <Date 2019-11-28.05:38:48.497>
    created_at = <Date 2016-04-10.19:11:43.436>
    labels = ['3.7', '3.8', 'type-bug', 'library', '3.9']
    title = 'SpooledTemporaryFile rollover corrupts data silently when Unicode characters are written'
    updated_at = <Date 2019-11-28.08:27:28.265>
    user = 'https://bugs.python.org/JamesHennessy'

    bugs.python.org fields:

    activity = <Date 2019-11-28.08:27:28.265>
    actor = 'xtreak'
    assignee = 'none'
    closed = True
    closed_date = <Date 2019-11-28.05:38:48.497>
    closer = 'methane'
    components = ['Library (Lib)']
    creation = <Date 2016-04-10.19:11:43.436>
    creator = 'James Hennessy'
    dependencies = []
    files = ['42423', '42424']
    hgrepos = []
    issue_num = 26730
    keywords = ['patch']
    message_count = 24.0
    messages = ['263147', '263148', '263150', '263157', '263698', '263707', '263708', '339029', '339590', '357474', '357491', '357505', '357506', '357510', '357521', '357550', '357558', '357570', '357595', '357596', '357602', '357603', '357604', '357613']
    nosy_count = 8.0
    nosy_names = ['r.david.murray', 'methane', 'martin.panter', 'serhiy.storchaka', 'James Hennessy', 'miss-islington', 'xtreak', 'graham.coster']
    pr_nums = ['17400', '17405', '17407']
    priority = 'critical'
    resolution = 'fixed'
    stage = 'resolved'
    status = 'closed'
    superseder = None
    type = 'behavior'
    url = 'https://bugs.python.org/issue26730'
    versions = ['Python 3.7', 'Python 3.8', 'Python 3.9']

    @JamesHennessy
    Copy link
    Mannequin Author

    JamesHennessy mannequin commented Apr 10, 2016

    The tempfile.SpooledTemporaryFile class doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written. The attached program demonstrates the failure. It creates a SpooledTemporaryFile object, writes 20 string characters to it, and then tries to read them back. If the SpooledTemporaryFile has rolled over to disk, as it does in the demonstration program, then the data is not read back correctly. Instead, an exception is recognized due to the data in the SpooledTemporaryFile being corrupted.

    The problem is this statement in tempfile.py, in the rollover() method:
    newfile.seek(file.tell(), 0)
    The "file" variable references a StringIO object, whose tell() and seek() methods count in characters, not bytes, yet this value is applied to a TemporaryFile object, whose tell() and seek() methods deal in bytes, not characters. The demonstration program writes 10 characters to the SpooledTemporaryFile. Since 10 exceeds the rollover size of 5, the implementation writes the 10 characters to the TemporaryFile and then seeks to position 10 in the TemporaryFile, which it thinks is the end of the stream. But those 10 characters got encoded to 30 bytes, and seek position 10 is in the middle of the UTF-8 sequence for the fourth character. The next write to the SpooledTemporaryFile starts overlaying bytes from there. The attempt to read back the data fails because the byte stream no longer represents a valid UTF-8 stream of data.

    The related problem is the inconsistency of the behavior of tell() and seek() for text (non-binary) SpooledTemporaryFile objects. If the data hasn't yet rolled over to a TemporaryFile, they count in string characters. If the data has rolled over, they count in bytes.

    A quick fix for this is to remove the seek() in the rollover() method. I presume it's there to preserve the stream position if an explicit call to rollover() is made, since for an implicit call, the position would be at the end of the stream already. This quick fix, therefore, would introduce an external incompatibility in the behavior of rollover().

    Another possibility is to never use a StringIO object, but to always buffer data in a BytesIO object, as is done for binary SpooledTemporaryFile objects. This has the advantage of "fixing" the tell() and seek() inconsistency, making them count bytes all the time. The downside, of course, is that data that doesn't end up being rolled over to a TemporaryFile gets encoded and decoded, a round trip that could otherwise be avoided.

    This problem can be circumvented by a user of SpooledTemporaryFile by explicitly seeking to the end of the stream after every write to the SpooledTemporaryFile object: spool.seek(0, io.SEEK_END)

    @JamesHennessy JamesHennessy mannequin added stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error labels Apr 10, 2016
    @bitdancer
    Copy link
    Member

    I can't see why it even does the seek. The existing tests pass without it. Does your example?

    Either way, the first step here is for someone to turn this into a unit test for the tempfile module (in test_tempfile).

    @serhiy-storchaka
    Copy link
    Member

    Here is a patch that fixes SpooledTemporaryFile by getting rid of StringIO.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 10, 2016

    David: I guess the seek() is to support rollover() when you are positioned halfway through the file.

    Serhiy’s patch seems to be about the best we can do, although it does break the documented promise that the “ ‘_file’ attribute is either an io.BytesIO or io.StringIO”.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 18, 2016

    I noticed another possible problem with using TextIOWrapper. We call tell() to see if a rollover is needed. But I think TextIOWrapper streams can return complicated values from tell(), encoding stuff like codec state and buffered data in an integer. At the minimum, I think the meaning of max_size has changed (code points vs bytes). And perhaps it won’t work very well with some codecs that record state in the tell() value.

    @vadmium
    Copy link
    Member

    vadmium commented Apr 19, 2016

    According to bpo-12215, the TextIOWrapper.tell() position only includes decoder state, not encoder state. So we may be lucky that tell() is proportional to the file size. If so my concern about max_size isn’t so important. But it still feels like a bad hack.

    Maybe another quick fix for this would be to do two writes of the StringIO data, and then seek back to the end of the first write:

    pos = file.tell()  # StringIO position is in code points
    file.seek(0)
    newfile.write(file.read(pos))
    pos = newfile.tell()  # TextIOWrapper position is opaque
    newfile.write(file.read())
    newfile.seek(pos, 0)

    This still depends on StringIO.tell() corresponding to the number of code points, which is undocumented (we already disable newline translation in the StringIO). See also bpo-25190.

    @serhiy-storchaka
    Copy link
    Member

    This still depends on StringIO.tell() corresponding to the number of code points, which is undocumented (we already disable newline translation in the StringIO). See also bpo-25190.

    And doesn't work with Python implementation of StringIO. More general solution needs the code like:

    pos = file.tell()
    tail = file.read()
    file.seek(0)
    head = file.read()
    if tail:
        head = head[:-len(tail)]
    newfile.write(head)
    pos = newfile.tell()  # TextIOWrapper position is opaque
    newfile.write(tail)
    newfile.seek(pos, 0)

    Yet one bad thing with using independent StringIO instead of TextIOWrapper, is that saved tell() position becomes not valid after rollover(). I'm going to write few tests for cases when file position points not at the end of the stream.

    @methane
    Copy link
    Member

    methane commented Mar 28, 2019

    Getting rid of StringIO seems better approach to me.
    .tell(), .seek(), and .truncate() behaviors are very different between
    StringIO and TextIOWrapper.

    @methane
    Copy link
    Member

    methane commented Apr 8, 2019

    I think this bug is critical, because it may break user's data silently.

    If we can not fix this bug soon, how about adding note in document and raise warning?

    @serhiy-storchaka serhiy-storchaka added 3.7 (EOL) end of life 3.8 only security fixes labels Apr 8, 2019
    @serhiy-storchaka serhiy-storchaka self-assigned this Apr 8, 2019
    @grumBit
    Copy link
    Mannequin

    grumBit mannequin commented Nov 26, 2019

    This may be a silly question, however, does SpooledTemporaryFile need to exist at all?

    From some testing on macOS, SpooledTemporaryFile appeared to never have a performance advantage over OS file caching, but with max_size greater than 4GB, it was a significant disadvantage. So, if the purpose of SpooledTemporaryFile is to increase performance, it may not work.

    I found that the macOS built-in file cache was increasing in size as I wrote bigger TemporaryFile files, up to some limit the OS had decided. So, it seems the OS is automatically doing the same job as SpooledTemporaryFile. Once the OS decided to write to disk, there was no sudden hit to performance, it just slowed down.

    However, when SpooledTemporaryFile rolled-over large max_size files, there was a temporary big hit to performance, which then became a consistent slow down the same as a TemporaryFile that had exceeded the OS file cache.

    A big issue came with very large SpooledTemporaryFile max_sizes hogging RAM and causing the OS to start swapping all processes. This caused a huge performance hit to my program and the system as a whole. Once my program did finish, it took the system considerable time to reclaim swap.

    I’m guessing SpooledTemporaryFile may have benefits on light weight embedded OSes that have no, or poor, file caching. However, tuning the max_size to work with embedded systems’ limited RAM could be tricky for developers and would be hardware dependent. So, perhaps leaving file caching to the underlying operating systems is actually a better, and safer, option than offering it in Python?

    If there are no benefits to SpooledTemporaryFile, should it be deprecated? If so, as it is phasesd out, could it be patched to be a TemporaryFile wrapper, with no rollover functionality?

    @methane
    Copy link
    Member

    methane commented Nov 26, 2019

    Creating files is slow on Windows too.

    But I think we should fix the data corruption ASAP.

    While Serhiy's patch looks good to me, there is a more quick and safe way to fix the data corruption. Use TemporaryFile at first if it is text mode.

    @methane methane added the 3.9 only security fixes label Nov 26, 2019
    @JamesHennessy
    Copy link
    Mannequin Author

    JamesHennessy mannequin commented Nov 26, 2019

    I have to disagree with the idea that SpooledTemporaryFile is not useful. Although on some systems, the file system may appear as fast as memory, that cannot be assumed to be universally true. I think the idea behind SpooledTemporaryFile is completely valid.

    @JamesHennessy
    Copy link
    Mannequin Author

    JamesHennessy mannequin commented Nov 26, 2019

    I don't like the idea of using a TemporaryFile right from the beginning in text mode. You might as well remove text mode support altogether if that's the approach you want to take, since it undoes any potential performance benefit of using SpooledTemporaryFile in the first place.

    @methane
    Copy link
    Member

    methane commented Nov 26, 2019

    But poor performance is better than silent data corruption.
    If we can not fix the rollover right now, stop the spooling is a considerable option for next bugfix release.

    @JamesHennessy
    Copy link
    Mannequin Author

    JamesHennessy mannequin commented Nov 26, 2019

    The quickest fix for the data corruption problem is to delete the line
    newfile.seek(file.tell(), 0)
    from the rollover() method.

    This doesn't fix the inconsistency of tell() and seek(), but it's very low risk. It's technically a change to the API, that rollover() no longer preserves the seek position, but unless the user was writing only characters from the ISO-8859-1 character set, it wasn't working properly before anyway.

    @methane
    Copy link
    Member

    methane commented Nov 27, 2019

    @serhiy, would you create a pull request based on your patch? Or may I?

    @james it doesn't make sense at all. It breaks the file even when only ASCII characters are used.

    f = SpooledTemporaryFile(mode="w+")
    f.write("foobar")
    f.seek(3)
    f.rollover()
    f.write("baz")

    @methane methane changed the title SpooledTemporaryFile doesn't correctly preserve data for text (non-binary) SpooledTemporaryFile objects when Unicode characters are written SpooledTemporaryFile rollover corrupts data silently when Unicode characters are written Nov 27, 2019
    @serhiy-storchaka
    Copy link
    Member

    I cannot creàte a pr now. Please take this issue yourself Inada+san.

    @serhiy-storchaka serhiy-storchaka removed their assignment Nov 27, 2019
    @methane
    Copy link
    Member

    methane commented Nov 27, 2019

    New changeset ea9835c by Inada Naoki in branch 'master':
    bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400)
    ea9835c

    @miss-islington
    Copy link
    Contributor

    New changeset d21b8e8 by Miss Islington (bot) in branch '3.8':
    bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400)
    d21b8e8

    @methane
    Copy link
    Member

    methane commented Nov 28, 2019

    New changeset e65b3fa by Inada Naoki in branch '3.7':
    bpo-26730: Fix SpooledTemporaryFile data corruption (GH-17400)
    e65b3fa

    @methane methane closed this as completed Nov 28, 2019
    @serhiy-storchaka
    Copy link
    Member

    Thank you Inada-san.

    @tirkarthi
    Copy link
    Member

    Do we need to explicitly document the return value change of _file which is documented with a separate versionchanged directive for 3.7 and 3.8? Code like below could fail in them since TextIOWrapper doesn't have getvalue attribute as the minor version upgrade is done.

    import tempfile
    
    with tempfile.SpooledTemporaryFile(mode='wt') as f:
        f.write('abc')
        assert f._file.getvalue() == 'abc'

    @methane
    Copy link
    Member

    methane commented Nov 28, 2019

    Do we need to explicitly document the return value change of _file which is documented with a separate versionchanged directive for 3.7 and 3.8?

    I feel it is too detailed.

    Note that the _file attribute may be TextIOWrapper after rollover() anyway and the rollover() may be called implicitly.
    The code depending on the specific type of the _file is fragile at first.

    @tirkarthi
    Copy link
    Member

    Note that the _file attribute may be TextIOWrapper after rollover() anyway and the rollover() may be called implicitly.
    The code depending on the specific type of the _file is fragile at first.

    Okay, thanks for the detail.

    @ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    3.7 (EOL) end of life 3.8 only security fixes 3.9 only security fixes stdlib Python modules in the Lib dir type-bug An unexpected behavior, bug, or error
    Projects
    None yet
    Development

    No branches or pull requests

    6 participants