New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in module _io can lose data from buffered files in reference cycles #62052
Comments
In Python 2, a buffered file opened for writing is flushed by the C library when the process exit. In Python 3, the _pyio and _io modules don't do it reliably. They rely on __del__ being called, which is not neccesarily the case. The attached example ends with the file 'foo' empty (tested with Python 3.3 and the current trunk). The same example in Python 2 using the built-in open() cannot end with 'foo' empty. |
There are actually several issues here:
|
Code relying on garbage collection/shutdown hook to flush files is borked:
Furthermore, I think it's inherently unsafe: in a multi-threaded program, flushing requires locking, and doing this at shutdown is a recipe for deadlock/arbitrary delay (image you have a daemon thread currently holding the lock, e.g. for stdout). In short, that's an application bug to me. |
It used to be a consistently reliable behavior in Python 2 (and we made it so in PyPy too), provided of course that the process exits normally; but it no longer is in Python 3. Well I can see the reasons for not flushing files, if it's clearly documented somewhere as a change of behavior from Python 2. However I'm complaining about the current behavior: files are flushed *most of the time*. That's a behavior that is clearly misleading, or so I would think. I'm rather sure that there are many small scripts and large programs out there relying on automatic flushing, and then one day they'll hit a case where the file is not flushed and get the worst kind of error: a file unexpectedly truncated at 99% of its length, in a way that cannot be reproduced by small examples. Feel free to close anyway as not-a-bug; I won't fight the Python 3 behavior, because Python 2 works as expected. |
When you say Python 2, I assume you mean CPython 2, right?
That's the problem with implementation-defined behavior ;-) |
"In Python 2, a buffered file opened for writing is flushed by the C library when the process exit." "When you say Python 2, I assume you mean CPython 2, right? It looks to be a feature of the standard C library, at least the GNU libc. Its libio library installs an exit handler flushing all open files. You can see it if you set a breaking on write() using gdb: Breakpoint 5, write () at ../sysdeps/unix/syscall-template.S:82 (gdb) where Source of _IO_flush_all_lockp() in the GNU libc: So the glibc maintains a list of open "FILE*" files: _IO_list_all, which is protected by list_all_lock lock. |
Yes, it's guaranteed by POSIX/ANSI (see man exit). |
Hum, POSIX (2004) is not so strict: 2013/4/29 Charles-François Natali <report@bugs.python.org>:
|
You're looking at the wrong section: "The exit() function shall then flush all open streams with unwritten Then a couple lines below: "The _Exit() [CX] and _exit() functions shall not call functions It's guaranteed for exit(), not _Exit()/_exit(). |
Oops ok, thanks. 2013/4/30 Charles-François Natali <report@bugs.python.org>:
|
In 3.4+ the example script always writes string "bar" to file "foo". Tested by running it in a loop hundreds times. Cleaning up at shutdown was enhanced in 3.4. |
Is anyone interested to work on the "maintain a list of open file objects" idea? I consider that Python 3 does its best to flush data at exit, but it's not a good practice to rely on the destructors to flush data. I mean, there is no warranty that files will be written in the correct order and that files will stay consistent. It's much safer to explicitly call the clode() method. By the way, Python 3 now emits ResourceWarning warnings when files are destroyed without being explicitly closed. I enhanced the code handling warnings during Python shutdown to show these warnings in more use cases, but it is still not perfect. I opened the issue bpo-21788 as a reminder that the code can still be enhanced. For the "maintain a list of objects" idea, I expect race conditions with threads, signals and processes (fork). We already have known issues with locks + fork. I expect even more issues if we maintain a list of open files. I propose to close the issue as "wontfix". Python 3 does its best, but please call explicitly the close() method! Maybe the undefined behaviour should be documented (with a warning?) in the buffered writer of the io module, and maybe also in the open() function? |
I'm +1 on closing this. Agree with Charles-François that it's never been guaranteed by the Python specification. Pythonic way to work with files is to use the "with" statement, or, if you need long living file stream, careful close files in the "finally" block or in __exit__ method of some class which is used as context manager. Non-observance of this rule can be considered as a bug (see bpo-22831). |
Victor: there is the GIL, you don't need any locking. |
Serhiy, I believe this still happens in Python 3.4, but it is harder to reproduce. I couldn't get Armin's script to produce the problem either, but I'm pretty sure that this is what causes e.g. https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=771452#60. |
Maybe, as it's been said several times, it's an application bug: files |
I hate to repeat myself, but if the C standard says "files are flushed at exit"; if Python 2 follows this standard; and if Python 3 follows it most of the time but *not always*... then it seems to me that something is very, very buggy in the worst possible way (rare data loss). |
Do you have working in 3.4+ example? |
If I understood correctly, Python 3.4 tries harder to find cycles and call destructors at the end of the program, but that's not a full guarantee. For example you can have a reference from a random C extension module. While trying to come up with an example, I found one that I don't fully understand, but the point is that it shows how easy it is to defeats it: import sys
f = open('foo.txt', 'w')
f.write('abc')
def func(*args):
return func
sys.settrace(func) |
(Ah, it's probably a reference from the trace function -> func_globals -> f). |
Here is an example. |
Note the example I posted doesn't involve the shutdown sequence. It calls gc.collect() explicitly. |
To add to the confusion: Antoine's example produces an empty file on the current trunk cd282dd0cfe8. When I first tried it on a slightly older trunk (157 changesets ago), it correctly emitted a file with "barXXX ", but only if "gc.collect()" was present. Without "gc.collect()" it didn't. This seems at odds with the explanation that "gc.collect()" should be what occurs at shut-down now. |
The problem is that the order of tp_finalize calls is arbitrary when there is a reference cycle (same thing, of course, with tp_clear). So depending on the exact layout of the garbage list, the TextIOWrapper could be collected before the BufferedWriter and the FileIO (good), or after (bad). I don't see an easy way to solve this. Either _io should provide hints to the GC (which kind of hints?), or the tp_finalize should be customized to somehow call a dependent wrapper's tp_finalize (how? by keeping a weakref?). |
Maybe accepting the fact that relying on finalizers is a bad idea here? |
You mean encourage people to explicitly use "with" or call close()? Yes, that's the simplest answer. It's why we added ResourceWarning. |
If we want to guaranty that all files are properly flushed at exit, the best option is to maintain a list of open files. I'm not interested to implement that, it sounds too complex. For example, open("text.txt", "w") creates 3 objects: FileIO, BufferedWriter, TextIOWrapper. You have to register these 3 objects and flush them in the right order. Python is dynamic, you are free to modify objects, for example using the detach() method or by modifying the buffer attribute of the TextIOWrapper. How do you decide which object should be flushed? In which order? We have to take care of signals, threads and forks, stay portable, etc. |
Le 04/12/2014 12:05, STINNER Victor a écrit :
I am not talking about files flushing at exit. |
I created a new PR which uses the atexit module instead of using _Py_PyAtExit. I think registering in io.py is okay. I see that atexit is now implemented in C. Rather than registering in io.py, we could create a C API to register callbacks (i.e. atexit_register). That would perhaps be a bit cleaner. |
Attached is a script that triggers the non-flushing behaviour for me. I don't think it is reliable since it depends on the order that FileIO AND BufferedWriter are finalized when the gc finds them in a reference cycle. BTW, it is arguable that the root cause of this bug is that we no longer to topological ordering when calling finalizers. Originally, the cycle GC would refuse to call __del__ methods because once they are part of a cycle, there is no defined topological ordering. So, we linked that garage to gc.garbage rather than calling __del__ and have potentially undefined results. Now the GC has been changed to call __del__ anyhow. I haven't studied how this has been changed but this non-flushing bug is a result. Having the buffer added to gc.garbage would also result in the data not being flushed but arguably it would be more understandable what's going on. I'm not arguing that we should go back to that, just that current behaviour can be subtle and confusing. |
The reason Python 2 did well here is simply that Python 2 had a single Python object (the file object) for each actual file. Python 3 has several of them (the raw IO object, the buffered IO object, possibly the text IO wrapper), and so suddenly the finalization order matters. |
In the process of trying to write a test for this, I now realize that PR 4847 is not really a fix. If the underlying file gets closed during an earlier gc.collect() and not during shutdown, the extra flushing step is not going to help. So, using atexit is not really a fix. Maybe there is something else we can do, need to think about it more. |
I'm not well known with GC, but if it be possible to call destructors in the reversed chronological order, this could help with this issue. |
Using reversed chronological order would work in 99% of the cases probably but then you would have that rare case where it didn't work. So, I'm not too keen on that approach. I think this is a classic problem with finalization and GC, probably should do some actual reading of literature. One idea that occurs to me is that the underlying file object could keep a weakref to the higher level (e.g. buffer) objects. Then, if the underlying file gets closes, it could make sure to call flush on the higher level objects first. The weakref would ensure that a ref cycle is not created. |
Given the apparent difficulties getting this right, how about not attempting to implicitly flush buffers in the finalizer at all? This means scripts relying on this will break, but in contrast to the current behavior they will break consistently so it's easy to find and fix the problem. |
Welp, another day another attempt. As mentioned in the PR 4847, atexit is not the answer. If the raw/buffered file pair are part of a reference cycle and the GC cleans it before atexit runs, then the buffered data can get lost. I attempted to implement my weakref idea (i.e. raw file keeps a weakref to the buffered file, calls flush before the raw file gets closed). That doesn't work either because the GC clears the weakref before calling __del__. The attached patch "buffer_register_flush.txt" does seem to work. The downside is that it creates a reference cycle between the raw and buffered file objects. Perhaps that is not a problem since unless you call close() on the raw file, you will be leaking resources anyhow. In the patch, calling close() removes the reference cycle. I still feel like this is worth fixing, as ugly as the implementation is. |
I think it would be quite disruptive to create a reference cycle each time open() is called. It may also break user scripts. |
Yeah, I think you are correct. Currently files not part of reference cycles get properly flushed and closed by the reference counter. Implementing my "buffer_register_flush" patch would cause files to be closed only by the cyclic garbage collector (if not explicitly closed). That would mean a script that opens a large number of files could run out of file descriptors. Basically, we lose one of the main advantages of reference counting. Probably the GC should keep track of how many files are open and call collect() when a threshold is reached. Still, forcing every file to be collected only by the cyclic GC seems too ugly of a solution to this issue. |
Here's another idea: have a new type field (a tp_something) that exposes the GC priority for this type. It could be an integer between -8 and 7, for example (0 being the default). Then tp_finalize would be called in sorted priority order. The tricky part of course is to make this acceptable performance-wise (i.e. avoid walking the collectable list 16 times, once for each priority level). For example by creating bucket lists for each priority level (only for objects with a non-NULL tp_finalize). |
Since the issue seems to have been active lately, may I suggest my view on solving it. One solution that comes to my mind is to keep a weak reference to the file, and to register an atexit function per file (and to unregister it when the file is closed). Example concept-illustrating python code for the _pyio module: import atexit, weakref # ... class TextIOWrapper(TextIOBase):
def __init__(self):
# ...
self._weakclose = operator.methodcaller('close')
atexit.register(self._weakclose, weakref.proxy(self))
def close(self):
atexit.unregister(self._weakclose)
# and now actually close the file There is a possibility of a negative impact arising from the use of operator.methodcaller, because it may return something cached in future versions of python. There is also the issue of unregistering an atexit function during atexit processing. But the general idea seems to be simple and worth considering. |
Using atexit is not the solution because the data can be lost even while the program is running, not just at interpreter shutdown. The problem occurs if the buffered file object and the underlying file object are both part of a reference cycle. Then, when the cycle is destroyed by the GC module, if the file __del__ is called before the buffer __del__, data is lost. So far, the best solution I've come up with is as follows: split the buffered file object into two objects, a wrapper object that will be returned from open() and an underlying object that holds the actual buffer. Make the underlying file object keep references to all the buffers that are using the file. When the file _del__ gets called, first flush all of the buffers and then close the file. Splitting the buffered file object is required so that we don't create reference cycles. |
This may be a bit of a left-field or too-big-a-hammer suggestion, but as far I can tell from this thread [1] it probably is technically possible to modify the GC to clear weakrefs after calling __del__. Nick wasn't a fan (he likes the invariant that weakrefs can't trigger a resurrection), but if all else fails it might be worth re-raising. You could add a secondary reference count on FileIO recording how many BufferedIO wrappers there are around it; then it's __del__ would skip calling close() if there are still BufferedIO wrappers, and BufferedIO.__del__ would decrement the reference count and close the underlying file if it hits zero and FileIO.__del__ had already been called. [1] https://mail.python.org/pipermail/python-dev/2016-October/146747.html |
I marked bpo-38548 as duplicate of this issue. Copy of msg355062: Consider the following program: f = open("out.txt", "w")
f.write("abc\n")
exit(0) Please note the absence of f.close(). from sympy.core import AtomicExpr
class MyWeirdClass(AtomicExpr):
def __init__(self):
pass
f = open("out.txt", "w")
f.write("abc\n")
exit(0) Note: sys.version is: "3.7.3 (default, Oct 7 2019, 12:56:13) \n[GCC 8.3.0]" |
The PR below fixed it: |
is there still an issue to be fixed here? if so re-open and explain or open a new issue. The docmentation was updated to highlight that unclose files can't be guaranteed to flush buffered data. |
I think there is still an issue here, even though it is a low priority bug. It's good we document now that unclosed files could lose buffered data. However, it is bad, IMHO, that it works for 99%+ of the time but sometimes doesn't. I think that's worth fixing, if we can do it without too much pain. I created GH-105171 as a different approach to fix this. |
Over last years, I made many enhancements to be able to log ResourceWarning as late as possible. But Python finalization is not deterministic and so some warnings cannot be logged when emitted "too late". My latest failed attempt to make the finalization more determistic: #86837 |
The root cause is the GC algorithm for breaking reference cycles. If we have a cycle: flowchart LR;
A==>B;
B==>C;
C==>A;
C-->TextIOWrapper;
TextIOWrapper-->BufferedWriter;
BufferedWriter-->FileIO;
then it will try to break it by breaking any reference, including references C-->TextIOWrapper, TextIOWrapper-->BufferedWriter and BufferedWriter-->FileIO, even if they do not lead to breaking the cycle. The only references which advance in reference cycle elimination are A-->B, B-->C and C-->A. If the algorithm be smarter, this would solve most of problems, not only with IO. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
Linked PRs
The text was updated successfully, but these errors were encountered: