Issue4944
This issue tracker has been migrated to GitHub,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2009-01-14 04:42 by javen72, last changed 2022-04-11 14:56 by admin. This issue is now closed.
Files | ||||
---|---|---|---|---|
File name | Uploaded | Description | Edit | |
test_file_flush.py | ggenellina, 2009-01-14 06:41 | main script | ||
checkfile.c | ggenellina, 2009-01-14 06:42 | check file existence like nmake would do | ||
test_file_flush.py | ggenellina, 2009-01-15 03:29 | 2nd version using threads and a queue | ||
checkfile.c | javen72, 2009-01-20 09:03 | 2nd version to check file existence using CreateFile | ||
test_file_flush.py | javen72, 2009-01-20 09:05 | 3rd version to emulate the real application |
Messages (24) | |||
---|---|---|---|
msg79829 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-14 04:42 | |
I encountered a very strange issue in file flush operation in Windows. Here's the scenario of my application: 1. The main thread of my application will create makefiles sequentially. 2. Once a makefile is generated, launch a separate process calling nmake.exe to run it in parallell. The main thread then create another makefile until no more makefiles to create. 3. The number of new processes is limited by command line options. My application has been running for almost a year without any problem. But, after I made some changes recently to the content of makefile generated, "nmake.exe" in separate process sometimes complains that the makefile was not found. But when I went into the directory, the makefile was there. Because I didn't change any thing related to file creation and the new makefiles are a little bit less than before, I guessed that the makefile just created hasn't been flushed to disk because of size change so the new process could not see it in a short time. So I decided add code to force flush file buffer after writing the file content (I didn't flush the file explicitly before). I did it like below: Fd = open(File, "w") Fd.write(Content) Fd.flush() os.fsync(Fd.fileno()) Fd.close() The strangest thing happened. The "makefile" missing happened more frequently than no flush operation. I searched the web but no answer there. Finally I tried to use Windows file API to create the file via pywin32 extension. The problem's gone. import win32file Fd = win32file.CreateFile( File, win32file.GENERIC_WRITE, 0, None, win32file.CREATE_ALWAYS, win32file.FILE_ATTRIBUTE_NORMAL, None ) win32file.WriteFile(Fd, str(Content), None) win32file.FlushFileBuffers(Fd) win32file.CloseHandle(Fd) I tried writing small python extension in C to make use Windows API to create file like above. It also works well, even I removed the FlushFileBuffers() calling. I think that there's a bug in Python file buffer mechanism. |
|||
msg79838 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2009-01-14 06:41 | |
I tried to reproduce the issue, interpreting your description, but failed. It worked fine in my setup. Perhaps you can add more elements until it fails. |
|||
msg79841 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * | Date: 2009-01-14 07:48 | |
python2.5 uses the functions of the fopen() family: fwrite(), fclose(). Does the problem reproduce if you use these functions in your extension module? Are your files located on the local hard drive, or on a network storage? |
|||
msg79854 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-14 15:31 | |
gagenellina, My application is a little bit different from your test code. It won't wait for the exit of new process and there're file writing operations during makefile running. I changed the test code to be real multi-process and tried many file sizes. But I cannot reproduce it. Maybe my application is more complicated situation. |
|||
msg79856 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-14 15:57 | |
The created files are on local drive. I saw the problem on the laptop (XP-SP2), desktop(XP-SP3) and server (Win2003). But there's no such problem on the Linux and Mac boxes. I tried to use fopen/fwrite in my extension according to your suggestion. The problem wasn't be reproduced. It seems the bug is more alike in Python part. My application is a build system and is also an open source project. Is it possible for you to download it and try it in your box? I created temporary user (py123, password: 123py123) for you (just in case) and here's steps of how to reproduce it. 1. Checkout the build system source code in, for example, c:\test C:\test> svn co --username py123 --password 123py123 https://buildtools.tianocore.org/svn/buildtools/trunk/BaseTools tools 2. Checkout the source code to be built against in c:\test C:\test> svn co --username py123 --password 123py123 https://edk2.tianocore.org/svn/edk2/trunk/edk2 edk2 3. Change the source code between line 222 and line 229 of c:\test\tools\Source\Python\Common\Misc.py (SaveFileOnChange function) like below: Fd = open(File, "wb") Fd.write(Content) Fd.flush() os.fsync(Fd.fileno()) Fd.close() 4. In c:\test\edk2, run C:\test\edk2> edksetup.bat C:\test\edk2> set PYTHONPATH=C:\test\tools\Source\Python C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py -n 2 -p MdeModulePkg\MdeModulePkg.dsc -a IA32 -s 5. If the application stops with message like "makefile not found" or "AutoGen.h not found" message, that means the problem happened. Visual Studio 2005 is needed to reproduce it. |
|||
msg79857 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-14 16:16 | |
There would be more chances to see the problem by doing this: C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py -n 2 -p IntelFrameworkModulePkg\IntelFrameworkModulePkg.dsc -a IA32 -s |
|||
msg79858 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * | Date: 2009-01-14 16:33 | |
I really tried, but I can't run your script. first PYTHONPATH must be set to some value (I tried with tools\Source\Python); Also the tool is complaining about missing WORKSPACE and EDK_TOOLS_PATH env vars, and a missing file "Conf\target.txt". This seems very complicated for a test case... furthermore it comes with binaries already linked with a specific version of python... what if I want to use a debug build? Another question: is there a running anti-virus? if yes, can you try to temporarily disable it? |
|||
msg79884 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-15 02:05 | |
Thank you very much for the trying. You might miss the step 4 in my previous message. The step is: C:\test\edk2> edksetup.bat C:\test\edk2> set PYTHONPATH=C:\test\tools\Source\Python C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py -n 2 -p IntelFrameworkModulePkg\IntelFrameworkModulePkg.dsc -a IA32 -s The Visual Studio 2005 must be in the standard installation directory. Otherwise the C:\test\edk2\Conf\tools_def.txt needs to be changed to reflect the real path. And I tried to disabled all antivirus services and the problem is still there. Don't worry about the binary version of build in edk2\BaseTools\Bin\Win32 (linked against Python 2.5.2). The step I told you is to execute my application from Python script directly. And I tried to execute from script source against Python 2.5.4 and the problem is the same. And no matter running the build from script source or the freeze-ed binary, the results are the same either. If it's hard or inconvenient for you to reproduce it, could you please give me any advice or suggestion on how to debug it (in the interpreter) and where's most possible place the root cause would be in the Python interpreter's code? I can try to change something in the Python interpreter's code, rebuild it and try it on my machine. Although I have work around for this problem, I'd like to root cause it to avoid further possible build break of our project. |
|||
msg79885 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2009-01-15 03:29 | |
I patiently waited for all those 150MB to download, modified Misc.py, run the specified commands and got this error: build.py... : error 7000: Failed to start command C:\Program Files\Microsoft Visual Studio 8\Vc\bin\nmake.exe / nologo -s t build [C:\test\edk2\Build\MdeModule\DEBUG_MYTOOLS\IA32\MdePkg\Library \BasePrintL ib\BasePrintLib] That's right - VS is installed in another place. "C:\Program Files" doesn't even exist in my Spanish version of Windows. edksetup.bat didn't report any error, and I have nmake.exe in my PATH. Anyway, trying to hunt a bug in the middle of 150 MB of code is way too much. You should try to reduce it to the smallest piece that still shows the problem. (As a side note, a project so big SHOULD have plenty of unit tests, but I see they're almost inexistant. Having to write tests forces people to use a more modular design. In this case, probably it would have been easier to test this issue, in a more isolated way, without having to bring the whole framework in). Multithreaded programs may be tricky. Looking at the _MultiThreadBuildPlatform method (we're talking of it, right?) it isn't obvious that there are no race conditions in the code. For a producer-consumer process like this, I would use a pool of worker threads, all waiting for work to do from a Queue, and a main (producer) thread that puts tasks to be done into the Queue. The synchronization is already done for you, it's almost automatic. I've modified my previous example to show this usage. Unless you can bring more evidence that this is a real Python issue and not a race condition in your code or something, I'd recommend to close this as invalid. |
|||
msg79886 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-15 04:36 | |
I agree multithread programming is kind of difficult and tricky. But I don't think there's a race condition in _MultiThreadBuildPlatform method, because I do have only one producer. And the consumer consumes the product only when it's done (i.e. the file is created and closed). The only race condition, I think, it's in the Python or Windows file system, in which the file created might not be seen by other process due to the file buffer mechanism. I think the flush() method of file object and os.fsync() are to remove the race condition. But it seems that they are not working as they're supposed to be. What I know is that os.fsync() calls the _commit() which calls FlushFileBuffers(). Why no problem if I call the FlushFileBuffers() directly? That's why I think the most possible race condition is in Python file buffer operation which is out of the control of my Python code. I'm sorry that I didn't realize there's 150M code to checkout. Thanks for your patience. Actually they are not code of my application itself. They are the code used to test my application because my application is a build system which needs source code to build. The real code of my application is in the, for my example, directory of C:\test\tools\Source\Python with just about 3M source files :-) And I think I have narrowed down the issue in the file creation in SaveFileOnChange function in C:\test\tools\Source\Python\Common\Misc.py. I know it's very hard to reproduce issue in multi-thread context. And I cannot say absolutely there's no bug in my code. It's OK for you to close this tracker. But it would be better to let it open for a few days so that I can do more investigation then. Anyway, thanks for the trying again. |
|||
msg79940 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-16 08:37 | |
I narrowed down the root cause in the GIL of Python. I read the source code of implementing os.fsync and found it's using Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS to enclose the calling of _commit() in order to be thread safe. I tried to add them in my C extension and then the issue was reproduced. It looks like the thread state operation or GIL caused a race condition and let os.fsync() not flush the makefile in buffer to disk before it returns. The multi-thread scenario in my application is a. the main thread produces makefile b. once a makefile is generated, the main thread launches a new thread in which a separated process is launched to run the makefile until it exits. I think there's no issue in this multi-thread scenario. And the Python 2.5.2 Maunal, Python/C API Ch8.1, says GIL must be taken care when doing some blocking I/O operations, but it caused thread unsafe. So I still think there's a bug in the Python. |
|||
msg79943 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2009-01-16 09:25 | |
If you start the new thread *after* the file is closed, no race condition should exist. I'm sorry but I'm unable to follow the code you use to create threads and write the makefile; but you should verify that your original assertion "Once a makefile is generated, launch a separate process..." is true. Alternatively, when using a Queue like in my last example, make sure the new job enters the queue *after* closing the file. |
|||
msg80171 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-19 15:00 | |
I'm quite certain that the makefile is generated before the "make" program is launched in separated process. Follow is the original code where the makefile is created (line 14) and a task is put in queue (line 19). It's executed in the main thread (no parallel execution here). There's no way that the process is launched before the makefile generation. 01 for Module in Pa.Platform.Modules: 02 Ma = ModuleAutoGen(Wa, Module, BuildTarget, ToolChain, Arch, self.PlatformFile) 03 if Ma == None: 04 continue 05 # generate AutoGen files and Makefile here 06 if self.Target not in ['clean', 'cleanlib', 'cleanall', 'run', 'fds']: 07 # for target which must generate AutoGen code and makefile 08 if not self.SkipAutoGen or self.Target == 'genc': 09 Ma.CreateCodeFile(True) 10 if self.Target == "genc": 11 continue 12 13 if not self.SkipAutoGen or self.Target == 'genmake': 14 Ma.CreateMakeFile(True) 15 if self.Target == "genmake": 16 continue 17 18 # Generate build task for the module which will be launched separately 19 Bt = BuildTask.New(ModuleMakeUnit(Ma, self.Target)) I think the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS should not be used for fsync in the Python core code (in posixmodule.c, right?). Because fsync is mostly used to solve file operation issue in parallel execution context, it should not give out CPU time to other threads or processes when it's running. |
|||
msg80194 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2009-01-19 17:03 | |
> I'm quite certain that the makefile is generated before > the "make" > program is launched in separated process. Follow is the > original code > where the makefile is created (line 14) and a task is put > in queue (line > 19). It's executed in the main thread (no parallel > execution here). > There's no way that the process is launched before the > makefile generation. It isn't obvious that this is the case, at least to me, due to those "if" and "continue" in the code. Try instrumenting it a bit and see what happens; a few "print" around lines 14 and 19 should suffice. Also, try to come with a *short* example that shows the problem. Remove all irrelevant details: if you write a "constant" makefile and still fails, omit all dependences for makefile generation; if you invoke a simple .exe (like the one I posted) and it still fails, replace the call to nmake.exe. And so on, until you can't remove anything. It may happen that you find yourself the problem doing this. Another strategy is to start with a simple example that works (like the one I posted) and adding more and more things until it doesn't work anymore. With the code as it is now, it's difficult to say whether this is a Python bug or not; there are tons of other factors involved. > I think the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS > should not > be used for fsync in the Python core code (in > posixmodule.c, right?). > Because fsync is mostly used to solve file operation issue > in parallel > execution context, it should not give out CPU time to other > threads or > processes when it's running. Why not? All I/O operations release the GIL, and that's a good thing. In this case, if (as you assert) the other thread that accesses the file hasn't started yet, it doesn't matter if the GIL is released or not. Yahoo! Cocina Recetas prácticas y comida saludable http://ar.mujer.yahoo.com/cocina/ |
|||
msg80237 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-20 09:03 | |
I did trace the order of file creation and process launch. It shows the file is indeed created before the process launch. I did another investigation. I added a new process, "cmd.exe /c copy Makefile NewMakefile", to copy the the Makefile created, just before launching the "nmake.exe" process. The strangest thing happened again: each makefile was copied successful but there's still "makefile not found" reported by "nmake.exe" process. I tried standalone copy application "xcopy.exe" and the result was the same. So I guess that "cmd.exe", "xcopy.exe" and "nmake.exe" might use different families of API or different mode (sync vs async) to access file. I decided to try the code you provided. In checkfile.c, fopen is used to test the file existence. I changed it to the Windows native API CreateFile and I also added a file write operation in order to make it more similar to the real case. Eventually, the problem was reproduced in your code. Following are the successful number of creating 1000 files 5 times in different file sizes: Create file in 403 bytes, 985, 992, 984, 989, 992 (no flush after creation) 883, 886, 907, 909, 915 (flush after creation) Create file in 4061 bytes 983, 976, 982, 977, 983 (no flush after creation) 654, 672, 684, 686, 648 (flush after creation) Create file in 16461 bytes: 970, 967, 963, 963, 971 (no flush after creation) 598, 664, 711, 653, 623 (flush after creation) In summary: a. Using fopen to test a file existence in check_file.c will never report failure, no matter buffer is flushed or not. b. Using CreateFile (read mode) to test a file existence in check_file.c will always report failure. The bigger the file size will cause more failure reported; the flush operation after file creation in test_file_flush.py will cause more failure reported; the flush operation after new file creation in check_file.c will cause more failure reported; no flush operation in both check_file.c and test_file_flush.py almost cause no failure. I don't know what's root cause: GIL, Python thread state switch, Python file buffer or Windows FileCreate API. I'm just certain there's race condition between Python and Windows. |
|||
msg80238 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-20 09:05 | |
The test code and script which can reproduce the problem has been uploaded. |
|||
msg80319 - (view) | Author: Gabriel Genellina (ggenellina) | Date: 2009-01-21 07:36 | |
Thanks for adapting the smaller example. I think I figured out what's the problem. The error reported by checkfile.c is 0x20 = ERROR_SHARING_VIOLATION "The process cannot access the file because it is being used by another process." I modified the subprocess call, adding the parameter close_fds=True -- and now I see no errors. Please do a similar change in your application and see if it works. |
|||
msg80358 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-22 02:58 | |
I'm using Python2.5 in which close_fds is not available in Windows. And I cannot use Python2.6 because I need to redirect the stdout and stderr and there's performance concern. I have questions on the root cause: a. why doesn't fopen() has sharing issue? b. why don't os.close() and FileObject.close() really close the file? |
|||
msg80359 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * | Date: 2009-01-22 08:20 | |
python2.6 is not so different from python2.5. Which performance concerns do you have? |
|||
msg80371 - (view) | Author: Javen Wang (javen72) | Date: 2009-01-22 16:29 | |
I don't have Python2.6 installed. I just have ever read a bench article comparing the performance between different version of Python, including 2.5, 2.6 and 3.0. That article shows, for the same script, Python2.6 has longer running time than Python2.5. My application is a build system and the users are very care about the time spent in build. That's why I have the performance concern. What concerns me more is the stdout/stderr redirection. Python2.6 manual says they cannot be redirected if close_fds is set to True. My application relies on the I/O redirection to control the screen output from subprocess. Actually I have had a work-around for this issue. It works very well so far. I reported a bug here just because I want the Python to be better. I learned it one year ago but I love this language. I just hope nobody else encounter such problem again. If you guys think it won't be fixed in Python 2.5 or has been fixed in Python 2.6, it's OK to close this tracker. Thanks for your time. |
|||
msg112743 - (view) | Author: Terry J. Reedy (terry.reedy) * | Date: 2010-08-04 03:23 | |
To be a tracker bug, as opposed to feature request, behavior must disagree with the manual or doc string, not personal expectation. In any case, OP declined to test on 2.6 and suggested closing, so I am. This still remains on tracker to be searched. Anyone who considers reopening should verify a discrepancy with 2.7 or later current version. |
|||
msg239306 - (view) | Author: Emil Styrke (Emil.Styrke) | Date: 2015-03-26 06:40 | |
I have experienced this issue with Python 2.7.8 and 2.7.9. It is almost the same issue as the OP experiences as far as I can tell: spawn some Python threads that each create a file, flush, fsync, close, then start a subprocess which uses the file through the Windows API. One thing that may differ is that I create the file and spawn the child from the same thread. I tried with close_fds=True, and it indeed works then, but like the OP, in my production code I need to get the output from the process, so it is not a usable workaround for me. Test script and child program code is available at https://gist.github.com/anonymous/5d7d198cd3c22b31fa2a (the file upload button doesn't work for me it seems). Running the script on my machine will print at least one failure most of the time, but not always. |
|||
msg239374 - (view) | Author: Eryk Sun (eryksun) * | Date: 2015-03-27 04:15 | |
Emil, Your example child process opens the file with only read sharing, which fails with a sharing violation if some other process inherits the file handle with write access. The `with` block only prevents this in a single-threaded environment. When you spawn 10 children, each from a separate thread, there's a good chance that one child will inherit a handle that triggers a sharing violation in another child. Using close_fds is a blunt solution since it prevents inheriting all inheritable handles. What you really need here has actually already been done for you. Just use the file descriptor that mkstemp returns, i.e. use os.fdopen(infd, 'wb'). mkstemp opens the file with O_NOINHERIT set in the flags. |
|||
msg239375 - (view) | Author: Emil Styrke (Emil.Styrke) | Date: 2015-03-27 06:01 | |
eryksun, thank you for the explanation, the penny finally dropped for me. It's the *other* child processes that keep the file open, not the parent. Actually this bug started hitting me after I *stopped* doing exactly what you suggest (fdopen on the temp file), and instead started using file names generated by my program (opened with regular "open") and now I see why that is. If I use os.fdopen with the O_NOINHERIT flag, it works as expected. |
History | |||
---|---|---|---|
Date | User | Action | Args |
2022-04-11 14:56:44 | admin | set | github: 49194 |
2015-03-27 06:01:42 | Emil.Styrke | set | messages: + msg239375 |
2015-03-27 04:15:18 | eryksun | set | nosy:
+ eryksun messages: + msg239374 |
2015-03-26 06:40:11 | Emil.Styrke | set | nosy:
+ Emil.Styrke messages: + msg239306 versions: + Python 2.7 |
2010-08-04 03:23:06 | terry.reedy | set | status: open -> closed nosy: + terry.reedy messages: + msg112743 resolution: postponed |
2010-01-29 01:13:10 | brian.curtin | set | priority: normal nosy: + brian.curtin stage: test needed |
2009-01-22 16:29:10 | javen72 | set | messages: + msg80371 |
2009-01-22 08:20:17 | amaury.forgeotdarc | set | messages: + msg80359 |
2009-01-22 02:58:29 | javen72 | set | messages: + msg80358 |
2009-01-21 07:36:02 | ggenellina | set | messages: + msg80319 |
2009-01-20 09:05:39 | javen72 | set | files:
+ test_file_flush.py messages: + msg80238 |
2009-01-20 09:03:18 | javen72 | set | files:
+ checkfile.c messages: + msg80237 |
2009-01-19 17:03:57 | ggenellina | set | messages: + msg80194 |
2009-01-19 15:00:51 | javen72 | set | messages: + msg80171 |
2009-01-16 09:25:18 | ggenellina | set | messages: + msg79943 |
2009-01-16 08:37:32 | javen72 | set | messages: + msg79940 |
2009-01-15 04:36:28 | javen72 | set | messages: + msg79886 |
2009-01-15 03:29:22 | ggenellina | set | files:
+ test_file_flush.py messages: + msg79885 |
2009-01-15 02:05:08 | javen72 | set | messages: + msg79884 |
2009-01-14 16:33:15 | amaury.forgeotdarc | set | messages: + msg79858 |
2009-01-14 16:16:04 | javen72 | set | messages: + msg79857 |
2009-01-14 15:57:12 | javen72 | set | messages: + msg79856 |
2009-01-14 15:31:02 | javen72 | set | messages: + msg79854 |
2009-01-14 07:48:05 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg79841 |
2009-01-14 06:42:37 | ggenellina | set | files: + checkfile.c |
2009-01-14 06:41:28 | ggenellina | set | files:
+ test_file_flush.py nosy: + ggenellina messages: + msg79838 |
2009-01-14 06:08:06 | eckhardt | set | nosy: + eckhardt |
2009-01-14 04:42:59 | javen72 | create |