classification
Title: os.fsync() doesn't work as expect in Windows
Type: behavior Stage: test needed
Components: Windows Versions: Python 2.7, Python 2.5
process
Status: closed Resolution: postponed
Dependencies: Superseder:
Assigned To: Nosy List: Emil.Styrke, amaury.forgeotdarc, brian.curtin, eckhardt, eryksun, ggenellina, javen72, terry.reedy
Priority: normal Keywords:

Created on 2009-01-14 04:42 by javen72, last changed 2015-03-27 06:01 by Emil.Styrke. This issue is now closed.

Files
File name Uploaded Description Edit
test_file_flush.py ggenellina, 2009-01-14 06:41 main script
checkfile.c ggenellina, 2009-01-14 06:42 check file existence like nmake would do
test_file_flush.py ggenellina, 2009-01-15 03:29 2nd version using threads and a queue
checkfile.c javen72, 2009-01-20 09:03 2nd version to check file existence using CreateFile
test_file_flush.py javen72, 2009-01-20 09:05 3rd version to emulate the real application
Messages (24)
msg79829 - (view) Author: Javen Wang (javen72) Date: 2009-01-14 04:42
I encountered a very strange issue in file flush operation in Windows.
Here's the scenario of my application:

    1. The main thread of my application will create makefiles sequentially.
    2. Once a makefile is generated, launch a separate process calling
nmake.exe to run it in parallell. The main thread then create another
makefile until no more makefiles to create.
    3. The number of new processes is limited by command line options.

My application has been running for almost a year without any problem.
But, after I made some changes recently to the content of makefile
generated, "nmake.exe" in separate process sometimes complains that
the makefile was not found. But when I went into the directory, the
makefile was there.

Because I didn't change any thing related to file creation and the new
makefiles are a little bit less than before, I guessed that the
makefile just created hasn't been flushed to disk because of size
change so the new process could not see it in a short time.

So I decided add code to force flush file buffer after writing the
file content (I didn't flush the file explicitly before). I did it
like below:

    Fd = open(File, "w")
    Fd.write(Content)
    Fd.flush()
    os.fsync(Fd.fileno())
    Fd.close()

The strangest thing happened. The "makefile" missing happened more
frequently than no flush operation. I searched the web but no answer
there.

Finally I tried to use Windows file API to create the file via pywin32
extension. The problem's gone.

        import win32file
        Fd = win32file.CreateFile(
                        File,
                        win32file.GENERIC_WRITE,
                        0,
                        None,
                        win32file.CREATE_ALWAYS,
                        win32file.FILE_ATTRIBUTE_NORMAL,
                        None
                        )
        win32file.WriteFile(Fd, str(Content), None)
        win32file.FlushFileBuffers(Fd)
        win32file.CloseHandle(Fd)

I tried writing small python extension in C to make use Windows API to
create file like above. It also works well, even I removed the
FlushFileBuffers() calling.

I think that there's a bug in Python file buffer mechanism.
msg79838 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-14 06:41
I tried to reproduce the issue, interpreting your description, but 
failed. It worked fine in my setup. Perhaps you can add more elements 
until it fails.
msg79841 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-01-14 07:48
python2.5 uses the functions of the fopen() family: fwrite(), fclose(). 
Does the problem reproduce if you use these functions in your extension 
module?

Are your files located on the local hard drive, or on a network storage?
msg79854 - (view) Author: Javen Wang (javen72) Date: 2009-01-14 15:31
gagenellina,

My application is a little bit different from your test code. It won't
wait for the exit of new process and there're file writing operations
during makefile running. I changed the test code to be real
multi-process and tried many file sizes. But I cannot reproduce it.
Maybe my application is more complicated situation.
msg79856 - (view) Author: Javen Wang (javen72) Date: 2009-01-14 15:57
The created files are on local drive. I saw the problem on the laptop
(XP-SP2), desktop(XP-SP3) and server (Win2003). But there's no such
problem on the Linux and Mac boxes.

I tried to use fopen/fwrite in my extension according to your
suggestion. The problem wasn't be reproduced. It seems the bug is more
alike in Python part.

My application is a build system and is also an open source project. Is
it possible for you to download it and try it in your box?

I created temporary user (py123, password: 123py123) for you (just in
case) and here's steps of how to reproduce it.

  1. Checkout the build system source code in, for example, c:\test
     C:\test> svn co --username py123 --password 123py123
https://buildtools.tianocore.org/svn/buildtools/trunk/BaseTools tools

  2. Checkout the source code to be built against in c:\test
     C:\test> svn co --username py123 --password 123py123
https://edk2.tianocore.org/svn/edk2/trunk/edk2 edk2

  3. Change the source code between line 222 and line 229 of
c:\test\tools\Source\Python\Common\Misc.py (SaveFileOnChange function)
like below:

     Fd = open(File, "wb")
     Fd.write(Content)
     Fd.flush()
     os.fsync(Fd.fileno())
     Fd.close()

  4. In c:\test\edk2, run
     C:\test\edk2> edksetup.bat
     C:\test\edk2> set PYTHONPATH=C:\test\tools\Source\Python
     C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py
-n 2 -p MdeModulePkg\MdeModulePkg.dsc -a IA32 -s

  5. If the application stops with message like "makefile not found" or
"AutoGen.h not found" message, that means the problem happened.

Visual Studio 2005 is needed to reproduce it.
msg79857 - (view) Author: Javen Wang (javen72) Date: 2009-01-14 16:16
There would be more chances to see the problem by doing this:

C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py
-n 2 -p IntelFrameworkModulePkg\IntelFrameworkModulePkg.dsc -a IA32 -s
msg79858 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-01-14 16:33
I really tried, but I can't run your script. first PYTHONPATH must be
set to some value (I tried with tools\Source\Python); Also the tool is
complaining about missing WORKSPACE and EDK_TOOLS_PATH env vars, and a
missing file "Conf\target.txt".
This seems very complicated for a test case... furthermore it comes with
binaries already linked with a specific version of python... what if I
want to use a debug build?

Another question: is there a running anti-virus? if yes, can you try to
temporarily disable it?
msg79884 - (view) Author: Javen Wang (javen72) Date: 2009-01-15 02:05
Thank you very much for the trying. You might miss the step 4 in my
previous message. The step is:

  C:\test\edk2> edksetup.bat
  C:\test\edk2> set PYTHONPATH=C:\test\tools\Source\Python
  C:\test\edk2> python.exe C:\test\tools\Source\Python\build\build.py
-n 2 -p IntelFrameworkModulePkg\IntelFrameworkModulePkg.dsc -a IA32 -s

The Visual Studio 2005 must be in the standard installation directory.
Otherwise the C:\test\edk2\Conf\tools_def.txt needs to be changed to
reflect the real path.

And I tried to disabled all antivirus services and the problem is still
there. 

Don't worry about the binary version of build in
edk2\BaseTools\Bin\Win32 (linked against Python 2.5.2). The step I told
you is to execute my application from Python script directly. And I
tried to execute from script source against Python 2.5.4 and the problem
is the same. And no matter running the build from script source or the
freeze-ed binary, the results are the same either.

If it's hard or inconvenient for you to reproduce it, could you please
give me any advice or suggestion on how to debug it (in the interpreter)
and where's most possible place the root cause would be in the Python
interpreter's code? I can try to change something in the Python
interpreter's code, rebuild it and try it on my machine. Although I have
work around for this problem, I'd like to root cause it to avoid further
possible build break of our project.
msg79885 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-15 03:29
I patiently waited for all those 150MB to download, modified Misc.py, 
run the specified commands and got this error:

build.py...
 : error 7000: Failed to start command
        C:\Program Files\Microsoft Visual Studio 8\Vc\bin\nmake.exe /
nologo -s t
build [C:\test\edk2\Build\MdeModule\DEBUG_MYTOOLS\IA32\MdePkg\Library
\BasePrintL
ib\BasePrintLib]

That's right - VS is installed in another place. "C:\Program Files" 
doesn't even exist in my Spanish version of Windows. edksetup.bat 
didn't report any error, and I have nmake.exe in my PATH.

Anyway, trying to hunt a bug in the middle of 150 MB of code is way too 
much. You should try to reduce it to the smallest piece that still 
shows the problem. 

(As a side note, a project so big SHOULD have plenty of unit tests, but 
I see they're almost inexistant. Having to write tests forces people to 
use a more modular design. In this case, probably it would have been 
easier to test this issue, in a more isolated way, without having to 
bring the whole framework in).

Multithreaded programs may be tricky. Looking at the 
_MultiThreadBuildPlatform method (we're talking of it, right?) it isn't 
obvious that there are no race conditions in the code.

For a producer-consumer process like this, I would use a pool of worker 
threads, all waiting for work to do from a Queue, and a main (producer) 
thread that puts tasks to be done into the Queue. The synchronization 
is already done for you, it's almost automatic.

I've modified my previous example to show this usage.

Unless you can bring more evidence that this is a real Python issue and 
not a race condition in your code or something, I'd recommend to close 
this as invalid.
msg79886 - (view) Author: Javen Wang (javen72) Date: 2009-01-15 04:36
I agree multithread programming is kind of difficult and tricky. But I
don't think there's a race condition in _MultiThreadBuildPlatform
method, because I do have only one producer. And the consumer consumes
the product only when it's done (i.e. the file is created and closed).
The only race condition, I think, it's in the Python or Windows file
system, in which the file created might not be seen by other process due
to the file buffer mechanism. I think the flush() method of file object
and os.fsync() are to remove the race condition. But it seems that they
are not working as they're supposed to be.

What I know is that os.fsync() calls the _commit() which calls
FlushFileBuffers(). Why no problem if I call the FlushFileBuffers()
directly? That's why I think the most possible race condition is in
Python file buffer operation which is out of the control of my Python code.

I'm sorry that I didn't realize there's 150M code to checkout. Thanks
for your patience. Actually they are not code of my application itself.
They are the code used to test my application because my application is
a build system which needs source code to build. The real code of my
application is in the, for my example, directory of
C:\test\tools\Source\Python with just about 3M source files :-) And I
think I have narrowed down the issue in the file creation in
SaveFileOnChange function in C:\test\tools\Source\Python\Common\Misc.py. 

I know it's very hard to reproduce issue in multi-thread context. And I
cannot say absolutely there's no bug in my code. It's OK for you to
close this tracker. But it would be better to let it open for a few days
so that I can do more investigation then. Anyway, thanks for the trying
again.
msg79940 - (view) Author: Javen Wang (javen72) Date: 2009-01-16 08:37
I narrowed down the root cause in the GIL of Python. I read the source
code of implementing os.fsync and found it's using
Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS to enclose the calling
of _commit() in order to be thread safe. I tried to add them in my C
extension and then the issue was reproduced. 

It looks like the thread state operation or GIL caused a race condition
and let os.fsync() not flush the makefile in buffer to disk before it
returns.

The multi-thread scenario in my application is 

    a. the main thread produces makefile
    b. once a makefile is generated, the main thread launches a new
thread in which a separated process is launched to run the makefile
until it exits.

I think there's no issue in this multi-thread scenario. And the Python
2.5.2 Maunal, Python/C API Ch8.1, says GIL must be taken care when doing
some blocking I/O operations, but it caused thread unsafe. So I still
think there's a bug in the Python.
msg79943 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-16 09:25
If you start the new thread *after* the file is closed, no race 
condition should exist. I'm sorry but I'm unable to follow the code you 
use to create threads and write the makefile; but you should verify 
that your original assertion "Once a makefile is generated, launch a 
separate process..." is true.

Alternatively, when using a Queue like in my last example, make sure 
the new job enters the queue *after* closing the file.
msg80171 - (view) Author: Javen Wang (javen72) Date: 2009-01-19 15:00
I'm quite certain that the makefile is generated before the "make"
program is launched in separated process. Follow is the original code
where the makefile is created (line 14) and a task is put in queue (line
19). It's executed in the main thread (no parallel execution here).
There's no way that the process is launched before the makefile generation. 
 
01    for Module in Pa.Platform.Modules:
02        Ma = ModuleAutoGen(Wa, Module, BuildTarget, ToolChain, Arch,
self.PlatformFile)
03        if Ma == None:
04            continue
05        # generate AutoGen files and Makefile here
06        if self.Target not in ['clean', 'cleanlib', 'cleanall', 'run',
'fds']:
07            # for target which must generate AutoGen code and makefile
08            if not self.SkipAutoGen or self.Target == 'genc':
09                Ma.CreateCodeFile(True)
10            if self.Target == "genc":
11                continue
12   
13            if not self.SkipAutoGen or self.Target == 'genmake':
14                Ma.CreateMakeFile(True)
15            if self.Target == "genmake":
16                continue
17
18        # Generate build task for the module which will be launched
separately
19        Bt = BuildTask.New(ModuleMakeUnit(Ma, self.Target))

I think the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS should not
be used for fsync in the Python core code (in posixmodule.c, right?).
Because fsync is mostly used to solve file operation issue in parallel
execution context, it should not give out CPU time to other threads or
processes when it's running.
msg80194 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-19 17:03
> I'm quite certain that the makefile is generated before
> the "make"
> program is launched in separated process. Follow is the
> original code
> where the makefile is created (line 14) and a task is put
> in queue (line
> 19). It's executed in the main thread (no parallel
> execution here).
> There's no way that the process is launched before the
> makefile generation. 

It isn't obvious that this is the case, at least to me, due to those "if" and "continue" in the code.
Try instrumenting it a bit and see what happens; a few "print" around lines 14 and 19 should suffice. 

Also, try to come with a *short* example that shows the problem. Remove all irrelevant details: if you write a "constant" makefile and still fails, omit all dependences for makefile generation; if you invoke a simple .exe (like the one I posted) and it still fails, replace the call to nmake.exe. And so on, until you can't remove anything. It may happen that you find yourself the problem doing this.

Another strategy is to start with a simple example that works (like the one I posted) and adding more and more things until it doesn't work anymore.

With the code as it is now, it's difficult to say whether this is a Python bug or not; there are tons of other factors involved.

> I think the Py_BEGIN_ALLOW_THREADS and Py_END_ALLOW_THREADS
> should not
> be used for fsync in the Python core code (in
> posixmodule.c, right?).
> Because fsync is mostly used to solve file operation issue
> in parallel
> execution context, it should not give out CPU time to other
> threads or
> processes when it's running.

Why not? All I/O operations release the GIL, and that's a good thing. In this case, if (as you assert) the other thread that accesses the file hasn't started yet, it doesn't matter if the GIL is released or not.

      Yahoo! Cocina
Recetas prácticas y comida saludable
http://ar.mujer.yahoo.com/cocina/
msg80237 - (view) Author: Javen Wang (javen72) Date: 2009-01-20 09:03
I did trace the order of file creation and process launch. It shows the
file is indeed created before the process launch.

I did another investigation. I added a new process, "cmd.exe /c copy
Makefile NewMakefile", to copy the the Makefile created, just before
launching the "nmake.exe" process. The strangest thing happened again:
each makefile was copied successful but there's still "makefile not
found" reported by "nmake.exe" process. I tried standalone copy
application "xcopy.exe" and the result was the same. So I guess that
"cmd.exe", "xcopy.exe" and "nmake.exe" might use different families of
API or different mode (sync vs async) to access file.

I decided to try the code you provided. In checkfile.c, fopen is used to
test the file existence. I changed it to the Windows native API
CreateFile and I also added a file write operation in order to make it
more similar to the real case. Eventually, the problem was reproduced in
your code. Following are the successful number of creating 1000 files 5
times in different file sizes:

Create file in 403 bytes,
  985, 992, 984, 989, 992 (no flush after creation)
  883, 886, 907, 909, 915 (flush after creation)
  
Create file in 4061 bytes
  983, 976, 982, 977, 983 (no flush after creation)
  654, 672, 684, 686, 648 (flush after creation)
  
Create file in 16461 bytes:
  970, 967, 963, 963, 971 (no flush after creation)
  598, 664, 711, 653, 623 (flush after creation)

In summary:
  a. Using fopen to test a file existence in check_file.c will never
report failure, no matter buffer is flushed or not.
  b. Using CreateFile (read mode) to test a file existence in
check_file.c will always report failure. The bigger the file size will
cause more failure reported; the flush operation after file creation in
test_file_flush.py will cause more failure reported; the flush operation
after new file creation in check_file.c will cause more failure
reported; no flush operation in both check_file.c and test_file_flush.py
almost cause no failure.

I don't know what's root cause: GIL, Python thread state switch, Python
file buffer or Windows FileCreate API. I'm just certain there's race
condition between Python and Windows.
msg80238 - (view) Author: Javen Wang (javen72) Date: 2009-01-20 09:05
The test code and script which can reproduce the problem has been uploaded.
msg80319 - (view) Author: Gabriel Genellina (ggenellina) Date: 2009-01-21 07:36
Thanks for adapting the smaller example. I think I figured out what's 
the problem.

The error reported by checkfile.c is 0x20 = ERROR_SHARING_VIOLATION
"The process cannot access the file because it is being used by 
another process."

I modified the subprocess call, adding the parameter close_fds=True -- 
and now I see no errors. 
Please do a similar change in your application and see if it works.
msg80358 - (view) Author: Javen Wang (javen72) Date: 2009-01-22 02:58
I'm using Python2.5 in which close_fds is not available in Windows. And
I cannot use Python2.6 because I need to redirect the stdout and stderr
and there's performance concern.

I have questions on the root cause:

  a. why doesn't fopen() has sharing issue?
  b. why don't os.close() and FileObject.close() really close the file?
msg80359 - (view) Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * (Python committer) Date: 2009-01-22 08:20
python2.6 is not so different from python2.5. Which performance concerns
do you have?
msg80371 - (view) Author: Javen Wang (javen72) Date: 2009-01-22 16:29
I don't have Python2.6 installed. I just have ever read a bench article
comparing the performance between different version of Python, including
2.5, 2.6 and 3.0. That article shows, for the same script, Python2.6 has
longer running time than Python2.5. My application is a build system and
the users are very care about the time spent in build. That's why I have
the performance concern.

What concerns me more is the stdout/stderr redirection. Python2.6 manual
says they cannot be redirected if close_fds is set to True. My
application relies on the I/O redirection to control the screen output
from subprocess.

Actually I have had a work-around for this issue. It works very well so
far. I reported a bug here just because I want the Python to be better.
I learned it one year ago but I love this language. I just hope nobody
else encounter such problem again. If you guys think it won't be fixed
in Python 2.5 or has been fixed in Python 2.6, it's OK to close this
tracker. Thanks for your time.
msg112743 - (view) Author: Terry J. Reedy (terry.reedy) * (Python committer) Date: 2010-08-04 03:23
To be a tracker bug, as opposed to feature request, behavior must disagree with the manual or doc string, not personal expectation. In any case, OP declined to test on 2.6 and suggested closing, so I am. This still remains on tracker to be searched.

Anyone who considers reopening should verify a discrepancy with 2.7 or later current version.
msg239306 - (view) Author: Emil Styrke (Emil.Styrke) Date: 2015-03-26 06:40
I have experienced this issue with Python 2.7.8 and 2.7.9.  It is almost the same issue as the OP experiences as far as I can tell: spawn some Python threads that each create a file, flush, fsync, close, then start a subprocess which uses the file through the Windows API.  One thing that may differ is that I create the file and spawn the child from the same thread.

I tried with close_fds=True, and it indeed works then, but like the OP, in my production code I need to get the output from the process, so it is not a usable workaround for me.

Test script and child program code is available at https://gist.github.com/anonymous/5d7d198cd3c22b31fa2a (the file upload button doesn't work for me it seems).  Running the script on my machine will print at least one failure most of the time, but not always.
msg239374 - (view) Author: Eryk Sun (eryksun) * (Python triager) Date: 2015-03-27 04:15
Emil,

Your example child process opens the file with only read sharing, which fails with a sharing violation if some other process inherits the file handle with write access. The `with` block only prevents this in a single-threaded environment. When you spawn 10 children, each from a separate thread, there's a good chance that one child will inherit a handle that triggers a sharing violation in another child.

Using close_fds is a blunt solution since it prevents inheriting all inheritable handles. What you really need here has actually already been done for you. Just use the file descriptor that mkstemp returns, i.e. use os.fdopen(infd, 'wb'). mkstemp opens the file with O_NOINHERIT set in the flags.
msg239375 - (view) Author: Emil Styrke (Emil.Styrke) Date: 2015-03-27 06:01
eryksun, thank you for the explanation, the penny finally dropped for me.  It's the *other* child processes that keep the file open, not the parent.  Actually this bug started hitting me after I *stopped* doing exactly what you suggest (fdopen on the temp file), and instead started using file names generated by my program (opened with regular "open") and now I see why that is.  If I use os.fdopen with the O_NOINHERIT flag, it works as expected.
History
Date User Action Args
2015-03-27 06:01:42Emil.Styrkesetmessages: + msg239375
2015-03-27 04:15:18eryksunsetnosy: + eryksun
messages: + msg239374
2015-03-26 06:40:11Emil.Styrkesetnosy: + Emil.Styrke

messages: + msg239306
versions: + Python 2.7
2010-08-04 03:23:06terry.reedysetstatus: open -> closed

nosy: + terry.reedy
messages: + msg112743

resolution: postponed
2010-01-29 01:13:10brian.curtinsetpriority: normal
nosy: + brian.curtin

stage: test needed
2009-01-22 16:29:10javen72setmessages: + msg80371
2009-01-22 08:20:17amaury.forgeotdarcsetmessages: + msg80359
2009-01-22 02:58:29javen72setmessages: + msg80358
2009-01-21 07:36:02ggenellinasetmessages: + msg80319
2009-01-20 09:05:39javen72setfiles: + test_file_flush.py
messages: + msg80238
2009-01-20 09:03:18javen72setfiles: + checkfile.c
messages: + msg80237
2009-01-19 17:03:57ggenellinasetmessages: + msg80194
2009-01-19 15:00:51javen72setmessages: + msg80171
2009-01-16 09:25:18ggenellinasetmessages: + msg79943
2009-01-16 08:37:32javen72setmessages: + msg79940
2009-01-15 04:36:28javen72setmessages: + msg79886
2009-01-15 03:29:22ggenellinasetfiles: + test_file_flush.py
messages: + msg79885
2009-01-15 02:05:08javen72setmessages: + msg79884
2009-01-14 16:33:15amaury.forgeotdarcsetmessages: + msg79858
2009-01-14 16:16:04javen72setmessages: + msg79857
2009-01-14 15:57:12javen72setmessages: + msg79856
2009-01-14 15:31:02javen72setmessages: + msg79854
2009-01-14 07:48:05amaury.forgeotdarcsetnosy: + amaury.forgeotdarc
messages: + msg79841
2009-01-14 06:42:37ggenellinasetfiles: + checkfile.c
2009-01-14 06:41:28ggenellinasetfiles: + test_file_flush.py
nosy: + ggenellina
messages: + msg79838
2009-01-14 06:08:06eckhardtsetnosy: + eckhardt
2009-01-14 04:42:59javen72create